473,324 Members | 2,541 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

Best way to handle large lists?

I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz
Oct 3 '06 #1
19 7781
Chaz Ginger <cg********@hotmail.comwrote:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]
How about:

smallSet = set(smallList)
something = [ i for i in bigList if i not in smallSet ]

Use timeit.py on some representative data to see what difference that
makes.
Oct 3 '06 #2
Chaz Ginger <cg********@hotmail.comwrites:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]
diff = list(set(bigList) - set(smallList))

Note that doesn't get you the elements in any particular order.
Oct 3 '06 #3
Chaz Ginger írta:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz
Hi !

If you have big list, you can use dbm like databases.
They are very quick. BSDDB, flashdb, etc. See SleepyCat, or see python help.

In is very slow in large datasets, but bsddb is use hash values, so it
is very quick.
The SleepyCat database have many extras, you can set the cache size and
many other parameters.

Or if you don't like dbm style databases, you can use SQLite. Also
quick, you can use SQL commands.
A little slower than bsddb, but it is like SQL server. You can improve
the speed with special parameters.

dd

Oct 3 '06 #4
I don't know enough about Python internals, but the suggested solutions
all seem to involve scanning bigList. Can this presumably linear
operation be avoided by using dict or similar to find all occurrences of
smallist items in biglist and then deleting those occurrences?

Bill Williams

In article <prrUg.1662$We.477@trndny08>,
Chaz Ginger <cg********@hotmail.comwrote:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz
Oct 3 '06 #5
I don't know much about the python internals either, so this may be the
blind leading the blind, but aren't dicts much slower to work with than
lists and therefore wouldn't your suggestion to use dicts be much
slower? I think it's something to do with the comparative overhead of
using keys in dicts rather than using positional indexes in lists/arrays...

At least that is what I thought.

Can anyone confirm this?

-h

Hari Sekhon

Bill Williams wrote:
I don't know enough about Python internals, but the suggested solutions
all seem to involve scanning bigList. Can this presumably linear
operation be avoided by using dict or similar to find all occurrences of
smallist items in biglist and then deleting those occurrences?

Bill Williams

In article <prrUg.1662$We.477@trndny08>,
Chaz Ginger <cg********@hotmail.comwrote:

>I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz
Oct 3 '06 #6
Bill Williams enlightened us with:
I don't know enough about Python internals, but the suggested
solutions all seem to involve scanning bigList. Can this presumably
linear operation be avoided by using dict or similar to find all
occurrences of smallist items in biglist and then deleting those
occurrences?
And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Run benchmarks on the suggestions, and see which is fastest for
yourself.

Sybren
--
Sybren Stüvel
Stüvel IT - http://www.stuvel.eu/
Oct 3 '06 #7
I've done that and decided that Python's 'list comprehension' isn't a
way to go. I was hoping that perhaps someone had some experience with
some C or C++ library that has a Python interface that would make a
difference.

Chaz

Sybren Stuvel wrote:
Bill Williams enlightened us with:
>I don't know enough about Python internals, but the suggested
solutions all seem to involve scanning bigList. Can this presumably
linear operation be avoided by using dict or similar to find all
occurrences of smallist items in biglist and then deleting those
occurrences?

And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Run benchmarks on the suggestions, and see which is fastest for
yourself.

Sybren
Oct 3 '06 #8
Sybren Stuvel <sy*******@YOURthirdtower.com.imaginationwrites:
I don't know enough about Python internals, but the suggested
solutions all seem to involve scanning bigList. Can this presumably
linear operation be avoided by using dict or similar to find all
occurrences of smallist items in biglist and then deleting those
occurrences?

And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.
Maybe the application should use sets instead of lists for these
collections.
Oct 3 '06 #9
Paul Rubin wrote:
Sybren Stuvel <sy*******@YOURthirdtower.com.imaginationwrites:
>>I don't know enough about Python internals, but the suggested
solutions all seem to involve scanning bigList. Can this presumably
linear operation be avoided by using dict or similar to find all
occurrences of smallist items in biglist and then deleting those
occurrences?
And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Maybe the application should use sets instead of lists for these
collections.
What would sets do for me over lists?

Chaz
Oct 3 '06 #10
Chaz Ginger wrote:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz

IMHO the only way to speed things up is to know more about the
actual data in the lists (e.g are the elements unique, can they
be sorted, etc) and take advantage of all that information to
come up with a "faster" algorithm. If they are unique, sets
might be a good choice. If they are sorted, bisect module
might help. The specifics about the list(s) may yield a faster
method.

-Larry
Oct 3 '06 #11
Chaz Ginger wrote:
What would sets do for me over lists?
It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/
Oct 3 '06 #12
Larry Bates wrote:
Chaz Ginger wrote:
>I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz


IMHO the only way to speed things up is to know more about the
actual data in the lists (e.g are the elements unique, can they
be sorted, etc) and take advantage of all that information to
come up with a "faster" algorithm. If they are unique, sets
might be a good choice. If they are sorted, bisect module
might help. The specifics about the list(s) may yield a faster
method.

-Larry
Each item in the list is a fully qualified domain name, e.g.
foo.bar.com. The order in the list has no importance. That is about all
there is to the list other than to say the number of items in a list can
top out about 10,000.

Chaz
Oct 3 '06 #13

"Chaz Ginger" <cg********@hotmail.comwrote in message
news:45**************@hotmail.com...
Each item in the list is a fully qualified domain name, e.g.
foo.bar.com. The order in the list has no importance.
So you don't actually need to use lists at all, then.
You can just use sets and write:

newSet = bigSet - littleSet
Oct 3 '06 #14
Jeremy Sanders wrote:
Chaz Ginger wrote:
>What would sets do for me over lists?

It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).
As a footnote, this program

import random
num = 100000

a = set( range(num) )
for i in range(100000):
x = random.randint(0, num-1) in a

completes in less than a second, whereas

import random
num = 100000

a = range(num)
for i in range(100000):
x = random.randint(0, num-1) in a

takes a long time on my computer.

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/
Oct 3 '06 #15
Jeremy Sanders wrote:
Jeremy Sanders wrote:
>Chaz Ginger wrote:
>>What would sets do for me over lists?
It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).

As a footnote, this program

import random
num = 100000

a = set( range(num) )
for i in range(100000):
x = random.randint(0, num-1) in a

completes in less than a second, whereas

import random
num = 100000

a = range(num)
for i in range(100000):
x = random.randint(0, num-1) in a

takes a long time on my computer.

Jeremy
Thanks Jeremy. I am in the process of converting my stuff to use sets! I
wouldn't have thought it would have made that big a deal! I guess it is
live and learn.

Peace,
Chaz
Oct 3 '06 #16
Hari Sekhon wrote:
That is surprising since I read on this list recently that lists were
faster than dicts
depends on what you're doing with them, of course.
It was one reason that was cited as to why local vars are better than
global vars.
L[int] is indeed a bit faster than D[string] (but not much), but that
doesn't mean that you can loop over *all* items in a list faster than
you can look up a single key in a dictionary.

</F>

Oct 3 '06 #17
Hi !
Thanks Jeremy. I am in the process of converting my stuff to use sets! I
wouldn't have thought it would have made that big a deal! I guess it is
live and learn.
If you have simplified records with big amount of data, you can trying
dbhash. With this you don't get out from memory...

dd
import dbhash
import time
import random
import gc
import sys

itemcount = 250000

db = dbhash.open('test.dbh','w')
for i in range(itemcount):
db[str(i)] = str(i)

littlelist = []
littleset = set()

while len(littlelist) < 1000:
x = str(random.randint(0, itemcount-1))
if not (x in littlelist):
littlelist.append(x)
littleset.add(x)

def DBHash():
gc.collect()
hk = db.has_key
st = time.time()
newlist = []
for val in littlelist:
if hk(val):
newlist.append(val)
et = time.time()
print "Size", len(newlist)
newlist.sort()
print "Hash", hash(str(newlist))
print "Time", "%04f"%(et-st)
print

def Set():
gc.collect()
largeset = set()
for i in range(itemcount):
largeset.add(str(i))

st = time.time()
newset = largeset.intersection(littleset)
newsetlist = []
while newset:
newsetlist.append(newset.pop())
et = time.time()
print "Size", len(newsetlist)
newsetlist.sort()
print "Hash", hash(str(newsetlist))
print "Time", "%04f"%(et-st)

DBHash()
Set()

Oct 4 '06 #18
Maybe the application should use sets instead of lists for these
collections.
What would sets do for me over lists?
searching for an element in a list is O(n)
searching for an element in a set is O(1) (for reasonable distributed
elements)

Harald

Oct 4 '06 #19
Hari Sekhon wrote:
So are you saying that using a dict means a faster search since you only
need to look up one value?

I would think that you would have to look through the keys and stop at
the first key that matches since each key has to be uniq, so perhaps if
it is nearer the front of the set of keys then perhaps it would be a
quicker lookup?
http://en.wikipedia.org/wiki/Hash_table

</F>

Oct 4 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Avi Kak | last post by:
Hello: Is it possible in Python to define an empty list of a predetermined size? To illustrate my question further, I can do the following in Perl my @arr = (); $#arr = 999;
4
by: Utada P.W. SIU | last post by:
I know that MS have limited the form data size which about 100k however, I need submit a data that may be less or more than 100k, such as 500k I have found a code that using...
1
by: Jay | last post by:
Hi, I have a very large list that I would like to have displayed in a ListBox. Trying to load the entire list takes WAY too long. Is there any kind of buffer I can use to allow part of the list...
2
by: The Bear | last post by:
I'm trying to deteremine what would be the best way to check a set of inputs for any updates. Currently I have a "lookup" table in a db. And every so often when a new item needs to be added, The...
3
by: Thomas Beyerlein | last post by:
I am writing code for a list box editor, one of the lists is large was hoping that there is a way to either speed it up by making the server do the IF statements of there is a faster way of...
1
by: pedagani | last post by:
Dear comp.lang.c++, I'm interested in knowing the general techniques used to handle large binary files (>10GB) efficiently such as tweaking with filebuf , etc. Reading chunk by chunk seems to be...
7
by: ineedahelp | last post by:
Can anyone help with how to manage/store 3,000 to 5,000 daily pricing data. Should I use one very large table. I will ultimately need to collect info on one of the priced items over many days,...
2
by: rmstvns | last post by:
Okay, I am stuck with an issue. I have two large lists, a list of users and a list zipcodes. Both lists have thousands of records. The user picks zipcode preferences and can choose as many zipcodes...
7
by: fekioh | last post by:
Hello, I need to store data in large lists (~e7 elements) and I often get a memory error in code that looks like: f = open('data.txt','r') for line in f: list1.append(line.split(','))...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.