469,955 Members | 2,157 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,955 developers. It's quick & easy.

very large dictionaries

I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?

-- robin
Jul 18 '05 #1
8 3558
1) If the key is 10 bytes, that's 5.5Gb
of memory at a minimum??? That is an
unusually large amount of memory.

2) You can't really "search" through dictionaries
fast, but you can index into them via their key
very fast. But then you can do a database read
with an index key on a table very quickly also
with no startup time to read an index 50 million
records.

We would really need to know more about what
you mean by "search through" to make an intelligent
suggestion.

Larry Bates
Syscon, Inc.

"robin" <es***********@yahoo.com> wrote in message
news:gk********************************@4ax.com...
I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?

-- robin

Jul 18 '05 #2
robin <es***********@yahoo.com> wrote:
I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?


50M x 100 = 5000M = 5G. You got 5Gig of memory?

Since you are talking about key/value record, you can choose from GDBM
(gdbm), Berkeley DB (dbhash), or disk-based dictionary front-end
(shelve). You can now access GDBM database from Bash shell. :-)

--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
No, I will not fix your computer! I'll reformat your harddisk, though.
Jul 18 '05 #3
robin wrote:
I need to do a search through about 50 million records, each of
which are less than 100 bytes wide. A database is actually too slow
for this, so I thought of optimising the data and putting it all in
memory.
You'll hardly find something faster than a real database.
There is a single key field, so a dictionary is an obvious choice
for a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of
tree?


It depends on the structure of the data. What kind of data do you
have, what are you looking for and on what machine (at least how
much memory is available) will the whole thing run?

Mathias
Jul 18 '05 #4
William Park wrote:
Since you are talking about key/value record, you can choose from GDBM
(gdbm), Berkeley DB (dbhash), or disk-based dictionary front-end
(shelve). You can now access GDBM database from Bash shell. :-)


If you just need fast lookup, maybe cdb
(http://pilcrow.madison.wi.us/#pycdb) can be of use: two disk accesses
per lookup.

--
Ciao,
Matteo
Jul 18 '05 #5
Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin
Jul 18 '05 #6
Matteo Dell'Amico <de***@toglimi.linux.it> wrote:
If you just need fast lookup, maybe cdb
(http://pilcrow.madison.wi.us/#pycdb) can be of use: two disk accesses
per lookup.


I see that this allows duplicate keys and so may be exactly what is
needed. I will check it out further. Thank you!

-- robin
Jul 18 '05 #7

From what I understand...

- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)

Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action

My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)

Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).

Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.

If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.

What do you think ? I already used this algorithm and it's very powerful.



Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Jul 18 '05 #8

From what I understand...

- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)

Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action

My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)

Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).

Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.

If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.

What do you think ? I already used this algorithm and it's very powerful.



Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Jul 18 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

210 posts views Thread by Christoph Zwerschke | last post: by
8 posts views Thread by placid | last post: by
57 posts views Thread by Chris Foote | last post: by
2 posts views Thread by Gabriel Genellina | last post: by
20 posts views Thread by Simon Strobl | last post: by
14 posts views Thread by cnb | last post: by
3 posts views Thread by Adam Clauss | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.