473,417 Members | 1,597 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,417 software developers and data experts.

very large dictionaries

I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?

-- robin
Jul 18 '05 #1
8 3826
1) If the key is 10 bytes, that's 5.5Gb
of memory at a minimum??? That is an
unusually large amount of memory.

2) You can't really "search" through dictionaries
fast, but you can index into them via their key
very fast. But then you can do a database read
with an index key on a table very quickly also
with no startup time to read an index 50 million
records.

We would really need to know more about what
you mean by "search through" to make an intelligent
suggestion.

Larry Bates
Syscon, Inc.

"robin" <es***********@yahoo.com> wrote in message
news:gk********************************@4ax.com...
I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?

-- robin

Jul 18 '05 #2
robin <es***********@yahoo.com> wrote:
I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all in
memory.

There is a single key field, so a dictionary is an obvious choice for
a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of tree?


50M x 100 = 5000M = 5G. You got 5Gig of memory?

Since you are talking about key/value record, you can choose from GDBM
(gdbm), Berkeley DB (dbhash), or disk-based dictionary front-end
(shelve). You can now access GDBM database from Bash shell. :-)

--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
No, I will not fix your computer! I'll reformat your harddisk, though.
Jul 18 '05 #3
robin wrote:
I need to do a search through about 50 million records, each of
which are less than 100 bytes wide. A database is actually too slow
for this, so I thought of optimising the data and putting it all in
memory.
You'll hardly find something faster than a real database.
There is a single key field, so a dictionary is an obvious choice
for a structure, since Python optimises these nicely.

But is there a better choice? Is it worth building some sort of
tree?


It depends on the structure of the data. What kind of data do you
have, what are you looking for and on what machine (at least how
much memory is available) will the whole thing run?

Mathias
Jul 18 '05 #4
William Park wrote:
Since you are talking about key/value record, you can choose from GDBM
(gdbm), Berkeley DB (dbhash), or disk-based dictionary front-end
(shelve). You can now access GDBM database from Bash shell. :-)


If you just need fast lookup, maybe cdb
(http://pilcrow.madison.wi.us/#pycdb) can be of use: two disk accesses
per lookup.

--
Ciao,
Matteo
Jul 18 '05 #5
Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin
Jul 18 '05 #6
Matteo Dell'Amico <de***@toglimi.linux.it> wrote:
If you just need fast lookup, maybe cdb
(http://pilcrow.madison.wi.us/#pycdb) can be of use: two disk accesses
per lookup.


I see that this allows duplicate keys and so may be exactly what is
needed. I will check it out further. Thank you!

-- robin
Jul 18 '05 #7

From what I understand...

- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)

Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action

My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)

Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).

Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.

If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.

What do you think ? I already used this algorithm and it's very powerful.



Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Jul 18 '05 #8

From what I understand...

- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)

Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action

My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)

Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).

Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.

If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.

What do you think ? I already used this algorithm and it's very powerful.



Let me specify further. Our actual data is enormous, and tests with
Postgres and MySQL indicate that the time taken just to build a single
index is on the order of hours, which is too long. After analysis we
believe that the important lookup info can be compressed to about 40
million records of 48 bytes each. Furthermore, we have the flexibility
to partition this into four somewhat equal parts. Each will hence be
about 460MB in size. Even with structural overhead I see no reason
this couldn't fit into memory. This is our match file.

Our process involves taking an input disk file, and traversing it one
record at a time. This we can sort by the same partition criteria used
to split the match file (so that each match file can be loaded but
once). For each input record we build a series of keys and compare
them to the appropriate match file; thus there are multiple lookups
per input record. An algorithm then determines which match record we
wish to pick and we write an ID to the input.

There is more to it than this, but these are the major elements. The
input table may be millions of records long, but likely will be much
smaller.

The total processing time will be a sum of:
time to sort/index input file
time to traverse input file
time to load in all parts of the match table
time to build keys on input records
time to search match table for each key
time to write match key ID
overhead time of routine

A new wrinkle is that the match table may have duplicate keys, so
storing it as a dictionary is not an option.

The data is alphanumeric.

Assume an arbitrarily powerful computer, since adding a GB of RAM is
not an issue.

The question I had, for those with experience with large data sets, is
what structure would best handle this problem?

-- robin


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: DJTB | last post by:
zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...
210
by: Christoph Zwerschke | last post by:
This is probably a FAQ, but I dare to ask it nevertheless since I haven't found a satisfying answer yet: Why isn't there an "ordered dictionary" class at least in the standard list? Time and again...
8
by: placid | last post by:
Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers
57
by: Chris Foote | last post by:
Hi all. I have the need to store a large (10M) number of keys in a hash table, based on a tuple of (long_integer, integer). The standard python dictionary works well for small numbers of keys,...
6
by: vdicarlo | last post by:
I am a programming amateur and a Python newbie who needs to convert about 100,000,000 strings of the form "1999-12-30" into ordinal dates for sorting, comparison, and calculations. Though my script...
2
by: Gabriel Genellina | last post by:
En Wed, 30 Jul 2008 21:29:39 -0300, <python@bdurham.comescribi�: You could use a different hash algorithm yielding a smaller value (crc32, by example, fits on an integer). At the expense of...
20
by: Simon Strobl | last post by:
Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? SImon
14
by: cnb | last post by:
Are dictionaries the same as hashtables?
3
by: Adam Clauss | last post by:
We have an application which is fairly network intensive. During some stress testing, we had it setup to open approximately 300-400 TCP connections (outbound connections, we are a TCP client). ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.