473,804 Members | 3,163 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Large Dictionaries

Hi all.

I have the need to store a large (10M) number of keys in a hash table,
based on a tuple of (long_integer, integer). The standard python
dictionary works well for small numbers of keys, but starts to
perform badly for me inserting roughly 5M keys:

# keys dictionary metakit (both using psyco)
------ ---------- -------
1M 8.8s 22.2s
2M 24.0s 43.7s
5M 115.3s 105.4s

Has anyone written a fast hash module which is more optimal for
large datasets ?

p.s. Disk-based DBs are out of the question because most
key lookups will result in a miss, and lookup time is
critical for this application.

Cheers,
Chris
May 15 '06
57 10858
Thomas Ganss wrote:
Klaas schrieb:
4. Insert your keys in sorted order.

This advice is questionable -....

My gut feeling on this matter is:
IF the insert times of pre-sorted values is far better
than the times of unsorted values, there is a chance
that the resulting tree is unbalanced: only 1 compare
operation after insert and no re-balancing of the tree.

re-indexing will probably give you far better access times
in this case. Another option is to drop non RI indexes used only
for query optimization and recreate them after batch insert.


Don't use your gut for such issues. Pre-sorted data is such
a common special case (in fact, the only easily describable
special case) that many systems handle this specially. For
example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

If you are using your gut without testing, go ahead and
presort. In any case, reading documents and testing beats
gut feels every time.

--Scott David Daniels
sc***********@a cm.org
May 29 '06 #41
Roy Smith wrote:
My guess would be that each resize grows the table by a constant
factor, which IIRC, works out to amortized O(n).
Right. For <50000 entries, the size is multiplied by 4 each time,
and doubled each time for large dictionaries.
It's not as good as
creating the dict the right size in the first place, but it's really
not that bad. Somewhat more troubling is that it can lead to memory
fragmentation problems.
Depends on your operating system, of course. You get over a page size
fairly quickly, and then many systems use anonymous mappings, where
a range of pages gets mapped from swap on allocation, and unmapped
on deallocation. So while this can cause address space fragmentation,
it doesn't cause memory fragmentation (i.e. memory that is allocated
by the process but can't be used).
I don't understand why Python doesn't have a way to give a size hint
when creating a dict. It seems so obvious to be able to say "d = dict
(size=10*1000*1 000)" if you know beforehand that's how many keys
you're going to add.


There is no need for it. If dicts don't work well in some cases, these
cases should be investigated and fixed, instead of working around the
problem by loading the problem onto the developer.

As you said, it *should* have linear time, with the number of
insertions. If it doesn't (and it appears not to), that may be due to
hash collisions, or due to the time for computing hashes not being
constant.

Regards,
Martin
May 30 '06 #42
In article <44********@nnt p0.pdx.net>,
Scott David Daniels <sc***********@ acm.org> wrote:
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.


But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?
Jun 5 '06 #43
Lawrence D'Oliveiro wrote:
In article <44********@nnt p0.pdx.net>,
Scott David Daniels <sc***********@ acm.org> wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?


Isn't that just your definition of "reasonable "?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Love me, love my blog http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Jun 5 '06 #44
In article <ld************ ***********@lus t.ihug.co.nz>,
Lawrence D'Oliveiro <ld*@geek-central.gen.new _zealand> wrote:
In article <44********@nnt p0.pdx.net>,
Scott David Daniels <sc***********@ acm.org> wrote:

For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.


But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?


Read some of the old discussions in the python-dev archives.
--
Aahz (aa**@pythoncra ft.com) <*> http://www.pythoncraft.com/

"I saw `cout' being shifted "Hello world" times to the left and stopped
right there." --Steve Gonedes
Jun 5 '06 #45

Lawrence D'Oliveiro wrote:
In article <44********@nnt p0.pdx.net>,
Scott David Daniels <sc***********@ acm.org> wrote:
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.


But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?


An already sorted list can be pathological for Quicksort, depending on
how you code it.

Iain

Jun 5 '06 #46
[Scott David Daniels]
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.

O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro] But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?


For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.

Python's current sorting algorithm exploits pre-existing order of many
kinds. For example, take a sorted list, "cut it in half, and riffle
shuffle the two halves together"; e.g.,

[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

That turns out to be a supernaturally good case too.
Jun 5 '06 #47
In article <ma************ *************** ************@py thon.org>,
Tim Peters <ti********@gma il.com> wrote:
[Scott David Daniels]
For example, time timsort (Python's internal sort) on pre-sorted
data; you'll find it is handled faster than random data.


O(N) vs O(N log N), in fact.

[Lawrence D'Oliveiro]
But isn't that how a reasonable sorting algorithm should behave? Less
work to do if the data is already sorted?


For example, the O(N log N) heapsort is unreasonable compared to the
O(N**2) bubblesort by that reasoning (pre-sorted is actually a bad
case for heapsort, but a good case for bubblesort)? O(N log N)
sorting algorithms helped by pre-existing order are uncommon, unless
they do extra work to detect and exploit pre-existing order.


Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof, whereas some other
sorts, quicksort being a prime example, require great care to
avoid pathological cases.

--
Jim Segrave (je*@jes-2.demon.nl)

Jun 5 '06 #48
[Jim Segrave]
Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof,
Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start. Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible). "good" and "bad" are both
O(N log N) here, though.

BTW, most people have never heard of Smoothsort, which was Dijkstra's
excruciating attempt to make heapsort exploit pre-existing order:

http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796.PDF

That's one of a hundred algorithms I rejected for Python ;-)
whereas some other sorts, quicksort being a prime example, require
great care to avoid pathological cases.


Indeed, Python gave up on quicksort for that reason. While the
samplesort hybrid that came between quicksort and the current sort
also had theoretical O(N**2) worst-case behavior, it was exceedingly
difficult to create such an input, and (unlike as for every quicksort
variant Python tried) no real-life case of undue sloth was ever
reported against it.
Jun 5 '06 #49
In article <ma************ *************** ************@py thon.org>,
Tim Peters <ti********@gma il.com> wrote:
[Jim Segrave]
Actually, presorted lists are not a bad case for heapsort - it's quite
immune to any existing order or lack thereof,


Write a heapsort and time it. It's not a difference in O() behavior,
but more memory movement is required for a sorted list because
transforming the list into a max-heap at the start requires shuffling
the largest elements from the end of the list to its start. Initially
reverse-sorted is a "good case" for heapsort in the same sense
(because the list is already a max-heap then; initially sorted is as
far from being a max-heap as is possible). "good" and "bad" are both
O(N log N) here, though.


I did implement a crude heapsort before posting this and measured the
number of comparisions and the number of swaps.

=============== =============== =============== =====

#!/usr/local/bin/python

import random

class HeapSorter(obje ct):
def __init__(self, seq):
self.compares = 0
self.swaps = 0
self.length = len(seq)
self.seq = seq

def __sift(self, i):
while True:
j = 2 * i + 1
if j + 1 < self.length:
self.compares += 1
if self.seq[j + 1] > self.seq[j]:
j = j + 1
if j >= self.length:
return

self.compares += 1
if self.seq[i] > self.seq[j]:
return

self.swaps += 1
self.seq[i], self.seq[j] = (self.seq[j], self.seq[i])
i = j

def __makeheap(self ):
for i in range(self.leng th / 2, -1, -1):
self.__sift(i)

def sort(self):
self.__makeheap ()
for i in range(self.leng th - 1, -1, -1):
self.swaps += 1
self.seq[i], self.seq[0] = (self.seq[0], self.seq[i])
self.length -= 1
self.__sift(0)

if __name__ == '__main__':

s = list(range(1000 00))
random.shuffle( s)

heap = HeapSorter(s)
heap.sort()
print "Random list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)
heap = HeapSorter(s)
heap.sort()
print "Sorted list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)

s.reverse()
heap = HeapSorter(s)
heap.sort()
print "Reverse Sorted list: comapres %d, swaps %d" % \
(heap.compares, heap.swaps)

=============== =============== =============== =====

Which gave the following results:

/usr/home/jes% python ./heapsort
Random list: comapres 3019969, swaps 1575263
Sorted list: comapres 3112517, swaps 1650855
Reverse Sorted list: comapres 2926640, swaps 1497435

Assuming compares and swaps dominate other bookkeeping, then heapsort
is quite stable in its performance, although the constant in the O(N log N)
is much larger than for other sorts.

--
Jim Segrave (je*@jes-2.demon.nl)

Jun 5 '06 #50

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
3850
by: robin | last post by:
I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all in memory. There is a single key field, so a dictionary is an obvious choice for a structure, since Python optimises these nicely. But is there a better choice? Is it worth building some sort of tree?
8
1752
by: placid | last post by:
Hi all, Just wondering if anyone knows how to pop up the dialog that windows pops up when copying/moving/deleting files from one directory to another, in python ? Cheers
2
2121
by: Gabriel Genellina | last post by:
En Wed, 30 Jul 2008 21:29:39 -0300, <python@bdurham.comescribi�: You could use a different hash algorithm yielding a smaller value (crc32, by example, fits on an integer). At the expense of having more collisions, and more processing time to check those possible duplicates. -- Gabriel Genellina
0
9585
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10586
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10323
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10082
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7622
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5525
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5658
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3823
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2997
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.