473,699 Members | 2,715 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Sorted and reversed on huge dict ?

Hello,

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.

but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalu es(), d.iterkeys())
for v, k in reversed(sorted (pairs)):
print k, v

or will it be the same as building the whole list ?

Nov 3 '06 #1
13 2345
vd*****@yahoo.f r writes:
i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory,
Depends on how much memory your machine has.
or if i should proceed one after the other.
Obviously that's more memory friendly if you can do it that way.
from itertools import izip
pairs = izip(d.itervalu es(), d.iterkeys())
for v, k in reversed(sorted (pairs)):
print k, v

or will it be the same as building the whole list ?
I think the above is pretty good. sorted necessarily builds and
returns a list, but itervalues/iterkeys, izip, and reversed, just
build small iterator objects.

If the lists are really large, your next step after the above is
probably to use an external sort, but 150000 entries is not that many,
and anyway if sorting is a strain on available memory, then having the
dicts in memory at all will probably also be a strain. Maybe you
should start looking into storing the dict contents externally, such
as in a database.
Nov 3 '06 #2
vd*****@yahoo.f r wrote:
i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).
not sure 150k entries qualify as huge, though, unless you're short on
memory.
i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.
why not just try it out?
but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalu es(), d.iterkeys())
for v, k in reversed(sorted (pairs)):
print k, v

or will it be the same as building the whole list ?
sorted() needs access to all data, so it'll build a list, even if you
feed it a generator. you will have to test it yourself, but I suspect
that the most memory-efficient way to do what you want could be:

items = d.items()
items.sort(key= operator.itemge tter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of the
sort is done in place.

</F>

Nov 3 '06 #3
Fredrik Lundh <fr*****@python ware.comwrites:
items = d.items()
items.sort(key= operator.itemge tter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.
I think the OP's method avoided the key map.
Nov 3 '06 #4
Paul Rubin wrote:
> items = d.items()
items.sort(key= operator.itemge tter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.

I think the OP's method avoided the key map.
right. I should have written "most efficient", not memory efficient.
the items+itemgette r approach is a bit faster, but uses a few extra
megabytes of memory temporarily, during the sort.

</F>

Nov 3 '06 #5
thanks for your replies :)

so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

memory is 4Gb of ram, is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command? by now around 220mb is used for around 200.000 words handled
in 15 dicts)

Nov 3 '06 #6
vd*****@yahoo.f r writes:
but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?
There will still be a pointer for each key, the strings themselves
won't be duplicated.
memory is 4Gb of ram,
That sounds like enough ram to hold all your stuff easily.
is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command?
I think try resource.getrus age()
by now around 220mb is used for around 200.000 words handled in 15
dicts)
That sounds very manageable given a 4gb machine. Otherwise, since
it sounds like you're trying to scan some big text corpus and figure
out word frequencies, you could do it the old fashioned way. Write
the words one per line into a big file, then sort the file with the
Unix sort utility (which is an external sort), then read the sorted
file (or pipe it through "uniq -c") to figure out the counts.
Nov 3 '06 #7
vd*****@yahoo.f r wrote:
thanks for your replies :)

so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?
Definitely a good strategy. The easiest way is to use intern(key) when
storing the values. (This will only work if you are using 8bit
strings. You'll have to maintain your own object cache if you are
using unicode).

I've reduced the memory requirements of very similar apps this way.
-Mike

Nov 4 '06 #8

so it still unfinished :) around 1GB for 1033268 words :) (comes from a
top unix command)

Paul i was also thinking on doing it like that by pip-ing to 'sort |
uniq -c | sort -nr' , but i'm pleased if Python can handle it. (well
but maybe Python is slower? will check later...)

Klaas i do not know about intern construct, i will have look, but
when googling i first found a post from Raymond Hettinger so i'm going
to mess my mental space :)
http://mail.python.org/pipermail/pyt...er/040433.html

best regards.

Nov 4 '06 #9
vd*****@yahoo.f r wrote:
Klaas i do not know about intern construct, i will have look, but
when googling
the description in the documentation is pretty concise, I think:

http://effbot.org/pyref/intern

</F>

Nov 4 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
3565
by: kristofera | last post by:
I am trying to do a distinct grouping of some nodes sorted by a numeric value but for some reason the distinct (preceding-sibling filter) is applied to the result as if not sorted. If I don't use the preceding-sibling filter, the nodes are properly sorted. Is this a bug in the xslt processor I'm using (.net framework 1.1) or is this correct behaviour? Any alternative solutions that will show me the lowest priced item in each category?...
1
2011
by: Dan | last post by:
In .NET, what's the most efficient way to enumerate through all the files in a directory sorted by time (from oldest to newest)?
6
5509
by: Kamilche | last post by:
I have a code snippet here that prints a dict in an arbitrary order. (Certain keys first, with rest appearing in sorted order). I didn't want to subclass dict, that's error-prone, and overkill for my needs. I just need something that returns a value like dict.__str__, with a key ordering I specify. If you have any opinions on how it could be made better, I'm all ears!
2
2022
by: shearichard | last post by:
Hi - I want to take something like ... lstIn = lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 10, 'LEA_AUTOID': 1000}) lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 11, 'LEA_AUTOID': 2000}) lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 11, 'LEA_AUTOID': 2001}) lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 11, 'LEA_AUTOID': 2003}) lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 12, 'LEA_AUTOID': 3000}) lstIn.append({'COM_AUTOID': 1,...
13
2155
by: Steven Bethard | last post by:
Jean-Paul Calderone <exarkun@divmod.comwrote: Interesting. Could you give a few illustrations of this? (I didn't run into the same problem at all, so I'm curious.) Steve
11
11405
by: John | last post by:
I am coding a radix sort in python and I think that Python's dictionary may be a choice for bucket. The only problem is that dictionary is a mapping without order. But I just found that if the keys are numeric, the keys themselves are ordered in the dictionary. part of my code is like this: radix={} for i in range(256):
2
1578
by: loial | last post by:
The following code gives the error d=sortedmachines TypeError: list indices must be integers What works for the unsorted dictionary does not work for the sorted dictionary. Can anyone help?
8
2588
by: Guy | last post by:
Is there a better way to search identical elements in a sorted array list than the following: iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count, aSearchedObject ); aFoundObject= m_Array; m_ResultArray.Add ( aFoundObject);
1
1077
by: ++imanshu | last post by:
Hi, Is there a reason why two similarly named functions Sorted and Reversed return different types of data or is it an accident. Thanks, ++imanshu
0
9172
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9032
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
7745
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6532
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5869
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4374
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4626
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3054
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2344
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.