473,387 Members | 1,545 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Sorted and reversed on huge dict ?

Hello,

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.

but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?

Nov 3 '06 #1
13 2314
vd*****@yahoo.fr writes:
i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory,
Depends on how much memory your machine has.
or if i should proceed one after the other.
Obviously that's more memory friendly if you can do it that way.
from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?
I think the above is pretty good. sorted necessarily builds and
returns a list, but itervalues/iterkeys, izip, and reversed, just
build small iterator objects.

If the lists are really large, your next step after the above is
probably to use an external sort, but 150000 entries is not that many,
and anyway if sorting is a strain on available memory, then having the
dicts in memory at all will probably also be a strain. Maybe you
should start looking into storing the dict contents externally, such
as in a database.
Nov 3 '06 #2
vd*****@yahoo.fr wrote:
i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).
not sure 150k entries qualify as huge, though, unless you're short on
memory.
i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.
why not just try it out?
but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?
sorted() needs access to all data, so it'll build a list, even if you
feed it a generator. you will have to test it yourself, but I suspect
that the most memory-efficient way to do what you want could be:

items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of the
sort is done in place.

</F>

Nov 3 '06 #3
Fredrik Lundh <fr*****@pythonware.comwrites:
items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.
I think the OP's method avoided the key map.
Nov 3 '06 #4
Paul Rubin wrote:
> items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.

I think the OP's method avoided the key map.
right. I should have written "most efficient", not memory efficient.
the items+itemgetter approach is a bit faster, but uses a few extra
megabytes of memory temporarily, during the sort.

</F>

Nov 3 '06 #5
thanks for your replies :)

so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

memory is 4Gb of ram, is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command? by now around 220mb is used for around 200.000 words handled
in 15 dicts)

Nov 3 '06 #6
vd*****@yahoo.fr writes:
but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?
There will still be a pointer for each key, the strings themselves
won't be duplicated.
memory is 4Gb of ram,
That sounds like enough ram to hold all your stuff easily.
is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command?
I think try resource.getrusage()
by now around 220mb is used for around 200.000 words handled in 15
dicts)
That sounds very manageable given a 4gb machine. Otherwise, since
it sounds like you're trying to scan some big text corpus and figure
out word frequencies, you could do it the old fashioned way. Write
the words one per line into a big file, then sort the file with the
Unix sort utility (which is an external sort), then read the sorted
file (or pipe it through "uniq -c") to figure out the counts.
Nov 3 '06 #7
vd*****@yahoo.fr wrote:
thanks for your replies :)

so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?
Definitely a good strategy. The easiest way is to use intern(key) when
storing the values. (This will only work if you are using 8bit
strings. You'll have to maintain your own object cache if you are
using unicode).

I've reduced the memory requirements of very similar apps this way.
-Mike

Nov 4 '06 #8

so it still unfinished :) around 1GB for 1033268 words :) (comes from a
top unix command)

Paul i was also thinking on doing it like that by pip-ing to 'sort |
uniq -c | sort -nr' , but i'm pleased if Python can handle it. (well
but maybe Python is slower? will check later...)

Klaas i do not know about intern construct, i will have look, but
when googling i first found a post from Raymond Hettinger so i'm going
to mess my mental space :)
http://mail.python.org/pipermail/pyt...er/040433.html

best regards.

Nov 4 '06 #9
vd*****@yahoo.fr wrote:
Klaas i do not know about intern construct, i will have look, but
when googling
the description in the documentation is pretty concise, I think:

http://effbot.org/pyref/intern

</F>

Nov 4 '06 #10
Paul Rubin <httpwrote:
is there a good way to know how much ram is used directly from
python (or should i rely on 'top' and other unix command?

I think try resource.getrusage()
That should work (under BSD anyway) but unfortunately doesn't under
linux :-(

From the man page

CONFORMING TO
SVr4, 4.3BSD. POSIX.1-2001 specifies getrusage(), but only specifies
the fields ru_utime and ru_stime.

and

The above struct was taken from 4.3BSD Reno. Not all fields are mean-
ingful under Linux. In linux 2.4 only the fields ru_utime, ru_stime,
ru_minflt, and ru_majflt are maintained. Since Linux 2.6, ru_nvcsw and
ru_nivcsw are also maintained.

Ie none of the memory based ones are filled in.

This linux only code works though :-

def memory_used():
"""Return the memory used by this process under linux in bytes"""
return int(file("/proc/self/statm").read().split()[0]) * resource.getpagesize()

Eg
>>import resource
def memory_used():
.... return int(file("/proc/self/statm").read().split()[0]) * resource.getpagesize()
....
>>print memory_used()
4575232
>>a=10000000*"x"
print memory_used()
14577664
>>>
If anyone knows a (unix) portable way of doing this I'd be interested!

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Nov 4 '06 #11

so it has worked :) and last 12h4:56, 15 dicts with 1133755 keys, i do
not know how much ram was used as i was not always monitoring it.

thanks for all replies, i'm going to study intern and others
suggestions, hope also someone will bring a pythonic way to know memory
usage :)

best.

Nov 4 '06 #12

just to be sure about intern, it is used as :
>>d, f = {}, {}
s = "this is a string"
d[intern(s)] = 1
f[intern(s)] = 1
so actually the key in d and f are a pointer on an the same intern-ed
string? if so it can be interesting,
>>print intern.__doc__
intern(string) -string

``Intern'' the given string. This enters the string in the (global)
table of interned strings whose purpose is to speed up dictionary
lookups.
Return the string itself or the previously interned string object with
the
same value.

the comment here: "(Changed in version 2.3: Interned strings used to be
immortal, but you now need to keep a reference to the interned string
around.)", if it the string is used as a key, it is still reference-d,
am i right?

Nov 4 '06 #13
<vd*****@yahoo.frwrote:
>so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?
I've had programs handling dicts with 1.7million items and it isn't
a great problem, providing you're careful not to duplicate data.
Creating a copy of keys() in a separate list, for example, will
inflate memory usage noticably.
>memory is 4Gb of ram, [ ... ]
Unless you're on a 64bit OS, that's irrelevant. You'll hit the 2G
per-process limit before you start putting a strain on real memory.

--
\S -- si***@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
___ | "Frankly I have no feelings towards penguins one way or the other"
\X/ | -- Arthur C. Clarke
her nu becomež se bera eadward ofdun hlęddre heafdes bęce bump bump bump
Nov 6 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: kristofera | last post by:
I am trying to do a distinct grouping of some nodes sorted by a numeric value but for some reason the distinct (preceding-sibling filter) is applied to the result as if not sorted. If I don't use...
1
by: Dan | last post by:
In .NET, what's the most efficient way to enumerate through all the files in a directory sorted by time (from oldest to newest)?
6
by: Kamilche | last post by:
I have a code snippet here that prints a dict in an arbitrary order. (Certain keys first, with rest appearing in sorted order). I didn't want to subclass dict, that's error-prone, and overkill for...
2
by: shearichard | last post by:
Hi - I want to take something like ... lstIn = lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 10, 'LEA_AUTOID': 1000}) lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 11, 'LEA_AUTOID': 2000})...
13
by: Steven Bethard | last post by:
Jean-Paul Calderone <exarkun@divmod.comwrote: Interesting. Could you give a few illustrations of this? (I didn't run into the same problem at all, so I'm curious.) Steve
11
by: John | last post by:
I am coding a radix sort in python and I think that Python's dictionary may be a choice for bucket. The only problem is that dictionary is a mapping without order. But I just found that if the...
2
by: loial | last post by:
The following code gives the error d=sortedmachines TypeError: list indices must be integers What works for the unsorted dictionary does not work for the sorted dictionary. Can anyone help?
8
by: Guy | last post by:
Is there a better way to search identical elements in a sorted array list than the following: iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count, aSearchedObject ); aFoundObject= m_Array;...
1
by: ++imanshu | last post by:
Hi, Is there a reason why two similarly named functions Sorted and Reversed return different types of data or is it an accident. Thanks, ++imanshu
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.