Sorted and reversed on huge dict ?

vd12005

Hello,

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.

but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?

Nov 3 '06 #1

Subscribe Post Reply

2314

Paul Rubin

vd*****@yahoo.fr writes:

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory,

Depends on how much memory your machine has.

or if i should proceed one after the other.

Obviously that's more memory friendly if you can do it that way.

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?

I think the above is pretty good. sorted necessarily builds and
returns a list, but itervalues/iterkeys, izip, and reversed, just
build small iterator objects.

If the lists are really large, your next step after the above is
probably to use an external sort, but 150000 entries is not that many,
and anyway if sorting is a strain on available memory, then having the
dicts in memory at all will probably also be a strain. Maybe you
should start looking into storing the dict contents externally, such
as in a database.

Nov 3 '06 #2

Fredrik Lundh

vd*****@yahoo.fr wrote:

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

not sure 150k entries qualify as huge, though, unless you're short on
memory.

i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.

why not just try it out?

but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?

sorted() needs access to all data, so it'll build a list, even if you
feed it a generator. you will have to test it yourself, but I suspect
that the most memory-efficient way to do what you want could be:

items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of the
sort is done in place.

</F>

Nov 3 '06 #3

Paul Rubin

Fredrik Lundh <fr*****@pythonware.comwrites:

items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.

I think the OP's method avoided the key map.

Nov 3 '06 #4

Fredrik Lundh

Paul Rubin wrote:

> items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.

I think the OP's method avoided the key map.

right. I should have written "most efficient", not memory efficient.
the items+itemgetter approach is a bit faster, but uses a few extra
megabytes of memory temporarily, during the sort.

</F>

Nov 3 '06 #5

vd12005

thanks for your replies :)

so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

memory is 4Gb of ram, is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command? by now around 220mb is used for around 200.000 words handled
in 15 dicts)

Nov 3 '06 #6

Paul Rubin

vd*****@yahoo.fr writes:

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

There will still be a pointer for each key, the strings themselves
won't be duplicated.

memory is 4Gb of ram,

That sounds like enough ram to hold all your stuff easily.

is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command?

I think try resource.getrusage()

by now around 220mb is used for around 200.000 words handled in 15
dicts)

That sounds very manageable given a 4gb machine. Otherwise, since
it sounds like you're trying to scan some big text corpus and figure
out word frequencies, you could do it the old fashioned way. Write
the words one per line into a big file, then sort the file with the
Unix sort utility (which is an external sort), then read the sorted
file (or pipe it through "uniq -c") to figure out the counts.

Nov 3 '06 #7

Klaas

vd*****@yahoo.fr wrote:

thanks for your replies :)

so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

Definitely a good strategy. The easiest way is to use intern(key) when
storing the values. (This will only work if you are using 8bit
strings. You'll have to maintain your own object cache if you are
using unicode).

I've reduced the memory requirements of very similar apps this way.
-Mike

Nov 4 '06 #8

vd12005

so it still unfinished :) around 1GB for 1033268 words :) (comes from a
top unix command)

Paul i was also thinking on doing it like that by pip-ing to 'sort |
uniq -c | sort -nr' , but i'm pleased if Python can handle it. (well
but maybe Python is slower? will check later...)

Klaas i do not know about intern construct, i will have look, but
when googling i first found a post from Raymond Hettinger so i'm going
to mess my mental space :)
http://mail.python.org/pipermail/pyt...er/040433.html

best regards.

Nov 4 '06 #9

Fredrik Lundh

vd*****@yahoo.fr wrote:

Klaas i do not know about intern construct, i will have look, but
when googling

the description in the documentation is pretty concise, I think:

http://effbot.org/pyref/intern

</F>

Nov 4 '06 #10

Nick Craig-Wood

Paul Rubin <httpwrote:

is there a good way to know how much ram is used directly from
python (or should i rely on 'top' and other unix command?

I think try resource.getrusage()

That should work (under BSD anyway) but unfortunately doesn't under
linux :-(

From the man page

CONFORMING TO
SVr4, 4.3BSD. POSIX.1-2001 specifies getrusage(), but only specifies
the fields ru_utime and ru_stime.

and

The above struct was taken from 4.3BSD Reno. Not all fields are mean-
ingful under Linux. In linux 2.4 only the fields ru_utime, ru_stime,
ru_minflt, and ru_majflt are maintained. Since Linux 2.6, ru_nvcsw and
ru_nivcsw are also maintained.

Ie none of the memory based ones are filled in.

This linux only code works though :-

def memory_used():
"""Return the memory used by this process under linux in bytes"""
return int(file("/proc/self/statm").read().split()[0]) * resource.getpagesize()

Eg

>>import resource
def memory_used():

.... return int(file("/proc/self/statm").read().split()[0]) * resource.getpagesize()
....

>>print memory_used()

4575232

>>a=10000000*"x"
print memory_used()

14577664

>>>

If anyone knows a (unix) portable way of doing this I'd be interested!

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Nov 4 '06 #11

vd12005

so it has worked :) and last 12h4:56, 15 dicts with 1133755 keys, i do
not know how much ram was used as i was not always monitoring it.

thanks for all replies, i'm going to study intern and others
suggestions, hope also someone will bring a pythonic way to know memory
usage :)

best.

Nov 4 '06 #12

vd12005

just to be sure about intern, it is used as :

>>d, f = {}, {}
s = "this is a string"
d[intern(s)] = 1
f[intern(s)] = 1

so actually the key in d and f are a pointer on an the same intern-ed
string? if so it can be interesting,

>>print intern.__doc__

intern(string) -string

``Intern'' the given string. This enters the string in the (global)
table of interned strings whose purpose is to speed up dictionary
lookups.
Return the string itself or the previously interned string object with
the
same value.

the comment here: "(Changed in version 2.3: Interned strings used to be
immortal, but you now need to keep a reference to the interned string
around.)", if it the string is used as a key, it is still reference-d,
am i right?

Nov 4 '06 #13

Sion Arrowsmith

<vd*****@yahoo.frwrote:

>so i just have tried, even if i think it will not go to the end =i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

I've had programs handling dicts with 1.7million items and it isn't
a great problem, providing you're careful not to duplicate data.
Creating a copy of keys() in a separate list, for example, will
inflate memory usage noticably.

>memory is 4Gb of ram, [ ... ]

Unless you're on a 64bit OS, that's irrelevant. You'll hit the 2G
per-process limit before you start putting a strain on real memory.

--
\S -- si***@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
___ | "Frankly I have no feelings towards penguins one way or the other"
\X/ | -- Arthur C. Clarke
her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump

Nov 6 '06 #14

Similar topics

distinct grouping of sorted results

by: kristofera | last post by:

I am trying to do a distinct grouping of some nodes sorted by a numeric value but for some reason the distinct (preceding-sibling filter) is applied to the result as if not sorted. If I don't use...

.NET Framework

Enumerating files in dir sorted by time

by: Dan | last post by:

In .NET, what's the most efficient way to enumerate through all the files in a directory sorted by time (from oldest to newest)?

C# / C Sharp

Print dict in sorted order

by: Kamilche | last post by:

I have a code snippet here that prints a dict in an arbitrary order. (Certain keys first, with rest appearing in sorted order). I didn't want to subclass dict, that's error-prone, and overkill for...

Python

Convert (sorted) list of dics to nested list ?

by: shearichard | last post by:

Hi - I want to take something like ... lstIn = lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 10, 'LEA_AUTOID': 1000}) lstIn.append({'COM_AUTOID': 1, 'PRG_AUTOID': 11, 'LEA_AUTOID': 2000})...

Python

huge amounts of pure Python code broken by Python 2.5?

by: Steven Bethard | last post by:

Jean-Paul Calderone <exarkun@divmod.comwrote: Interesting. Could you give a few illustrations of this? (I didn't run into the same problem at all, so I'm curious.) Steve

Python

Is numeric keys of Python's dictionary automatically sorted?

by: John | last post by:

I am coding a radix sort in python and I think that Python's dictionary may be a choice for bucket. The only problem is that dictionary is a mapping without order. But I just found that if the...

Python

Handling sorted dictionaries

by: loial | last post by:

The following code gives the error d=sortedmachines TypeError: list indices must be integers What works for the unsorted dictionary does not work for the sorted dictionary. Can anyone help?

Python

How to best extract a list of identical keys in a sorted ArrayList with BinarySearch ?

by: Guy | last post by:

Is there a better way to search identical elements in a sorted array list than the following: iIndex = Array.BinarySearch( m_Array, 0, m_Array.Count, aSearchedObject ); aFoundObject= m_Array;...

C# / C Sharp

Sorted Returns List and Reversed Returns Iterator

by: ++imanshu | last post by:

Hi, Is there a reason why two similarly named functions Sorted and Reversed return different types of data or is it an accident. Thanks, ++imanshu

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++