very large dictionary

Simon Strobl

Hello,

I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

SImon

Aug 1 '08 #1

Subscribe Post Reply

10287

Marc 'BlackJack' Rintsch

On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:

I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

What does "load a dictionary" mean? Was it saved with the `pickle`
module?

How about using a database instead of a dictionary?

Ciao,
Marc 'BlackJack' Rintsch

Aug 1 '08 #2

Simon Strobl

What does "load a dictionary" mean?

I had a file bigrams.py with a content like below:

bigrams = {
", djy" : 75 ,
", djz" : 57 ,
", djzoom" : 165 ,
", dk" : 28893 ,
", dk.au" : 854 ,
", dk.b." : 3668 ,
....

}

In another file I said:

from bigrams import bigrams

How about using a database instead of a dictionary?

If there is no other way to do it, I will have to learn how to use
databases in Python. I would prefer to be able to use the same type of
scripts with data of all sizes, though.

Aug 1 '08 #3

bearophileHUGS

Simon Strobl:

I had a file bigrams.py with a content like below:
bigrams = {
", djy" : 75 ,
", djz" : 57 ,
", djzoom" : 165 ,
", dk" : 28893 ,
", dk.au" : 854 ,
", dk.b." : 3668 ,
...
}
In another file I said:
from bigrams import bigrams

Probably there's a limit in the module size here. You can try to
change your data format on disk, creating a text file like this:
", djy" 75
", djz" 57
", djzoom" 165
....
Then in a module you can create an empty dict, read the lines of the
data with:
for line in somefile:
part, n = .rsplit(" ", 1)
somedict[part.strip('"')] = int(n)

Otherwise you may have to use a BigTable, a DB, etc.

If there is no other way to do it, I will have to learn how to use
databases in Python. I would prefer to be able to use the same type of
scripts with data of all sizes, though.

I understand, I don't know if there are documented limits for the
dicts of the 64-bit Python.

Bye,
bearophile

Aug 1 '08 #4

Sion Arrowsmith

Simon Strobl <Si**********@gmail.comwrote:

>I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

Let's just eliminate one thing here: this server is running a
64-bit OS, isn't it? Because if it's a 32-bit OS, the blunt
answer is "You can't, no matter how much physical memory you
have" and you're going to have to go down the database route
(or some approach which stores the mapping on disk and only
loads items into memory on demand).

--
\S -- si***@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
"Frankly I have no feelings towards penguins one way or the other"
-- Arthur C. Clarke
her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump

Aug 1 '08 #5

Raja Baz

On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote:

Simon Strobl <Si**********@gmail.comwrote:
>>I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

Let's just eliminate one thing here: this server is running a 64-bit OS,
isn't it? Because if it's a 32-bit OS, the blunt answer is "You can't,
no matter how much physical memory you have" and you're going to have to
go down the database route (or some approach which stores the mapping on
disk and only loads items into memory on demand).

I very highly doubt he has 128GB of main memory and is running a 32bit OS.

Aug 1 '08 #6

Raja Baz

On Fri, 01 Aug 2008 14:47:17 +0100, Sion Arrowsmith wrote:

Simon Strobl <Si**********@gmail.comwrote:
>>I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

Let's just eliminate one thing here: this server is running a 64-bit OS,
isn't it? Because if it's a 32-bit OS, [etc...]

I very highly doubt he has 128GB of main memory and is running a 32bit OS.

Aug 1 '08 #7

Sean

Simon Strobl wrote:

Hello,

I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

SImon

Take a look at the python bsddb module. Uing btree tables is fast, and
it has the benefit that once the table is open, the programing interface
is identical to a normal dictionary.

http://docs.python.org/lib/bsddb-objects.html

Sean

Aug 2 '08 #8

Steven D'Aprano

On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:

Hello,

I tried to load a 6.8G large dictionary on a server that has 128G of
memory. I got a memory error. I used Python 2.5.2. How can I load my
data?

How do you know the dictionary takes 6.8G?

I'm going to guess an answer to my own question. In a later post, Simon
wrote:

[quote]
I had a file bigrams.py with a content like below:

bigrams = {
", djy" : 75 ,
", djz" : 57 ,
", djzoom" : 165 ,
", dk" : 28893 ,
", dk.au" : 854 ,
", dk.b." : 3668 ,
....

}
[end quote]
I'm guessing that the file is 6.8G of *text*. How much memory will it
take to import that? I don't know, but probably a lot more than 6.8G. The
compiler has to read the whole file in one giant piece, analyze it,
create all the string and int objects, and only then can it create the
dict. By my back-of-the-envelope calculations, the pointers alone will
require about 5GB, nevermind the objects they point to.

I suggest trying to store your data as data, not as Python code. Create a
text file "bigrams.txt" with one key/value per line, like this:

djy : 75
djz : 57
djzoom : 165
dk : 28893
....

Then import it like such:

bigrams = {}
for line in open('bigrams.txt', 'r'):
key, value = line.split(':')
bigrams[key.strip()] = int(value.strip())
This will be slower, but because it only needs to read the data one line
at a time, it might succeed where trying to slurp all 6.8G in one piece
will fail.

--
Steven

Aug 2 '08 #9

Jorgen Grahn

On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl <Si**********@gmail.comwrote:

>What does "load a dictionary" mean?

I had a file bigrams.py with a content like below:

bigrams = {
", djy" : 75 ,
", djz" : 57 ,
", djzoom" : 165 ,
", dk" : 28893 ,
", dk.au" : 854 ,
", dk.b." : 3668 ,
...

}

In another file I said:

from bigrams import bigrams

>How about using a database instead of a dictionary?

If there is no other way to do it, I will have to learn how to use
databases in Python.

If you use Berkeley DB ("import bsddb"), you don't have to learn much.
These databases look very much like dictionaries string:string, only
they are disk-backed.

(I assume here that Berkeley DB supports 7GB data sets.)

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se R'lyeh wgah'nagl fhtagn!

Aug 3 '08 #10

Jorgen Grahn

On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn <gr********@snipabacken.sewrote:

On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl <Si**********@gmail.comwrote:

....

>If there is no other way to do it, I will have to learn how to use
databases in Python.

If you use Berkeley DB ("import bsddb"), you don't have to learn much.
These databases look very much like dictionaries string:string, only
they are disk-backed.

.... all of which Sean pointed out elsewhere in the thread.

Oh well. I guess pointing it out twice doesn't hurt. bsddb has been
very pleasant to work with for me. I normally avoid database
programming like the plague.

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se R'lyeh wgah'nagl fhtagn!

Aug 3 '08 #11

member thudfoo

On 3 Aug 2008 20:40:02 GMT, Jorgen Grahn <gr********@snipabacken.sewrote:

On 3 Aug 2008 20:36:33 GMT, Jorgen Grahn <gr********@snipabacken.sewrote:
On Fri, 1 Aug 2008 01:05:07 -0700 (PDT), Simon Strobl <Si**********@gmail.comwrote:

...

If there is no other way to do it, I will have to learn how to use
>databases in Python.
>
If you use Berkeley DB ("import bsddb"), you don't have to learn much.
These databases look very much like dictionaries string:string, only
they are disk-backed.

... all of which Sean pointed out elsewhere in the thread.

Oh well. I guess pointing it out twice doesn't hurt. bsddb has been
very pleasant to work with for me. I normally avoid database
programming like the plague.

13.4 shelve -- Python object persistence

A ``shelf'' is a persistent, dictionary-like object. The difference
with ``dbm'' databases is that the values (not the keys!) in a shelf
can be essentially arbitrary Python objects -- anything that the
pickle module can handle. This includes most class instances,
recursive data types, and objects containing lots of shared
sub-objects. The keys are ordinary strings....

[...]

Aug 3 '08 #12

Avinash Vora

On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote:

(You might want to post this to comp.lang.python rather than to me --
I am just another c.l.p reader. If you already have done to, please
disregard this.)

Yeah, I hit "reply" by mistake and didn't realize it. My bad.

>>(I assume here that Berkeley DB supports 7GB data sets.)

If I remember correctly, BerkeleyDB is limited to a single file size
of 2GB.

Sounds likely. But with some luck maybe they have increased this in
later releases? There seem to be many competing Berkeley releases.

It's worth investigating, but that leads me to:

>I haven't caught the earlier parts of this thread, but do I
understand correctly that someone wants to load a 7GB dataset into
the
form of a dictionary?

Yes, he claimed the dictionary was 6.8 GB. How he measured that, I
don't know.

To the OP: how did you measure this?

--
Avi

Aug 3 '08 #13

Simon Strobl

On 4 Aug., 00:51, Avinash Vora <avinashv...@gmail.comwrote:

On Aug 4, 2008, at 4:12 AM, Jörgen Grahn wrote:

(You might want to post this to comp.lang.python rather than to me --
I am just another c.l.p reader. If you already have done to, please
disregard this.)

Yeah, I hit "reply" by mistake and didn't realize it. My bad.

>(I assume here that Berkeley DB supports 7GB data sets.)

If I remember correctly, BerkeleyDB is limited to a single file size
of 2GB.

Sounds likely. But with some luck maybe they have increased this in
later releases? There seem to be many competing Berkeley releases.

It's worth investigating, but that leads me to:

I haven't caught the earlier parts of this thread, but do I
understand correctly that someone wants to load a 7GB dataset into
the
form of a dictionary?

Yes, he claimed the dictionary was 6.8 GB. How he measured that, I
don't know.

To the OP: how did you measure this?

I created a python file that contained the dictionary. The size of
this file was 6.8GB. I thought it would be practical not to create the
dictionary from a text file each time I needed it. I.e. I thought
loading the .pyc-file should be faster. Yet, Python failed to create
a .pyc-file

Simon

Aug 4 '08 #14

Steven D'Aprano

On Mon, 04 Aug 2008 07:02:16 -0700, Simon Strobl wrote:

I created a python file that contained the dictionary. The size of this
file was 6.8GB.

Ah, that's what I thought you had done. That's not a dictionary. That's a
text file containing the Python code to create a dictionary.

My guess is that a 7GB text file will require significantly more memory
once converted to an actual dictionary: in my earlier post, I estimated
about 5GB for pointers. Total size of the dictionary is impossible to
estimate accurately without more information, but I'd guess that 10GB or
20GB wouldn't be unreasonable.

Have you considered that the operating system imposes per-process limits
on memory usage? You say that your server has 128 GB of memory, but that
doesn't mean the OS will make anything like that available.

And I don't know how to even start estimating how much temporary memory
is required to parse and build such an enormous Python program. Not only
is it a 7GB program, but it is 7GB in one statement.

I thought it would be practical not to create the
dictionary from a text file each time I needed it. I.e. I thought
loading the .pyc-file should be faster. Yet, Python failed to create a
.pyc-file

Probably a good example of premature optimization. Out of curiosity, how
long does it take to create it from a text file?

--
Steven

Aug 4 '08 #15

Simon Strobl

Have you considered that the operating system imposes per-process limits

on memory usage? You say that your server has 128 GB of memory, but that
doesn't mean the OS will make anything like that available.

According to our system administrator, I can use all of the 128G.

I thought it would be practical not to create the
dictionary from a text file each time I needed it. I.e. I thought
loading the .pyc-file should be faster. Yet, Python failed to create a
.pyc-file

Probably a good example of premature optimization.

Well, as I was using Python, I did not expect to have to care about
the language's internal affairs that much. I thought I could simply do
always the same no matter how large my files get. In other words, I
thought Python was really scalable.

Out of curiosity, how
long does it take to create it from a text file?

I do not remember this exactly. But I think it was not much more than
an hour.

Aug 5 '08 #16

Gabriel Genellina

En Mon, 04 Aug 2008 11:02:16 -0300, Simon Strobl <Si**********@gmail.com>
escribió:

I created a python file that contained the dictionary. The size of
this file was 6.8GB. I thought it would be practical not to create the
dictionary from a text file each time I needed it. I.e. I thought
loading the .pyc-file should be faster. Yet, Python failed to create
a .pyc-file

Looks like the marshal format (used to create the .pyc file) can't handle
sizes so big - and that limitation will stay for a while:
http://mail.python.org/pipermail/pyt...ay/073161.html
So follow any of the previous suggestions and store your dictionary as
data, not code.

--
Gabriel Genellina

Aug 5 '08 #17

Steven D'Aprano

On Tue, 05 Aug 2008 01:20:08 -0700, Simon Strobl wrote:

I thought it would be practical not to create the dictionary from a
text file each time I needed it. I.e. I thought loading the .pyc-file
should be faster. Yet, Python failed to create a .pyc-file

Probably a good example of premature optimization.

Well, as I was using Python, I did not expect to have to care about the
language's internal affairs that much. I thought I could simply do
always the same no matter how large my files get. In other words, I
thought Python was really scalable.

Yeah, it really is a pain when abstractions leak.

http://www.joelonsoftware.com/articl...tractions.html

>Out of curiosity, how
long does it take to create it from a text file?

I do not remember this exactly. But I think it was not much more than an
hour.

Hmmm... longer than I expected. Perhaps not as premature as I thought.
Have you tried the performance of the pickle and marshal modules?

--
Steven

Aug 5 '08 #18

Terry Reedy

Simon Strobl wrote:

>
Well, as I was using Python, I did not expect to have to care about
the language's internal affairs that much. I thought I could simply do
always the same no matter how large my files get. In other words, I
thought Python was really scalable.

Python the language is indefinitely scalable. Finite implementations
are not. CPython is a C program compiled to a system executable. Most
OSes run executables with a fairly limited call stack space.

CPython programs are, when possible, cached as .pyc files. The
existence and format of .pyc's is an internal affair of the CPython
implementation. They are most definitely not a language requirement or
language feature.

Have you tried feeding multigigabytes source code files to other
compilers? Most, if not all, could be broken by the 'right' big-enough
code.

tjr

Aug 5 '08 #19

Bruno Desthuilliers

Simon Strobl a écrit :
(snip)

I would prefer to be able to use the same type of
scripts with data of all sizes, though.

Since computers have a limited RAM, this is to remain a wish. You can't
obviously expect to deal with terabytes of data like you do with a 1kb
text file.

Aug 6 '08 #20

Jake Anderson

Bruno Desthuilliers wrote:

Simon Strobl a écrit :
(snip)
I would prefer to be able to use the same type of
scripts with data of all sizes, though.

Since computers have a limited RAM, this is to remain a wish. You
can't obviously expect to deal with terabytes of data like you do with
a 1kb text file.
--
http://mail.python.org/mailman/listinfo/python-list

You can, you just start off handling the multi GB case and your set.
databases are really easy, I often use them for manipulating pretty
small amounts of data because its just an easy way to group and join etc.

Aug 6 '08 #21

by: possibilitybox | last post by:

this code here: def wordcount(lines): for i in range(len(lines)/8): words = lines.split(" ") if not locals().has_key("frequency"): frequency = {} for word in words: if...

Python

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...

Python

Designing Data Interface for Very Large Files [more than GB size]

by: shailesh kumar | last post by:

Hi, I need to design data interfaces for accessing files of very large sizes efficiently. The data will be accessed in chunks of fixed size ... My data interface should be able to do a random...

C / C++

ilog2() for very large numbers

by: Alex Vinokur | last post by:

Dann Corbit has implemented function int ilog2 (unsigned long) at http://groups.google.com/groups?selm=lkPa4.2165%24I41.1498%40client Is exist a similar C++-function for very large numbers,...

C / C++

Querying Very Large XML

by: Greg | last post by:

I am working on a project that will have about 500,000 records in an XML document. This document will need to be queried with XPath, and records will need to be updated. I was thinking about...

.NET Framework

Definition of a 'Very Large Table'

by: shsandeep | last post by:

Hi all, I have heard and read this many times: "Partitions should only be used for 'very large' tables". What actually determines whether a table is 'very large' or not? I have tables containing...

DB2 Database

Comparison issues using fstat and very large files in C++

by: Lars B | last post by:

Hey guys, I have written a C++ program that passes data from a file to an FPGA board and back again using software and DMA buffers. In my program I need to compare the size of a given file against...

C / C++

How to find modulus of a very large number

by: zephyrus360 | last post by:

This is about a technique to find the mod of a very large integer with a normal small integer. I recently encountered this problem when I needed to compute the modulus of a very large number with...

Software Development

Re: Optimizing size of very large dictionaries

by: M.-A. Lemburg | last post by:

On 2008-07-31 02:29, python@bdurham.com wrote: If you don't have a problem with taking a small performance hit, then I'd suggest to have a look at mxBeeBase, which is an on-disk dictionary...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

very large dictionary

Similar topics