By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,592 Members | 1,240 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,592 IT Pros & Developers. It's quick & easy.

cPickle alternative?

P: n/a
Hello,
I have a huge problem with loading very simple structure into memory
it is a list of tuples, it has 6MB and consists of 100000 elements
import cPickle plik = open("mealy","r")
mealy = cPickle.load(plik)
plik.close()


this takes about 30 seconds!
How can I accelerate it?

Thanks in adv.

Jul 18 '05 #1
Share this Question
Share on Google+
13 Replies


P: n/a
Drochom wrote:
Hello,
I have a huge problem with loading very simple structure into memory
it is a list of tuples, it has 6MB and consists of 100000 elements
import cPickle

plik = open("mealy","r")
mealy = cPickle.load(plik)
plik.close()


this takes about 30 seconds!
How can I accelerate it?

Thanks in adv.


What protocol did you pickle your data with? The default (protocol 0,
ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
save your data with the new protocol 2 -- it's likely to be fastest.
Alex

Jul 18 '05 #2

P: n/a
Hi,

I have no idea! I used a similar scheme the other day and made some
benchmarks (I *like* benchmarks!)

About 6 MB took 4 seconds dumping as well as loading on a 800 Mhz P3 Laptop.
When using binary mode it went down to about 1.5 seconds (And space to 2 MB)

THis is o.k., because I generally have problems beeing faster than 1 MB/sec
with my 2" drive, processor and Python ;-)

Python 2.3 seems to have even a more effective "protocoll mode 2".

May be your structures are *very* complex??

Kindly
Michael P

"Drochom" <pe******@gazeta.pl> schrieb im Newsbeitrag
news:bh**********@atlantis.news.tpi.pl...
Hello,
I have a huge problem with loading very simple structure into memory
it is a list of tuples, it has 6MB and consists of 100000 elements
import cPickle

plik = open("mealy","r")
mealy = cPickle.load(plik)
plik.close()


this takes about 30 seconds!
How can I accelerate it?

Thanks in adv.

Jul 18 '05 #3

P: n/a
Drochom wrote:
What protocol did you pickle your data with? The default (protocol 0,
ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
save your data with the new protocol 2 -- it's likely to be fastest.
Alex


Thanks:)
i'm using default protocol, i'm not sure if i can upgrade so simply, because
i'm using many modules for Py2.2


Then use protocol 1 instead -- that has been the binary pickle protocol
for a long time, and works perfectly on Python 2.2.x :-)
(and it's much faster than protocol 0 -- the text protocol)

--Irmen

Jul 18 '05 #4

P: n/a
Drochom wrote:
Thanks for help:)
Here is simple example:
frankly speaking it's a graph with 100000 nodes:
STRUCTURE:
[(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
0),), (('a', 4, 0), ('o', 2, 0))]


Perhaps this matches your spec:

from random import randrange
import pickle, cPickle, time

source = [(chr(randrange(33, 127)), randrange(100000), randrange(i+50))
for i in range(100000)]
def timed(module, flag, name='file.tmp'):
start = time.time()
dest = file(name, 'wb')
module.dump(source, dest, flag)
dest.close()
mid = time.time()
dest = file(name, 'rb')
result = module.load(dest)
dest.close()
stop = time.time()
assert source == result
return mid-start, stop-mid

On 2.2:
timed(pickle, 0): (7.8, 5.5)
timed(pickle, 1): (9.5, 6.2)
timed(cPickle, 0): (0.41, 4.9)
timed(cPickle, 1): (0.15, .53)

On 2.3:
timed(pickle, 0): (6.2, 5.3)
timed(pickle, 1): (6.6, 5.4)
timed(pickle, 2): (6.5, 3.9)

timed(cPickle, 0): (6.2, 5.3)
timed(pickle, 1): (.88, .69)
timed(pickle, 2): (.80, .67)

(Not tightly controlled -- I'd gues 1.5 digits)

-Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #5

P: n/a

"Michael Peuser" <mp*****@web.de> wrote in message
news:bh*************@news.t-online.com...
o.k - I modified my testprogram - let it run at your machine.
It took 1.5 seconds - I made it 11 Million records to get to 2 Mbyte.
Kindly
Michael
------------------
import cPickle as Pickle
from time import clock

# generate 1.000.000 records
r=[(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3, 0),), (('a', 4, 0), ('o', 2, 0))]

x=[]

for i in xrange(1000000):
x.append(r)

print len(x), "records"

t0=clock()
f=open ("test","w")
Pickle.dump(x,f,1)
f.close()
print "out=", clock()-t0

t0=clock()
f=open ("test")
x=Pickle.load(f)
f.close()
print "in=", clock()-t0
---------------------


Hi, i'm really grateful for your help,
i've modyfied your code a bit, check your times and tell me what are they

TRY THIS:

import cPickle as Pickle
from time import clock
from random import randrange
x=[]

for i in xrange(20000):
c = []
for j in xrange(randrange(2,25)):
c.append((chr(randrange(33,120)),randrange(1,10000 0),randrange(1,3)))
c = tuple(c)
x.append(c)
if i%1000==0: print i #it will help you to survive waiting...
print len(x), "records"

t0=clock()
f=open ("test","w")
Pickle.dump(x,f,0)
f.close()
print "out=", clock()-t0
t0=clock()
f=open ("test")
x=Pickle.load(f)
f.close()
print "in=", clock()-t0

Thanks once again:)

Jul 18 '05 #6

P: n/a
Hello,
If speed is important, you may want to do different things depending on e.g., what is in those tuples, and whether they are all the same length, etc. E.g., if they were all fixed length tuples of integers, you could do hugely better than store the data as a list of tuples. Those tuples have different length indeed.
You could store the whole thing in a mmap image, with a length-prefixed pickle string in the front representing index info. If i only knew how do to it...:-)
Find a way to avoid doing it? Or doing much of it?
What are your access needs once the data is accessible?

My structure stores a finite state automaton with polish dictionary (lexicon
to be more precise) and it should be loaded
once but fast!

Thx
Regards,
Przemo Drochomirecki

Jul 18 '05 #7

P: n/a
I forgot to explain you why i use tuples instead of lists
i was squeezing a lexicon => minimalization of automaton => using a
dictionary => using hashable objects =>using tuples(lists aren't hashable)
Regards,
Przemo Drochomirecki
Jul 18 '05 #8

P: n/a

Perhaps this matches your spec:

from random import randrange
import pickle, cPickle, time

source = [(chr(randrange(33, 127)), randrange(100000), randrange(i+50))
for i in range(100000)]
def timed(module, flag, name='file.tmp'):
start = time.time()
dest = file(name, 'wb')
module.dump(source, dest, flag)
dest.close()
mid = time.time()
dest = file(name, 'rb')
result = module.load(dest)
dest.close()
stop = time.time()
assert source == result
return mid-start, stop-mid

On 2.2:
timed(pickle, 0): (7.8, 5.5)
timed(pickle, 1): (9.5, 6.2)
timed(cPickle, 0): (0.41, 4.9)
timed(cPickle, 1): (0.15, .53)

On 2.3:
timed(pickle, 0): (6.2, 5.3)
timed(pickle, 1): (6.6, 5.4)
timed(pickle, 2): (6.5, 3.9)

timed(cPickle, 0): (6.2, 5.3)
timed(pickle, 1): (.88, .69)
timed(pickle, 2): (.80, .67)

(Not tightly controlled -- I'd gues 1.5 digits)

-Scott David Daniels
Sc***********@Acm.Org

Hello, and Thanks, your code was extremely helpful:)

Regards
Przemo Drochomirecki
Jul 18 '05 #9

P: n/a
On Sat, 16 Aug 2003 00:41:42 +0200, "Drochom" <pe******@gazeta.pl> wrote:
Hello,
If speed is important, you may want to do different things depending on

e.g.,
what is in those tuples, and whether they are all the same length, etc.

E.g.,
if they were all fixed length tuples of integers, you could do hugely

better
than store the data as a list of tuples.

Those tuples have different length indeed.
You could store the whole thing in a mmap image, with a length-prefixed

pickle
string in the front representing index info.

If i only knew how do to it...:-)
Find a way to avoid doing it? Or doing much of it?
What are your access needs once the data is accessible?

My structure stores a finite state automaton with polish dictionary (lexicon
to be more precise) and it should be loaded
once but fast!

I wonder how much space it would take to store the Polish complete language word
list with one entry each in a Python dictionary. 300k words of 6-7 characters avg?
Say 2MB plus the dict hash stuff. I bet it would be fast.

Is that in effect what you are doing, except sort of like a regex state machine
to match words character by character?

Regards,
Bengt Richter
Jul 18 '05 #10

P: n/a
Hi Drochem,

(1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May
be there is someone interested in fixing this ....
(2) I run your code and - as you noticed - it takes some time to *generate*
the datastructure. To be fair pickle has to do the same so it cannot be
*significantly* faster!!!
The size of the file was 5,5 MB

(3) Timings (2.2):
Generation of data: 18 secs
Dunping: 3,2 secs
Loading: 19,4 sec

(4) I couldn't refrain from running it under 2.3
Generation of data: 8,5 secs !!!!
Dumping: 6,4 secs !!!!
Loading: 5,7 secs
So your programming might really improve when changing to 2.3 - and if
anyone can fix the cPickle bug, protocol mode 2 will be even more efficient.

Kindly
Michael

"Drochom" <pe******@gazeta.pl> schrieb im Newsbeitrag
news:bh**********@nemesis.news.tpi.pl...
[....] TRY THIS:

import cPickle as Pickle
from time import clock
from random import randrange
x=[]

for i in xrange(20000):
c = []
for j in xrange(randrange(2,25)):
c.append((chr(randrange(33,120)),randrange(1,10000 0),randrange(1,3)))
c = tuple(c)
x.append(c)
if i%1000==0: print i #it will help you to survive waiting...
print len(x), "records"

t0=clock()
f=open ("test","w")
Pickle.dump(x,f,0)
f.close()
print "out=", clock()-t0
t0=clock()
f=open ("test")
x=Pickle.load(f)
f.close()
print "in=", clock()-t0

Thanks once again:)

Jul 18 '05 #11

P: n/a
Drochom wrote:
import cPickle

plik = open("mealy","r")
mealy = cPickle.load(plik)
plik.close()
this takes about 30 seconds!
How can I accelerate it?


Perhaps it's worth looking into PyTables:

<http://pytables.sourceforge.net/doc/PyCon.html#section4>
Cheers,

// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #12

P: n/a
"Michael Peuser" <mp*****@web.de> writes:
Hi Drochem,

(1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May
be there is someone interested in fixing this ....

[snip]
f=open ("test","w") [snip] f=open ("test")

[snip]

Note that on windows, you must open binary files using binary mode
when reading and writing them, like so:

f = open('test', 'wb')
f = open('test', 'rb')
^^^^

If you don't do this binary data will be corrupted by the automatic
conversion of '\n' to '\r\n' by win32. This is very likely what is
causing the above error.

--
Tim Evans
Jul 18 '05 #13

P: n/a
So stupid of me :-(((

Now here are the benchmarks I got from Drochems dataset. I think it should
sufice to use the binary mode of 2.2. (I checked the 2.3 data on a different
disk the other day - that made them not comparable!! I now use the same disk
for the tests.)

Timings (2.2.2):
Generation of data: 18 secs
Dunping: 3 secs
Loading: 18,5 sec
Filesize: 5,5 MB

Binary dump: 2,4
Binary load: 3
Filesize: 2,8 MB

2.3
Generation of data: 9 secs
Dumping: 2,4
Loading: 2,8
Binary dump: 1
Binary load: 1,9
Filesize: 2,8 MB

Mode 2 dump: 0,9
Mode 2 load: 1,7
Filesize: 2,6 MB

The musch faster time for generating the data in 2.3 could be due to an
improved random generator (?) That had alwys been quite slow..

Kindly
Michael P

"Tim Evans" <t.*****@paradise.net.nz> schrieb im Newsbeitrag
news:87************@cassandra.evansnet...
"Michael Peuser" <mp*****@web.de> writes:
Hi Drochem,

(1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May be there is someone interested in fixing this ....

[snip] f=open ("test","w") [snip] f=open ("test")

[snip]

Note that on windows, you must open binary files using binary mode
when reading and writing them, like so:

f = open('test', 'wb')
f = open('test', 'rb')
^^^^

If you don't do this binary data will be corrupted by the automatic
conversion of '\n' to '\r\n' by win32. This is very likely what is
causing the above error.

--
Tim Evans

Jul 18 '05 #14

This discussion thread is closed

Replies have been disabled for this discussion.