pickle alternative

simonwittber

I've written a simple module which serializes these python types:

IntType, TupleType, StringType, FloatType, LongType, ListType, DictType

It available for perusal here:

http://aspn.activestate.com/ASPN/Coo.../Recipe/415503

It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?
Sw.

Jul 19 '05 #1

Subscribe Reply

2435

Andrew Dalke

simonwittber wrote:

I've written a simple module which serializes these python types:

IntType, TupleType, StringType, FloatType, LongType, ListType, DictType
For simple data types consider "marshal" as an alternative to "pickle".
It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?

def dec_int_type(da ta):
value = int(unpack('!i' , data.read(4))[0])
return value

That 'int' isn't needed -- unpack returns an int not a string
representation of the int.

BTW, your code won't work on 64 bit machines.

def enc_long_type(o bj):
return "%s%s%s" % ("B", pack("!L", len(str(obj))), str(obj))

There's no need to compute str(long) twice -- for large longs
it takes a lot of work to convert to base 10. For that matter,
it's faster to convert to hex, and the hex form is more compact.

Every decode you do requires several function calls. While
less elegant, you'll likely get better performance (test it!)
if you minimize that; try something like this

def decode(data):
return _decode(StringI O(data).read)

def _decode(read, unpack = struct.unpack):
code = read(1)
if not code:
raise IOError("reache d the end of the file")
if code == "I":
return unpack("!i", read(4))[0]
if code == "F":
return unpack("!f", read(4))[0]
if code == "L":
count = unpack("!i", read(4))
return [_decode(read) for i in range(count)]
if code == "D":
count = unpack("!i", read(4))
return dict([_decode(read) for i in range(count)]
...

Andrew
da***@dalkescie ntific.com

Jul 19 '05 #2

simonwittber

> For simple data types consider "marshal" as an alternative to "pickle".

From the marhal documentation: Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.
BTW, your code won't work on 64 bit machines.
Any idea how this might be solved? The number of bytes used has to be
consistent across platforms. I guess this means I cannot use the struct
module?
There's no need to compute str(long) twice -- for large longs
it takes a lot of work to convert to base 10. For that matter,
it's faster to convert to hex, and the hex form is more compact.

Thanks for the tip.

Sw.

Jul 19 '05 #3

Andrew Dalke

simonwittber wrote:

From the marhal documentation: Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.

Ahh, I had forgotten that. Though I can't recall what an attack
might be, I think it's because the C code hasn't been fully vetted
for unexpected error conditions.
Any idea how this might be solved? The number of bytes used has to be
consistent across platforms. I guess this means I cannot use the struct
module?

How do you want to solve it? Should a 64 bit machine be able to read
a data stream made on a 32 bit machine? What about vice versa? How
are floats interconverted?

You could preface the output stream with a description of the encoding
used: version number, size of float, size of int (which should always
be sizeof float these days, I think). Read these then use that
information to figure out which decode/dispatch function to use.

Andrew
da***@dalkescie ntific.com

Jul 19 '05 #4

simonwittber

Ahh, I had forgotten that. Though I can't recall what an attack
might be, I think it's because the C code hasn't been fully vetted
for unexpected error conditions.

I tried out the marshal module anyway.

marshal can serialize small structures very qucikly, however, using the
below test value:

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{ "simon":"wittbe r"}]

marshal took 7.90 seconds to serialize it into a 5000061 length string.
decode took 0.08 seconds.

The aforementioned recipe took 2.53 seconds to serialize it into a
5000087 length string. decode took 5.16 seconds, which is much longer
than marshal!!

Sw.

Jul 19 '05 #5

Andrew Dalke

simonwittber wrote:

marshal can serialize small structures very qucikly, however, using the
below test value:

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{ "simon":"wittbe r"}]

marshal took 7.90 seconds to serialize it into a 5000061 length string.
decode took 0.08 seconds.

Strange. Here's what I found:

value = [r for r in xrange(1000000)] +[{1:2,3:4,5:6},{ "simon":"wittbe r"}]
import time, marshal
t1=time.time(); s=marshal.dumps (value);t2=time .time()
t2-t1 0.2247400283813 4766 len(s) 5000061 t1=time.time(); new_value=marsh al.loads(s);t2= time.time()
t2-t1 0.3606879711151 123 new_value == value True

I can't reproduce your large times for marshal.dumps. Could you
post your test code?

Andrew
da***@dalkescie ntific.com

Jul 19 '05 #6

simonwittber

> I can't reproduce your large times for marshal.dumps. Could you

post your test code?

Certainly:

import sencode
import marshal
import time

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{ "simon":"wittbe r"}]

t = time.clock()
x = marshal.dumps(v alue)
print "marshal enc T:", time.clock() - t

t = time.clock()
x = marshal.loads(x )
print "marshal dec T:", time.clock() - t

t = time.clock()
x = sencode.dumps(v alue)
print "sencode enc T:", time.clock() - t
t = time.clock()
x = sencode.loads(x )
print "sencode dec T:", time.clock() - t

Jul 19 '05 #7

Andrew Dalke

simonwittber posted his test code.

I tooks the code from the cookbook, called it "sencode" and
added these two lines

dumps = encode
loads = decode
I then ran your test code (unchanged except that my newsreader
folded the "value = ..." line) and got

marshal enc T: 0.21
marshal dec T: 0.4
sencode enc T: 7.76
sencode dec T: 11.56

This is with Python 2.3; the stock one provided by Apple
for my Mac.

I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.

BTW, I tried the performance approach I outlined earlier.
The numbers aren't much better

marshal enc T: 0.2
marshal dec T: 0.38
sencode2 enc T: 7.16
sencode2 dec T: 9.49
I changed the format a little bit; dicts are treated a bit
differently.
from struct import pack, unpack
from cStringIO import StringIO

class EncodeError(Exc eption):
pass
class DecodeError(Exc eption):
pass

def encode(data):
f = StringIO()
_encode(data, f.write)
return f.getvalue()

def _encode(data, write, pack = pack):
# The original code use the equivalent of "type(data) is list"
# I preserve that behavior

T = type(data)

if T is int:
write("I")
write(pack("!i" , data))
elif T is list:
write("L")
write(pack("!L" , len(data)))
# Assumes len and 'for ... in' aren't lying
for item in data:
_encode(item, write)
elif T is tuple:
write("T")
write(pack("!L" , len(data)))
# Assumes len and 'for ... in' aren't lying
for item in data:
_encode(item, write)
elif T is str:
write("S")
write(pack("!L" , len(data)))
write(data)
elif T is long:
s = hex(data)[2:-1]
write("B")
write(pack("!i" , len(s)))
write(s)
elif T is type(None):
write("N")
elif T is float:
write("F")
write(pack("!f" , data))
elif T is dict:
write("D")
write(pack("!L" , len(data)))
for k, v in data.items():
_encode(k, write)
_encode(v, write)
else:
raise EncodeError((da ta, T))
def decode(s):
"""
Decode a binary string into the original Python types.
"""
buffer = StringIO(s)
return _decode(buffer. read)

def _decode(read, unpack = unpack):
code = read(1)
if code == "I":
return unpack("!i", read(4))[0]
if code == "D":
size = unpack("!L", read(4))[0]
x = [_decode(read) for i in range(size*2)]
return dict(zip(x[0::2], x[1::2]))
if code == "T":
size = unpack("!L", read(4))[0]
return tuple([_decode(read) for i in range(size)])
if code == "L":
size = unpack("!L", read(4))[0]
return [_decode(read) for i in range(size)]
if code == "N":
return None
if code == "S":
size = unpack("!L", read(4))[0]
return read(size)
if code == "F":
return unpack("!f", read(4))[0]
if code == "B":
size = unpack("!L", read(4))[0]
return long(read(size) , 16)
raise DecodeError(cod e)

dumps = encode
loads = decode
I wonder if this could be improved by a "struct2" module
which could compile a pack/unpack format once. Eg,

float_struct = struct2.struct( "!f")

float_struct.pa ck(f)
return float_struct.un pack('?\x80\x00 \x00')[0]
which might the same as
return float_struct.un pack1('?\x80\x0 0\x00')

Andrew
da***@dalkescie ntific.com

Jul 19 '05 #8

simonwittber

Andrew Dalke wrote:

This is with Python 2.3; the stock one provided by Apple
for my Mac.
Ahh that is the difference. I'm running Python 2.4. I've checked my
benchmarks on a friends machine, also in Python 2.4, and received the
same results as my machine.
I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.

It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.

Thanks for your feedback Andrew!

Sw.

Jul 19 '05 #9

Andrew Dalke

simonwittber wrote:

It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.

Interesting. Hadn't noticed that change. Is dump(StringIO() ) as
slow?

Andrew
da***@dalkescie ntific.com

Jul 19 '05 #10

Similar topics

2064

No limit! (was: pickle complexity limit?)

by: Christian Tismer | last post by:

Martin v. Löwis wrote: > "Mark Hahn" <mark@hahnca.com> writes: > > >>I don't understand how this could happen with pickle. Isn't it supposed to >>stop when it runs into an object it has already pickled? > > Yes, and it does. >

Python

4019

AssertionError in pickle's memoize function

by: Michael Hohn | last post by:

Hi, under python 2.2, the pickle/unpickle sequence incorrectly restores a larger data structure I have. Under Python 2.3, these structures now give an explicit exception from Pickle.memoize(): assert id(obj) not in self.memo I'm shrinking the offending data structure down to find the problem

Python

1517

Pickle problem

by: Mario Ceresa | last post by:

Hello everybody: I'd like to use the pickle module to save the state of an object so to be able to restore it later. The problem is that it holds a list of other objects, say numbers, and if I modify the list and restore the object, the list itself is not reverted to the saved one, but stays with one element deleted. An example session is the following: Data is A saving a with pickle

Python

6350

Pickle problem : Can't pickle 'SRE_Match' object:

by: IceMan85 | last post by:

Hi to all, I have spent the whole morning trying, with no success to pickle an object that I have created. The error that I get is : Can't pickle 'SRE_Match' object: <_sre.SRE_Match object at 0x2a969c0ad0> the complete stack is the following : Traceback (most recent call last): File "manager.py", line 305, in ? commandLineExec (log, parser) File "manager.py", line 229, in commandLineExec

Python

9587

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9423

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10211

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10045

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9993

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

6672

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5447

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3561

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2815

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General