467,145 Members | 944 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,145 developers. It's quick & easy.

pickle alternative

I've written a simple module which serializes these python types:

IntType, TupleType, StringType, FloatType, LongType, ListType, DictType

It available for perusal here:

http://aspn.activestate.com/ASPN/Coo.../Recipe/415503

It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?
Sw.

Jul 19 '05 #1
  • viewed: 2148
Share:
14 Replies
simonwittber wrote:
I've written a simple module which serializes these python types:

IntType, TupleType, StringType, FloatType, LongType, ListType, DictType
For simple data types consider "marshal" as an alternative to "pickle".
It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?

def dec_int_type(data):
value = int(unpack('!i', data.read(4))[0])
return value

That 'int' isn't needed -- unpack returns an int not a string
representation of the int.

BTW, your code won't work on 64 bit machines.

def enc_long_type(obj):
return "%s%s%s" % ("B", pack("!L", len(str(obj))), str(obj))

There's no need to compute str(long) twice -- for large longs
it takes a lot of work to convert to base 10. For that matter,
it's faster to convert to hex, and the hex form is more compact.

Every decode you do requires several function calls. While
less elegant, you'll likely get better performance (test it!)
if you minimize that; try something like this

def decode(data):
return _decode(StringIO(data).read)

def _decode(read, unpack = struct.unpack):
code = read(1)
if not code:
raise IOError("reached the end of the file")
if code == "I":
return unpack("!i", read(4))[0]
if code == "F":
return unpack("!f", read(4))[0]
if code == "L":
count = unpack("!i", read(4))
return [_decode(read) for i in range(count)]
if code == "D":
count = unpack("!i", read(4))
return dict([_decode(read) for i in range(count)]
...

Andrew
da***@dalkescientific.com

Jul 19 '05 #2
> For simple data types consider "marshal" as an alternative to "pickle".
From the marhal documentation: Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.
BTW, your code won't work on 64 bit machines.
Any idea how this might be solved? The number of bytes used has to be
consistent across platforms. I guess this means I cannot use the struct
module?
There's no need to compute str(long) twice -- for large longs
it takes a lot of work to convert to base 10. For that matter,
it's faster to convert to hex, and the hex form is more compact.


Thanks for the tip.

Sw.

Jul 19 '05 #3
simonwittber wrote:
From the marhal documentation: Warning: The marshal module is not intended to be secure against
erroneous or maliciously constructed data. Never unmarshal data
received from an untrusted or unauthenticated source.


Ahh, I had forgotten that. Though I can't recall what an attack
might be, I think it's because the C code hasn't been fully vetted
for unexpected error conditions.
Any idea how this might be solved? The number of bytes used has to be
consistent across platforms. I guess this means I cannot use the struct
module?


How do you want to solve it? Should a 64 bit machine be able to read
a data stream made on a 32 bit machine? What about vice versa? How
are floats interconverted?

You could preface the output stream with a description of the encoding
used: version number, size of float, size of int (which should always
be sizeof float these days, I think). Read these then use that
information to figure out which decode/dispatch function to use.

Andrew
da***@dalkescientific.com

Jul 19 '05 #4

Ahh, I had forgotten that. Though I can't recall what an attack
might be, I think it's because the C code hasn't been fully vetted
for unexpected error conditions.


I tried out the marshal module anyway.

marshal can serialize small structures very qucikly, however, using the
below test value:

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{"simon":"wittber"}]

marshal took 7.90 seconds to serialize it into a 5000061 length string.
decode took 0.08 seconds.

The aforementioned recipe took 2.53 seconds to serialize it into a
5000087 length string. decode took 5.16 seconds, which is much longer
than marshal!!

Sw.

Jul 19 '05 #5
simonwittber wrote:
marshal can serialize small structures very qucikly, however, using the
below test value:

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{"simon":"wittber"}]

marshal took 7.90 seconds to serialize it into a 5000061 length string.
decode took 0.08 seconds.


Strange. Here's what I found:
value = [r for r in xrange(1000000)] +[{1:2,3:4,5:6},{"simon":"wittber"}]
import time, marshal
t1=time.time();s=marshal.dumps(value);t2=time.time ()
t2-t1 0.22474002838134766 len(s) 5000061 t1=time.time();new_value=marshal.loads(s);t2=time. time()
t2-t1 0.3606879711151123 new_value == value True


I can't reproduce your large times for marshal.dumps. Could you
post your test code?

Andrew
da***@dalkescientific.com

Jul 19 '05 #6
> I can't reproduce your large times for marshal.dumps. Could you
post your test code?

Certainly:

import sencode
import marshal
import time

value = [r for r in xrange(1000000)] +
[{1:2,3:4,5:6},{"simon":"wittber"}]

t = time.clock()
x = marshal.dumps(value)
print "marshal enc T:", time.clock() - t

t = time.clock()
x = marshal.loads(x)
print "marshal dec T:", time.clock() - t

t = time.clock()
x = sencode.dumps(value)
print "sencode enc T:", time.clock() - t
t = time.clock()
x = sencode.loads(x)
print "sencode dec T:", time.clock() - t

Jul 19 '05 #7
simonwittber posted his test code.

I tooks the code from the cookbook, called it "sencode" and
added these two lines

dumps = encode
loads = decode
I then ran your test code (unchanged except that my newsreader
folded the "value = ..." line) and got

marshal enc T: 0.21
marshal dec T: 0.4
sencode enc T: 7.76
sencode dec T: 11.56

This is with Python 2.3; the stock one provided by Apple
for my Mac.

I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.

BTW, I tried the performance approach I outlined earlier.
The numbers aren't much better

marshal enc T: 0.2
marshal dec T: 0.38
sencode2 enc T: 7.16
sencode2 dec T: 9.49
I changed the format a little bit; dicts are treated a bit
differently.
from struct import pack, unpack
from cStringIO import StringIO

class EncodeError(Exception):
pass
class DecodeError(Exception):
pass

def encode(data):
f = StringIO()
_encode(data, f.write)
return f.getvalue()

def _encode(data, write, pack = pack):
# The original code use the equivalent of "type(data) is list"
# I preserve that behavior

T = type(data)

if T is int:
write("I")
write(pack("!i", data))
elif T is list:
write("L")
write(pack("!L", len(data)))
# Assumes len and 'for ... in' aren't lying
for item in data:
_encode(item, write)
elif T is tuple:
write("T")
write(pack("!L", len(data)))
# Assumes len and 'for ... in' aren't lying
for item in data:
_encode(item, write)
elif T is str:
write("S")
write(pack("!L", len(data)))
write(data)
elif T is long:
s = hex(data)[2:-1]
write("B")
write(pack("!i", len(s)))
write(s)
elif T is type(None):
write("N")
elif T is float:
write("F")
write(pack("!f", data))
elif T is dict:
write("D")
write(pack("!L", len(data)))
for k, v in data.items():
_encode(k, write)
_encode(v, write)
else:
raise EncodeError((data, T))
def decode(s):
"""
Decode a binary string into the original Python types.
"""
buffer = StringIO(s)
return _decode(buffer.read)

def _decode(read, unpack = unpack):
code = read(1)
if code == "I":
return unpack("!i", read(4))[0]
if code == "D":
size = unpack("!L", read(4))[0]
x = [_decode(read) for i in range(size*2)]
return dict(zip(x[0::2], x[1::2]))
if code == "T":
size = unpack("!L", read(4))[0]
return tuple([_decode(read) for i in range(size)])
if code == "L":
size = unpack("!L", read(4))[0]
return [_decode(read) for i in range(size)]
if code == "N":
return None
if code == "S":
size = unpack("!L", read(4))[0]
return read(size)
if code == "F":
return unpack("!f", read(4))[0]
if code == "B":
size = unpack("!L", read(4))[0]
return long(read(size), 16)
raise DecodeError(code)

dumps = encode
loads = decode
I wonder if this could be improved by a "struct2" module
which could compile a pack/unpack format once. Eg,

float_struct = struct2.struct("!f")

float_struct.pack(f)
return float_struct.unpack('?\x80\x00\x00')[0]
which might the same as
return float_struct.unpack1('?\x80\x00\x00')

Andrew
da***@dalkescientific.com

Jul 19 '05 #8

Andrew Dalke wrote:
This is with Python 2.3; the stock one provided by Apple
for my Mac.
Ahh that is the difference. I'm running Python 2.4. I've checked my
benchmarks on a friends machine, also in Python 2.4, and received the
same results as my machine.
I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.


It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.

Thanks for your feedback Andrew!

Sw.

Jul 19 '05 #9
simonwittber wrote:
It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.


Interesting. Hadn't noticed that change. Is dump(StringIO()) as
slow?

Andrew
da***@dalkescientific.com

Jul 19 '05 #10
si**********@gmail.com wrote:
Andrew Dalke wrote:
This is with Python 2.3; the stock one provided by Apple
for my Mac.


Ahh that is the difference. I'm running Python 2.4. I've checked my
benchmarks on a friends machine, also in Python 2.4, and received the
same results as my machine.
I expected the numbers to be like this because the marshal
code is used to make and read the .pyc files and is supposed
to be pretty fast.


It would appear that the new version 1 format introduced in Python 2.4
is much slower than version 0, when using the dumps function.


Not so for me. My benchmarks suggest no change between 2.3 and 2.4.

Reinhold
Jul 19 '05 #11
Running stest.py produced these results for me:

marshal enc T: 12.5195908977
marshal dec T: 0.134508715493
sencode enc T: 3.75118904777
sencode dec T: 5.86602012267
11.9369997978
0.109000205994
True

Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
on win32

Notice the slow "marshal enc"oding.
Overall this recipe is faster than marshall for me.

Mark

Jul 19 '05 #12
si**********@gmail.com writes:
It appears to work faster than pickle, however, the decode process is
much slower (5x) than the encode process. Has anyone got any tips on
ways I might speed this up?


I think you should implement it as a C extension and/or write a PEP.
This has been an unfilled need in Python for a while (SF RFE 467384).

Note that using marshal is inappropriate, not only for security
reasons, but because marshal is explicitly NOT guaranteed to
interoperate across differing Python versions. You cannot assume that
an object marshalled in Python 2.4 will unmarshal correctly in 2.5.
Jul 19 '05 #13
Ok, I've attached the proto PEP below.

Comments on the proto PEP and the implementation are appreciated.

Sw.

Title: Secure, standard serialization of simple python types.

Abstract

This PEP suggests the addition of a module to the standard library,
which provides a serialization class for simple Python types.
Copyright

This document is placed in the public domain.
Motivation

The standard library currently provides two modules which are used
for object serialization. Pickle is not secure by its very nature,
and the marshal module is clearly marked as being not secure in the
documentation. The marshal module does not guarantee compatibility
between Python versions. The proposed module will only serialize
simple built-in Python types, and provide compatibility across
Python versions.

See RFE 467384 (on SourceForge) for more discussion on the above
issues.
Specification

The proposed module should use the same API as the marshal module.

dump(value, file)
#serialize value, and write to open file object
load(file)
#read data from file object, unserialize and return an object
dumps(value)
#return the string that would be written to the file by dump
loads(value)
#unserialize and return object
Reference Implementation

http://metaplay.dyndns.org:82/~simon/gherkin.py.txt
Rationale

The marshal documentation explicitly states that it is unsuitable
for unmarshalling untrusted data. It also explicitly states that
the format is not compatible across Python versions.

Pickle is compatible across versions, but also unsafe for loading
untrusted data. Exploits demonstrating pickle vulnerability exist.

xmlrpclib provides serialization functions, but is unsuitable when
serializing large data structures, or when high performance is a
requirement. If performance is an issue, a C-based accelerator
module can be installed. If size is an issue, gzip can be used,
however, this creates a mutually exclusive size/performance
trade-off.

Other existing formats, such as JSON and Bencode (bittorrent) do
not handle some marginally complex python structures and/or all
the simple Python types.

Time and space efficiency, and security do not have to be mutually
exclusive features of a serializer. Python does not provide, in the
standard library, a serializer which can work safely with untrusted
data which is time and space efficient. The proposed gherkin module
goes some way to achieving this. The format is simple enough to
easily write interoperable implementations across platforms.

Jul 21 '05 #14
On 4 Jul 2005 19:45:07 -0700, rumours say that si**********@gmail.com
might have written:
Time and space efficiency, and security do not have to be mutually
exclusive features of a serializer. Python does not provide, in the
standard library, a serializer which can work safely with untrusted
data which is time and space efficient. The proposed gherkin module
goes some way to achieving this. The format is simple enough to
easily write interoperable implementations across platforms.


I cannot readily check the source code because your web server listens
on port 82 (we're behind a strict firewall), so I don't know if my
following question has a reason to exist, but there we go:

Have you considered basing your module on xdrlib, which is more of a
cross-language standard?
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians
Jul 21 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.