Pickling Unicode

Using Python 2.6, I am trying to pickle a dictionary (for Chinese pinyin) which contains both Unicode characters in the range 128-255 and 4-byte Unicode characters. I get allergic reactions from pickle.dump() under all protocols.

Here’s a simple test program:

Expand|Select|Wrap|Line Numbers

 # Program 1 (protocol 0), program 2 (protocol 2)

  PickleFile = codecs.open('PFile.utf', 'w', 'utf-8')    

  Str1 = u'lǘelü' 

  pickle.dump(Str1, PickleFile, protocol=0) # Error here!

  PickleFile.close()

1. Attempting to run this gives the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 10: ordinal not in range(128)
This is understandable, since protocol 0 is strictly ASCII and 0xfc is the character 'ü'.

2. With protocol=2 (or -1) I get a different, more mysterious error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

Well, let's try to use pickle.dumps() (which DOES work) and store the resulting string in a file.

Expand|Select|Wrap|Line Numbers

 # Program 3. Using pickle.dumps()

      Str1 = u'lǘelü'

      PickleStr1 = pickle.dumps(Str1) # So far so good!
 
      SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')

      SPickleFile.write(PickleStr1) # Error here!

      close(SPickleFile)

3. Running this program, I get the error “can’t decode byte 0xfc in position 10” as in program 1.

Isn’t this horribly, and uselessly, frustrating?? The pickle module has been around long enough not to stub its toes on this dinky example. Or is there something I have missed?

There is a long discussion of this issue in Issue 2980: Pickle stream for unicode object may contain non-ASCII characters. - Python tracker , which seems to address this problem but does not solve it that I can see.

Thank you all for your help & understanding.

Jan 20 '09 #1

Subscribe Post Reply

18550

bvdet

2,851

Expert Mod 2GB

I was able to pickle and unpickle your unicode characters by using cPickle, encoding the string to UTF-8, and decoding the loaded string. I am not sure if this will help you though.

Expand|Select|Wrap|Line Numbers

 import codecs

import cPickle
 
str1 = u'lǘelü'

print

print str1, repr(str1)

str1U = str1.encode("UTF-8")

print repr(str1U)
 
PickleStr1 = cPickle.dumps(str1U)

SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')

SPickleFile.write(PickleStr1)

SPickleFile.close()
 
f = codecs.open('SpFile.utf', 'r', 'utf-8')

str2U = cPickle.load(f)

print repr(str2U)

str2 = str2U.decode("UTF-8")

print str2, repr(str2)

Output:

Expand|Select|Wrap|Line Numbers

 >>> 

luelü u'luel\xfc'

'luel\xc3\xbc'

'luel\xc3\xbc'

luelü u'luel\xfc'

>>>

Jan 20 '09 #2

bigturtle

Finally got your solution to work. There are a couple of things I don't understand about it.

I like the idea: you flatten all the Unicode out of the string by changing all the Unicode to ASCII encodings, store it in a file as ASCII, then read it in and reverse the process. Here's a version that works for me.

Expand|Select|Wrap|Line Numbers

 import codecs

import cPickle
 
str1 = u'lǘelü'

print "Pickling"

print "str1 [" + repr(str1) + "]"

str1U = str1.encode("UTF-8")

print "str1U [" + repr(str1U) + "]"

PickleStr1 = cPickle.dumps(str1U)

SPickleFile = codecs.open('SpFile.utf', 'w')

SPickleFile.write(PickleStr1)

SPickleFile.close()
 
print "\nUnpickling"

f = codecs.open('SpFile.utf', 'r')

str2U = cPickle.load(f)

print "str2U [" + repr(str2U) + "]"

str2 = str2U.decode("UTF-8")

print "str2 [" + repr(str2) + "]"

Output:

Expand|Select|Wrap|Line Numbers

 Pickling

str1 [u'l\u01d8el\xfc']

str1U ['l\xc7\x98el\xc3\xbc']
 
Unpickling

str2U ['l\xc7\x98el\xc3\xbc']

str2 [u'l\u01d8el\xfc']

Comments:

1. I can't print Unicode strings at all using "print". How do you do it?

2. You specify "UTF-8" both on your input file and your output file, but I think this can't be right. On the output file it doesn't matter since the file is anyhow ASCII. But on the input file it's fatal. (After all, the whole point is that the contents are ASCII.) You get the error

str2U = cPickle.load(f)
UnpicklingError: pickle data was truncated

3. I didn't think it's possible to dump a string using pickle.dumps() and load it back in using pickle.load(). But it works, much to my surprise! The alternative is to replace your assignment to Str2U by

Expand|Select|Wrap|Line Numbers

 PickleStr2 = f.read()

str2U = cPickle.loads(PickleStr2)

Thanks for your help. Sorry not to reply earlier, but now I have settled down in China and have a bit of time.

Jan 27 '09 #3

bvdet

2,851

Expert Mod 2GB

bigturtle,

Canada to China - that's a big move! I wish I could better explain Unicode behavior, but I am learning about it myself. I do not use any Unicode in my work. Python 3.0 unifies Unicode and 8-bit strings into the str type.

I don't understand why you cannot print Unicode strings, unless the behavior of 2.6 is different from 2.3, which is what I am using.

Expand|Select|Wrap|Line Numbers

 >>> str1 = u'luelü'

>>> print str1

luelü

>>>

Jan 27 '09 #4

bigturtle

In your example, you have included in your test string 'ü' (ASCII 252 = u'\xFC') but not 'ǘ' (Unicode u'\u01D8'). It appears that there are three classes of characters:
. 7-bit ASCII (0-127 = u'\x00' - u'\x7F')
. "upper ASCII" (128-255 = u'\x80' - u'\xFF')
. full 2-byte Unicode (u'\u0100' - u'\uFFFF')
Codes 128-255 give problems to some routines because they are neither straight ASCII nor 2-byte Unicode.

In your example all the codes above 127 give trouble for me, depending. Here's my complete source file. The second line, which declares the encoding of the source file as Unicode, is required. Note 'ǘ' in the test string.

Expand|Select|Wrap|Line Numbers

 #!/usr/bin/env python

# -*- coding: utf-8 -*-
 
str1 = u'lǘelü'

print str1

Output:

Expand|Select|Wrap|Line Numbers

     print str1

UnicodeEncodeError: 'ascii' codec can't encode character u'\u01d8' in position 1: ordinal not in range(128)

If I change the test string to u'lüelǘ', I get this output:

Expand|Select|Wrap|Line Numbers

     print str1

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)

This seems to show that both the non-ASCII characters make "print" choke: it chokes on whichever one comes first.

HOWEVER, if I use the test string u'luelü' with no 2-byte Unicode char, as in your reply, it prints with no problem. Go figure!

It might be useful to know what your system does with my two test strings above... if we care.

Finally: If Python 3.0 has finally unified the string type to include Unicode, that's a real good reason for me to change. Thanks for the tip!

Jan 28 '09 #5

bigturtle

I have now switched to Python 3.0 and find that most of my problems have gone away. The pickle module works fine for Unicode, since all strings are anyhow Unicode. So no more u'...' in front of Unicode strings.

FYI here are some things I had to watch out for. There is no more codecs module, and so Unicode input files are specified by

FH = open(FileName, 'r', encoding='utf-8')

and output files the same with 'w'. Pickle files have to be specified as binary:

PFH = open(PickleFileName, 'rb')

or 'wb', depending.

The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?

Thanks for all your help.

Feb 1 '09 #6

bvdet

2,851

Expert Mod 2GB

@bigturtle
That's a good question, and I don't know the answer. Have you looked at sys.setdefaultencoding(name) or codecs.StreamWriter(stream[, errors]) and codecs.getwriter(encoding)? There may be a way to redefine print() to handle your encoding problem. Also look into io.

HTH, BV

Feb 2 '09 #7

Stress

Peace, friends,

You're mixing things up: serialization (pickling) gives you a binary representation of any Python object, Unicode text included.

If you open a file in text mode, and tell Python that it contains text encoded as UTF-8, then obviously you shouldn't be writing binary data (byte arrays, "bytes" in Python 3), such as pickled stuff, to it.

Put your pickles in a binary file. What you read/write from/to a UTF-8 encoded file is Unicode text ("str" in Python 3, right?), that gets automatically de/encoded for you.

OK?
;o)

Feb 5 '09 #8

Similar topics

pickling lambdas?

by: gong | last post by:

hi i would like to pickle a lambda; according to the library docs in 2.3, i believe this shouldnt be possible, since a lambda is not a function defined at the top level of a module (?) ...

Python

Pickling Tkinter widgets - Where does Python stand now?

by: Marc | last post by:

Hi all, After some research I've decided that my previous question (Confusing problem between Tkinter.Intvar...) was headed in the wrong direction. Partly because I think I have a greater...

Python

176

Typed Python?

by: Thomas Reichelt | last post by:

Moin, short question: is there any language combining the syntax, flexibility and great programming experience of Python with static typing? Is there a project to add static typing to Python? ...

Python

pickling subclasses of dict/list

by: Edward Loper | last post by:

I'm having trouble pickling subclasses of dict when they contain cycles. In particular: >>> import pickle >>> class D(dict): pass >>> d = D() >>> d = d # add a cycle. >>> print d {1: {...}}...

Python

pickle: huge memory consumption *during* pickling

by: Hans Georg Krauthaeuser | last post by:

Dear all, I have a long running application (electromagnetic compatibility measurements in mode-stirred chambers over GPIB) that use pickle (cPickle) to autosave a class instance with all the...

Python

pickling a subclass of tuple

by: fedor | last post by:

Hi all, happy new year, I was trying to pickle a instance of a subclass of a tuple when I ran into a problem. Pickling doesn't work with HIGHEST_PROTOCOL. How should I rewrite my class so I can...

Python

Pickling and inheritance are making me hurt

by: Kirk Strauser | last post by:

I have a module that defines a Search class and a SearchResult class. I use these classes by writing other modules that subclass both of them as needed to interface with particular search engines....

Python

How to pickling VT_DATE values?

by: jkn | last post by:

Hi all this is my first go at pickling and I'm having trouble with variables of type VT_DATE returned from a COM application. I don't seem to be successfully pickling/depickling to the the same...

Python

Problem pickling exceptions in Python 2.5/2.6

by: Irmen de Jong | last post by:

I'm having troubles pickling classes that extend Exception. Given the following source: class Foo(object): def __init__(self, m): self.m=m class Bar(Exception): def __init__(self, m):

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice