By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,035 Members | 1,388 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,035 IT Pros & Developers. It's quick & easy.

Pickling Unicode

bigturtle
P: 19
Using Python 2.6, I am trying to pickle a dictionary (for Chinese pinyin) which contains both Unicode characters in the range 128-255 and 4-byte Unicode characters. I get allergic reactions from pickle.dump() under all protocols.

Here’s a simple test program:
Expand|Select|Wrap|Line Numbers
  1. # Program 1 (protocol 0), program 2 (protocol 2)
  2.   PickleFile = codecs.open('PFile.utf', 'w', 'utf-8')    
  3.   Str1 = u'lǘelü' 
  4.   pickle.dump(Str1, PickleFile, protocol=0) # Error here!
  5.   PickleFile.close()
  6.  
1. Attempting to run this gives the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 10: ordinal not in range(128)
This is understandable, since protocol 0 is strictly ASCII and 0xfc is the character 'ü'.

2. With protocol=2 (or -1) I get a different, more mysterious error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

Well, let's try to use pickle.dumps() (which DOES work) and store the resulting string in a file.
Expand|Select|Wrap|Line Numbers
  1. # Program 3. Using pickle.dumps()
  2.       Str1 = u'lǘelü'
  3.       PickleStr1 = pickle.dumps(Str1) # So far so good!
  4.  
  5.       SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
  6.       SPickleFile.write(PickleStr1) # Error here!
  7.       close(SPickleFile)
  8.  
3. Running this program, I get the error “can’t decode byte 0xfc in position 10” as in program 1.

Isn’t this horribly, and uselessly, frustrating?? The pickle module has been around long enough not to stub its toes on this dinky example. Or is there something I have missed?

There is a long discussion of this issue in Issue 2980: Pickle stream for unicode object may contain non-ASCII characters. - Python tracker , which seems to address this problem but does not solve it that I can see.

Thank you all for your help & understanding.
Jan 20 '09 #1
Share this Question
Share on Google+
7 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
I was able to pickle and unpickle your unicode characters by using cPickle, encoding the string to UTF-8, and decoding the loaded string. I am not sure if this will help you though.
Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. import cPickle
  3.  
  4. str1 = u'lǘelü'
  5. print
  6. print str1, repr(str1)
  7. str1U = str1.encode("UTF-8")
  8. print repr(str1U)
  9.  
  10. PickleStr1 = cPickle.dumps(str1U)
  11. SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
  12. SPickleFile.write(PickleStr1)
  13. SPickleFile.close()
  14.  
  15.  
  16. f = codecs.open('SpFile.utf', 'r', 'utf-8')
  17. str2U = cPickle.load(f)
  18. print repr(str2U)
  19. str2 = str2U.decode("UTF-8")
  20. print str2, repr(str2)
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> 
  2. luelü u'luel\xfc'
  3. 'luel\xc3\xbc'
  4. 'luel\xc3\xbc'
  5. luelü u'luel\xfc'
  6. >>> 
Jan 20 '09 #2

bigturtle
P: 19
Finally got your solution to work. There are a couple of things I don't understand about it.

I like the idea: you flatten all the Unicode out of the string by changing all the Unicode to ASCII encodings, store it in a file as ASCII, then read it in and reverse the process. Here's a version that works for me.
Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. import cPickle
  3.  
  4. str1 = u'lǘelü'
  5. print "Pickling"
  6. print "str1 [" + repr(str1) + "]"
  7. str1U = str1.encode("UTF-8")
  8. print "str1U [" + repr(str1U) + "]"
  9. PickleStr1 = cPickle.dumps(str1U)
  10. SPickleFile = codecs.open('SpFile.utf', 'w')
  11. SPickleFile.write(PickleStr1)
  12. SPickleFile.close()
  13.  
  14. print "\nUnpickling"
  15. f = codecs.open('SpFile.utf', 'r')
  16. str2U = cPickle.load(f)
  17. print "str2U [" + repr(str2U) + "]"
  18. str2 = str2U.decode("UTF-8")
  19. print "str2 [" + repr(str2) + "]"
Output:
Expand|Select|Wrap|Line Numbers
  1. Pickling
  2. str1 [u'l\u01d8el\xfc']
  3. str1U ['l\xc7\x98el\xc3\xbc']
  4.  
  5. Unpickling
  6. str2U ['l\xc7\x98el\xc3\xbc']
  7. str2 [u'l\u01d8el\xfc']
Comments:

1. I can't print Unicode strings at all using "print". How do you do it?

2. You specify "UTF-8" both on your input file and your output file, but I think this can't be right. On the output file it doesn't matter since the file is anyhow ASCII. But on the input file it's fatal. (After all, the whole point is that the contents are ASCII.) You get the error

str2U = cPickle.load(f)
UnpicklingError: pickle data was truncated


3. I didn't think it's possible to dump a string using pickle.dumps() and load it back in using pickle.load(). But it works, much to my surprise! The alternative is to replace your assignment to Str2U by
Expand|Select|Wrap|Line Numbers
  1. PickleStr2 = f.read()
  2. str2U = cPickle.loads(PickleStr2)
Thanks for your help. Sorry not to reply earlier, but now I have settled down in China and have a bit of time.
Jan 27 '09 #3

bvdet
Expert Mod 2.5K+
P: 2,851
bigturtle,

Canada to China - that's a big move! I wish I could better explain Unicode behavior, but I am learning about it myself. I do not use any Unicode in my work. Python 3.0 unifies Unicode and 8-bit strings into the str type.

I don't understand why you cannot print Unicode strings, unless the behavior of 2.6 is different from 2.3, which is what I am using.
Expand|Select|Wrap|Line Numbers
  1. >>> str1 = u'luelü'
  2. >>> print str1
  3. luelü
  4. >>> 
Jan 27 '09 #4

bigturtle
P: 19
In your example, you have included in your test string 'ü' (ASCII 252 = u'\xFC') but not 'ǘ' (Unicode u'\u01D8'). It appears that there are three classes of characters:
. 7-bit ASCII (0-127 = u'\x00' - u'\x7F')
. "upper ASCII" (128-255 = u'\x80' - u'\xFF')
. full 2-byte Unicode (u'\u0100' - u'\uFFFF')
Codes 128-255 give problems to some routines because they are neither straight ASCII nor 2-byte Unicode.

In your example all the codes above 127 give trouble for me, depending. Here's my complete source file. The second line, which declares the encoding of the source file as Unicode, is required. Note 'ǘ' in the test string.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3.  
  4. str1 = u'lǘelü'
  5. print str1
Output:
Expand|Select|Wrap|Line Numbers
  1.     print str1
  2. UnicodeEncodeError: 'ascii' codec can't encode character u'\u01d8' in position 1: ordinal not in range(128)
If I change the test string to u'lüelǘ', I get this output:
Expand|Select|Wrap|Line Numbers
  1.     print str1
  2. UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)
This seems to show that both the non-ASCII characters make "print" choke: it chokes on whichever one comes first.

HOWEVER, if I use the test string u'luelü' with no 2-byte Unicode char, as in your reply, it prints with no problem. Go figure!

It might be useful to know what your system does with my two test strings above... if we care.

Finally: If Python 3.0 has finally unified the string type to include Unicode, that's a real good reason for me to change. Thanks for the tip!
Jan 28 '09 #5

bigturtle
P: 19
I have now switched to Python 3.0 and find that most of my problems have gone away. The pickle module works fine for Unicode, since all strings are anyhow Unicode. So no more u'...' in front of Unicode strings.

FYI here are some things I had to watch out for. There is no more codecs module, and so Unicode input files are specified by

FH = open(FileName, 'r', encoding='utf-8')

and output files the same with 'w'. Pickle files have to be specified as binary:

PFH = open(PickleFileName, 'rb')

or 'wb', depending.

The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?

Thanks for all your help.
Feb 1 '09 #6

bvdet
Expert Mod 2.5K+
P: 2,851
@bigturtle
That's a good question, and I don't know the answer. Have you looked at sys.setdefaultencoding(name) or codecs.StreamWriter(stream[, errors]) and codecs.getwriter(encoding)? There may be a way to redefine print() to handle your encoding problem. Also look into io.

HTH, BV
Feb 2 '09 #7

P: 1
Peace, friends,

You're mixing things up: serialization (pickling) gives you a binary representation of any Python object, Unicode text included.

If you open a file in text mode, and tell Python that it contains text encoded as UTF-8, then obviously you shouldn't be writing binary data (byte arrays, "bytes" in Python 3), such as pickled stuff, to it.

Put your pickles in a binary file. What you read/write from/to a UTF-8 encoded file is Unicode text ("str" in Python 3, right?), that gets automatically de/encoded for you.

OK?
;o)
Feb 5 '09 #8

Post your reply

Sign in to post your reply or Sign up for a free account.