473,396 Members | 2,111 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Pickling Unicode

bigturtle
Using Python 2.6, I am trying to pickle a dictionary (for Chinese pinyin) which contains both Unicode characters in the range 128-255 and 4-byte Unicode characters. I get allergic reactions from pickle.dump() under all protocols.

Here’s a simple test program:
Expand|Select|Wrap|Line Numbers
  1. # Program 1 (protocol 0), program 2 (protocol 2)
  2.   PickleFile = codecs.open('PFile.utf', 'w', 'utf-8')    
  3.   Str1 = u'lǘelü' 
  4.   pickle.dump(Str1, PickleFile, protocol=0) # Error here!
  5.   PickleFile.close()
  6.  
1. Attempting to run this gives the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 10: ordinal not in range(128)
This is understandable, since protocol 0 is strictly ASCII and 0xfc is the character 'ü'.

2. With protocol=2 (or -1) I get a different, more mysterious error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

Well, let's try to use pickle.dumps() (which DOES work) and store the resulting string in a file.
Expand|Select|Wrap|Line Numbers
  1. # Program 3. Using pickle.dumps()
  2.       Str1 = u'lǘelü'
  3.       PickleStr1 = pickle.dumps(Str1) # So far so good!
  4.  
  5.       SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
  6.       SPickleFile.write(PickleStr1) # Error here!
  7.       close(SPickleFile)
  8.  
3. Running this program, I get the error “can’t decode byte 0xfc in position 10” as in program 1.

Isn’t this horribly, and uselessly, frustrating?? The pickle module has been around long enough not to stub its toes on this dinky example. Or is there something I have missed?

There is a long discussion of this issue in Issue 2980: Pickle stream for unicode object may contain non-ASCII characters. - Python tracker , which seems to address this problem but does not solve it that I can see.

Thank you all for your help & understanding.
Jan 20 '09 #1
7 18550
bvdet
2,851 Expert Mod 2GB
I was able to pickle and unpickle your unicode characters by using cPickle, encoding the string to UTF-8, and decoding the loaded string. I am not sure if this will help you though.
Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. import cPickle
  3.  
  4. str1 = u'lǘelü'
  5. print
  6. print str1, repr(str1)
  7. str1U = str1.encode("UTF-8")
  8. print repr(str1U)
  9.  
  10. PickleStr1 = cPickle.dumps(str1U)
  11. SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
  12. SPickleFile.write(PickleStr1)
  13. SPickleFile.close()
  14.  
  15.  
  16. f = codecs.open('SpFile.utf', 'r', 'utf-8')
  17. str2U = cPickle.load(f)
  18. print repr(str2U)
  19. str2 = str2U.decode("UTF-8")
  20. print str2, repr(str2)
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> 
  2. luelü u'luel\xfc'
  3. 'luel\xc3\xbc'
  4. 'luel\xc3\xbc'
  5. luelü u'luel\xfc'
  6. >>> 
Jan 20 '09 #2
Finally got your solution to work. There are a couple of things I don't understand about it.

I like the idea: you flatten all the Unicode out of the string by changing all the Unicode to ASCII encodings, store it in a file as ASCII, then read it in and reverse the process. Here's a version that works for me.
Expand|Select|Wrap|Line Numbers
  1. import codecs
  2. import cPickle
  3.  
  4. str1 = u'lǘelü'
  5. print "Pickling"
  6. print "str1 [" + repr(str1) + "]"
  7. str1U = str1.encode("UTF-8")
  8. print "str1U [" + repr(str1U) + "]"
  9. PickleStr1 = cPickle.dumps(str1U)
  10. SPickleFile = codecs.open('SpFile.utf', 'w')
  11. SPickleFile.write(PickleStr1)
  12. SPickleFile.close()
  13.  
  14. print "\nUnpickling"
  15. f = codecs.open('SpFile.utf', 'r')
  16. str2U = cPickle.load(f)
  17. print "str2U [" + repr(str2U) + "]"
  18. str2 = str2U.decode("UTF-8")
  19. print "str2 [" + repr(str2) + "]"
Output:
Expand|Select|Wrap|Line Numbers
  1. Pickling
  2. str1 [u'l\u01d8el\xfc']
  3. str1U ['l\xc7\x98el\xc3\xbc']
  4.  
  5. Unpickling
  6. str2U ['l\xc7\x98el\xc3\xbc']
  7. str2 [u'l\u01d8el\xfc']
Comments:

1. I can't print Unicode strings at all using "print". How do you do it?

2. You specify "UTF-8" both on your input file and your output file, but I think this can't be right. On the output file it doesn't matter since the file is anyhow ASCII. But on the input file it's fatal. (After all, the whole point is that the contents are ASCII.) You get the error

str2U = cPickle.load(f)
UnpicklingError: pickle data was truncated


3. I didn't think it's possible to dump a string using pickle.dumps() and load it back in using pickle.load(). But it works, much to my surprise! The alternative is to replace your assignment to Str2U by
Expand|Select|Wrap|Line Numbers
  1. PickleStr2 = f.read()
  2. str2U = cPickle.loads(PickleStr2)
Thanks for your help. Sorry not to reply earlier, but now I have settled down in China and have a bit of time.
Jan 27 '09 #3
bvdet
2,851 Expert Mod 2GB
bigturtle,

Canada to China - that's a big move! I wish I could better explain Unicode behavior, but I am learning about it myself. I do not use any Unicode in my work. Python 3.0 unifies Unicode and 8-bit strings into the str type.

I don't understand why you cannot print Unicode strings, unless the behavior of 2.6 is different from 2.3, which is what I am using.
Expand|Select|Wrap|Line Numbers
  1. >>> str1 = u'luelü'
  2. >>> print str1
  3. luelü
  4. >>> 
Jan 27 '09 #4
In your example, you have included in your test string 'ü' (ASCII 252 = u'\xFC') but not 'ǘ' (Unicode u'\u01D8'). It appears that there are three classes of characters:
. 7-bit ASCII (0-127 = u'\x00' - u'\x7F')
. "upper ASCII" (128-255 = u'\x80' - u'\xFF')
. full 2-byte Unicode (u'\u0100' - u'\uFFFF')
Codes 128-255 give problems to some routines because they are neither straight ASCII nor 2-byte Unicode.

In your example all the codes above 127 give trouble for me, depending. Here's my complete source file. The second line, which declares the encoding of the source file as Unicode, is required. Note 'ǘ' in the test string.
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3.  
  4. str1 = u'lǘelü'
  5. print str1
Output:
Expand|Select|Wrap|Line Numbers
  1.     print str1
  2. UnicodeEncodeError: 'ascii' codec can't encode character u'\u01d8' in position 1: ordinal not in range(128)
If I change the test string to u'lüelǘ', I get this output:
Expand|Select|Wrap|Line Numbers
  1.     print str1
  2. UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)
This seems to show that both the non-ASCII characters make "print" choke: it chokes on whichever one comes first.

HOWEVER, if I use the test string u'luelü' with no 2-byte Unicode char, as in your reply, it prints with no problem. Go figure!

It might be useful to know what your system does with my two test strings above... if we care.

Finally: If Python 3.0 has finally unified the string type to include Unicode, that's a real good reason for me to change. Thanks for the tip!
Jan 28 '09 #5
I have now switched to Python 3.0 and find that most of my problems have gone away. The pickle module works fine for Unicode, since all strings are anyhow Unicode. So no more u'...' in front of Unicode strings.

FYI here are some things I had to watch out for. There is no more codecs module, and so Unicode input files are specified by

FH = open(FileName, 'r', encoding='utf-8')

and output files the same with 'w'. Pickle files have to be specified as binary:

PFH = open(PickleFileName, 'rb')

or 'wb', depending.

The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?

Thanks for all your help.
Feb 1 '09 #6
bvdet
2,851 Expert Mod 2GB
@bigturtle
That's a good question, and I don't know the answer. Have you looked at sys.setdefaultencoding(name) or codecs.StreamWriter(stream[, errors]) and codecs.getwriter(encoding)? There may be a way to redefine print() to handle your encoding problem. Also look into io.

HTH, BV
Feb 2 '09 #7
Stress
1
Peace, friends,

You're mixing things up: serialization (pickling) gives you a binary representation of any Python object, Unicode text included.

If you open a file in text mode, and tell Python that it contains text encoded as UTF-8, then obviously you shouldn't be writing binary data (byte arrays, "bytes" in Python 3), such as pickled stuff, to it.

Put your pickles in a binary file. What you read/write from/to a UTF-8 encoded file is Unicode text ("str" in Python 3, right?), that gets automatically de/encoded for you.

OK?
;o)
Feb 5 '09 #8

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: gong | last post by:
hi i would like to pickle a lambda; according to the library docs in 2.3, i believe this shouldnt be possible, since a lambda is not a function defined at the top level of a module (?) ...
1
by: Marc | last post by:
Hi all, After some research I've decided that my previous question (Confusing problem between Tkinter.Intvar...) was headed in the wrong direction. Partly because I think I have a greater...
176
by: Thomas Reichelt | last post by:
Moin, short question: is there any language combining the syntax, flexibility and great programming experience of Python with static typing? Is there a project to add static typing to Python? ...
1
by: Edward Loper | last post by:
I'm having trouble pickling subclasses of dict when they contain cycles. In particular: >>> import pickle >>> class D(dict): pass >>> d = D() >>> d = d # add a cycle. >>> print d {1: {...}}...
8
by: Hans Georg Krauthaeuser | last post by:
Dear all, I have a long running application (electromagnetic compatibility measurements in mode-stirred chambers over GPIB) that use pickle (cPickle) to autosave a class instance with all the...
1
by: fedor | last post by:
Hi all, happy new year, I was trying to pickle a instance of a subclass of a tuple when I ran into a problem. Pickling doesn't work with HIGHEST_PROTOCOL. How should I rewrite my class so I can...
2
by: Kirk Strauser | last post by:
I have a module that defines a Search class and a SearchResult class. I use these classes by writing other modules that subclass both of them as needed to interface with particular search engines....
0
by: jkn | last post by:
Hi all this is my first go at pickling and I'm having trouble with variables of type VT_DATE returned from a COM application. I don't seem to be successfully pickling/depickling to the the same...
0
by: Irmen de Jong | last post by:
I'm having troubles pickling classes that extend Exception. Given the following source: class Foo(object): def __init__(self, m): self.m=m class Bar(Exception): def __init__(self, m):
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.