By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,627 Members | 2,244 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,627 IT Pros & Developers. It's quick & easy.

Some questions about decode/encode

P: n/a
I use chinese charactors as an example here.
>>>s1='你好吗'
repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>>b1=s1.decode('GBK')
My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

Jan 24 '08 #1
Share this Question
Share on Google+
15 Replies


P: n/a
glacier <ro*******@gmail.comwrites:
I use chinese charactors as an example here.
>>s1='浣*濂藉悧'
repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>b1=s1.decode('GBK')

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It's almost certainly
not aware of "words", only character-to-byte mappings.

--
\ "When I get new information, I change my position. What, sir, |
`\ do you do with new information?" -- John Maynard Keynes |
_o__) |
Ben Finney
Jan 24 '08 #2

P: n/a
Ben Finney <bi****************@benfinney.id.auwrites:
glacier <ro*******@gmail.comwrites:
I use chinese charactors as an example here.
>>>s1='浣*濂藉悧'
>>>repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>>b1=s1.decode('GBK')
My first question is : what strategy does 'decode' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?

The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It's almost
certainly not aware of "words", only character-to-byte mappings.
To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney
Jan 24 '08 #3

P: n/a
On 1月24日, 下午1时41分, Ben Finney <bignose+hates-s....@benfinney.id.au>
wrote:
Ben Finney <bignose+hates-s...@benfinney.id.auwrites:
glacier <rong.x...@gmail.comwrites:
I use chinese charactors as an example here.
>>s1='你好吗'
>>repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>b1=s1.decode('GBK')
My first question is : what strategy does 'decode' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?
The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It's almost
certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney- 隐藏被引用文字 -

- 显示引用的文字 -
thanks for your respoonse:)

When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='你好吗'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?
Thanks :)


Jan 24 '08 #4

P: n/a
En Thu, 24 Jan 2008 04:52:22 -0200, glacier <ro*******@gmail.comescribi贸:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='浣*濂藉悧'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?
Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see
http://docs.python.org/lib/string-methods.html#l2h-237

--
Gabriel Genellina

Jan 24 '08 #5

P: n/a
On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.
Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch
Jan 24 '08 #6

P: n/a
On Jan 24, 2:49 pm, glacier <rong.x...@gmail.comwrote:
I use chinese charactors as an example here.
>>s1='你好吗'
repr(s1)

"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>b1=s1.decode('GBK')

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John

Jan 24 '08 #7

P: n/a
On Jan 24, 1:44*am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.

That's because SAX wants bytes, not a decoded string. *Don't decode it
yourself.
encode() converts a unicode string to a regular string. decode()
converts a regular string to a unicode string. So I think what Marc
is saying is that SAX needs a regular string(i.e. bytes) not a decoded
string(i.e. a unicode string).


Jan 24 '08 #8

P: n/a
On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-....@yahoo.com.arwrote:
En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.x...@gmail.comescribió:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='你好吗'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237

--
Gabriel Genellina
Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.
Jan 27 '08 #9

P: n/a
On 1月24日, 下午4时44分, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Wed, 23 Jan 2008 19:49:01 -0800, glacier wrote:
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch
Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

Is there any way to solve this better?
I mean if I shouldn't convert the GBK string to unicode string, what
should I do to make SAX work?

Thanks , Marc.
:)
Jan 27 '08 #10

P: n/a
On 1月24日, 下午5时51分, John Machin <sjmac...@lexicon.netwrote:
On Jan 24, 2:49 pm, glacier <rong.x...@gmail.comwrote:
I use chinese charactors as an example here.
>>>s1='你好吗'
>>>repr(s1)
"'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>>>b1=s1.decode('GBK')
My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John
Thanks John, I will try to write a function to test if the strategy
above caused the problem I described in the 1st post:)

Jan 27 '08 #11

P: n/a
On Sun, 27 Jan 2008 02:18:48 -0800, glacier wrote:
Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).
If the `decode()` method supports it, IMHO SAX should too.
Is there any way to solve this better?
I mean if I shouldn't convert the GBK string to unicode string, what
should I do to make SAX work?
Decode it and then encode it to utf-8 before feeding it to the parser.

Ciao,
Marc 'BlackJack' Rintsch
Jan 27 '08 #12

P: n/a
On Jan 27, 9:17 pm, glacier <rong.x...@gmail.comwrote:
On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-...@yahoo.com.arwrote:
En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.x...@gmail.comescribió:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='你好吗'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?
Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237
--
Gabriel Genellina

Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.
*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor :-)
Jan 27 '08 #13

P: n/a
On 1月27日, 下午7时20分, John Machin <sjmac...@lexicon.netwrote:
On Jan 27, 9:17 pm, glacier <rong.x...@gmail.comwrote:


On 1月24日, 下午3时29分, "Gabriel Genellina" <gagsl-...@yahoo.com.arwrote:
En Thu, 24 Jan 2008 04:52:22 -0200, glacier <rong.x...@gmail.comescribió:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='你好吗'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?
Don't do that. You might be splitting the input string at a point thatis
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237
--
Gabriel Genellina
Thanks Gabriel,
I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?
Thanks a lot.

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor :-)- 隐藏被引用文字 -

- 显示引用的文字 -
I guess firstly, I should check if the file I used to test is well-
formed GBK..:)
Jan 27 '08 #14

P: n/a
On Jan 28, 2:53 pm, glacier <rong.x...@gmail.comwrote:
>
Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?
Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'

HTH,
John
Jan 28 '08 #15

P: n/a
On Jan 28, 2:31*pm, John Machin <sjmac...@lexicon.netwrote:
On Jan 28, 2:53 pm, glacier <rong.x...@gmail.comwrote:
Thanks,John.
It's no doubt that you proved SAX didn't support GBK encoding.
But can you give some suggestion on how to make SAX parse some GBK
string?

Yes, the same suggestion as was given to you by others very early in
this thread, the same as I demonstrated in the middle of proving that
SAX doesn't support a GBK-encoded input file.

Suggestion: Recode your input from GBK to UTF-8. Ensure that the XML
declaration doesn't have an unsupported encoding. Your handler will
get data encoded as UTF-8. Recode that to GBK if needed.

Here's a cut down version of the previous script, focussed on
demonstrating that the recoding strategy works.

C:\junk>type gbksax2.py
import xml.sax, xml.sax.saxutils
import cStringIO
unistr = u''.join(unichr(0x4E00+i) + unichr(ord('W')+i) for i in
range(4))
gbkstr = unistr.encode('gbk')
print 'This is a GBK-encoded string: %r' % gbkstr
utf8str = gbkstr.decode('gbk').encode('utf8')
print 'Now recoded as UTF-8 to be fed to a SAX parser: %r' % utf8str
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
utf8doc = xml_template % ('utf-8', unistr.encode('utf8'))
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(utf8doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata
print "SAX output recoded to GBK: %r" %
mydata.decode('utf8').encode('gbk')

C:\junk>gbksax2.py
This is a GBK-encoded string: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'
Now recoded as UTF-8 to be fed to a SAX parser: '\xe4\xb8\x80W
\xe4\xb8\x81X\xe4\xb8\x82Y\xe4\xb8\x83Z'
SAX output (UTF-8): '\xe4\xb8\x80W\xe4\xb8\x81X\xe4\xb8\x82Y
\xe4\xb8\x83Z'
SAX output recoded to GBK: '\xd2\xbbW\xb6\xa1X\x81@Y\xc6\xdfZ'

HTH,
John
Thanks a lot John:)
I'll try it.
Jan 28 '08 #16

This discussion thread is closed

Replies have been disabled for this discussion.