Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old August 2nd, 2006, 03:45 PM
Sébastien Boisgérault
Guest
 
Posts: n/a
Default ElementTree and Unicode


I guess I am doing something wrong ... Any clue ?
Quote:
Quote:
Quote:
>>from elementtree.ElementTree import *
>>element = Element("string", value=u"\x00")
>>xml = tostring(element)
>>XML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py",
line 960, in XML
parser.feed(text)
File "/usr/lib/python2.4/site-packages/elementtree/ElementTree.py",
line 1242, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 15

Cheers,

SB

  #2  
Old August 2nd, 2006, 03:55 PM
Richard Brodie
Guest
 
Posts: n/a
Default Re: ElementTree and Unicode


"Sébastien Boisgérault" <Sebastien.Boisgerault@gmail.comwrote in message
news:1154530195.741884.34350@h48g2000cwc.googlegro ups.com...
Quote:
Quote:
Quote:
>>>element = Element("string", value=u"\x00")
I'm not as familiar with elementtree.ElementTree as I perhaps
should be. However, you appear to be trying to insert a null
character into an XML document. Should you succeed in this
quest, the resulting document will be ill-formed, and any
conforming parser will choke on it.


  #3  
Old August 2nd, 2006, 04:25 PM
Sébastien Boisgérault
Guest
 
Posts: n/a
Default Re: ElementTree and Unicode


Richard Brodie wrote:
Quote:
"Sébastien Boisgérault" <Sebastien.Boisgerault@gmail.comwrote in message
news:1154530195.741884.34350@h48g2000cwc.googlegro ups.com...
>
Quote:
Quote:
>>element = Element("string", value=u"\x00")
>
I'm not as familiar with elementtree.ElementTree as I perhaps
should be. However, you appear to be trying to insert a null
character into an XML document. Should you succeed in this
quest, the resulting document will be ill-formed, and any
conforming parser will choke on it.
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?

SB

  #4  
Old August 2nd, 2006, 06:35 PM
Marc 'BlackJack' Rintsch
Guest
 
Posts: n/a
Default Re: ElementTree and Unicode

In <1154532671.351968.142890@b28g2000cwb.googlegroups .com>, Sébastien
Boisgérault wrote:
Quote:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?
Encode it in UTF-8 and then Base64. AFAIK the only reliable way to put an
arbitrary string into XML and get exactly the same string back again.

Ciao,
Marc 'BlackJack' Rintsch
  #5  
Old August 2nd, 2006, 08:15 PM
Martin v. Löwis
Guest
 
Posts: n/a
Default Re: ElementTree and Unicode

Sébastien Boisgérault schrieb:
Quote:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?
XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in

http://www.w3.org/TR/2004/REC-xml-20040204

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now, one might thing you could use a character reference
(e.g. �) to refer to the "missing" characters, but this is not so:


[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';

Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.

As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.

Regards,
Martin
  #6  
Old August 2nd, 2006, 08:15 PM
Martin v. Löwis
Guest
 
Posts: n/a
Default Re: ElementTree and Unicode

Sébastien Boisgérault schrieb:
Quote:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?
XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in

http://www.w3.org/TR/2004/REC-xml-20040204

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

Now, one might thing you could use a character reference
(e.g. �) to refer to the "missing" characters, but this is not so:


[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';

Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.

As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.

Regards,
Martin
  #7  
Old August 2nd, 2006, 11:55 PM
Sébastien Boisgérault
Guest
 
Posts: n/a
Default Re: ElementTree and Unicode


Martin v. Löwis wrote:
Quote:
Sébastien Boisgérault schrieb:
Quote:
I am trying to embed an *arbitrary* (unicode) strings inside
an XML document. Of course I'd like to be able to reconstruct
it later from the xml document ... If the naive way to do it does
not work, can anyone suggest a way to do it ?
>
XML does not support arbitrary Unicode characters; a few control
characters are excluded. See the definiton of Char in
>
http://www.w3.org/TR/2004/REC-xml-20040204
>
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
Now, one might thing you could use a character reference
(e.g. �) to refer to the "missing" characters, but this is not so:
>
>
[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';
>
Well-formedness constraint: Legal Character
Characters referred to using character references must match the
production for Char.
>
As others have explained, if you want to transmit arbitrary characters,
you need to encode it as text in some way. One obvious solution
would be to encode the Unicode data as UTF-8 first, and then encode
the UTF-8 bytes using base64. The receiver of the XML document then
must do the reverse.
>
Regards,
Martin
OK ! Thanks a lot for this helpful information.

Cheers,

SB

 

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles