Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old July 18th, 2005, 10:00 PM
Kevin Dangoor
Guest
 
Posts: n/a
Default XML with Unicode: what am I doing wrong?

This is a followup to a blog post I wrote the other day
http://www.blueskyonmars.com/archive...ementtidy.html

I started out working in the context of elementtidy, but now I am
running into trouble in general Python-XML areas, so I thought I'd toss
the question out here. The code below is fairly self-explanatory. I have
a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
compatible. I use Tidy to convert it to XHTML, and this particular setup
returns a unicode instance rather than a string.

import _elementtidy as et
from xml.parsers import expat

data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
html = et.fixup(data)[0]
parser = expat.ParserCreate()
parser.Parse(html)

UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
position 542: ordinal not in range(128)

If I set my default encoding to utf8 in sitecustomize.py, it works just
fine. I'm thinking that I can't be the only one trying to pass unicode
to expat... Is there something else I need to do here?

Thanks,
Kevin
Blazing Things
  #2  
Old July 18th, 2005, 10:00 PM
Diez B. Roggisch
Guest
 
Posts: n/a
Default Re: XML with Unicode: what am I doing wrong?

> I started out working in the context of elementtidy, but now I am[color=blue]
> running into trouble in general Python-XML areas, so I thought I'd toss
> the question out here. The code below is fairly self-explanatory. I have
> a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
> compatible. I use Tidy to convert it to XHTML, and this particular setup
> returns a unicode instance rather than a string.
>
> import _elementtidy as et
> from xml.parsers import expat
>
> data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
> html = et.fixup(data)[0]
> parser = expat.ParserCreate()
> parser.Parse(html)
>
> UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
> position 542: ordinal not in range(128)
>
> If I set my default encoding to utf8 in sitecustomize.py, it works just
> fine. I'm thinking that I can't be the only one trying to pass unicode
> to expat... Is there something else I need to do here?[/color]

you confuse unicode with utf8. Expat can parse the latter - the former is
internal to python. And passing it to something that needs a string will
result in a conversion - which fails because of the ascii encoding.

Do this:

parser.Parse(html.encode('utf-8'))

--
Regards,

Diez B. Roggisch
  #3  
Old July 18th, 2005, 10:00 PM
Just
Guest
 
Posts: n/a
Default Re: XML with Unicode: what am I doing wrong?

In article <ctr7ae$ioj$03$1@news.t-online.com>,
"Diez B. Roggisch" <deetsNOSPAM@web.de> wrote:
[color=blue][color=green]
> > I started out working in the context of elementtidy, but now I am
> > running into trouble in general Python-XML areas, so I thought I'd toss
> > the question out here. The code below is fairly self-explanatory. I have
> > a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
> > compatible. I use Tidy to convert it to XHTML, and this particular setup
> > returns a unicode instance rather than a string.
> >
> > import _elementtidy as et
> > from xml.parsers import expat
> >
> > data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
> > html = et.fixup(data)[0]
> > parser = expat.ParserCreate()
> > parser.Parse(html)
> >
> > UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
> > position 542: ordinal not in range(128)
> >
> > If I set my default encoding to utf8 in sitecustomize.py, it works just
> > fine. I'm thinking that I can't be the only one trying to pass unicode
> > to expat... Is there something else I need to do here?[/color]
>
> you confuse unicode with utf8. Expat can parse the latter - the former is
> internal to python. And passing it to something that needs a string will
> result in a conversion - which fails because of the ascii encoding.
>
> Do this:
>
> parser.Parse(html.encode('utf-8'))[/color]

Possibly preceded by

parser = expat.ParserCreate('utf-8')

...so there's no confusion with the declared encoding, in case that's not
utf-8.

Just
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles