By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,526 Members | 1,618 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,526 IT Pros & Developers. It's quick & easy.

encoding in lxml

P: n/a
Hey,

I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:

html_doc = HTML(string_with_document)

Then I retrieve some info from the document with XPath:

xpath_nodes = html_doc('/html/body/something')

Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:

xpath_nodes[0].text

And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?

Regards,

Mike
Nov 3 '08 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Hi Mike,
I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not.
There will be host of more lightweight solutions, but you can opt
to sanizite incominhg HTML with HTML Tidy (python binding available).

It will replace invalid UTF-8 bytes with U+FFFD. It will not
guess a better encoding to use.

If you are sure you don't have HTML sloppiness to correct but only
the
occasional wrong byte, even decoding (with fallback) and encoding
using
the standard codec package will do.

Regards,
Peter
Nov 3 '08 #2

P: n/a
jasiu85 wrote:
I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not.
You can instantiate your own HTML parser and pass encoding="utf-8". That way,
when it's not UTF-8, you will get an exception at parse time, which allows you
to reparse the document with another encoding (say, ISO-8859-1) to get the
correct content.

Stefan
Nov 3 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.