473,503 Members | 1,691 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF8 & HTMLParser

Hello all,

I'm writing a python script which fetches a HTML-page (using wget),
and then parses the retrieved page using a custom htmllib HTMLParser.

The page I fetch is encoded in utf8, and my text-handler currently
looks like this:

def handle_data(self, text):
if self.inOption:
self.currentName = text

However, I would like to convert the "text" (which is utf8) to
latin-1. How do I do that? I've been trying to figure it out for some
time now, and I'm just getting frustrated. :-(

--
Kind Regards,
Jan Danielsson
Te audire non possum. Musa sapientum fixa est in aure.
Dec 1 '06 #1
2 2390
Jan Danielsson wrote:
Hello all,

I'm writing a python script which fetches a HTML-page (using wget),
and then parses the retrieved page using a custom htmllib HTMLParser.

The page I fetch is encoded in utf8, and my text-handler currently
looks like this:

def handle_data(self, text):
if self.inOption:
self.currentName = text

However, I would like to convert the "text" (which is utf8) to
latin-1. How do I do that? I've been trying to figure it out for some
time now, and I'm just getting frustrated. :-(
I should have mentioned: The problem appears to be that I can't seem
to find a way to make python understand that "text" (the above argument)
is in fact already utf-8.

--
Kind Regards,
Jan Danielsson
Te audire non possum. Musa sapientum fixa est in aure.
Dec 1 '06 #2
Jan Danielsson wrote:
However, I would like to convert the "text" (which is utf8)
to latin-1. How do I do that?
How about:

latin = unicode(text, 'utf-8').encode('iso-8859-1')

Please see help(u''.encode) for details about error handling. You
might also want to trap errors in a try-except statement.

Cheers,

--
Klaus Alexander Seistrup
http://klaus.seistrup.dk/
Dec 1 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2479
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...
8
2449
by: Anders Eriksson | last post by:
Hello! I'm using smgllib (ActivePython 2.3.2, build 230) and I have some trouble with letters that has been coded, e.g. the letter å is coded å ä is coded ä and ö is coded ö all...
4
3315
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
1
1975
by: C. Titus Brown | last post by:
Hi all, while playing with PBP/mechanize/ClientForm, I ran into a problem with the way htmllib.HTMLParser was handling encoded tag attributes. Specifically, the following HTML was not being...
3
7722
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
1
1859
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
7
49831
by: 一首诗 | last post by:
Is there any simple way to solve this problem?
8
8530
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...
3
2100
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
0
7203
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7089
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7282
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7463
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5581
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4678
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3168
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
1
738
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
389
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.