character encoding conversion

Dylan

Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

I've searched and read for many hours, but have not found a solution
for handling the case where the page author does not use the character
encoding that they have specified.

Things I have tried include encode()/decode(), and replacement lookup
tables (i.e. something like
http://groups-beta.google.com/group/...991de6ced3406b
) . However, I am still unable to convert the characters to something
meaningful. In the case of the lookup table, this failed as all of
the imporoperly encoded characters were returning as ? rather than
their original encoding.

I'm using urllib and htmllib to open, read, and parse the html
fragments, Python 2.3 on OS X 10.3

Any ideas or pointers would be greatly appreciated.

-Dylan Schiemann
http://www.dylanschiemann.com/

Jul 18 '05 #1

Subscribe Post Reply

2493

Martin v. Löwis

Dylan wrote:

Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.

Regards,
Martin

Jul 18 '05 #2

Christian Ergh

Martin v. Löwis wrote:

Dylan wrote:
Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.

Regards,
Martin

I have a similar problem, with characters like äöüAÖÜß and so on. I am
extracting some content out of webpages, and they deliver whatever,
sometimes not even giving any encoding information in the header. But
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris

Jul 18 '05 #3

Martin v. Löwis

Christian Ergh wrote:

- it works with the characters i mentioned
It does.
- what encoding do you have in the end
US-ASCII
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?

I could, but I don't have the time - just try to come up with some
code, and I try to comment on it.

Regards,
Martin

Jul 18 '05 #4

Christian Ergh

Martin v. Löwis wrote:

Dylan wrote:
Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.

Regards,
Martin

Something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = ???
xmlencoding = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass:
data = data.encode("ascii", "xmlcharrefreplace")

Jul 18 '05 #5

Steven Bethard

Christian Ergh wrote:

flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass

A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a for-loop is for:

for char in data:
if 127 < ord(char) < 128:
break
else:
try:
data = data.encode('latin-1')
except:
pass

Only saves you one line of code, but you don't have to keep track of a
'flag' variable. Generally, I find that when I want to set a 'flag'
variable, I can usually do it with a for/else instead.

Steve

[1] Messed up indentation happens in a lot of clients if you have tabs
in your code. If you can replace tabs with spaces before posting, this
usually solves the problem.

Jul 18 '05 #6

Peter Otten

Steven Bethard wrote:

Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass

A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a for-loop is for:

for char in data:
if 127 < ord(char) < 128:
break
else:
try:
data = data.encode('latin-1')
except:
pass

Only saves you one line of code, but you don't have to keep track of a
'flag' variable. Generally, I find that when I want to set a 'flag'
variable, I can usually do it with a for/else instead.

Steve

[1] Messed up indentation happens in a lot of clients if you have tabs
in your code. If you can replace tabs with spaces before posting, this
usually solves the problem.

Even more off-topic:

for char in data: .... if 127 < ord(char) < 128:
.... break
.... print char

127.5

:-)

Peter

Jul 18 '05 #7

Christian Ergh

Peter Otten wrote:

Steven Bethard wrote:

Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass

A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a for-loop is for:

for char in data:
if 127 < ord(char) < 128:
break
else:
try:
data = data.encode('latin-1')
except:
pass

Only saves you one line of code, but you don't have to keep track of a
'flag' variable. Generally, I find that when I want to set a 'flag'
variable, I can usually do it with a for/else instead.

Steve

[1] Messed up indentation happens in a lot of clients if you have tabs
in your code. If you can replace tabs with spaces before posting, this
usually solves the problem.

Even more off-topic:

for char in data:
... if 127 < ord(char) < 128:
... break
...
print char

127.5

:-)

Peter

Well yes, that happens when doing a quick hack and not reviewing it, 128
has to be 160 of course...

Jul 18 '05 #8

Christian Ergh

Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 160:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass
data = data.encode("ascii", "xmlcharrefreplace")

Jul 18 '05 #9

Max M

Christian Ergh wrote:

A smiple way to try out different encodings in a given order:

# -*- coding: latin-1 -*-

def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass
st = 'Test characters æøå ÆØÅ'
encodings = ['utf-8', 'latin-1', 'ascii', ]
print get_encoded(st, encodings)

(u'Test characters \xe6\xf8\xe5 \xc6\xd8\xc5', 'latin-1')

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Jul 18 '05 #10

Christian Ergh

- snip -

def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass

-snip-
This works fine, but after this you have three possible encodings (or
even more, looking at the data in the net you'll see a lot of
encodings...)- what we need is just one for all.
Chris

Jul 18 '05 #11

Christian Ergh

Dylan wrote:

Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

Finally: For me this works, all inside my own class, and the module has
a logger, for reuse you would need to fix this stuff... Im am updating a
postgreSQL Database, in case someone wonders about the __setattr__, and
my class inherits from SQLObject.

def doDecode(self, st):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
stEncoded = st.decode(encoding)
return stEncoded
except UnicodeError:
pass

def setAttribute(self, name, data):
import HTMLFilter
data = self.doDecode(data)
try:
data = data.encode('ascii', "xmlcharrefreplace")
except:
log.warn('new method did not fit')

try:
if '&#' in data:
data = HTMLFilter.HTMLDecode(data)
except UnicodeDecodeError:
log.debug('HTML decoding failed!!!')

try:
data = data.encode('utf-8')
except:
log.warn('new utf 8 method did not fit')

try:
self.__setattr__(name, data)
except:
log.debug('1. try failed: ')
log.warning(type(data))
log.debug(data)
log.warning('Some unicode error while updating')

Jul 18 '05 #12

Christian Ergh

Forgot a part... You need the encoding list:

encodings = [
'utf-8',
'latin-1',
'ascii',
'cp1252',
]

Christian Ergh wrote:

Dylan wrote:
Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

Finally: For me this works, all inside my own class, and the module has
a logger, for reuse you would need to fix this stuff... Im am updating a
postgreSQL Database, in case someone wonders about the __setattr__, and
my class inherits from SQLObject.

def doDecode(self, st):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
stEncoded = st.decode(encoding)
return stEncoded
except UnicodeError:
pass

def setAttribute(self, name, data):
import HTMLFilter
data = self.doDecode(data)
try:
data = data.encode('ascii', "xmlcharrefreplace")
except:
log.warn('new method did not fit')

try:
if '&#' in data:
data = HTMLFilter.HTMLDecode(data)
except UnicodeDecodeError:
log.debug('HTML decoding failed!!!')

try:
data = data.encode('utf-8')
except:
log.warn('new utf 8 method did not fit')

try:
self.__setattr__(name, data)
except:
log.debug('1. try failed: ')
log.warning(type(data))
log.debug(data)
log.warning('Some unicode error while updating')

Jul 18 '05 #13

Martin v. Löwis

Christian Ergh wrote:

Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.

Also, it might be possible to do this in a for loop, e.g.

for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
"UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
try:
data = data.encode(encoding)
break;
except UnicodeError:
pass

You then just need to add the Latin-1-no-controls codec, or you need
to special-case this in the loop.
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
You need to remember the HTTP connection that you got the HTML file
from. The webserver may have sent a Content-Type header.
xmlencoding = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'

Depending on the library you use, these aren't that trivial, either.

Regards,
Martin

Jul 18 '05 #14

Martin v. Löwis

Max M wrote:

A smiple way to try out different encodings in a given order:

The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is
somewhat redundant. The 'ASCII' case is never considered, since
Latin-1 effectively works as a catch-all encoding (as all byte
sequences can be considered Latin-1 - whether they are meaningful
data is a different question).

Regards,
Martin

Jul 18 '05 #15

Similar topics

xml, character encoding, asp question

by: Mark | last post by:

Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around...

ASP / Active Server Pages

Character lost in POST submit

by: Pavils Jurjans | last post by:

Hello, I am experiencing a weird behaviour on my ASP.NET project. The project consists from client-side, which can be whatever environment - web page, EXE application, etc. The client sends HTTP...

ASP.NET

Byte array to string without character conversion?

by: Kasper Hansen | last post by:

I need to search through a binary file to find a specific string and then replace it with another string. However the System.Text.Encoding.ASCII.GetString method i originally used seems to do some...

Visual Basic .NET

Problems with Replace Method

by: james | last post by:

Hi, I am loading a CSV file ( Comma Seperated Value) into a Richtext box. I have a routine that splits the data up when it hits the "," and then copies the results into a listbox. The data also...

Visual Basic .NET

hexadecimal value 0x00, is an invalid character error

by: jasn | last post by:

Hello I am getting the following error message when I try and send an XML sting to a web service, I read somewhere that most web services prefer ascii and some throw errors when using unicode so...

.NET Framework

Customizing character set conversions with an error handler

by: Jukka Aho | last post by:

When converting Unicode strings to legacy character encodings, it is possible to register a custom error handler that will catch and process all code points that do not have a direct equivalent in...

Python

StreamReader(), System.Text.Encoding and German character Ã„.

by: George | last post by:

Hi, I am puzzled by the following and seeking some assistance to help me understand what happened. I have very limited encoding knowledge. Our SAP system writes out a text file which includes...

C# / C Sharp

wide character (unicode) and multi-byte character

by: =?Utf-8?B?R2Vvcmdl?= | last post by:

Hello everyone, Wide character and multi-byte character are two popular encoding schemes on Windows. And wide character is using unicode encoding scheme. But each time I feel confused when...

.NET Framework

Get ASCII value for character when higher than 127

by: ssetz | last post by:

Hello, For work, I need to write a password filter. The problem is that my C+ + experience is only some practice in school, 10 years ago. I now develop in C# which is completely different to me....

C / C++

Wide character input/output

by: Ioannis Vranos | last post by:

The following code does not work as expected: #include <wchar.h> #include <locale.h> #include <stdio.h> #include <stddef.h> int main() {

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA