473,322 Members | 1,620 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

UnicodeEncodeError while reading xml file (newbie question)

I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Exception Type: UnicodeEncodeError
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib\encodings\cp1252.py in encode,
line 12

The string that could not be encoded/decoded was: H_C="" A_C

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

The code that reads the file goes like this :

from xml.etree import ElementTree as ET

def read_xml(request):
data = open('live.xml', 'r').read()
data = data.decode('utf-8', 'replace')
data = ET.XML(data)

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Can someone please help ?
Jun 27 '08 #1
2 3894
On Jun 8, 10:12 am, nikosk <nikos.nikos.nikos.ni...@gmail.comwrote:
I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Exception Type: UnicodeEncodeError
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib\encodings\cp1252.py in encode,
line 12

The string that could not be encoded/decoded was: H_C="����" A_C

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

The code that reads the file goes like this :

from xml.etree import ElementTree as ET

def read_xml(request):
data = open('live.xml', 'r').read()
data = data.decode('utf-8', 'replace')
data = ET.XML(data)

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Can someone please help ?
Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.....)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('yourfile.xml', 'rb').read()[before_pos:after_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?
Jun 27 '08 #2
You won't believe how helpful your reply was. I was looking for a
problem that did not exist.
You wrote : (3) why you think you need to have "data.decode(.....)"
at all
and after that : (7) are you expecting non-ASCII characters after
H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?
And I went back to see why I was doing this in the first place
(couldn't remember
after struggling for so many hours) and I opened the file in Interent
explorer.
The browser wouldn't open it because it didn't like the encoding
declared in the <xmltag
"System does not support the specified encoding. Error processing
resource 'http://scores24live.com/xml/live.xml'. Line 1, ..."
(IE was the only program that complained, FF and some other tools
opened it without hassle)

Then I went back and looked for the original message that got me
struggling and it was this :
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

From then on it was easy to see that it was the xml encoding that was
wrong :
<?xml version="1.0" encoding="utf8"?>

when I switched that to :
<?xml version="1.0" encoding="utf-8"?>

everything just worked.

I can't thank you enough for opening my eyes...

PS.: The UnicodeEncodeError must have something to do with Java's
UTF-8
implementation (the xml is produced by a Dom4j on a J2EE server).
Those characters I posted in the original message should
have read "ΚΙΝΑ" (China in Greek) but I after I copy pasted them in
the post
it came up like this : H_C="����" A_C which is weird because
this
page is UTF encoded which means that characters should be 1 or 2 bytes
long.
From the message you see that instead of 4 characters it reads 8 which
means
there were extra information in the string.

If the above is true then it might be something for python developers
to address in the language. If someone wishes to investigate further
here is the link for info on java utf and the file that caused the
UnicodeEncodeError :
http://en.wikipedia.org/wiki/UTF-8 (the java section)
http://java.sun.com/javase/6/docs/ap...modified-utf-8

the xml file : http://dsigned.gr/live.xml

On Jun 8, 3:50 am, John Machin <sjmac...@lexicon.netwrote:
On Jun 8, 10:12 am, nikosk <nikos.nikos.nikos.ni...@gmail.comwrote:
I just spent a whole day trying to read an xml file and I got stuck
with the following error:
Exception Type: UnicodeEncodeError
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib\encodings\cp1252.py in encode,
line 12
The string that could not be encoded/decoded was: H_C="����" A_C
After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="
The code that reads the file goes like this :
from xml.etree import ElementTree as ET
def read_xml(request):
data = open('live.xml', 'r').read()
data = data.decode('utf-8', 'replace')
data = ET.XML(data)
I've tried all the combinations of str.decode str.encode I could
think of but nothing.
Can someone please help ?

Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.....)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('yourfile.xml', 'rb').read()[before_pos:after_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?
Jun 27 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: ohaya | last post by:
Hi, I'm a real newbie, but have been asked to try to fix a problem in one of our JSP pages that is suppose to read in a text file and display it. From my testing thus far, it appears this page...
2
by: Brian Ward | last post by:
First: sorry as a relative newbie for previously not including code. My question: Reading C++ books I almost always find programs such as the one below give the following type of code for reading...
2
by: Francach | last post by:
Hi, I don't know what I'm doing wrong here. I''m using Python 2.4 and py2exe. I get he following error: Traceback (most recent call last): File "notegui.pyc", line 34, in OnClose File...
6
by: ALI-R | last post by:
Hi All; I'm reading the following xml file into a Dataset but there are only 4 datatables in my dataset (which should be 5) Is that because I have two nodes with the same name (detail) in my xml...
1
by: khaleel.alyasini | last post by:
Hello sirs & madams, I'm currently working on DCT compression algorithm as my project. My knowledge on C++ and DSP is rather weak/moderate. I was wondering if any could help me and teach me the...
21
by: EdUarDo | last post by:
Hi all, I'm not a newbie with C, but I don't use it since more than 5 years... I'm trying to read a text file which has doubles in it: 1.0 1.1 1.2 1.3 1.4 2.0 2.1 2.2 2.3 2.4 I'm doing...
3
by: erikcw | last post by:
Hi all, I'm trying to parse an email message, but am running into this exception. Traceback (most recent call last): File "wa.py", line 336, in ? main() File "wa.py", line 332, in main...
5
by: traviswhiskey | last post by:
so i'm trying to read in from a file but having lots of trouble so far all i get are segmentation faults or jibberish numbers when outputting and i just need a little bit of direction on how to solve...
7
by: Gilles Ganault | last post by:
Hello Data that I download from the web seems to be using different code pages at times, and Python doesn't like this. Google returned a way to handle this, but I'm still getting an error:...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shllpp 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.