473,809 Members | 2,763 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UnicodeEncodeEr ror while reading xml file (newbie question)

I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Exception Type: UnicodeEncodeEr ror
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib \encodings\cp12 52.py in encode,
line 12

The string that could not be encoded/decoded was: H_C="" A_C

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

The code that reads the file goes like this :

from xml.etree import ElementTree as ET

def read_xml(reques t):
data = open('live.xml' , 'r').read()
data = data.decode('ut f-8', 'replace')
data = ET.XML(data)

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Can someone please help ?
Jun 27 '08 #1
2 3947
On Jun 8, 10:12 am, nikosk <nikos.nikos.ni kos.ni...@gmail .comwrote:
I just spent a whole day trying to read an xml file and I got stuck
with the following error:

Exception Type: UnicodeEncodeEr ror
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib \encodings\cp12 52.py in encode,
line 12

The string that could not be encoded/decoded was: H_C="�� ��" A_C

After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="

The code that reads the file goes like this :

from xml.etree import ElementTree as ET

def read_xml(reques t):
data = open('live.xml' , 'r').read()
data = data.decode('ut f-8', 'replace')
data = ET.XML(data)

I've tried all the combinations of str.decode str.encode I could
think of but nothing.

Can someone please help ?
Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.. ...)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('your file.xml', 'rb').read()[before_pos:afte r_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?
Jun 27 '08 #2
You won't believe how helpful your reply was. I was looking for a
problem that did not exist.
You wrote : (3) why you think you need to have "data.decode(.. ...)"
at all
and after that : (7) are you expecting non-ASCII characters after
H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?
And I went back to see why I was doing this in the first place
(couldn't remember
after struggling for so many hours) and I opened the file in Interent
explorer.
The browser wouldn't open it because it didn't like the encoding
declared in the <xmltag
"System does not support the specified encoding. Error processing
resource 'http://scores24live.co m/xml/live.xml'. Line 1, ..."
(IE was the only program that complained, FF and some other tools
opened it without hassle)

Then I went back and looked for the original message that got me
struggling and it was this :
xml.parsers.exp at.ExpatError: unknown encoding: line 1, column 30

From then on it was easy to see that it was the xml encoding that was
wrong :
<?xml version="1.0" encoding="utf8" ?>

when I switched that to :
<?xml version="1.0" encoding="utf-8"?>

everything just worked.

I can't thank you enough for opening my eyes...

PS.: The UnicodeEncodeEr ror must have something to do with Java's
UTF-8
implementation (the xml is produced by a Dom4j on a J2EE server).
Those characters I posted in the original message should
have read "ΚΙΝΑ" (China in Greek) but I after I copy pasted them in
the post
it came up like this : H_C="�� ��" A_C which is weird because
this
page is UTF encoded which means that characters should be 1 or 2 bytes
long.
From the message you see that instead of 4 characters it reads 8 which
means
there were extra information in the string.

If the above is true then it might be something for python developers
to address in the language. If someone wishes to investigate further
here is the link for info on java utf and the file that caused the
UnicodeEncodeEr ror :
http://en.wikipedia.org/wiki/UTF-8 (the java section)
http://java.sun.com/javase/6/docs/ap...modified-utf-8

the xml file : http://dsigned.gr/live.xml

On Jun 8, 3:50 am, John Machin <sjmac...@lexic on.netwrote:
On Jun 8, 10:12 am, nikosk <nikos.nikos.ni kos.ni...@gmail .comwrote:
I just spent a whole day trying to read an xml file and I got stuck
with the following error:
Exception Type: UnicodeEncodeEr ror
Exception Value: 'charmap' codec can't encode characters in position
164-167: character maps to <undefined>
Exception Location: C:\Python25\lib \encodings\cp12 52.py in encode,
line 12
The string that could not be encoded/decoded was: H_C="�� ��" A_C
After some tests I can say with confidence that the error comes up
when python finds those greek characters after H_C="
The code that reads the file goes like this :
from xml.etree import ElementTree as ET
def read_xml(reques t):
data = open('live.xml' , 'r').read()
data = data.decode('ut f-8', 'replace')
data = ET.XML(data)
I've tried all the combinations of str.decode str.encode I could
think of but nothing.
Can someone please help ?

Perhaps, with some more information:
(1) the *full* traceback
(2) what encoding is mentioned up the front of the XML file
(3) why you think you need to have "data.decode(.. ...)" at all
(4) why you think your input file is encoded in utf8 [which may be
answered by (2)]
(5) why you are using 'replace' (which would cover up (for a while)
any non-utf8 characters in your file)
(6) what "those greek characters" *really* are -- after fiddling with
encodings in my browser the best I can make of that is four capital
gamma characters each followed by a garbage byte or a '?'. Do
something like:

print repr(open('your file.xml', 'rb').read()[before_pos:afte r_pos])

(7) are you expecting non-ASCII characters after H_C= ? what
characters? when you open your xml file in a browser, what do you see
there?
Jun 27 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
12990
by: ohaya | last post by:
Hi, I'm a real newbie, but have been asked to try to fix a problem in one of our JSP pages that is suppose to read in a text file and display it. From my testing thus far, it appears this page is somehow hanging when relatively large file is used. My original intent was to try to just add a check for file size, and error out somehow if the file was "too" large, but in looking at the
2
2552
by: Brian Ward | last post by:
First: sorry as a relative newbie for previously not including code. My question: Reading C++ books I almost always find programs such as the one below give the following type of code for reading a file : //********* BUM CODE ************************************* infile.open ("test.dat" , ios::in); while (!infile.eof()) {
2
12436
by: Francach | last post by:
Hi, I don't know what I'm doing wrong here. I''m using Python 2.4 and py2exe. I get he following error: Traceback (most recent call last): File "notegui.pyc", line 34, in OnClose File "brain.pyc", line 61, in setNote File "points.pyc", line 151, in setNote File "point.pyc", line 100, in writeNote
6
1691
by: ALI-R | last post by:
Hi All; I'm reading the following xml file into a Dataset but there are only 4 datatables in my dataset (which should be 5) Is that because I have two nodes with the same name (detail) in my xml file? how can I fix this? Thanks for your time. <?xml version="1.0" ?>
1
10735
by: khaleel.alyasini | last post by:
Hello sirs & madams, I'm currently working on DCT compression algorithm as my project. My knowledge on C++ and DSP is rather weak/moderate. I was wondering if any could help me and teach me the guidelines of DCT image compression. Basically, this is my coding in C++ on extracting pixels values from a RAW image file. It has errors, anyone could help me? #include <iostream.h>
21
6399
by: EdUarDo | last post by:
Hi all, I'm not a newbie with C, but I don't use it since more than 5 years... I'm trying to read a text file which has doubles in it: 1.0 1.1 1.2 1.3 1.4 2.0 2.1 2.2 2.3 2.4 I'm doing this (it's only a test trying to achieve the goal...):
3
3948
by: erikcw | last post by:
Hi all, I'm trying to parse an email message, but am running into this exception. Traceback (most recent call last): File "wa.py", line 336, in ? main() File "wa.py", line 332, in main print out
5
1625
by: traviswhiskey | last post by:
so i'm trying to read in from a file but having lots of trouble so far all i get are segmentation faults or jibberish numbers when outputting and i just need a little bit of direction on how to solve this problem. i'm reading i don't have much experience with reading arrays in from a file and stuff. heres what i have. struct date { int day, month, year; }; struct student { string firstName, lastName;
7
5128
by: Gilles Ganault | last post by:
Hello Data that I download from the web seems to be using different code pages at times, and Python doesn't like this. Google returned a way to handle this, but I'm still getting an error: ======== print output.decode('utf-8') File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True)
0
9721
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
9601
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
10378
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10115
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7653
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6881
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5550
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
3861
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3013
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.