usage of <string>.encode('utf-8','xmlcharrefreplace')?

J Peyret

Well, as usual I am confused by unicode encoding errors.

I have a string with problematic characters in it which I'd like to
put into a postgresql table.
That results in a postgresql error so I am trying to fix things with
<string>.encode

>>s = 'he Company\xef\xbf\xbds ticker'
print s

he Companyï¿½s ticker

>>>

Trying for an encode:

>>print s.encode('utf-8')

Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

OK, that's pretty much as expected, I know this is not valid utf-8.
But I should be able to fix this with the errors parameter of the
encode method.

>>error_replace = 'xmlcharrefreplace'

>>print s.encode('utf-8',error_replace)

Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

Same exact error I got without the errors parameter.

Did I mistype the error handler name? Nope.

>>codecs.lookup_error(error_replace)

<built-in function xmlcharrefreplace_errors>

Same results with 'ignore' as an error handler.

>>print s.encode('utf-8','ignore')

Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

And with a bogus error handler:

print s.encode('utf-8','bogus')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

This all looks unusually complicated for Python.
Am I missing something incredibly obvious?
How does one use the errors parameter on strings' encode method?

Also, why are the exceptions above complaining about the 'ascii' codec
if I am asking for 'utf-8' conversion?

Version and environment below. Should I try to update my python from
somewhere?

./$ python
Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2

Cheers

Feb 19 '08 #1

Subscribe Post Reply

8199

Carsten Haese

On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote

Well, as usual I am confused by unicode encoding errors.

I have a string with problematic characters in it which I'd like to
put into a postgresql table.
That results in a postgresql error so I am trying to fix things with
<string>.encode

>s = 'he Company\xef\xbf\xbds ticker'
print s

he [UTF-8?]Companyï¿½s ticker

>>

Trying for an encode:

>print s.encode('utf-8')

Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

OK, that's pretty much as expected, I know this is not valid utf-8.

Actually, the string *is* valid UTF-8, but you're confused about encoding and
decoding. Encoding is the process of turning a Unicode object into a byte
string. Decoding is the process of turning a byte string into a Unicode object.

You need to decode your byte string into a Unicode object, and then encode the
result to a byte string in a different encoding. For example:

>>s = 'he Company\xef\xbf\xbds ticker'
s.decode("utf-8").encode("ascii", "xmlcharrefreplace")

'he Company�s ticker'

By the way, whether this is the correct fix for your PostgreSQL error is not
clear, since you kept that error message a secret for some reason. There could
be a better solution than transcoding the string in this way, but we won't
know until you show us the actual error you're trying to fix. At the moment,
it's like showing you the best way to inflate a tire with a hammer.

Hope this helps,

--
Carsten Haese
http://informixdb.sourceforge.net

Feb 19 '08 #2

7stud

To clarify a couple of points:

On Feb 18, 11:38*pm, 7stud <bbxx789_0...@yahoo.comwrote:

>*A unicode string looks like this:

s = u'\u0041'

but your string looks like this:

s = 'he Company\xef\xbf\xbds ticker'

Note that there is no 'u' in front of your string. *

That means your string is a regular string.

If a python function requires a unicode string and a unicode string
isn't provided..

For example: encode().
One last point: you can't display a unicode string. The very act of
trying to print a unicode string causes it to be converted to a
regular string. If you try to display a unicode string without
explicitly encode()'ing it first, i.e. converting it to a regular
string using a specified secret code--a so called 'codec', python will
implicitly attempt to convert the unicode string to a regular string
using the default codec, which is usually set to ascii.

Feb 19 '08 #3

J Peyret

On Feb 18, 10:54 pm, 7stud <bbxx789_0...@yahoo.comwrote:

One last point: you can't display a unicode string. The very act of
trying to print a unicode string causes it to be converted to a
regular string. If you try to display a unicode string without
explicitly encode()'ing it first, i.e. converting it to a regular
string using a specified secret code--a so called 'codec', python will
implicitly attempt to convert the unicode string to a regular string
using the default codec, which is usually set to ascii.

Yes, the string above was obtained by printing, which got it into
ASCII format, as you picked up.
Something else to watch out for when posting unicode issues.

The solution I ended up with was

1) Find out the encoding in the data file.

In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
the bottom of the save prompt dialog.

ISO-8859-15 in my case.

2) Look up encoding corresponding to ISO-8859-15 at

http://docs.python.org/lib/standard-encodings.html

3) Applying the decode/encode recipe suggested previously, for which I
do understand the reason now.

#converting rawdescr
#from ISO-8859-15 (from the file)
#to UTF-8 (what postgresql wants)
#no error handler required.
decodeddescr = rawdescr.decode('iso8859_15').encode('utf-8')

postgresql insert is done using decodeddescr variable.

Postgresql is happy, I'm happy.

Feb 19 '08 #4

7stud

On Feb 19, 12:15*am, J Peyret <jpey...@gmail.comwrote:

On Feb 18, 10:54 pm, 7stud <bbxx789_0...@yahoo.comwrote:

One last point: you can't display a unicode string. *The very act of
trying to print a unicode string causes it to be converted to a
regular string. *If you try to display a unicode string without
explicitly encode()'ing it first, i.e. converting it to a regular
string using a specified secret code--a so called 'codec', python will
implicitly attempt to convert the unicode string to a regular string
using the default codec, which is usually set to ascii.

Yes, the string above was obtained by printing, which got it into
ASCII format, as you picked up.
Something else to watch out for when posting unicode issues.

The solution I ended up with was

1) Find out the encoding in the data file.

In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
the bottom of the save prompt dialog.

ISO-8859-15 in my case.

2) Look up encoding corresponding to ISO-8859-15 at

http://docs.python.org/lib/standard-encodings.html

3) Applying the decode/encode recipe suggested previously, for which I
do understand the reason now.

#converting rawdescr
#from ISO-8859-15 (from the file)
#to UTF-8 (what postgresql wants)
#no error handler required.
decodeddescr = rawdescr.decode('iso8859_15').encode('utf-8')

postgresql insert is done using decodeddescr variable.

Postgresql is happy, I'm happy.

Or, you can cheat. If you are reading from a file, you can make set
it up so any string that you read from the file automatically gets
converted from its encoding to another encoding. You don't even have
to be aware of the fact that a regular string has to be converted into
a unicode string before it can be converted to a regular string with a
different encoding. Check out the codecs module and the EncodedFile()
function:

import codecs

s = 'he Company\xef\xbf\xbds ticker'

f = open('data2.txt', 'w')
f.write(s)
f.close()

f = open('data2.txt')
f_special = codecs.EncodedFile(f, 'utf-8', 'iso8859_15') #file, new
encoding, file's encoding
print f_special.read() #If your display device understands utf-8, you
will see the troublesome character displayed.
#Are you sure that character is legitimate?

f.close()
f_special.close()

Feb 19 '08 #5

by: Mark McKay | last post by:

I have a thread which is used for updating a display window. The normal paint message queue is being bypassed in favor of drawing on demand by this thread. This thread is passed a Graphics...

Java

Memory Usage

by: rbt | last post by:

Would a Python process consume more memory on a PC with lots of memory? For example, say I have the same Python script running on two WinXP computers that both have Python 2.4.0. One computer has...

Python

memory usage

by: tomvr | last post by:

Hello I have noticed some 'weird' memory usage in a vb.net windows app The situation is as follows I have an app (heavy on images) with 2 forms (actually there are more forms and on starting...

.NET Framework

About memory usage

by: Jarvis | last post by:

I've made a testing program to test the memory usage of some Data Forms. I create a MDI parent form with one single MDI child form, which is a Data Form generated by .NET Data Form Wizard. To...

.NET Framework

High Memory Usage Garbage Collection Question

by: Ian Taite | last post by:

Hello, I'm exploring why one of my C# .NET apps has "high" memory usage, and whether I can reduce the memory usage. I have an app that wakes up and processes text files into a database...

.NET Framework

High memory usage

by: Philip Carnstam | last post by:

How come .Net applications use so much memory? Every application I compile uses at least 10 MB of memory, even the ones consisting of only a form and nothing else. If I minimize them though the...

C# / C Sharp

How to find anomalous usage

by: rdemyan via AccessMonster.com | last post by:

My app contains utility meter usage. One of the things we have to deal with is when a usage is clearly incorrect. Perhaps someone wrote the meter reading down incorrectly or made a factor of 10...

Microsoft Access / VBA

CPU usage amd memory usage

by: Sirisha | last post by:

I am using the following code to get the CPU usage PerformanceCounter myCounter; myCounter = new PerformanceCounter(); myCounter.CategoryName = "Processor"; myCounter.CounterName = "%...

C# / C Sharp

Aspnet Worker Thread Memory Usage

by: jld | last post by:

Hi, I developed an asp.net based eCommerce Website for a client and it is hosted at discount asp. The site is quite interactive, queries a database a lot and uses ajax.asp.net to spice up...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

usage of <string>.encode('utf-8','xmlcharrefreplace')?

Similar topics