Customizing character set conversions with an error handler

Jukka Aho

When converting Unicode strings to legacy character encodings, it is
possible to register a custom error handler that will catch and process
all code points that do not have a direct equivalent in the target
encoding (as described in PEP 293).

The thing to note here is that the error handler itself is required to
return the substitutions as Unicode strings - not as the target encoding
bytestrings. Some lower-level gadgetry will silently convert these
strings to the target encoding.

That is, if the substitution _itself_ doesn't contain illegal code
points for the target encoding.

Which brings us to the point: if my error handler for some reason
returns illegal substitutions (from the viewpoint of the target
encoding), how can I catch _these_ errors and make things good again?

I thought it would work automatically, by calling the error handler as
many times as necessary, and letting it work out the situation, but it
apparently doesn't. Sample code follows:

--- 8< ---

#!/usr/bin/python

import codecs

# =============== =============== =============== =============== ======
# Here's our error handler
# =============== =============== =============== =============== ======

def charset_convers ion(error):

# error.object = The original unicode string we're trying to
# process and which has characters for which
# there is no mapping in the built-in tables.
#
# error.start = The index position in which the error
# occurred in the string
#
# (See PEP 293 for more information)

# Here's our simple conversion table:

table = {
u"\u2022": u"\u00b7", # "BULLET" to "MIDDLE DOT"
u"\u00b7": u"*" # "MIDDLE DOT" to "ASTERISK"
}

try:

# If we can find the character in our conversion table,
# let's make a substitution

substitution = table[error.object[error.start]]

except KeyError:

# Okay, the character wasn't in our substitution table.
# There's nothing we can do. Better print out its
# unicode codepoint as a hex string instead:

substitution = u"[U+%04x]" % ord(error.objec t[error.start])

# Return the substituted string and let the built-in codec
# continue from the next position:

return (substitution,e rror.start+1)

# =============== =============== =============== =============== ======
# Register the above-defined error handler with the name 'practical'
# =============== =============== =============== =============== ======

codecs.register _error('practic al',charset_con version)

# =============== =============== =============== =============== ======
# TEST
# =============== =============== =============== =============== ======

if __name__ == "__main__":

print

# Here's our test string: Three BULLET symbols, a space,
# the word "TEST", a space again, and three BULLET symbols
# again.

test = u"\u2022\u2022\ u2022 TEST \u2022\u2022\u2 022"

# Let's see how we can print out it with our new error
# handler - in various encodings.

# The following works - it just converts the internal
# Unicode representation of the above-defined string
# to UTF-8 without ever hitting the custom error handler:

print " UTF-8: "+test.encode(' utf-8','practical')

# The next one works, too - it converts the Unicode
# "BULLET" symbols to Latin 1 "MIDDLE DOTs":

print "Latin 1: "+test.encode(' iso-8859-1','practical')

# This works as well - it converts the Unicode "BULLET"
# symbols to IBM Codepage 437 "MIDDLE DOTs":

print " CP 437: "+test.encode(' cp437','practic al')

# The following doesn't work. It should convert the
# Unicode "BULLET" symbols to "ASTERISKS" by calling
# the error handler two times - first time substituting
# the BULLET with the MIDDLE DOT, then finding out
# that that doesn't work for ASCII either, and falling
# back to a yet simpler form (by calling the error
# handler again, which will this time substitute the
# MIDDLE DOT with the ASTERISK) - but apparently it
# doesn't work that way. We'll get a
# UnicodeEncodeEr ror instead.

print " ASCII: "+test.encode(' ascii','practic al')

# So the question becomes: how can I make this work
# in a graceful manner?

--- 8< ---

--
znark

Mar 12 '06 #1

Subscribe Reply

2483

Serge Orlov

Jukka Aho wrote:

When converting Unicode strings to legacy character encodings, it is
possible to register a custom error handler that will catch and process
all code points that do not have a direct equivalent in the target
encoding (as described in PEP 293).

The thing to note here is that the error handler itself is required to
return the substitutions as Unicode strings - not as the target encoding
bytestrings. Some lower-level gadgetry will silently convert these
strings to the target encoding.

That is, if the substitution _itself_ doesn't contain illegal code
points for the target encoding.

Which brings us to the point: if my error handler for some reason
returns illegal substitutions (from the viewpoint of the target
encoding), how can I catch _these_ errors and make things good again?

I thought it would work automatically, by calling the error handler as
many times as necessary, and letting it work out the situation, but it
apparently doesn't. Sample code follows:
# So the question becomes: how can I make this work
# in a graceful manner?

change the return statement with this code:

return (substitution.e ncode(error.enc oding,"practica l").decode(
error.encoding) , error.start+1)

-- Serge

Mar 12 '06 #2

Jukka Aho

Serge Orlov wrote:

# So the question becomes: how can I make this work
# in a graceful manner?
change the return statement with this code:

return (substitution.e ncode(error.enc oding,"practica l").decode(
error.encoding) , error.start+1)

Thanks, that was a quite neat recursive solution. :) I wouldn't have
thought of that.

I ended up doing it without the recursion, by testing the individual
problematic code points with .encode() within the handler, and catching
the possible exceptions:

--- 8< ---

# This is our original problematic code point:
c = error.object[error.start]

while 1:

# Search for a substitute code point in
# our table:

c = table.get(c)

# If a substitute wasn't found, convert the original code
# point into a hexadecimal string representation of itself
# and exit the loop.

if c == None:
c = u"[U+%04x]" % ord(error.objec t[error.start])
break

# A substitute was found, but we're not sure if it is OK
# for for our target encoding. Let's check:

try:
c.encode(error. encoding,'stric t')
# No exception; everything was OK, we
# can break off from the loop now
break

except UnicodeEncodeEr ror:
# The mapping that was found in the table was not
# OK for the target encoding. Let's loop and try
# again; there might be a better (more generic)
# substitution in the chain waiting for us.
pass

--- 8< ---

--
znark

Mar 14 '06 #3

Similar topics

2928

character sets

by: WindAndWaves | last post by:

Hi Folk Here I am writing my first php / mysql site, almost ready, and now this... charactersets.... The encoding that I use on my webpage is: <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8"> When people enter new data I use

PHP

2294

dbconvert() doing incorrect conversions

by: Suvodip Mukherjee | last post by:

Hi, I am facing a problem using the DBLIBRARY API dbconvert(). When dbconvert() is used to convert a MS SQL Server MONEY data type to DECIMAL and then to STRING, the scale is getting lost. eg. 41.98 returns 41.97999 123456789123456.98 returns 123456789123456.97000 The code snippet in VC++ is given as follows:

Microsoft SQL Server

4393

new line character is considered as two characters

by: Someonekicked | last post by:

if you create a new file , open it and only hit enter then save. if you use seekg(0,ios::end) and then tellg(), tellg() will return 2 (not 1); so new line character is considered as two characters. Is there a possible way so new line character can be considered (reading and writing) as one character ? I need that in my program and the new line character being two characters is messing up everything.

C / C++

6610

Error Handler best practices

by: deko | last post by:

I use this convention frequently: Exit_Here: Exit Sub HandleErr: Select Case Err.Number Case 3163 Resume Next Case 3376 Resume Next

Microsoft Access / VBA

4484

Error trapping code

by: Thelma Lubkin | last post by:

I use code extensively; I probably overuse it. But I've been using error trapping very sparingly, and now I've been trapped by that. A form that works for me on the system I'm using, apparently runs into problems on the system where it will actually be used, and since I used so little error-trapping it dies very ungracefully. I will of course try to fix whatever is causing the error and add error-trapping to the functions where the...

Microsoft Access / VBA

2637

'#' conversion flag in printf() doesn't work with NULL character

by: Erik Leunissen | last post by:

L.S. I've observed unexpected behaviour regarding the usage of the '#' flag in the conversion specification in the printf() family of functions. Did I detect a bug, or is there something wrong with my expectations regarding the effect of the following code: printf("NULL as hex: %#4.2x\n", '\0');

C / C++

2350

Customizing structure with policy classes

by: aaragon | last post by:

Hi everyone. I'm trying to write a class with policy based design (Alexandrescu's Modern C++ Design). I'm not a programmer but an engineer so this is kind of hard for me. Through the use of policies, I want to customize the structure of a class. The idea is to investigate the use of several data structures. One option would be the use of the boost dynamic bitset. Another would be the use of the std::vector. I obtained some code that...

C / C++

4487

3 overloads have similar conversions Error

by: adsci | last post by:

Hello! Im posting this to c.l.c++ AND win32 because i dont know if this is a MS Compiler Issue or not. here we go: <code> class MyCString

C / C++

916

Customizing code.InteractiveConsole

by: Leonhard Vogt | last post by:

Hello I have subclassed code.InteractiveInterpreter for testing an "interpreter" i have written myself. The interpreter is a function (evaluate) that can raise MyError exceptions. I want these to be reported with an indication of the position (^) as in the python interactive interpreter.

Python

8991

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8830

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9372

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9247

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8243

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6796

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6074

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4874

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2783

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP