473,287 Members | 1,865 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

Customizing character set conversions with an error handler

When converting Unicode strings to legacy character encodings, it is
possible to register a custom error handler that will catch and process
all code points that do not have a direct equivalent in the target
encoding (as described in PEP 293).

The thing to note here is that the error handler itself is required to
return the substitutions as Unicode strings - not as the target encoding
bytestrings. Some lower-level gadgetry will silently convert these
strings to the target encoding.

That is, if the substitution _itself_ doesn't contain illegal code
points for the target encoding.

Which brings us to the point: if my error handler for some reason
returns illegal substitutions (from the viewpoint of the target
encoding), how can I catch _these_ errors and make things good again?

I thought it would work automatically, by calling the error handler as
many times as necessary, and letting it work out the situation, but it
apparently doesn't. Sample code follows:

--- 8< ---

#!/usr/bin/python

import codecs

# ================================================== ================
# Here's our error handler
# ================================================== ================

def charset_conversion(error):

# error.object = The original unicode string we're trying to
# process and which has characters for which
# there is no mapping in the built-in tables.
#
# error.start = The index position in which the error
# occurred in the string
#
# (See PEP 293 for more information)

# Here's our simple conversion table:

table = {
u"\u2022": u"\u00b7", # "BULLET" to "MIDDLE DOT"
u"\u00b7": u"*" # "MIDDLE DOT" to "ASTERISK"
}

try:

# If we can find the character in our conversion table,
# let's make a substitution

substitution = table[error.object[error.start]]

except KeyError:

# Okay, the character wasn't in our substitution table.
# There's nothing we can do. Better print out its
# unicode codepoint as a hex string instead:

substitution = u"[U+%04x]" % ord(error.object[error.start])

# Return the substituted string and let the built-in codec
# continue from the next position:

return (substitution,error.start+1)

# ================================================== ================
# Register the above-defined error handler with the name 'practical'
# ================================================== ================

codecs.register_error('practical',charset_conversi on)

# ================================================== ================
# TEST
# ================================================== ================

if __name__ == "__main__":

print

# Here's our test string: Three BULLET symbols, a space,
# the word "TEST", a space again, and three BULLET symbols
# again.

test = u"\u2022\u2022\u2022 TEST \u2022\u2022\u2022"

# Let's see how we can print out it with our new error
# handler - in various encodings.

# The following works - it just converts the internal
# Unicode representation of the above-defined string
# to UTF-8 without ever hitting the custom error handler:

print " UTF-8: "+test.encode('utf-8','practical')

# The next one works, too - it converts the Unicode
# "BULLET" symbols to Latin 1 "MIDDLE DOTs":

print "Latin 1: "+test.encode('iso-8859-1','practical')

# This works as well - it converts the Unicode "BULLET"
# symbols to IBM Codepage 437 "MIDDLE DOTs":

print " CP 437: "+test.encode('cp437','practical')

# The following doesn't work. It should convert the
# Unicode "BULLET" symbols to "ASTERISKS" by calling
# the error handler two times - first time substituting
# the BULLET with the MIDDLE DOT, then finding out
# that that doesn't work for ASCII either, and falling
# back to a yet simpler form (by calling the error
# handler again, which will this time substitute the
# MIDDLE DOT with the ASTERISK) - but apparently it
# doesn't work that way. We'll get a
# UnicodeEncodeError instead.

print " ASCII: "+test.encode('ascii','practical')

# So the question becomes: how can I make this work
# in a graceful manner?

--- 8< ---

--
znark

Mar 12 '06 #1
2 2453
Jukka Aho wrote:
When converting Unicode strings to legacy character encodings, it is
possible to register a custom error handler that will catch and process
all code points that do not have a direct equivalent in the target
encoding (as described in PEP 293).

The thing to note here is that the error handler itself is required to
return the substitutions as Unicode strings - not as the target encoding
bytestrings. Some lower-level gadgetry will silently convert these
strings to the target encoding.

That is, if the substitution _itself_ doesn't contain illegal code
points for the target encoding.

Which brings us to the point: if my error handler for some reason
returns illegal substitutions (from the viewpoint of the target
encoding), how can I catch _these_ errors and make things good again?

I thought it would work automatically, by calling the error handler as
many times as necessary, and letting it work out the situation, but it
apparently doesn't. Sample code follows:
# So the question becomes: how can I make this work
# in a graceful manner?


change the return statement with this code:

return (substitution.encode(error.encoding,"practical").d ecode(
error.encoding), error.start+1)

-- Serge

Mar 12 '06 #2
Serge Orlov wrote:
# So the question becomes: how can I make this work
# in a graceful manner?
change the return statement with this code:

return (substitution.encode(error.encoding,"practical").d ecode(
error.encoding), error.start+1)


Thanks, that was a quite neat recursive solution. :) I wouldn't have
thought of that.

I ended up doing it without the recursion, by testing the individual
problematic code points with .encode() within the handler, and catching
the possible exceptions:

--- 8< ---

# This is our original problematic code point:
c = error.object[error.start]

while 1:

# Search for a substitute code point in
# our table:

c = table.get(c)

# If a substitute wasn't found, convert the original code
# point into a hexadecimal string representation of itself
# and exit the loop.

if c == None:
c = u"[U+%04x]" % ord(error.object[error.start])
break

# A substitute was found, but we're not sure if it is OK
# for for our target encoding. Let's check:

try:
c.encode(error.encoding,'strict')
# No exception; everything was OK, we
# can break off from the loop now
break

except UnicodeEncodeError:
# The mapping that was found in the table was not
# OK for the target encoding. Let's loop and try
# again; there might be a better (more generic)
# substitution in the chain waiting for us.
pass

--- 8< ---

--
znark

Mar 14 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: WindAndWaves | last post by:
Hi Folk Here I am writing my first php / mysql site, almost ready, and now this... charactersets.... The encoding that I use on my webpage is: <META HTTP-EQUIV="content-type"...
0
by: Suvodip Mukherjee | last post by:
Hi, I am facing a problem using the DBLIBRARY API dbconvert(). When dbconvert() is used to convert a MS SQL Server MONEY data type to DECIMAL and then to STRING, the scale is getting lost. eg....
4
by: Someonekicked | last post by:
if you create a new file , open it and only hit enter then save. if you use seekg(0,ios::end) and then tellg(), tellg() will return 2 (not 1); so new line character is considered as two characters....
13
by: deko | last post by:
I use this convention frequently: Exit_Here: Exit Sub HandleErr: Select Case Err.Number Case 3163 Resume Next Case 3376 Resume Next
13
by: Thelma Lubkin | last post by:
I use code extensively; I probably overuse it. But I've been using error trapping very sparingly, and now I've been trapped by that. A form that works for me on the system I'm using, apparently...
9
by: Erik Leunissen | last post by:
L.S. I've observed unexpected behaviour regarding the usage of the '#' flag in the conversion specification in the printf() family of functions. Did I detect a bug, or is there something wrong...
11
by: aaragon | last post by:
Hi everyone. I'm trying to write a class with policy based design (Alexandrescu's Modern C++ Design). I'm not a programmer but an engineer so this is kind of hard for me. Through the use of...
2
by: adsci | last post by:
Hello! Im posting this to c.l.c++ AND win32 because i dont know if this is a MS Compiler Issue or not. here we go: <code> class MyCString
0
by: Leonhard Vogt | last post by:
Hello I have subclassed code.InteractiveInterpreter for testing an "interpreter" i have written myself. The interpreter is a function (evaluate) that can raise MyError exceptions. I want...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.