469,353 Members | 2,282 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,353 developers. It's quick & easy.

utf-8 encoding issue

The line below looks up the name "öttinger" (with the German umlaut) of
an author using the mysql console:

mysql> select author from records where author like '%Öttinger%';

This successfully finds all entries in the records database where
"öttinger" is the author or the co-author.

In a web form, the user enters "öttinger" and wants to search with this
search string. My idea is now to convert the search string (which also
could be e.g. some cyrillic text) into unicode and then to utf-8:

unicode(search_string).encode('utf-8')

This gives me the utf-8 encoded version of the string but not yet in the
correct representation. How can I get the correct one (is this the hex
version? I don't know the correct terminology.)?

In short: how do I e.g. convert a sting containing a "ö" into a string
containing a "%Ö"?

Regards,
Marc
Jul 18 '05 #1
1 2068
Marc Petitmermet wrote:
In a web form, the user enters "öttinger" and wants to search with this
search string. My idea is now to convert the search string (which also
could be e.g. some cyrillic text) into unicode and then to utf-8:

unicode(search_string).encode('utf-8')

This gives me the utf-8 encoded version of the string but not yet in the
correct representation. How can I get the correct one (is this the hex
version? I don't know the correct terminology.)?

In short: how do I e.g. convert a sting containing a "ö" into a string
containing a "%Ö"?


that's not UTF-8, that's HTML/XML-style charrefs.

if mysql translates the charref's to unicode characters, you can simply
use:

s = u.encode("ascii", "xmlcharrefreplace")

where "u" is a unicode string.

if you've stored charrefs as is in the database, you're in for some
serious trouble. assuming that all charrefs are hexadecimal charrefs,
you can use something like:

def fixup(m): return "&#" + hex(int(m.group(1)))[1:]
s = re.sub("&#(\d+)", fixup, u.encode("ascii", "xmlcharrefreplace"))

to map all non-ASCII characters to charrefs, and then translate all
charrefs to hexadecimal charrefs.

decoding the charrefs *before* you add the strings to the database
is a better idea, though.

</F>


Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.