Dealing with "funny" characters

sophie_newbie

Hi, I want to store python text strings that characters like "Ã©" "ÄŒ"
in a mysql varchar text field. Now my problem is that mysql does not
seem to accept these characters. I'm wondering if there is any way I
can somehow "encode" these characters to appear as normal characters
and then "decode" them when I want to get them out of the database
again?

-Thanks.

Oct 20 '07 #1

Subscribe Reply

2260

Paul Hankin

On Oct 20, 2:13 pm, sophie_newbie <paulgeele...@gmail.comwrote:

Hi, I want to store python text strings that characters like "Ã©" "ÄŒ"
in a mysql varchar text field. Now my problem is that mysql does not
seem to accept these characters. I'm wondering if there is any way I
can somehow "encode" these characters to appear as normal characters
and then "decode" them when I want to get them out of the database
again?

Use unicode strings, and use the 'encode' and 'decode' methods. Mysql
has utf8 support for varchars, so that seems a likely encoding to use.
If you're stuck with ascii, you can use 'unicode_escape' if you're
only accessing it from python.

But first read this excellent article on 'funny' characters:
http://www.joelonsoftware.com/articles/Unicode.html

--
Paul Hankin

Oct 20 '07 #2

John Nagle

Gert-Jan wrote:

sophie_newbie schreef:
>Hi, I want to store python text strings that characters like "Ã©" "ÄŒ"
in a mysql varchar text field. Now my problem is that mysql does not
seem to accept these characters. I'm wondering if there is any way I
can somehow "encode" these characters to appear as normal characters
and then "decode" them when I want to get them out of the database
again?

It seems you'll have to use Unicode in your program rather than 'plain'
strings.

Before storing an unicode textstring in a database or a file, you must
encode it using an appropriate encoding/codepage, for example:

outputstring = unicodeobject.encode('utf-8')

No, no, that's wrong. MySQL and the Python interface to it understand
Unicode. You don't want to convert data to UTF-8 before putting it in a
database; the database indexing won't work.

Here's how to do it right.

First, tell MySQL, before you create your MySQL tables, that the tables are
to be stored in Unicode:

ALTER database yourdatabasename DEFAULT CHARACTER SET utf8;

You can also do this on a table by table basis, or even for single fields,
but you'll probably get confused if you do.

Then, when you connect to the database in Python, use something like this:

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)

That tells MySQLdb to talk to the database in Unicode, and it tells the database
(via "charset") that you're talking Unicode.

Within Python, you can use Unicode as well. If you have a Unicode text
editor, you can create Python source files in Unicode and have Unicode text
constants in quotes. If you do this, you should put

# -*- coding: UTF-8 -*-

as the first line of Python files. Quoted constants should be written
as

s = u'Test'

rather than

s = 'Test'

Instead of "str()", use "unicode()".

Once everything is set up like this, you can pass Unicode in and out of MySQL
databases freely, and all the SQL commands will work properly on Unicode data.

John Nagle

Oct 20 '07 #3

Diez B. Roggisch

No, no, that's wrong. MySQL and the Python interface to it understand

Unicode. You don't want to convert data to UTF-8 before putting it in a
database; the database indexing won't work.

I doubt that indexing has anything to do with it whatsoever.

Here's how to do it right.

First, tell MySQL, before you create your MySQL tables, that the
tables are
to be stored in Unicode:

ALTER database yourdatabasename DEFAULT CHARACTER SET utf8;

You can also do this on a table by table basis, or even for single fields,
but you'll probably get confused if you do.

Then, when you connect to the database in Python, use something like
this:

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)

That tells MySQLdb to talk to the database in Unicode, and it tells the
database
(via "charset") that you're talking Unicode.

You confuse unicode with utf-8 here. And while this appears to be
nitpicking, it is important to write this small program and meditate the
better part of an hour in front of it running:

while True:
print "utf-8 is not unicode"
You continue to make that error below, so I snip that.

The important part is this: unicode is a standard that aims to provide a
codepoint for each and every character that humankind has invented. And
python unicode objects can also represent all characters one can imagine.

However, unicode as such is an abstraction. Harddisks, network sockets,
databases and the like don't deal with abstractions though - the eat
bytes. Which makes it necessary to encode unicode objects to
byte-strings when serializing them. Thus there are the thingies called
encodings: latin1 for most characters used in westen europe for example.
But it is limited to 256 characters (actually, even less), chinese or
russian customers won't get too happy with them.

So some encodings are defined that are capable of encoding _ALL_ unicode
codepoints. Either by being larger than one byte for each character. Or
by providing escape-mechanisms. The former are e.g. UCS4 (4 bytes per
character), the most important member of the latter is utf-8. Which uses
ascii + escapes to encode all codepoints.

Now what does that mean in python?

First of all, the coding:-declaration: it tells python which encoding to
use when dealing with unicode-literals, which are the

u"something"

thingies. If you use a coding of latin1, that means that the text

u"Ã¶"

is expected to be one byte long, with the proper value that depicts the
german umlaut o in latin1. Which is 0xf6.

If coding: is set to utf-8, the same string has to consist not of one,
but of two bytes: 0xc3 0xb6.

So, when editing files that are supposed to contain "funny" characters,
you have to

- set your editor to save the file in an appropriate encoding

- specify the same encoding in the coding:-declaration

Regarding databases: they store bytes. Mostly. Some allow to store
unicode by means of one of the fixed-size-encodings, but you pay a
storage-size penalty for that.

So - you we're right when you said that one can change the encoding a db
uses, on several levels even.

But that's not all that is to it. Another thing is the encoding the
CONNECTION expects byte-strings to be passed, and will use to render
returned strings in. The conversion from and to the used storage
encoding is done automagically.

It is for example perfectly legal (and unfortunately happens
involuntarily) to have a database that internally uses utf-8 as storage,
potentially being able to store all possible codepoints.

But due to e.g. environmental settings, opened connections will deliver
the contents in e.g. latin1. Which of course will lead to problems if
you try to return data from the table with the topmost chines first names.

So you can alter the encoding the connection delivers and expects
byte-strings in. In mysql, this can be done explcit using

cursor.execute("set names <encoding>")

Or - as you said - as part of a connection-string.

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)
But there is more to it. If the DB-API supports it, then the API itself
will decode the returned strings, using the specified encoding, so that
the user will only deal with "real" unicode-objects, greatly reducing
the risk of mixing byte-strings with unicode-objects. That's what the
use_unicod-parameter is for: it makes the API accept and deliver
unicod-objects. But it would do so even if the charset-parameter was
"latin1". Which makes me repeat the lesson from the beginning:

while True:
print "utf-8 is not unicode"
Diez

Oct 20 '07 #4

Similar topics

12213

"smart" quotes in PHP

by: Martin Goldman | last post by:

Hello all, I've been struggling for a few days with the question of how to convert "smart" (curly) quotes into straight quotes. I tried playing with the htmlentities() function, but all that is...

PHP

4795

Unicode: Combining diacritical "dot above" mark with the capital letter P

by: Barry | last post by:

Hi all, I've noticed a strange error on my website. When I print a capital letter P with a dot above, using & #7766; it appears correctly, but when I use P& #0775 it doesn't. The following...

HTML / CSS

3570

Is "a >= b" equivalent to "a - b >= 0"?

by: hjbortol | last post by:

Hi! Is the expression "a >= b" equivalent to "a - b >= 0" in C/C++? Is this equivalence an IEEE/ANSI rule? Or is this machine/compiler dependent? Any references are welcome! Thanks in...

C / C++

2352

"directory order" - K and R 2 exercise 5-16?

by: G Fernandes | last post by:

Can someone explain what is meant by "directory order" in the questoin for K and R 2 exercise 5-16? I can't seem to find a solution for this exercise on the main site where clc goers have posted...

C / C++

3828

multiple versions of "Extended ASCII characters"(No. 128 to 255)

by: wob | last post by:

Many thanks for those who responded to my question of "putting greek char into C string". In searching for an solution, I noticed that there are more than one version of "Extended ASCII...

C / C++

3356

WebRequest.GetResponse gives "Bad Request"

by: Mange | last post by:

I'm using WebRequest class to access a PHP script, but keep getting a 400 error from the server when calling the GetResponse() method. I have added the url to the script in the bypass list in...

ASP.NET

2837

K&R2 section 1.5.1 - "getchar" problem

by: arnuld | last post by:

i have slightly modified the programme from section 1.5.1 which takes the input frm keyboard and then prints that to the terminal. it just does not run and i am unable to understand the error...

C / C++

3850

"Python" is not a good name, should rename to "Athon"

by: ureuffyrtu955 | last post by:

Python is a good programming language, but "Python" is not a good name. First, python also means snake, Monty Python. If we search "python" in google, emule, many results are not programming...

Python

2519

"Access is denied" in IE on node.focus()

by: Jeff Bigham | last post by:

Hi, I'm getting an "Access is denied" error in IE when I try to focus an input box with node.focus() My understanding is that should only happen when the domains are different of the...

Javascript

7136

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7344

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7505

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5060

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4730

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3216

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3203

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

775

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

441

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General