Understanding Unicode & encodings

Raphael.Benedet

Hello,

For my application, I would like to execute an SQL query like this:
self.dbCursor.e xecute("INSERT INTO track (name, nbr, idartist, idalbum,
path) VALUES ('%s', %s, %s, %s, '%s')" % (track, nbr, idartist,
idalbum, path))
where the different variables are returned by the libtagedit python
bindings as Unicode. Every time I execute this, I get an exception like
this:

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xa1 in position
64: ordinal not in range(128)

I tried to encode the different variables in many different encodings
(latin1), but I always get an exception. Where does this ascii codec
error comes from? How can I simply build this query string?

Thanks in advance.
Best Regards,
Raphael

Jul 23 '06 #1

Subscribe Reply

2025

Jim

Ra************* @gmail.com wrote:

Hello,

For my application, I would like to execute an SQL query like this:
self.dbCursor.e xecute("INSERT INTO track (name, nbr, idartist, idalbum,
path) VALUES ('%s', %s, %s, %s, '%s')" % (track, nbr, idartist,
idalbum, path))

No, I'll bet that you'd like to run something like
self.dcCursor.e xecute("INSERT INTO track (name, nbr, idartist,
idalbum,path) VALUES (%(track)s, %(nbr)s,
%(idartist)s,%( idalbum)s,'%(pa th)s')",
{'track':track, 'nbr':nbr,'idar tist':idartist, 'idalbum':idalb um,'path':path} )
(only without my typos). That's an improvment for a number of reasons,
one of which is that the system will quote for you, for instance in
idartist="John' s Beer" changing the single quote to two single quotes
to suit SQL.

Every time I execute this, I get an exception like
this:

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xa1 in position
64: ordinal not in range(128)

I tried to encode the different variables in many different encodings
(latin1), but I always get an exception. Where does this ascii codec
error comes from? How can I simply build this query string?

Some more information may help: is the error returned before or during
the execute call? If before, then the execute() call is a distraction.
If during, then what is your dB, what is it's encoding (is the dB
using latin1, or does the dB only accept ascii?), and what are you
using to connect to it?

Jim

Jul 23 '06 #2

clarkcb

Ra************* @gmail.com wrote:

I tried to encode the different variables in many different encodings
(latin1), but I always get an exception. Where does this ascii codec
error comes from? How can I simply build this query string?

Raphael,

The 'ascii' encoding is set in the python library file site.py
(/usr/lib/python2.4/site.py on my gentoo machine) as the system default
encoding for python. The solution I used to the problem you're
describing was to create a sitecustomize.p y file and redefine the
encoding as 'utf-8'. The entire file contents look like this:

--------
'''
Site customization: change default encoding to UTF-8
'''
import sys
sys.setdefaulte ncoding('utf-8')
--------

For more info on creating a sitecustomize.p y file, read the comments in
the site.py file.

I use UTF-8 because I do a lot of multilingual text manipulation, but
if all you're concerned about is Western European, you could also use
'latin1'.

This gets you halfway there. Beyond that you need to "stringify" the
(potentially Unicode) strings during concatenation, e.g.:

self.dbCursor.e xecute("""INSER T INTO track (name, nbr, idartist,
idalbum, path)
VALUES ('%s', %s, %s, %s, '%s')""" % \
(str(track), nbr, idartist, idalbum, path))

(Assuming that track is the offending string.) I'm not exactly sure why
this explicit conversion is necessary, as it is supposed to happen
automatically, but I get the same UnicodeDecodeEr ror error without it.

Hope this helps,
Cary

Jul 23 '06 #3

John Machin

Jim wrote:

Ra************* @gmail.com wrote:
Hello,

For my application, I would like to execute an SQL query like this:
self.dbCursor.e xecute("INSERT INTO track (name, nbr, idartist, idalbum,
path) VALUES ('%s', %s, %s, %s, '%s')" % (track, nbr, idartist,
idalbum, path))
No, I'll bet that you'd like to run something like
self.dcCursor.e xecute("INSERT INTO track (name, nbr, idartist,
idalbum,path) VALUES (%(track)s, %(nbr)s,
%(idartist)s,%( idalbum)s,'%(pa th)s')",
{'track':track, 'nbr':nbr,'idar tist':idartist, 'idalbum':idalb um,'path':path} )
(only without my typos). That's an improvment for a number of reasons,
one of which is that the system will quote for you, for instance in
idartist="John' s Beer" changing the single quote to two single quotes
to suit SQL.

self.dcCursor.e xecute("INSERT INTO track (name, nbr, idartist,
idalbum,path) VALUES (%(track)s, %(nbr)s,
%(idartist)s,%( idalbum)s,'%(pa th)s')",
{'track':track, 'nbr':nbr,'idar tist':idartist, 'idalbum':idalb um,'path':path} )

I see no improvement here.

The OP's code is effectively::

sql = "INSERT INTO track (name, ..., path) VALUES ('%s', ..., '%s')"
value_tuple = (track, ...., path)
self.dcCursor.e xecute(sql % value_tuple)

Your suggested replacement is effectively:

sql = "INSERT INTO track (name, ...,path) VALUES (%(track)s,
....,'%(path)s' )"
str_fmt_dict = {'track':track, ...,'path':path }
self.dcCursor.e xecute(sql, str_fmt_dict)

Well, that won't run at all. Let's correct the presumed typo:

self.dcCursor.e xecute(sql % str_fmt_dict)

Now, the only practical difference is that you have REMOVED the OP's
explicit quoting of the first column value. Changing the string
formatting from the %s style to the %(column_name) style achieves
nothing useful. You are presenting the "system" with a constant SQL
string -- it is not going to get any chance to fiddle with the quoting.
However the verbosity index has gone off the scale: each column name is
mentioned 4 times (previously 1).

I would suggest the standard default approach:

sql = "INSERT INTO track (name, ..., path) VALUES (?, ..., ?)"
value_tuple = (track, ...., path)
self.dcCursor.e xecute(sql, value_tuple)

The benefits of doing this include that the DBAPI layer gets to
determine the type of each incoming value and the type of the
corresponding DB column, and makes the appropriate adjustments,
including quoting each value properly, if quoting is necessary.

Every time I execute this, I get an exception like
this:

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xa1 in position
64: ordinal not in range(128)

I tried to encode the different variables in many different encodings
(latin1), but I always get an exception. Where does this ascii codec
error comes from? How can I simply build this query string?

Some more information may help: is the error returned before or during
the execute call? If before, then the execute() call is a distraction.
If during, then what is your dB, what is it's encoding (is the dB
using latin1, or does the dB only accept ascii?), and what are you
using to connect to it?

These are very sensible questions. Some more q's for the OP:

(1) What is the schema for the 'track' table?

(2) "I tried to encode the different variables in many different
encodings (latin1)" -- you say "many different encodings" but mention
only one ... please explain and/or show a sample of the actual code of
the "many different" attempts.

(3) You said that your input values (produced by some libblahblah) were
in Unicode -- are you sure? The exception that you got means that it
was trying to convert *from* an 8-bit string *to* Unicode, but used the
default ASCII codec (which couldn't hack it). Try doing this before the
execute() call:

print 'track', type(track), repr(track)
...
print 'path', type(path), repr(path)

and change the execute() call to three statements along the above
lines, so we can see (as Jim asked) where the exception is being
raised.

HTH,
John

Jul 23 '06 #4

John Machin

cl*****@gmail.c om wrote:

Ra************* @gmail.com wrote:
I tried to encode the different variables in many different encodings
(latin1), but I always get an exception. Where does this ascii codec
error comes from? How can I simply build this query string?

Raphael,

The 'ascii' encoding is set in the python library file site.py
(/usr/lib/python2.4/site.py on my gentoo machine) as the system default
encoding for python. The solution I used to the problem you're
describing was to create a sitecustomize.p y file and redefine the
encoding as 'utf-8'.

Here is the word from on high (effbot, April 2006):
"""
(you're not supposed to change the default encoding. don't
do that; it'll only cause problems in the long run).
"""

That exception is a wake-up call -- it means "you don't have a clue how
your 8-bit strings are encoded". You are intended to obtain a clue
(case by case), and specify the encoding explicitly (case by case).
Sure the current app might dump utf_8 on you. What happens if the next
app dumps latin1 or cp1251 or big5 on you?

This gets you halfway there. Beyond that you need to "stringify" the
(potentially Unicode) strings during concatenation, e.g.:

self.dbCursor.e xecute("""INSER T INTO track (name, nbr, idartist,
idalbum, path)
VALUES ('%s', %s, %s, %s, '%s')""" % \
(str(track), nbr, idartist, idalbum, path))

(Assuming that track is the offending string.) I'm not exactly sure why
this explicit conversion is necessary, as it is supposed to happen
automatically, but I get the same UnicodeDecodeEr ror error without it.

Perhaps if you were to supply info like which DBMS, type of the
offending column in the DB, Python type of the value that *appears* to
need stringification , ... we could help you too.

Cheers,
John

Jul 23 '06 #5

Jim

John Machin wrote:

Jim wrote:
No, I'll bet that you'd like to run something like
self.dcCursor.e xecute("INSERT INTO track (name, nbr, idartist,
idalbum,path) VALUES (%(track)s, %(nbr)s,
%(idartist)s,%( idalbum)s,'%(pa th)s')",
{'track':track, 'nbr':nbr,'idar tist':idartist, 'idalbum':idalb um,'path':path} )
(only without my typos). That's an improvment for a number of reasons,
one of which is that the system will quote for you, for instance in
idartist="John' s Beer" changing the single quote to two single quotes
to suit SQL.
I see no improvement here.

The OP's code is effectively::

sql = "INSERT INTO track (name, ..., path) VALUES ('%s', ..., '%s')"
value_tuple = (track, ...., path)
self.dcCursor.e xecute(sql % value_tuple)

Your suggested replacement is effectively:

sql = "INSERT INTO track (name, ...,path) VALUES (%(track)s,
...,'%(path)s') "
str_fmt_dict = {'track':track, ...,'path':path }
self.dcCursor.e xecute(sql, str_fmt_dict)

Well, that won't run at all. Let's correct the presumed typo:

self.dcCursor.e xecute(sql % str_fmt_dict)

I'm sorry, that wasn't a typo. I was using what the dBapi 2.0 document
calls 'pyformat' (see the text under "paramstyle " in that document).

Now, the only practical difference is that you have REMOVED the OP's
explicit quoting of the first column value. Changing the string
formatting from the %s style to the %(column_name) style achieves
nothing useful. You are presenting the "system" with a constant SQL
string -- it is not going to get any chance to fiddle with the quoting.
However the verbosity index has gone off the scale: each column name is
mentioned 4 times (previously 1).

Gee, I like the dictionary; it has a lot of advantages.

I would suggest the standard default approach:

sql = "INSERT INTO track (name, ..., path) VALUES (?, ..., ?)"
value_tuple = (track, ...., path)
self.dcCursor.e xecute(sql, value_tuple)

The benefits of doing this include that the DBAPI layer gets to
determine the type of each incoming value and the type of the
corresponding DB column, and makes the appropriate adjustments,
including quoting each value properly, if quoting is necessary.

I'll note that footnote [2] of the dBapi format indicates some
preference for pyformat over the format above, called there 'qmark'.
But it all depends on what the OP is using to connect to the dB; their
database module may well force them to choose a paramstyle, AIUI.

Anyway, the point is that to get quote escaping right, to prevent SQL
injection, etc., paramstyles are better than direct string %-ing.

Jim

Jul 24 '06 #6

John Machin

Jim wrote:

John Machin wrote:
Jim wrote:
No, I'll bet that you'd like to run something like
self.dcCursor.e xecute("INSERT INTO track (name, nbr, idartist,
idalbum,path) VALUES (%(track)s, %(nbr)s,
%(idartist)s,%( idalbum)s,'%(pa th)s')",
{'track':track, 'nbr':nbr,'idar tist':idartist, 'idalbum':idalb um,'path':path} )
(only without my typos). That's an improvment for a number of reasons,
one of which is that the system will quote for you, for instance in
idartist="John' s Beer" changing the single quote to two single quotes
to suit SQL.
I see no improvement here.

The OP's code is effectively::

sql = "INSERT INTO track (name, ..., path) VALUES ('%s', ..., '%s')"
value_tuple = (track, ...., path)
self.dcCursor.e xecute(sql % value_tuple)

Your suggested replacement is effectively:

sql = "INSERT INTO track (name, ...,path) VALUES (%(track)s,
...,'%(path)s') "
str_fmt_dict = {'track':track, ...,'path':path }
self.dcCursor.e xecute(sql, str_fmt_dict)

Well, that won't run at all. Let's correct the presumed typo:

self.dcCursor.e xecute(sql % str_fmt_dict)
I'm sorry, that wasn't a typo. I was using what the dBapi 2.0 document
calls 'pyformat' (see the text under "paramstyle " in that document).

Oh yeah. My mistake. Noticed 'pyformat' years ago, thought "What a good
idea", found out that ODBC supports only qmark, SQLite supports only
qmark, working on database conversions where the SQL was
programatically generated anyway: forgot all about it.

>
Now, the only practical difference is that you have REMOVED the OP's
explicit quoting of the first column value. Changing the string
formatting from the %s style to the %(column_name) style achieves
nothing useful. You are presenting the "system" with a constant SQL
string -- it is not going to get any chance to fiddle with the quoting.
However the verbosity index has gone off the scale: each column name is
mentioned 4 times (previously 1).

Gee, I like the dictionary; it has a lot of advantages.

Like tersemess? Like wide availibility?

>
Anyway, the point is that to get quote escaping right, to prevent SQL
injection, etc., paramstyles are better than direct string %-ing.

And possible performance gains (the engine may avoid parsing the SQL
each time).

*NOW* we're on the same page of the same hymnbook, Brother Jim :-)

Cheers,
John

Jul 24 '06 #7

Similar topics

5139

UTF-8 & Unicode

by: EU citizen | last post by:

Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?

.NET Framework

6059

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

1794

Telling Unicode and real & characters apart.

by: Louise GK | last post by:

Hi there. I've written a simple program that makes a simple GET form with a text input box and displays $_GET when submitted. Using Windows Character Map, I pasted in the Cyrillic capital "Ya" (the backward R) and it came out as "Я". So far so good. Then I sent in " Я" (The is the Cyrillic character again.) That came out as "Я Я". How can I please tell the difference between the Cyrillic and the character sequence '&', '#',

PHP

3242

Internationalization - Unicode - Format used for command line arguments?

by: Sune | last post by:

Hi! For example: 1) I want to open a file in a Chinese locale and print it. 2) The program takes the file name as a command line argument.

C / C++

4197

Unicode & Pythonwin / win32 / console?

by: Robert | last post by:

Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different

Python

3295

Portable Code that supports Unicode

by: Tomás | last post by:

Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const

C / C++

3216

Unicode, encodings, and asian languages: need some help.

by: apprentice | last post by:

Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages, including asian languages (chinese, japanese, korean, etc). First of all, strings will be passed to my class methods, some of which based on the language (and on the encoding) might contain characters that require more that a single byte.

.NET Framework

3027

unicode mess in c++

by: damjan | last post by:

This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into character encoding intricacies before. In c/c++, the unicode characters are introduced by the means of wchar_t type. Based on the presence of _UNICODE definition C functions are macro'd to either the normal version or the one prefixed with w. Because...

C / C++

3675

different encodings for unicode() and u''.encode(), bug?

by: mario | last post by:

Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it surprisingly fails with a LookupError. This seems like something to be corrected? $ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) on darwin

Python

8375

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8290

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8707

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8482

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

7306

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6161

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5622

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4294

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1916

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP