473,513 Members | 2,618 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

handling unicode data

Hi all,

I'm starting to learn python but am having some difficulties with how
it handles the encoding of data I'm reading from a database. I'm using
pymssql to access data stored in a SqlServer database, and the
following is the script I'm using for testing purposes.

-----------------------------------------------------------------------------
import pymssql

mssqlConnection =
pymssql.connect(host='localhost',user='sa',passwor d='password',database='TestDB')
cur = mssqlConnection.cursor()
query="Select ID, Term from TestTable where ID > 200 and ID < 300;"
cur.execute(query)
row = cur.fetchone()
results = []
while row is not None:
term = row[1]
print type(row[1])
print term
results.append(term)
row = cur.fetchone()
cur.close()
mssqlConnection.close()
print results
-----------------------------------------------------------------------------

In the console output, for a record where I expected to see "França"
I'm getting the following:

"<type 'str'>" - When I print the type (print type(row[1]))
"Fran+a" - When I print the "term" variable (print term)
"Fran\xd8a" - When I print all the query results (print results)
The values in "Term" column in "TestTable" are stored as unicode (the
column's datatype is nvarchar), yet, the python data type of the values
I'm reading is not unicode.
It all seems to be an encoding issue, but I can't see what I'm doing
wrong..
Any thoughts?

thanks in advance,
Filipe

Jun 28 '06 #1
22 6036
Filipe wrote:
In the console output, for a record where I expected to see "França"
I'm getting the following:

"<type 'str'>" - When I print the type (print type(row[1]))
"Fran+a" - When I print the "term" variable (print term)
"Fran\xd8a" - When I print all the query results (print results)

The values in "Term" column in "TestTable" are stored as unicode (the
column's datatype is nvarchar), yet, the python data type of the values
I'm reading is not unicode.
It all seems to be an encoding issue, but I can't see what I'm doing
wrong..


looks like the DB-API driver returns 8-bit ISO-8859-1 strings instead of Unicode
strings. there might be some configuration option for this; see

in worst case, you could do something like

def unicodify(value):
if isinstance(value, str):
value = unicode(value, "iso-8859-1")
return value

term = unicodify(row[1])

but it's definitely better if you can get the DB-API driver to do the right thing.

</F>

Jun 28 '06 #2
Fredrik Lundh wrote:
looks like the DB-API driver returns 8-bit ISO-8859-1 strings instead of Unicode
strings. there might be some configuration option for this; see

Where did you want to point the OP here?
in worst case, you could do something like

def unicodify(value):
if isinstance(value, str):
value = unicode(value, "iso-8859-1")
return value

term = unicodify(row[1])

but it's definitely better if you can get the DB-API driver to do the right thing.


It seems pymssql does not support such a thing.

Also, it appears that DB-Library (the API used by pymssql) always
returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
so the "right" encoding to use is "mbcs".

Notice that Microsoft plans to abandon DB-Library, so it might be
best to switch to a different module for SQL Server access.

Regards,
Martin
Jun 28 '06 #3
Hi Fredrik,

Thanks for the reply.
Instead of:
term = row[1]
I tried:
term = unicode(row[1], "iso-8859-1")

but the following error was returned when printing "term":
Traceback (most recent call last):
File "test.py", line 11, in ?
print term
File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
position 31: character maps to <undefined>

Is it possible some unicode strings are not printable to the console?
It's odd, because I can manually write in the console the same string
I'm trying to print.
I also tried other encodings, besides iso-8859-1, but got the same
error.

Do you think this has something to do with the DB-API driver? I don't
even know where to start if I have to change something in there :|

Cheers,
Filipe

Jun 28 '06 #4
Filipe wrote:
Thanks for the reply.
Instead of:
term = row[1]
I tried:
term = unicode(row[1], "iso-8859-1")

but the following error was returned when printing "term":
Traceback (most recent call last):
File "test.py", line 11, in ?
print term
File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
position 31: character maps to <undefined>


works for me, given your example:
s = "Fran\xd8a"
unicode(s, "iso-8859-1")

u'Fran\xd8a'

what does

print repr(row[1])

print in this case ?

</F>

Jun 28 '06 #5
Hi,

Martin v. Löwis wrote:
Also, it appears that DB-Library (the API used by pymssql) always
returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
so the "right" encoding to use is "mbcs".
do you mean using something like the following line?
term = unicode(row[1], "mbcs")

What do you mean by "ANSI-to-OEM conversion is enabled"? (sorry, I'm
quite a newbie to python)
Notice that Microsoft plans to abandon DB-Library, so it might be
best to switch to a different module for SQL Server access.


I've done some searching and settled for pymssql, but it's not too late
to change yet.
I've found these options to connect to a MSSqlServer database:

Pymssql
http://pymssql.sourceforge.net/

ADODB for Python (windows only)
http://phplens.com/lens/adodb/adodb-py-docs.htm

SQLServer for Python (discontinued?)
http://www.object-craft.com.au/projects/mssql/

mxODBC (commercial license)
http://www.egenix.com/files/python/mxODBC.html

ASPN Recipe
http://aspn.activestate.com/ASPN/Coo.../Recipe/144183
Pymssql seemed like the best choice. The ASPN Recipe I mention doesn't
look bad either, but there doesn't seem to be as many people using it
as using pymssql. I'll look a little further though.

Jun 28 '06 #6
Fredrik Lundh wrote:
works for me, given your example:
>>> s = "Fran\xd8a"
>>> unicode(s, "iso-8859-1")

u'Fran\xd8a'

what does
print repr(row[1])

print in this case ?


It prints:
'Fran\xd8a'

The error I'm getting is beeing thrown when I print the value to the
console. If I just convert it to unicode all seems ok (except for not
beeing able to show it in the console, that is... :).

For example, when I try this:
print unicode("Fran\xd8a", "iso-8859-1")

I get the error:
Traceback (most recent call last):
File "a.py", line 1, in ?
print unicode("Fran\xd8a", "iso-8859-1")
File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
position 4
: character maps to <undefined>

Jun 28 '06 #7
In <11**********************@b68g2000cwa.googlegroups .com>, Filipe wrote:
The error I'm getting is beeing thrown when I print the value to the
console. If I just convert it to unicode all seems ok (except for not
beeing able to show it in the console, that is... :).

For example, when I try this:
print unicode("Fran\xd8a", "iso-8859-1")

I get the error:
Traceback (most recent call last):
File "a.py", line 1, in ?
print unicode("Fran\xd8a", "iso-8859-1")
File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
position 4
: character maps to <undefined>


The `unicode()` call doesn't fail here but the ``print`` because printing
unicode strings means they have to be encoded into a byte string again.
And whatever encoding the target of the print (your console) uses, it
does not contain the unicode character u'\xd8'. From the traceback it
seems your terminal uses `cp437` as encoding.

As you can see here: http://www.wordiq.com/definition/CP437 there's no Ø
in that character set.

Ciao,
Marc 'BlackJack' Rintsch
Jun 28 '06 #8

Filipe wrote:
Hi,

I've done some searching and settled for pymssql, but it's not too late
to change yet.
I've found these options to connect to a MSSqlServer database:

Pymssql
http://pymssql.sourceforge.net/

ADODB for Python (windows only)
http://phplens.com/lens/adodb/adodb-py-docs.htm

SQLServer for Python (discontinued?)
http://www.object-craft.com.au/projects/mssql/

mxODBC (commercial license)
http://www.egenix.com/files/python/mxODBC.html

ASPN Recipe
http://aspn.activestate.com/ASPN/Coo.../Recipe/144183


You did not mention the odbc module from Mark Hammond's win32
extensions. This is what I use, and it works for me. I believe it is
not 100% DB-API 2.0 compliant, but I have not had any problems.

I have not tried connecting to the database from a Linux box (or from
another Windows box, for that matter). I don't know if there are any
implications there.

Frank Millman

Jun 29 '06 #9
Filipe wrote:
Also, it appears that DB-Library (the API used by pymssql) always
returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
so the "right" encoding to use is "mbcs".
do you mean using something like the following line?
term = unicode(row[1], "mbcs")


Correct.
What do you mean by "ANSI-to-OEM conversion is enabled"? (sorry, I'm
quite a newbie to python)


It's an SQL server thing more than a Python thing. See AutoAnsiToOem
in

http://support.microsoft.com/default...B;EN-US;199819

Regards,
Martin
Jun 29 '06 #10
Frank Millman wrote:
You did not mention the odbc module from Mark Hammond's win32
extensions. This is what I use, and it works for me. I believe it is
not 100% DB-API 2.0 compliant, but I have not had any problems.

I have not tried connecting to the database from a Linux box (or from
another Windows box, for that matter). I don't know if there are any
implications there.


According to sourceforge's project page
(https://sourceforge.net/projects/pywin32/) it seems to only work on
windows.

There's also adodbapi (http://adodbapi.sourceforge.net/), that also
depends on PyWin32, but it would be very handy if I could run this code
on a linux box, and with these libs I wouldn't be able to. Still,
options are always good to have around :)

Jun 30 '06 #11
Marc 'BlackJack' Rintsch wrote:
The `unicode()` call doesn't fail here but the ``print`` because printing
unicode strings means they have to be encoded into a byte string again.
And whatever encoding the target of the print (your console) uses, it
does not contain the unicode character u'\xd8'. From the traceback it
seems your terminal uses `cp437` as encoding.

As you can see here: http://www.wordiq.com/definition/CP437 there's no Ø
in that character set.


somethings are much, much, clearer to me now. thanks!

For future reference, these links may also help:
http://www.jorendorff.com/articles/unicode/python.html
http://www.thescripts.com/forum/thread23314.html

I've changed my windows console copdepage to latin1 and the following
prints are now outputting "França", as expected:
print unicode("Fran\x87a", "cp850").encode("iso-8859-1")
print unicode("Fran\xe7a", "iso-8859-1").encode("iso-8859-1")

However, I don't yet fully understand what's happening with Pymssql.
The encoding I expected to be receiving from MSSqlServer was cp850 (the
column in question uses the collation SQL_Latin1_General_CP850_CS_AS),
but it doesn't seem to be what the query is returning. I tried
converting to a unicode string from a few different encodings, but none
of them seems to be the right one. For example, for cp850, using a
latin1 console:

--------------------------------------------------------
term = unicode(row[1], "cp850")
print repr(term)
print term

---- output -------------------------------------------
u'Fran\xcfa'
FranÏa
--------------------------------------------------------
And for iso-8859-1 (also got the same result for mbcs):
--------------------------------------------------------
term = unicode(row[1], "iso-8859-1")
print repr(term)
print term

---- output -------------------------------------------
u'Fran\xd8a'
FranØa
--------------------------------------------------------
What do you think? Might it be Pymssql doing something wrong?

Jun 30 '06 #12
Martin v. Löwis wrote:
What do you mean by "ANSI-to-OEM conversion is enabled"?


See AutoAnsiToOem in
http://support.microsoft.com/default...B;EN-US;199819


I checked the registry key
"HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer \Client\DB-Lib", and
verified AutoAnsiToOem was 'ON'.

I also tried assuming mbcs as the encoding but didn't got very far
(please see my other post).

Jun 30 '06 #13
Filipe wrote:
---- output -------------------------------------------
u'Fran\xd8a'
FranØa
--------------------------------------------------------
What do you think? Might it be Pymssql doing something wrong?


I think the data in your database is already wrong. Are you
sure the value in question is really "França" in the database?

Regards,
Martin
Jun 30 '06 #14
Martin v. Löwis wrote:
Filipe wrote:
---- output -------------------------------------------
u'Fran\xd8a'
FranØa
--------------------------------------------------------

What do you think? Might it be Pymssql doing something wrong?

I think the data in your database is already wrong. Are you
sure the value in question is really "França" in the database?
yes, I'm pretty sure. There's an application that was built to run on
top of this database and it correctly reads as writes data to the DB. I
also used SqlServer's Query Analyzer to select the data and it
displayed fine.

I've done some more tests and I think I'm very close to finding what
the problem is. The tests I had done before were executed from the
windows command line. I tried printing the following (row[1] is a value
I selected from the database) in two distinct environments, from within
an IDE (Pyscripter) and from the command line:

import sys
import locale
print getattr(sys.stdout,'encoding',None)
print locale.getdefaultlocale()[1]
print sys.getdefaultencoding()
term = "Fran\x87a"
print repr(term)
term = row[1]
print repr(term)

output I got in Pyscripter's interpreter window:
None
cp1252
ascii
'Fran\x87a'
'Fran\x87a'

output I got in the command line:
cp1252
cp1252
ascii
'Fran\x87a'
'Fran\xd8a'

I'd expect "print" to behave differently according with the console's
encoding, but does this mean this happens with repr() too?
in which way?

thanks,
Filipe

Jul 4 '06 #15
Filipe wrote:
term = row[1]
print repr(term)

output I got in Pyscripter's interpreter window:
'Fran\x87a'

output I got in the command line:
'Fran\xd8a'

I'd expect "print" to behave differently according with the console's
encoding, but does this mean this happens with repr() too?
repr always generates ASCII bytes. They are not effected by the
console's encoding. If you get different output, it really means
that the values are different (check ord(row[1][4]) to be sure)

What is the precise sequence of statements that you used to
set the "row" variable?

Regards,
Martin
Jul 4 '06 #16
Martin v. Löwis wrote:
Filipe wrote:
term = row[1]
print repr(term)

output I got in Pyscripter's interpreter window:
'Fran\x87a'

output I got in the command line:
'Fran\xd8a'

I'd expect "print" to behave differently according with the console's
encoding, but does this mean this happens with repr() too?

repr always generates ASCII bytes. They are not effected by the
console's encoding. If you get different output, it really means
that the values are different (check ord(row[1][4]) to be sure)
They do, in fact, output different values. The value outputed by
pyscripter was "135" (x87) while the value outputed in the command line
was "216" (xd8). I can't understand why though, because the script
being run is precisely the same on both environments.
What is the precise sequence of statements that you used to
set the "row" variable?
The complete script follows:
-----------------------------------------------------------------------
import sys
import locale
print getattr(sys.stdout,'encoding',None)
print locale.getdefaultlocale()[1]
print sys.getdefaultencoding()

import pymssql
mssqlConnection =
pymssql.connect(host='localhost',user='sa',passwor d='password',database='TestDB')
cur = mssqlConnection.cursor()
query="Select ID, Term from TestTable where ID = 204;"
cur.execute(query)
row = cur.fetchone()
results = []
while row is not None:
term = unicode(row[1], "cp850")
print ord(row[1][4])
print ord(term[4])
print term
results.append(term)
row = cur.fetchone()
cur.close()
mssqlConnection.close()
print results
-----------------------------------------------------------------------
The values outputed were, in pyscripter:
None
cp1252
ascii
135
231
França
[uFran\xe7a']

and in the command line
cp850
cp1252
ascii
216
207
FranÏa
[u'Fran\xcfa']

regards,
Filipe

Jul 5 '06 #17
Filipe wrote:
They do, in fact, output different values. The value outputed by
pyscripter was "135" (x87) while the value outputed in the command line
was "216" (xd8). I can't understand why though, because the script
being run is precisely the same on both environments.
That's indeed surprising, and doesn't really increase trust into
pymssql.

If we look at the values of
print ord(row[1][4])
(where row is the actual data read from the database)

we get
The values outputed were, in pyscripter:
135
Here, 135==0x87 really is LATIN SMALL LETTER C WITH CEDILLA in
code page 850.
and in the command line
216
216==0xd8 is not LATIN SMALL LETTER C WITH CEDILLA in any
encode I know, so it appears that this value is bogus.
One would have to ask the authors of pymssql, or Microsoft,
why that happens; alternatively, you have to run pymssql
in a debugger to find out yourself.

Regards,
Martin
Jul 5 '06 #18
Hi Martin,
One would have to ask the authors of pymssql, or Microsoft,
why that happens; alternatively, you have to run pymssql
in a debugger to find out yourself.
Tried running pymssql in a debugger, but I felt a bit lost. There are
too many things I would need to understand about pymssql first.

Meanwhile, I got to some very interesting conclusions. Remember the
"ANSI-to-OEM conversion" option you mentioned before? I began reading
some docs about it and this description turned up:

"The ANSI to OEM conversion translates data coming back from SQL Server
into the local code page used by your client."

which seemed exactly what I don't want.. so I turned it to "OFF" (by
using the registry key
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\ Client\DB-Lib\AutoAnsiToOem)
and everything started working the way I was originally expecting!

I think that the best way to actually solve this will be to override
AutoAnsiToOem (instead of using the registry setting) from within
pymssql, or find a way to specify what the "local code page" should be.
If not, I will have to have the pain of verifying this setting in every
system where the code I'm developing will be deployed. Which reminds
me... that this setting doesn't exist on non-windows environments (ie,
no registry on Linux) so I'm not sure how will it all work there.
Anyone with experience in using DB-Library that can confirm how it
works (namely, on linux)?
(but perhaps this is outside the scope of this newsgroup.. )

I got in touch with Andrzej Kukula, the current developer of pymssql,
who has also been very helpful, and knows what we've been discussing
over here.
thanks for all the help,
Filipe

Jul 6 '06 #19

Dennis Lee Bieber wrote:
The setting most likely has to be made on the machine running the
server -- and M$ SQL Server doesn't exist on Linux either <G>

If the conversion was being done by some client library on Windows,
then again, since that library probably doesn't exist on Linux, the
conversion probably is not done.
yes, it's something done on the client side. And I think DB-Library
does exist on Linux because Pymssql depends on it and Pymssql is
cross-platform:

"pymssql 0.7.4 has been tested on FreeBSD 6.0, NetBSD 2.1, 3.0, Linux
with kernel 2.6, and Windows XP. It should also run on other platforms."

Jul 6 '06 #20

Dennis Lee Bieber wrote:
If I interpret a short Google search, DB-Library might date back to
the original Sybase core from which M$ SQL Server was spawned. M$'s site
recommends /not/ using DB-Library but to use ODBC/OLEDB methods instead
-- something about ODBC being extensible. Could be confusing if both
Sybase and M$ SQL Server were on the same machine...

http://www.cs.sfu.ca/CourseCentral/S...B-LIBRARY.html

Technical details reference Sybase, but the wordy stuff is "SQL
Server" and "Transact-SQL".
The only reason I still think Pymssql (and therefore, DB-Library) might
be the best option is that, it is the only one I know that is both
cross-platform and free - as in beer and as in freedom. (check, in this
thread, a previous message by Tim Golden)

I searched a bit if there are any OLEDB based python libs and found
this one:
http://pyoledb.datadmin.com/

I'm still not sure if it's cross-platform or not, but It does have a
commercial license, so it's not my first choice for now.

Jul 7 '06 #21

Filipe wrote:
>
The only reason I still think Pymssql (and therefore, DB-Library) might
be the best option is that, it is the only one I know that is both
cross-platform and free - as in beer and as in freedom. (check, in this
thread, a previous message by Tim Golden)
I have bookmarked this post dated October 2004, intending to look into
it one day. I have not done so yet (where does all the time go?). The
subject is "Connecting to SQL Server from Unix".

http://tinyurl.com/zc7so

Try out the suggestions and let us know what happened. I for one will
be very interested.

Frank Millman

Jul 8 '06 #22

Frank Millman wrote:
Filipe wrote:
Try out the suggestions and let us know what happened. I for one will
be very interested.
The last version of ODBTPAPI is 0.1-alpha, last updated 2004-09-25.
Which is a bit scary...
I might try it just the same though.

Jul 8 '06 #23

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5251
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
4
6041
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
2501
by: kuhni | last post by:
Hi everybody! After searching newsgroups for a couple of hours, I now try asking directly (running the risk of asking again the same question). My problem is to predict when the size of the...
0
1153
by: Lau Lei Cheong | last post by:
Hello, I'm writing a project that involves entering chinese character into MySQL database. Since 1) connection-based coding selection requires an upgrade to version 4.1.1 or above and my...
6
6596
by: John Sidney-Woollett | last post by:
Hi I need to store accented characters in a postgres (7.4) database, and access the data (mostly) using the postgres JDBC driver (from a web app). Does anyone know if: 1) Is there a...
1
4834
by: jrs_14618 | last post by:
Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was...
8
2341
by: Richard Schulman | last post by:
The following program fragment works correctly with an ascii input file. But the file I actually want to process is Unicode (utf-16 encoding). The file must be Unicode rather than ASCII or...
44
9413
by: Kulgan | last post by:
Hi I am struggling to find definitive information on how IE 5.5, 6 and 7 handle character input (I am happy with the display of text). I have two main questions: 1. Does IE automaticall...
17
4506
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...
0
7157
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7379
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7535
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
7521
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
4745
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3232
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1591
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
798
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
455
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.