Convertion of Unicode to ASCII NIGHTMARE

ChaosKCW wrote:

Hi

I am reading from an oracle database using cx_Oracle. I am writing to a
SQLite database using apsw.

The oracle database is returning utf-8 characters for euopean item
names, ie special charcaters from an ASCII perspective.
And does cx_Oracle return those as Unicode objects or as plain strings
containing UTF-8 byte sequences? It's very important to distinguish
between these two cases, and I don't have any experience with cx_Oracle
to be able to give advice here.
I get the following error:
SQLiteCur.execute(sql, row)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not
in range(128)

It looks like you may have Unicode objects that you're presenting to
sqlite. In any case, with earlier versions of pysqlite that I've used,
you need to connect with a special unicode_results parameter, although
later versions should work with Unicode objects without special
configuration. See here for a thread (in which I seem to have
participated, coincidentally):

http://mail.python.org/pipermail/pyt...ne/107526.html
I have googled for serval days now and still cant get it to encode to
ascii.

This is a tough thing to find out - whilst previous searches did
uncover some discussions about it, I just tried and failed to find the
enlightening documents - and I certainly didn't see many references to
it on the official pysqlite site.

Paul

Apr 3 '06 #2

Robert Kern

ChaosKCW wrote:

Hi

I am reading from an oracle database using cx_Oracle. I am writing to a
SQLite database using apsw.

The oracle database is returning utf-8 characters for euopean item
names, ie special charcaters from an ASCII perspective.
I'm not sure that you are using those terms correctly. From your description
below, it seems that your data is being returned from the Oracle database as
unicode strings rather than regular strings containing UTF-8 encoded data. These
European characters are not "special characters from an ASCII perspective;" they
simply aren't characters in the ASCII character set at all.
I get the following error:
SQLiteCur.execute(sql, row)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not in >range(128)
I have googled for serval days now and still cant get it to encode to
ascii.

Don't. You can't. Those characters don't exist in the ASCII character set.
SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

http://www.sqlite.org/version3.html
I encode the SQL as follows:

sql = "insert into %s values %s" % (SQLiteTable, paramstr)
sql.encode('ascii', 'ignore')

The .encode() method returns a new value; it does not change an object inplace.

sql = sql.encode('utf-8')

--
Robert Kern
ro*********@gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 3 '06 #3

Diez B. Roggisch

> Don't. You can't. Those characters don't exist in the ASCII character set.

SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

That is not entirely correct - one can, if losing information is ok. The OPs
code that normalized UTF-8 to NFKD, an umlaut like Ã¤ is transformed to a
two-character-sequence basically saying "a with two dots on top". With
'ignore' specified as parameter to the encoder, this should be result in
the letter a.
Regards,

Diez

Apr 3 '06 #4

Oh, and it occurs to me, as I seem to have mentioned a document about
PgSQL rather than pysqlite (although they both have the same principal
developer), that you might need to investigate the client_encoding
parameter when setting up your connection. The following message gives
some information (but not much):

http://groups.google.com/group/comp....7fa9866c9b7b5f

Sadly, I can't find the information about getting result values as
Unicode objects, but I believe it involves some kind of SQL comment
that you send to the database system which actually tells pysqlite to
change its behaviour.

Paul

Apr 3 '06 #5

"Paul Boddie" <pa**@boddie.org.uk> wrote in message news:11**********************@i39g2000cwa.googlegr oups.com...

It looks like you may have Unicode objects that you're presenting to
sqlite. In any case, with earlier versions of pysqlite that I've used,
you need to connect with a special unicode_results parameter,

He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

Roger

Apr 4 '06 #6

ChaosKCW

Hi

Thanks for all the posts. I am still digesting it all but here are my
initial comments.

Don't. You can't. Those characters don't exist in the ASCII character set.
SQLite 3.0 deals with UTF-8 encoded SQL statements, though.
http://www.sqlite.org/version3.html
As mentioned by the next poster, there is, its supposed to be encode
with the 'ignore' option. Thus you lose data, but thats just dandy with
me. As for SQLite supporting unicode, it probably does, but something
on the python side (probabyl in apsw) converts it to ascii at some
point before its handed to SQLite.
The .encode() method returns a new value; it does not change an object inplace.
sql = sql.encode('utf-8')
Ah yes, big bistake on my part :-/
He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

Ok if SQLite uses unicode internally why do you need to ignore
everything greater than 127, the ascii table (256 bit one) fits into
unicode just fine as far as I recall? Or did I miss the boat here ?

Thanks,

Apr 4 '06 #7

John Machin

"Thus you lose data, but thats just dandy with
me.": Please reconsider this attitude, before you perpetrate a nonsense
or even a disaster.

Wrt your last para:
1. Roger didn't say "ignore" -- he said "won't accept" (a major
difference).
2. The ASCII code comprises 128 characters, *NOT* 256.
3. What Roger means is: given a Python 8-bit string and no other
information, you don't have a clue what the encoding is. Most codes of
interest these days have the ASCII code (or a mild perversion thereof)
in the first 128 positions, but it's anyones guess what the writer of
the string had in mind with the next 128.

Apr 4 '06 #8

Robert Kern

Roger Binns wrote:

"Paul Boddie" <pa**@boddie.org.uk> wrote in message news:11**********************@i39g2000cwa.googlegr oups.com...
It looks like you may have Unicode objects that you're presenting to
sqlite. In any case, with earlier versions of pysqlite that I've used,
you need to connect with a special unicode_results parameter,

He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

Ah, I misread the OP's traceback.

Okay, the OP is getting regular strings, which are probably encoded in
ISO-8859-1 if I had to guess, from the Oracle DB. He is trying to pass them in
to SQLiteCur.execute() which tries to make a unicode string from the input:

In [1]: unicode('\xdc')
---------------------------------------------------------------------------
exceptions.UnicodeDecodeError Traceback (most recent call
last)

/Users/kern/<ipython console>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal
not in range(128)

*Now*, my advice to the OP is to figure out the encoding of the strings that are
being returned from Oracle. As I said, ISO-8859-1 is probably a good guess.
Then, he would *decode* the string to a unicode string using the encoding. E.g.:

row = row.decode('iso-8859-1')

Then everything should be peachy. I hope.

--
Robert Kern
ro*********@gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 4 '06 #9

Robert Kern wrote:

Roger Binns wrote:
"Paul Boddie" <pa**@boddie.org.uk> wrote
It looks like you may have Unicode objects that you're presenting to
sqlite. In any case, with earlier versions of pysqlite that I've used,
you need to connect with a special unicode_results parameter,

Note that I've since mentioned client_encoding which seems to matter
for pysqlite 1.x.
He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

For pysqlite 2.x, it appears that Unicode objects can be handed
straight to the API methods, and I'd be interested to hear about your
problems with pysqlite, Unicode and what actually made you write apsw
instead.
Ah, I misread the OP's traceback.

Okay, the OP is getting regular strings, which are probably encoded in
ISO-8859-1 if I had to guess, from the Oracle DB. He is trying to pass them in
to SQLiteCur.execute() which tries to make a unicode string from the input:
[...]

There's an Oracle environment variable that appears to make a
difference: NLS_CHARSET, perhaps - it's been a while since I've had to
deal with Oracle, and I'm not looking for another adventure into
Oracle's hideous documentation to find out.
*Now*, my advice to the OP is to figure out the encoding of the strings that are
being returned from Oracle. As I said, ISO-8859-1 is probably a good guess.
Then, he would *decode* the string to a unicode string using the encoding. E.g.:

row = row.decode('iso-8859-1')

Then everything should be peachy. I hope.

Yes, just find out what Oracle wants first, then set it all up, noting
that without looking into the Oracle wrapper being used, I can't
suggest an easier way.

Paul

Apr 4 '06 #10

"ChaosKCW" <da********@gmail.com> wrote in message news:11**********************@v46g2000cwv.googlegr oups.com...

me. As for SQLite supporting unicode, it probably does,
No, SQLite *ONLY* supports Unicode. It will *only* accept
strings in Unicode and only produces strings in Unicode.
All the functionality built into SQLite such as comparison
operators operate only on Unicode strings.
but something
on the python side (probabyl in apsw) converts it to ascii at some
point before its handed to SQLite.
No. APSW converts it *to* Unicode. SQLite only accepts Unicode
so a Unicode string has to be supplied. If you supply a non-Unicode
string then conversion has to happen. APSW asks Python to
supply the string in Unicode. If Python can't do that (eg
it doesn't know the encoding) then you get an error.

I strongly recommend reading this:

The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets

http://www.joelonsoftware.com/articles/Unicode.html
Ok if SQLite uses unicode internally why do you need to ignore
everything greater than 127,
I never said that. I said that a special case is made so that
if the string you supply only contains ASCII characters (ie <=127)
then the ASCII string is converted to Unicode. (In fact it is
valid UTF-8 hence the shortcut).
the ascii table (256 bit one) fits into
unicode just fine as far as I recall?
No, ASCII characters have defined Unicode codepoints. The ASCII
character number just happens to be the same as the Unicode
codepoints. But there are only 127 ASCII characters.
Or did I miss the boat here ?

For bytes greater than 127, what character set is used? There
are hundreds of character sets that define those characters.
You have to tell the computer which one to use. See the Unicode
article referenced above.

Roger

Apr 5 '06 #11

Fredrik Lundh

Roger Binns wrote:

SQLite only accepts Unicode so a Unicode string has to be supplied.

fact or FUD? let's see:

import pysqlite2.dbapi2 as DB

db = DB.connect("test.db")

cur = db.cursor()

cur.execute("create table if not exists test (col text)")
cur.execute("insert into test values (?)", ["this is an ascii string"])
cur.execute("insert into test values (?)", [u"this is a unicode string"])
cur.execute("insert into test values (?)", [u"thïs ïs ö unicöde strïng"])

cur.execute("select * from test")

for row in cur.fetchall():
print row

prints

(u'this is an ascii string',)
(u'this is a unicode string',)
(u'th\xefs \xefs \xf6 unic\xf6de str\xefng',)

which is correct behaviour under Python's Unicode model.

</F>

Apr 5 '06 #12

"Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:ma***************************************@pyt hon.org...

Roger Binns wrote:
SQLite only accepts Unicode so a Unicode string has to be supplied.
fact or FUD? let's see:

Note I said SQLite. For APIs that take/give strings, you can either
supply/get a UTF-8 encoded sequence of bytes, or two bytes per character
host byte order sequence. Any wrapper of SQLite that doesn't do
Unicode in/out is seriously breaking things.

I ended up using the UTF-8 versions of the API as Python can't quite
make its mind up how to represent Unicode strings at the C api level.
You can have two bytes per char or four, and the handling/production
of byte order markers isn't that clear either.
import pysqlite2.dbapi2 as DB

pysqlite had several unicode problems in the past. It has since
been cleaned up as you saw.

Roger

Apr 5 '06 #13

Fredrik Lundh

Roger Binns wrote:

fact or FUD? let's see:

Note I said SQLite. For APIs that take/give strings, you can either
supply/get a UTF-8 encoded sequence of bytes, or two bytes per character
host byte order sequence. Any wrapper of SQLite that doesn't do
Unicode in/out is seriously breaking things.

I ended up using the UTF-8 versions of the API as Python can't quite
make its mind up how to represent Unicode strings at the C api level.
You can have two bytes per char or four, and the handling/production
of byte order markers isn't that clear either.

sounds like your understanding of Unicode and Python's Unicode system
is a bit unclear.

</F>

Apr 5 '06 #14

"Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:ma***************************************@pyt hon.org...

sounds like your understanding of Unicode and Python's Unicode system
is a bit unclear.

Err, no. Relaying unicode data between two disparate
C APIs requires being careful and thorough. That means
paying attention to when conversions happen, byte ordering
(compile time) and boms (run time) and when the API
documentation isn't thorough, verifying the behaviour
yourself. That requires a very clear understanding of
Unicode in order to do the requisite test cases, as well
as reading what the code does.

Roger

Apr 5 '06 #15

Roger Binns wrote:

"Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:ma***************************************@pyt hon.org...
Roger Binns wrote:
SQLite only accepts Unicode so a Unicode string has to be supplied.

fact or FUD? let's see:

Note I said SQLite. For APIs that take/give strings, you can either
supply/get a UTF-8 encoded sequence of bytes, or two bytes per character
host byte order sequence. Any wrapper of SQLite that doesn't do
Unicode in/out is seriously breaking things.

I ended up using the UTF-8 versions of the API as Python can't quite
make its mind up how to represent Unicode strings at the C api level.
You can have two bytes per char or four, and the handling/production
of byte order markers isn't that clear either.

I have an impression that handling/production of byte order marks is
pretty clear: they are produced/consumed only by two codecs: utf-16 and
utf-8-sig. What is not clear?

Serge

Apr 6 '06 #16

"Serge Orlov" <Se*********@gmail.com> wrote in message news:11**********************@i40g2000cwc.googlegr oups.com...

I have an impression that handling/production of byte order marks is
pretty clear: they are produced/consumed only by two codecs: utf-16 and
utf-8-sig. What is not clear?

Are you talking about the C APIs in Python/SQLite (that is what I
have been discussing) or the language level?

At the C level, SQLite doesn't accept boms. You have to
provide UTF-8 or host byte order two bytes per char
UTF-16.

Roger

Apr 6 '06 #17

Roger Binns wrote:

"Serge Orlov" <Se*********@gmail.com> wrote in message news:11**********************@i40g2000cwc.googlegr oups.com...
I have an impression that handling/production of byte order marks is
pretty clear: they are produced/consumed only by two codecs: utf-16 and
utf-8-sig. What is not clear?
Are you talking about the C APIs in Python/SQLite (that is what I
have been discussing) or the language level?

Both. Documentation for PyUnicode_DecodeUTF16 and PyUnicode_EncodeUTF16
is pretty clear when BOM is produced/removed. The only problem is that
you have to find out host endianess yourself. In python it's
sys.byteorder, in C you use hack like

unsigned long one = 1;
endianess = (*(char *) &one) == 0) ? 1 : -1;

And then pass endianess to PyUnicode_(De/En)codeUTF16. So I still don't
see what is unclear about BOM production/handling.

At the C level, SQLite doesn't accept boms.

It would be surprising if it did. Quote from
<http://www.unicode.org/faq/utf_bom.html>: "Where the data is typed,
such as a field in a database, a BOM is unnecessary"

Apr 6 '06 #18

ChaosKCW

Roger Binns wrote:

No. APSW converts it *to* Unicode. SQLite only accepts Unicode
so a Unicode string has to be supplied. If you supply a non-Unicode
string then conversion has to happen. APSW asks Python to
supply the string in Unicode. If Python can't do that (eg
it doesn't know the encoding) then you get an error.

If what you say is true, I have to ask why I get a converstion error
which states it cant convert to ASCII, not it cant convert to UNICODE?

Ok if SQLite uses unicode internally why do you need to ignore
everything greater than 127,

I never said that. I said that a special case is made so that
if the string you supply only contains ASCII characters (ie <=127)
then the ASCII string is converted to Unicode. (In fact it is
valid UTF-8 hence the shortcut).
the ascii table (256 bit one) fits into
unicode just fine as far as I recall?

No, ASCII characters have defined Unicode codepoints. The ASCII
character number just happens to be the same as the Unicode
codepoints. But there are only 127 ASCII characters.
Or did I miss the boat here ?

For bytes greater than 127, what character set is used? There
are hundreds of character sets that define those characters.
You have to tell the computer which one to use. See the Unicode
article referenced above.

Yes I know there are a million "extended" ASCII charaters sets, which
happen to the bane of all existence. Most computers deal in bytes
nativly and the 7 bit coding still causes problems to this day. But
since the error I get is a converstion error to ASCII, not from ASCII,
I am willing to accept loss of information. You cant code unicode into
ascii without loss of information or two charcater codes. In my mind,
somewhere inside the "cursor.execute" function, it converts to ascii. I
say this because of the error msg recieved. So I am missing how a
function which supposedly converts evereythin to unicode lands up doing
an ascii converstion ?

Apr 10 '06 #19

ChaosKCW

There's an Oracle environment variable that appears to make a
difference: NLS_CHARSET, perhaps - it's been a while since I've had to
deal with Oracle, and I'm not looking for another adventure into
Oracle's hideous documentation to find out.

That is an EVIL setting which should not be used. The NLS_CHARSET
environment variable causes so many headaches its not worth playing
with it at all.

Apr 10 '06 #20

ChaosKCW wrote:

Roger Binns wrote:

No. APSW converts it *to* Unicode. SQLite only accepts Unicode
so a Unicode string has to be supplied. If you supply a non-Unicode
string then conversion has to happen. APSW asks Python to
supply the string in Unicode. If Python can't do that (eg
it doesn't know the encoding) then you get an error.
If what you say is true, I have to ask why I get a converstion error
which states it cant convert to ASCII, not it cant convert to UNICODE?

You do get error about convertion to unicode. Quote from you message:
SQLiteCur.execute(sql, row)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not in >range(128)
Notice the name of the error: UnicodeDecode or in other words
ToUnicode.

So I am missing how a
function which supposedly converts evereythin to unicode lands up doing
an ascii converstion ?

When python tries to concatenate a byte string and a unicode string, it
assumes that the byte string is encoded ascii and tries to convert from
encoded ascii to unicode. It calls ascii decoder to do the decoding. If
decoding fails you see message from ascii decoder about the error.

Serge

Apr 10 '06 #21

ChaosKCW wrote:

There's an Oracle environment variable that appears to make a
difference: NLS_CHARSET, perhaps - it's been a while since I've had to
deal with Oracle, and I'm not looking for another adventure into
Oracle's hideous documentation to find out.

That is an EVIL setting which should not be used. The NLS_CHARSET
environment variable causes so many headaches its not worth playing
with it at all.

Well, at this very point in time I don't remember the preferred way of
getting Oracle, the client libraries and the database adapter to agree
on the character encoding used for communicating data between
applications and the database system. Nevertheless, what you need to do
is to make sure that you know which encoding is used so that if you
either get plain strings (ie. not Unicode objects) out of the database,
or if you need to write plain strings to the database, you can provide
the encoding to the unicode built-in function or to the decode/encode
methods; this is much better than just stripping out characters that
can't be represented by ASCII.

Anyway, despite my objections to digging through Oracle documentation,
I found the following useful documents: the "Globalization Support"
index [1], an FAQ about NLS_LANG [2], and a white paper about Unicode
support in Oracle [3]. It may well be the case that NLS_LANG might help
you do what you want, but since the database systems I have installed
(PostgreSQL, sqlite3) seem to "do Unicode" without such horsing around,
I'm not really able to offer much more advice on this subject.

Paul

[1]
http://www.oracle.com/technology/tec...ion/index.html
[2]
http://www.oracle.com/technology/tec...tdocs/nls_lang
faq.htm
[3]
http://www.oracle.com/technology/tec...code_10gR2.pdf

Apr 10 '06 #22

ChaosKCW

When python tries to concatenate a byte string and a unicode string, it
assumes that the byte string is encoded ascii and tries to convert from
encoded ascii to unicode. It calls ascii decoder to do the decoding. If
decoding fails you see message from ascii decoder about the error.

Serge

Ok I get it now. Sorry for the slowness. I have to say as a lover of
python for its simplicity and clarity, the charatcer set thing has been
harder than I would have liked to figure out.

Thanks for all the help.

Apr 10 '06 #23

ChaosKCW wrote:

Ok I get it now. Sorry for the slowness. I have to say as a lover of
python for its simplicity and clarity, the charatcer set thing has been
harder than I would have liked to figure out.

I think there is a room for improvement here. In my opinion the message
is too confusing for newbies. It would be easier for them if there is a
mini tutorial available about what's going on with links to other more
broad tutorials (like unicode tutorial). Instead of generic error
----------------------
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa5 in position 0:
ordinal not in range(128)
----------------------

It can be like this. Notice special url, it is pointing to a
non-existing :) tutorial about why concatenating byte strings with
unicode strings can produce UnicodeDecodeError
----------------------
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa5 in position 0:
ordinal
not in range(128)
For additional information about this exception see:
http://docs.python.org/2.4/exception...ecodeError+str
----------------------

Here is the sample code how it can be done:

---------------------------
extended_help = {
("concat", UnicodeDecodeError, str, unicode):
"http://docs.python.org/2.4/exceptions/concat+UnicodeDecodeError+str",
("concat", UnicodeDecodeError, unicode, str):
"http://docs.python.org/2.4/exceptions/concat+UnicodeDecodeError+str"
}

def get_more_help(error, key):
if not extended_help.has_key(key):
return
error.reason += "\nFor additional information about this exception
see:\n "
error.reason += extended_help[key]
def concat(s1,s2):
try:
return s1 + s2
except Exception, e:
key = "concat", e.__class__, type(s1), type(s2)
get_more_help(e, key)
raise

concat(chr(0xA5),unichr(0x5432))

Apr 11 '06 #24