By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,144 Members | 872 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,144 IT Pros & Developers. It's quick & easy.

Psycopg and queries with UTF-8 data

P: n/a
Another python/psycopg question, for which the solution is probably
quite simple; I just don't know where to look.

I have a query that inserts data originating from an utf-8 encoded XML
file. And guess what, it contains utf-8 encoded characters...
Now my problem is that psycopg will only accept queries of type str, so
how do I get my utf-8 encoded data into the DB?

I can't do query.encode('ascii'), that would be similar to:
x = u'\xc8'
print x.encode('ascii')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc8' in
position 0: ordinal not in range(128)
I also tried setting PostgreSQL's client-encoding by executing "SET
client_encoding TO 'utf-8'", but psycopg still only accepts str-type
strings (which is not really surprising).
I assume that the solution will result in an ascii encoded query string,
and that I then can use the QuotedString type to escape my strings
(which is in my current situation not possible because that also only
accepts str type strings and it contains utf-8 characters).

Regards,
Alban.
Jul 18 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Alban Hertroys wrote:
I have a query that inserts data originating from an utf-8 encoded XML
file. And guess what, it contains utf-8 encoded characters...
Now my problem is that psycopg will only accept queries of type str, so
how do I get my utf-8 encoded data into the DB?
This sounds like the usual unicode/utf-8 confusion: unicode is an abstract
specification of characters, utf-8 as well as latin1 and ascii are
encodings of that specification that allow for certain characters to be
used - namely, ascii for only well-known first 127, latin1 for some major
european languages, and utf-8 defines escapes for all possible characters
defined in unicode - with the result that some of the characters aren't one
byte per character anymore.

So unicode objects encapsulate abstract unicode character sequence - however
they accomplish that is not of your concern. strings on the opposite, are
pure byte sequences - and common libs work with them, with the exception of
the usually unicode aware xml libs. So to yield a string from an unicode
object, one has to specify an encoding - like utf-8 or latin1. Now having a
character in that unicode object that can't be encoded using the specified
encoding, that will produce an error.
Please do read a tutorial on unicode and python - there are several good
ones out there, use google to your advantage.

I can't do query.encode('ascii'), that would be similar to:
>>> x = u'\xc8'
>>> print x.encode('ascii') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc8' in
position 0: ordinal not in range(128)
Sure- xC8 > 127, so it can't be encoded. Do this:
x = u'\xc8'
x u'\xc8' x.encode('utf-8')

'\xc3\x88'

As you can see, the formerly one byte long character becomes two bytes. The
reason is that on unicode character is translated to that 2-byte sequence
using utf-8.
I also tried setting PostgreSQL's client-encoding by executing "SET
client_encoding TO 'utf-8'", but psycopg still only accepts str-type
strings (which is not really surprising).


Confusion again - please repeat:

unicode is not utf-8!!!
unicode is not utf-8!!!
unicode is not utf-8!!!
unicode is not utf-8!!!

Do encode the unicode object in utf-8, and pass that to the psycopg. If you
set client_encoding to latin1, you have to encode unicod to that.

--
Regards,

Diez B. Roggisch
Jul 18 '05 #2

P: n/a
Alban Hertroys <al***@magproductions.nl> pisze:
I have a query that inserts data originating from an utf-8 encoded XML
file. And guess what, it contains utf-8 encoded characters...
Now my problem is that psycopg will only accept queries of type str, so
how do I get my utf-8 encoded data into the DB?

I can't do query.encode('ascii'), that would be similar to:
x = u'\xc8'
print x.encode('ascii')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc8' in
position 0: ordinal not in range(128)


Did you try x.encode('utf-8')?

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
Jul 18 '05 #3

P: n/a
Diez B. Roggisch wrote:
Alban Hertroys wrote:
I have a query that inserts data originating from an utf-8 encoded XML
file. And guess what, it contains utf-8 encoded characters...
Now my problem is that psycopg will only accept queries of type str, so
how do I get my utf-8 encoded data into the DB?

This sounds like the usual unicode/utf-8 confusion: unicode is an abstract
specification of characters, utf-8 as well as latin1 and ascii are
encodings of that specification that allow for certain characters to be
used - namely, ascii for only well-known first 127, latin1 for some major
european languages, and utf-8 defines escapes for all possible characters
defined in unicode - with the result that some of the characters aren't one
byte per character anymore.


Ah, I see now. I _thought_ it was odd that unicode('string') resulted in
a unicode object and 'string'.encode('utf-8') did not. I understand now
that 'unicode' is data that is actual unicode data, while 'utf-8'
_encoded_ data is really a string, but with special characters rewritten
to specify utf-8 escape sequences instead of the actual unicode bytes.

Thanks for clearing out my confusion.
Please do read a tutorial on unicode and python - there are several good
ones out there, use google to your advantage.
I did, though some time ago. Apparently I missed the point being made
(or forgot about it).
Confusion again - please repeat:

unicode is not utf-8!!!
unicode is not utf-8!!!
unicode is not utf-8!!!
unicode is not utf-8!!!
while confused():
print "unicode is not utf-8!!!"
Do encode the unicode object in utf-8, and pass that to the psycopg. If you
set client_encoding to latin1, you have to encode unicod to that.


I suppose I won't notice much of that until I read from the DB (which is
done in PHP mostly), as the data inserted is already an ascii string by
itself (with escaped utf-8 characters, though). I'll worry about that
later ;)

Many thanks,
Alban.
Jul 18 '05 #4

P: n/a
> Ah, I see now. I _thought_ it was odd that unicode('string') resulted in
a unicode object and 'string'.encode('utf-8') did not. I understand now
that 'unicode' is data that is actual unicode data, while 'utf-8'
_encoded_ data is really a string, but with special characters rewritten
to specify utf-8 escape sequences instead of the actual unicode bytes.
Exactly.

Thanks for clearing out my confusion.
Your welcome.
while confused():
print "unicode is not utf-8!!!"


Lets hope confused() is True only for a short time, otherwise you'll end up
with pretty much output...
Do encode the unicode object in utf-8, and pass that to the psycopg. If
you set client_encoding to latin1, you have to encode unicod to that.


I suppose I won't notice much of that until I read from the DB (which is
done in PHP mostly), as the data inserted is already an ascii string by
itself (with escaped utf-8 characters, though). I'll worry about that
later ;)


Well, AFAIK php doesn't care about unicode - all it knows are strings as
byte sequences, plain old C-style. So if you read from it, things should
work if you set your HTTP header variables correct _and_ other parts of you
html-page aren't made in a different encoding - so make sure typing them in
your editor of choice will yield utf-8 data beeing saved.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.