By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
431,934 Members | 1,677 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 431,934 IT Pros & Developers. It's quick & easy.

adodbapi / string encoding problem

P: n/a
Hi,

I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to encoding
problems. How do I have to decode the unicode string to get my original data
back?

regards,
Achim
Jul 18 '05 #1
Share this Question
Share on Google+
10 Replies


P: n/a
Achim Domma wrote:
Hi,

I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to
encoding problems. How do I have to decode the unicode string to get my
original data back?


You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*
(which stands for COder-DECoder). I don't know what codec adodbapi is
using (Python's normal default codec is ASCII, which is the "minimum
common denominator" of just about every encoding around -- if adodbapi
hadn't surreptitiously inserted a different codec, it's impossible that
anything would be decoded that might cause problems in encoding it back;-).
Alex

Jul 18 '05 #2

P: n/a
Achim Domma wrote:
I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to
encoding problems. How do I have to decode the unicode string to get my
original data back?


You have to know the encoding of the original file.

Assuming (1) you had western european characters including the euro sign,
(2) they were correctly translated into unicode and (3) you want them back
that way:
s = u"".encode("iso-8859-15")
s '\xe4\xf6\xfc\xc4\xd6\xdc' print s type(s) <type 'str'>


Or more general:

unicodeFromAccess.encode(targetEncoding)

Peter
Jul 18 '05 #3

P: n/a
"Alex Martelli" <al***@aleax.it> wrote in message
news:0Z**********************@news1.tin.it...
You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*

[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in win32com.
Don't know if you will ever get your data back, once it's converted to
Variant. ;-)

Achim
Jul 18 '05 #4

P: n/a
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?

Achim
Jul 18 '05 #5

P: n/a
Achim Domma wrote:
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?


str is essentially a sequence of bytes that can store the same content in
different ways:
utf8 = u"".encode("utf8")
latin = u"".encode("latin1")
latin '\xe4' utf8 '\xc3\xa4'
Now imagine you store the latter byte sequence in your database and want to
display it in your windows editor
print utf8

ä
(you should see two strange characters)

I had this problem occasionally when I edited python scripts with idle and,
oddly enough, my old c++ builder 3 ide.

To avoid such ambiguities, unicode is introduced. Now I guess that the first
conversion, when your string data is fed to the db api, is performed
automatically using the default encoding of your environment, which may
differ from the encoding of the downloaded file, thus probably messing up
some characters.

Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.

Peter
Jul 18 '05 #6

P: n/a
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
str is essentially a sequence of bytes that can store the same content in
different ways:
That's clear so far ...
Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.


.... and that's exactly what I was looking for and what I would expect. My
string is a sequence of bytes, which I want to store in the database. And
exactly that sequence is what I want to have back. The encoding of the data
is stored in an extra column and handling of this information takes place in
another part of the application. But there are poinst where I need the
original data, so it's required for me to save and retrieve the string in
exactly the way I get it from the web.

BTW: How would you save binary data in an Access database? Access knows only
Memo fields or am I wrong?

Achim
Jul 18 '05 #7

P: n/a
Achim Domma wrote:
BTW: How would you save binary data in an Access database? Access knows
only Memo fields or am I wrong?


CREATE TABLE Bogus (TheFile BINARY);

might do to create the "Bogus" table with a binary "TheFile" field.
As of Access 2000, I think the BINARY datatype is not exposed in the table
designer, so you have to type the SQL into the query designer and then
execute the query.

I have never used it, so the above might or might not work.

Peter
Jul 18 '05 #8

P: n/a
Achim Domma fed this fish to the penguins on Thursday 25 September 2003
04:52 am:

Memofield in an Access database, using adodbapi. If I read the text
back I get a unicode string, which can not written to disc via
file(...) due to encoding problems. How do I have to decode the
unicode string to get my original data back?
I suspect you are running on an NT-family machine. As I recall, NT
uses unicode internally, whereas the W9x-family still used ASCII. Many
of the system calls have variations with an "A" at the end of the name
to emphasize the use of ASCII data.

The conversion to unicode is probably being performed by the JET
engine on writes -- by detecting the lack of a unicode prefix, maybe?
However, retrieval is probably using the non-A system calls, leaving
the data in unicode (on unicode OS, on W9x it likely stays ASCII in
both directions).

Suspect you'll need to determine what unicode encoding is used by
Windows.

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Bestiaria Home Page: http://www.beastie.dm.net/ <
Home Page: http://www.dm.net/~wulfraed/ <


Jul 18 '05 #9

P: n/a
Achim Domma wrote:
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?


Because the Access engine (actually known as Microsoft Jet: "Access" is
only, strictly a *FRONT-END* product -- marketing terminology confusion)
stores all text strings as Unicode; and COM (thus ADO) also uses Unicode
exclusively for all text strings (as a rule). If you cannot move to
better engines and interfaces, you're stuck with the ones you have...
(99 times out of 100, moving to better engines and interfaces -- e.g.
SQLite and PySQLite, or Firebird, etc, is preferable from most points
of view -- but 1% of the time one must keep supporting legacy code...).
Alex

Jul 18 '05 #10

P: n/a
Achim Domma wrote:
"Alex Martelli" <al***@aleax.it> wrote in message
news:0Z**********************@news1.tin.it...
You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*

[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in
win32com. Don't know if you will ever get your data back, once it's
converted to Variant. ;-)


So, take control of your destiny: since you know you're using tools
that can only deal with Unicode (and thus will inevitably convert --
in ways that perhaps you don't know -- if you pass them bytestrings),
preempt their "unknown and unwanted" conversion by doing a Unicode
conversion yourself in ways you DO know and control. UTF-16 sticks
2 bytes into each Unicode character -- you do need to be working with
strings of EVEN length, though. Or else you can use, e.g., ISO-8859-1,
and resign yourself to spending one Unicode character per byte in
your "real" byte-string.

Or else, of course, you can use a "BLOB" field instead of a text
one; I think the keyword for that in the Jet engine's DDL SQL is
BINARY. If you DO need to use Access to manipulate your db, though
(and I can see deucedly few other reasons to use a Jet engine...),
I think that might not work -- at least back when I was having to
work on MS platform, I seem to recall that Access could not truly
support BLOB fields (except perhaps with embedded SQL, but that was
not considered acceptable in most Access-addicted shops, since the
real reason to use Access was NOT having to understand SQL...;-).
Alex

Jul 18 '05 #11

This discussion thread is closed

Replies have been disabled for this discussion.