473,396 Members | 2,111 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

adodbapi / string encoding problem

Hi,

I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to encoding
problems. How do I have to decode the unicode string to get my original data
back?

regards,
Achim
Jul 18 '05 #1
10 4950
Achim Domma wrote:
Hi,

I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to
encoding problems. How do I have to decode the unicode string to get my
original data back?


You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*
(which stands for COder-DECoder). I don't know what codec adodbapi is
using (Python's normal default codec is ASCII, which is the "minimum
common denominator" of just about every encoding around -- if adodbapi
hadn't surreptitiously inserted a different codec, it's impossible that
anything would be decoded that might cause problems in encoding it back;-).
Alex

Jul 18 '05 #2
Achim Domma wrote:
I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to
encoding problems. How do I have to decode the unicode string to get my
original data back?


You have to know the encoding of the original file.

Assuming (1) you had western european characters including the euro sign,
(2) they were correctly translated into unicode and (3) you want them back
that way:
s = u"äöüÄÖÜ".encode("iso-8859-15")
s '\xe4\xf6\xfc\xc4\xd6\xdc' print s äöüÄÖÜ type(s) <type 'str'>


Or more general:

unicodeFromAccess.encode(targetEncoding)

Peter
Jul 18 '05 #3
"Alex Martelli" <al***@aleax.it> wrote in message
news:0Z**********************@news1.tin.it...
You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*

[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in win32com.
Don't know if you will ever get your data back, once it's converted to
Variant. ;-)

Achim
Jul 18 '05 #4
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?

Achim
Jul 18 '05 #5
Achim Domma wrote:
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?


str is essentially a sequence of bytes that can store the same content in
different ways:
utf8 = u"ä".encode("utf8")
latin = u"ä".encode("latin1")
latin '\xe4' utf8 '\xc3\xa4'
Now imagine you store the latter byte sequence in your database and want to
display it in your windows editor
print utf8

ä
(you should see two strange characters)

I had this problem occasionally when I edited python scripts with idle and,
oddly enough, my old c++ builder 3 ide.

To avoid such ambiguities, unicode is introduced. Now I guess that the first
conversion, when your string data is fed to the db api, is performed
automatically using the default encoding of your environment, which may
differ from the encoding of the downloaded file, thus probably messing up
some characters.

Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.

Peter
Jul 18 '05 #6
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
str is essentially a sequence of bytes that can store the same content in
different ways:
That's clear so far ...
Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.


.... and that's exactly what I was looking for and what I would expect. My
string is a sequence of bytes, which I want to store in the database. And
exactly that sequence is what I want to have back. The encoding of the data
is stored in an extra column and handling of this information takes place in
another part of the application. But there are poinst where I need the
original data, so it's required for me to save and retrieve the string in
exactly the way I get it from the web.

BTW: How would you save binary data in an Access database? Access knows only
Memo fields or am I wrong?

Achim
Jul 18 '05 #7
Achim Domma wrote:
BTW: How would you save binary data in an Access database? Access knows
only Memo fields or am I wrong?


CREATE TABLE Bogus (TheFile BINARY);

might do to create the "Bogus" table with a binary "TheFile" field.
As of Access 2000, I think the BINARY datatype is not exposed in the table
designer, so you have to type the SQL into the query designer and then
execute the query.

I have never used it, so the above might or might not work.

Peter
Jul 18 '05 #8
Achim Domma fed this fish to the penguins on Thursday 25 September 2003
04:52 am:

Memofield in an Access database, using adodbapi. If I read the text
back I get a unicode string, which can not written to disc via
file(...) due to encoding problems. How do I have to decode the
unicode string to get my original data back?
I suspect you are running on an NT-family machine. As I recall, NT
uses unicode internally, whereas the W9x-family still used ASCII. Many
of the system calls have variations with an "A" at the end of the name
to emphasize the use of ASCII data.

The conversion to unicode is probably being performed by the JET
engine on writes -- by detecting the lack of a unicode prefix, maybe?
However, retrieval is probably using the non-A system calls, leaving
the data in unicode (on unicode OS, on W9x it likely stays ASCII in
both directions).

Suspect you'll need to determine what unicode encoding is used by
Windows.

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Bestiaria Home Page: http://www.beastie.dm.net/ <
Home Page: http://www.dm.net/~wulfraed/ <


Jul 18 '05 #9
Achim Domma wrote:
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?


Because the Access engine (actually known as Microsoft Jet: "Access" is
only, strictly a *FRONT-END* product -- marketing terminology confusion)
stores all text strings as Unicode; and COM (thus ADO) also uses Unicode
exclusively for all text strings (as a rule). If you cannot move to
better engines and interfaces, you're stuck with the ones you have...
(99 times out of 100, moving to better engines and interfaces -- e.g.
SQLite and PySQLite, or Firebird, etc, is preferable from most points
of view -- but 1% of the time one must keep supporting legacy code...).
Alex

Jul 18 '05 #10
Achim Domma wrote:
"Alex Martelli" <al***@aleax.it> wrote in message
news:0Z**********************@news1.tin.it...
You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*

[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in
win32com. Don't know if you will ever get your data back, once it's
converted to Variant. ;-)


So, take control of your destiny: since you know you're using tools
that can only deal with Unicode (and thus will inevitably convert --
in ways that perhaps you don't know -- if you pass them bytestrings),
preempt their "unknown and unwanted" conversion by doing a Unicode
conversion yourself in ways you DO know and control. UTF-16 sticks
2 bytes into each Unicode character -- you do need to be working with
strings of EVEN length, though. Or else you can use, e.g., ISO-8859-1,
and resign yourself to spending one Unicode character per byte in
your "real" byte-string.

Or else, of course, you can use a "BLOB" field instead of a text
one; I think the keyword for that in the Jet engine's DDL SQL is
BINARY. If you DO need to use Access to manipulate your db, though
(and I can see deucedly few other reasons to use a Jet engine...),
I think that might not work -- at least back when I was having to
work on MS platform, I seem to recall that Access could not truly
support BLOB fields (except perhaps with embedded SQL, but that was
not considered acceptable in most Access-addicted shops, since the
real reason to use Access was NOT having to understand SQL...;-).
Alex

Jul 18 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: WEIWEIWEI | last post by:
Hi Al I'd like to encode a string submitted from a utf-8 form in a aspx page to big5 Any ideas on how to do that I try sth like public static string unicode_big5(string src) { Encoding big5...
8
by: Demon News | last post by:
I'm trying to do a transform (Using XmlTransform class in c#) and in the Transform I'm specifying the the output xsl below: <xsl:output method="xml" encoding="UTF-8" indent="no"/> the...
3
by: David Taylor | last post by:
In .net I am using a HttpWebRequest to read from a WebSite. I am getting everything back except for some characters above hex 7F which appear to have been stripped out of my response. I see these...
9
by: Mark | last post by:
I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...
4
by: flyingco | last post by:
URL decoding/encoding problem Iif the url contains chinese char,the url will be encoded. For example : url:http://194.0.0.84/ÖÐÎÄÒ³Ãæ.htm when my tdi driver intercept the packet, I find that...
4
by: Ron Garret | last post by:
Is there a way to change the default string encoding used by the string.encode() method? My default environment is utf-8 but I need it to be latin-1 to avoid errors like this: Traceback (most...
8
by: lisa1987i | last post by:
I am really having trouble with encoding characters. The application I am creating i based on a NNTP component from Smilla smilla.ru My propblem is when I read a string which contain special...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.