473,225 Members | 1,623 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,225 software developers and data experts.

adodbapi / string encoding problem

Hi,

I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to encoding
problems. How do I have to decode the unicode string to get my original data
back?

regards,
Achim
Jul 18 '05 #1
10 4938
Achim Domma wrote:
Hi,

I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to
encoding problems. How do I have to decode the unicode string to get my
original data back?


You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*
(which stands for COder-DECoder). I don't know what codec adodbapi is
using (Python's normal default codec is ASCII, which is the "minimum
common denominator" of just about every encoding around -- if adodbapi
hadn't surreptitiously inserted a different codec, it's impossible that
anything would be decoded that might cause problems in encoding it back;-).
Alex

Jul 18 '05 #2
Achim Domma wrote:
I read a webpage via urllib2. The result of the 'read' call is of type
'str'. This string can be written to disc via
file('out.html','w').write(html). Then I write the string into a Memofield
in an Access database, using adodbapi. If I read the text back I get a
unicode string, which can not written to disc via file(...) due to
encoding problems. How do I have to decode the unicode string to get my
original data back?


You have to know the encoding of the original file.

Assuming (1) you had western european characters including the euro sign,
(2) they were correctly translated into unicode and (3) you want them back
that way:
s = u"".encode("iso-8859-15")
s '\xe4\xf6\xfc\xc4\xd6\xdc' print s type(s) <type 'str'>


Or more general:

unicodeFromAccess.encode(targetEncoding)

Peter
Jul 18 '05 #3
"Alex Martelli" <al***@aleax.it> wrote in message
news:0Z**********************@news1.tin.it...
You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*

[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in win32com.
Don't know if you will ever get your data back, once it's converted to
Variant. ;-)

Achim
Jul 18 '05 #4
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?

Achim
Jul 18 '05 #5
Achim Domma wrote:
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?


str is essentially a sequence of bytes that can store the same content in
different ways:
utf8 = u"".encode("utf8")
latin = u"".encode("latin1")
latin '\xe4' utf8 '\xc3\xa4'
Now imagine you store the latter byte sequence in your database and want to
display it in your windows editor
print utf8

ä
(you should see two strange characters)

I had this problem occasionally when I edited python scripts with idle and,
oddly enough, my old c++ builder 3 ide.

To avoid such ambiguities, unicode is introduced. Now I guess that the first
conversion, when your string data is fed to the db api, is performed
automatically using the default encoding of your environment, which may
differ from the encoding of the downloaded file, thus probably messing up
some characters.

Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.

Peter
Jul 18 '05 #6
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
str is essentially a sequence of bytes that can store the same content in
different ways:
That's clear so far ...
Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.


.... and that's exactly what I was looking for and what I would expect. My
string is a sequence of bytes, which I want to store in the database. And
exactly that sequence is what I want to have back. The encoding of the data
is stored in an extra column and handling of this information takes place in
another part of the application. But there are poinst where I need the
original data, so it's required for me to save and retrieve the string in
exactly the way I get it from the web.

BTW: How would you save binary data in an Access database? Access knows only
Memo fields or am I wrong?

Achim
Jul 18 '05 #7
Achim Domma wrote:
BTW: How would you save binary data in an Access database? Access knows
only Memo fields or am I wrong?


CREATE TABLE Bogus (TheFile BINARY);

might do to create the "Bogus" table with a binary "TheFile" field.
As of Access 2000, I think the BINARY datatype is not exposed in the table
designer, so you have to type the SQL into the query designer and then
execute the query.

I have never used it, so the above might or might not work.

Peter
Jul 18 '05 #8
Achim Domma fed this fish to the penguins on Thursday 25 September 2003
04:52 am:

Memofield in an Access database, using adodbapi. If I read the text
back I get a unicode string, which can not written to disc via
file(...) due to encoding problems. How do I have to decode the
unicode string to get my original data back?
I suspect you are running on an NT-family machine. As I recall, NT
uses unicode internally, whereas the W9x-family still used ASCII. Many
of the system calls have variations with an "A" at the end of the name
to emphasize the use of ASCII data.

The conversion to unicode is probably being performed by the JET
engine on writes -- by detecting the lack of a unicode prefix, maybe?
However, retrieval is probably using the non-A system calls, leaving
the data in unicode (on unicode OS, on W9x it likely stays ASCII in
both directions).

Suspect you'll need to determine what unicode encoding is used by
Windows.

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Bestiaria Home Page: http://www.beastie.dm.net/ <
Home Page: http://www.dm.net/~wulfraed/ <


Jul 18 '05 #9
Achim Domma wrote:
"Peter Otten" <__*******@web.de> wrote in message
news:bk*************@news.t-online.com...
You have to know the encoding of the original file.


Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?


Because the Access engine (actually known as Microsoft Jet: "Access" is
only, strictly a *FRONT-END* product -- marketing terminology confusion)
stores all text strings as Unicode; and COM (thus ADO) also uses Unicode
exclusively for all text strings (as a rule). If you cannot move to
better engines and interfaces, you're stuck with the ones you have...
(99 times out of 100, moving to better engines and interfaces -- e.g.
SQLite and PySQLite, or Firebird, etc, is preferable from most points
of view -- but 1% of the time one must keep supporting legacy code...).
Alex

Jul 18 '05 #10
Achim Domma wrote:
"Alex Martelli" <al***@aleax.it> wrote in message
news:0Z**********************@news1.tin.it...
You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*

[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in
win32com. Don't know if you will ever get your data back, once it's
converted to Variant. ;-)


So, take control of your destiny: since you know you're using tools
that can only deal with Unicode (and thus will inevitably convert --
in ways that perhaps you don't know -- if you pass them bytestrings),
preempt their "unknown and unwanted" conversion by doing a Unicode
conversion yourself in ways you DO know and control. UTF-16 sticks
2 bytes into each Unicode character -- you do need to be working with
strings of EVEN length, though. Or else you can use, e.g., ISO-8859-1,
and resign yourself to spending one Unicode character per byte in
your "real" byte-string.

Or else, of course, you can use a "BLOB" field instead of a text
one; I think the keyword for that in the Jet engine's DDL SQL is
BINARY. If you DO need to use Access to manipulate your db, though
(and I can see deucedly few other reasons to use a Jet engine...),
I think that might not work -- at least back when I was having to
work on MS platform, I seem to recall that Access could not truly
support BLOB fields (except perhaps with embedded SQL, but that was
not considered acceptable in most Access-addicted shops, since the
real reason to use Access was NOT having to understand SQL...;-).
Alex

Jul 18 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: WEIWEIWEI | last post by:
Hi Al I'd like to encode a string submitted from a utf-8 form in a aspx page to big5 Any ideas on how to do that I try sth like public static string unicode_big5(string src) { Encoding big5...
8
by: Demon News | last post by:
I'm trying to do a transform (Using XmlTransform class in c#) and in the Transform I'm specifying the the output xsl below: <xsl:output method="xml" encoding="UTF-8" indent="no"/> the...
3
by: David Taylor | last post by:
In .net I am using a HttpWebRequest to read from a WebSite. I am getting everything back except for some characters above hex 7F which appear to have been stripped out of my response. I see these...
9
by: Mark | last post by:
I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...
4
by: flyingco | last post by:
URL decoding/encoding problem Iif the url contains chinese char,the url will be encoded. For example : url:http://194.0.0.84/ҳ.htm when my tdi driver intercept the packet, I find that...
4
by: Ron Garret | last post by:
Is there a way to change the default string encoding used by the string.encode() method? My default environment is utf-8 but I need it to be latin-1 to avoid errors like this: Traceback (most...
8
by: lisa1987i | last post by:
I am really having trouble with encoding characters. The application I am creating i based on a NNTP component from Smilla smilla.ru My propblem is when I read a string which contain special...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.