473,606 Members | 2,082 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unicode (UTF8) in dbhas on 2.5

Can you put UTF-8 characters in a dbhash in python 2.5 ?
It fails when I try:

#!/bin/env python
# -*- coding: utf-8 -*-

import dbhash

db = dbhash.open('db file.db', 'w')
db[u'smiley'] = u'☺'
db.close()

Do I need to change the bsd db library, or there is no way to make it work
with python 2.5 ?

What about python 2.6 ?
Thanks.

--
Yves.
http://www.sollers.ca/blog/2008/no_sound_PulseAudio
http://www.sollers.ca/blog/2008/Puls...pas_de_son/.fr
Oct 20 '08 #1
8 2650
Yves Dorfsman wrote:
Can you put UTF-8 characters in a dbhash in python 2.5 ?
It fails when I try:

#!/bin/env python
# -*- coding: utf-8 -*-

import dbhash

db = dbhash.open('db file.db', 'w')
db[u'smiley'] = u'☺'
db.close()

Do I need to change the bsd db library, or there is no way to make it work
with python 2.5 ?
Please write the following program and meditate at least 30min in front of
it:

while True:
print "utf-8 is not unicode"

Once this seemingly minor detail has sunken in, you are ready to work with
the below variant that will work:

#!/bin/env python
# -*- coding: utf-8 -*-
import dbhash
db = dbhash.open('db file.db', 'w')
db[u'smiley'.encod e('utf-8')] = u'☺'.encode(' utf-8')
db.close()
What is the difference? The dbhash module can only work with *bytestrings*.
Bytestrings are just that - a sequence of 8-bit-values.

u""-literals are *unicode objects*. These are an abstract sequence of
characters, smileys or others.

Now the real world of databases, network-connections and harddrives doesn't
know about unicode. They only know bytes. So before you can write to them,
you need to "encode" the unicode data to a byte-stream-representation.
There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
which has the property that it can render *all* unicode characters,
potentially needing more than one byte per character.

Which is why the code above has those encode-calls on the unicode-objects.

But beware! Once you encoded the data, there is no way to *know* it's
encoding. So when reading the data, you will get *bytestrings*. So you need
to "decode" them, with the proper encoding. In this case, again utf-8.

Which brings us to the second part of the program:

db = dbhash.open('db file.db')
smiley = db[u'smiley'.encod e('utf-8')].decode('utf-8')

print smiley.encode(' utf-8')
The last encode is there to print out the smiley on a terminal - one of
those pesky bytestream-eaters that don't know about unicode.

Diez
Oct 20 '08 #2
Diez B. Roggisch <de***@nospam.w eb.dewrote:
Please write the following program and meditate at least 30min in front of
it:
while True:
print "utf-8 is not unicode"
I hope you will have a better day today than yesterday !
Now, I did this:

while True:
print "¡ Python knows about encoding, but only sometimes !"

My terminal is setup in UTF-8, and... It did print correctly. I expected
that by setting coding: utf-8, all the I/O functions would do the encoding
for me, because if they don't then I, and everybody who writes a script, will
need to subclass every single I/O class (ok, except for print !).

Bytestrings are just that - a sequence of 8-bit-values.
It used to be that int were 8 bits, we did not stay stuck in time and int are
now typically longer. I expect a high level language to let me set the
encoding once, and do simple I/O operation... without having encode/decode.
Now the real world of databases, network-connections and harddrives doesn't
know about unicode. They only know bytes. So before you can write to them,
you need to "encode" the unicode data to a byte-stream-representation.
There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
which has the property that it can render *all* unicode characters,
potentially needing more than one byte per character.
Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
chars right, but that's why I use a high level language. Here's an example
of an implementation that let you write Unicode directly to a dbhash, I
hoped there would be something similar in python:
http://www.oracle.com/technology/doc...A/DBEntry.html
db = dbhash.open('db file.db')
smiley = db[u'smiley'.encod e('utf-8')].decode('utf-8')
print smiley.encode(' utf-8')
The last encode is there to print out the smiley on a terminal - one of
those pesky bytestream-eaters that don't know about unicode.
What are you talking about ?
I just copied this right from my terminal (LANG=en_CA.utf 8):
>>print unichr(0x020ac)
>>>
Now, I have read that python 2.6 has better support for Unicode. Does it allow
to write to file, bsddb etc... without having to encode/decode every time ?
This is a big enough issue for me right now that I will manually install 2.6
if it does.

Thanks.

--
Yves.
http://www.sollers.ca/blog/2008/no_sound_PulseAudio
http://www.sollers.ca/blog/2008/Puls...pas_de_son/.fr

Oct 21 '08 #3
On 20 Okt, 16:04, "Diez B. Roggisch" <de...@nospam.w eb.dewrote:
>
What is the difference? The dbhash module can only work with *bytestrings*.
Bytestrings are just that - a sequence of 8-bit-values.
Sounds like a prime candidate for some improvement work. Patches,
anyone? ;-)
u""-literals are *unicode objects*. These are an abstract sequence of
characters, smileys or others.
It's important to point this out, though. However...
Now the real world of databases, network-connections and harddrives doesn't
know about unicode. They only know bytes. So before you can write to them,
you need to "encode" the unicode data to a byte-stream-representation.
Although this is true, what the inquirer probably expected was the
interfaces to these things handling such details. In the case of
filesystems, this can be awkward on, say, Linux or UNIX for various
historical reasons. With regard to database systems, some messy
configuration may need to be done for each database, but it would be
nice to see the interface modules doing a bit more of the work.

[...]
print smiley.encode(' utf-8')

The last encode is there to print out the smiley on a terminal - one of
those pesky bytestream-eaters that don't know about unicode.
With respect to output encodings, you don't need to perform an encode
operation if the locale is compatible, as discussed recently in
another thread. Encoding manually to UTF-8 may avoid errors, but it
doesn't guarantee that the output will make any sense.

Paul
Oct 21 '08 #4
On Tue, Oct 21, 2008 at 10:16 AM, Yves Dorfsman <yv**@zioup.com wrote:
My terminal is setup in UTF-8, and... It did print correctly. I expected
that by setting coding: utf-8, all the I/O functions would do the encoding
for me, because if they don't then I, and everybody who writes a script, will
need to subclass every single I/O class (ok, except for print !).
No, you don't. You just need to use the tools provided for you in the
standard library, like this:

import codecs
in_file = codecs.open('my _utf8_file.txt' , 'r', 'utf8')

Now your file full of utf8 encoded bytes will be automatically
transformed into unicode strings as you read them in. You can do the
same thing on the output side (obviously, using mode 'w' instread of
'r').

If you need to wrap things other than files, the codecs module has the
tools to do that too.

--
Jerry
Oct 21 '08 #5
Yves Dorfsman wrote:
Diez B. Roggisch <de***@nospam.w eb.dewrote:
>Please write the following program and meditate at least 30min in front
of it:
>while True:
print "utf-8 is not unicode"

I hope you will have a better day today than yesterday !
I had a good day yesterday. And today. Thanks for asking.

Partially feeling good stemmed from the fact that I didn't "try to put
UTF-8-characters into a berkley-db" and claimed it fails, where what I
*really* tried was putting unicode-strings into it. Unicode and UTF-8 are
two different things, like it or not.
Now, I did this:

while True:
print "¡ Python knows about encoding, but only sometimes !"

My terminal is setup in UTF-8, and... It did print correctly. I expected
that by setting coding: utf-8, all the I/O functions would do the encoding
for me, because if they don't then I, and everybody who writes a script,
will need to subclass every single I/O class (ok, except for print !).
You seriously want all IO to be encoded depending on your terminal setting?
What about the database that works in latin1? The CSV file you write to
your vendor, expecting cp1251? And what happens if your process is not
*started* from a terminal? Or a different user starts the script, and all
of a sudden the exported data is messed up?
>
>Bytestrings are just that - a sequence of 8-bit-values.

It used to be that int were 8 bits, we did not stay stuck in time and int
are now typically longer. I expect a high level language to let me set the
encoding once, and do simple I/O operation... without having
encode/decode.
Sorry to say so, but you must face the sad truth: IO ops *need* explicit
encoding applied to them, otherwise errors will occur. Ask the Java-guys
why the needed to grow encoding-parameters to all their toBytes/fromBytes
functions in the IO-layer.

There is nothing that can be done about this. Which is not to say that
Python couldn't be enhanced at some places wrt unicode-handling, see below.
Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
chars right, but that's why I use a high level language. Here's an example
of an implementation that let you write Unicode directly to a dbhash, I
hoped there would be something similar in python:
http://www.oracle.com/technology/doc...A/DBEntry.html

The inner workings of the DB are still only byte-aware. I agree that you
could enhance the berkley-db-interface in python so that it takes a
default-encoding parameter, then transcoding all values from and to it.

OTOH you can help yourself writing a simple wrapper that does that for you,
untested:

class UnicodeWrapper( object):

def __init__(self, bdb, encoding="utf-8"):
self.bdb = bdb
self.encoding = encoding
def __setitem__(sel f, key, value):
if isinstance(key, unicode):
key = key.encode(self .encoding)
if isinstance(valu e, unicode):
value = value.encode(se lf.encoding)
self.bdb[key] = value

def __getitem__(sel f, key):
if isinstance(key, unicode):
key = key.encode(self .encoding)
return self.bdb[key]

>
>db = dbhash.open('db file.db')
smiley = db[u'smiley'.encod e('utf-8')].decode('utf-8')
>print smiley.encode(' utf-8')

>The last encode is there to print out the smiley on a terminal - one of
those pesky bytestream-eaters that don't know about unicode.

What are you talking about ?
I just copied this right from my terminal (LANG=en_CA.utf 8):
>>>print unichr(0x020ac)
>>>>
You are right, that works of course - when running inside a terminal. It
will fail though if the encoding can't be guessed, e.g. because the process
is not spawned from a terminal.

Nothing to do with the terminal though.

Diez
Oct 21 '08 #6
On Oct 21, 2008, at 2:39 PM, Martin v. Lwis wrote:
It's not possible to "fix" this - it isn't even broken. The *db
modules,
by design, support storing of arbitrary bytes, not just character
data.
Many database engines are encoding-aware, and distinguish between
'text' columns and 'blob' columns -- the latter are arbitrary bags of
bytes, but text columns store text, and a good database (with a
sensibly designed database) will be aware of this and handle encoding
and decoding of text responsibly.

I can tell you that in REALbasic, if your database is properly
configured to use UTF-8 encoding, the rest is all handled seamlessly
-- you just store and retrieve text, and don't have to worry about
encoding and decoding things all over the place.

So the OP's request is quite valid. Python's handling of encodings is
currently primitive compared to some other environments, and I see
that this extends to the database modules. Fine, fair enough, it is
what it is, but there is no harm in asking about (or even yearning
for) a more intelligent system that does more of the grunt work for us.

Best,
- Joe

Oct 21 '08 #7
>Many database engines are encoding-aware, and distinguish between
>'text' columns and 'blob' columns -- the latter are arbitrary bags
of bytes, but text columns store text, and a good database (with a
sensibly designed database) will be aware of this and handle
encoding and decoding of text responsibly.
Ok, by this definition, the dbm interface of Unix is not a good
database. Tough luck.
>I can tell you that in REALbasic, if your database is properly
configured to use UTF-8 encoding, the rest is all handled
seamlessly -- you just store and retrieve text, and don't have to
worry about encoding and decoding things all over the place.
In Python, the database system is independent of the programming
language. Python can deal with
>So the OP's request is quite valid.
Which of the questions specifically?

Q: Can you put UTF-8 characters in a dbhash in python 2.5 ?
A: Sure, certainly.

Q: Do I need to change the bsd db library,
or there is no way to make it work with python 2.5 ?
A: You don't need to change the bsd db library; it works out
of the box.

Q: What about python 2.6 ?
A: It's the same.

He got essentially the answers to the questions he asked.
>Python's handling of encodings is currently primitive compared to
some other environments, and I see that this extends to the
database modules.
That's *not* a question that he had asked. He asked about UTF-8, but
perhaps meant to ask about Unicode (in particular as his example did
demonstrate any problems with UTF-8 encoded strings).
>Fine, fair enough, it is what it is, but there is no harm in asking
about (or even yearning for) a more intelligent system that does
more of the grunt work for us.
It *is* important to understand the difference between an "UTF-8
string", and a "Unicode string". If the OP hadn't been confused
about the two, and fully understood the difference, he probably
wouldn't have needed to ask.

Regards,
Martin
Oct 22 '08 #8
On 21 Okt, 22:39, "Martin v. Lwis" <mar...@v.loewi s.dewrote:
>
It's not possible to "fix" this - it isn't even broken. The *db modules,
by design, support storing of arbitrary bytes, not just character data.
You can put images into them, or sound files, java byte code files, etc.
So if Python would assume they have to be UTF-8 encoded character
strings, it would severely limit the usability of these modules.
If the inquirer was aware of the Unicode/UTF-8 distinction, then he
apparently wanted a conversion from Unicode to UTF-8 for the purpose
of storing text in the database. I don't really see a problem with a
module like this handling Unicode values in a reasonable fashion
whilst letting the user supply plain/byte strings if they also want to
do so, except perhaps for the issue of whether retrieved values should
be Unicode or something else, how the user gets to override the
default behaviour, and how this fits in with the existing API. Various
DB-API modules support Unicode, so this isn't a completely new
phenomenon, and a connection parameter for alternative encodings would
be adequate if people wanted to use something other than UTF-8 to
represent textual values within the database.

Paul
Oct 22 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5260
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
3
7745
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
18
34106
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found Encoding.Convert, but that needs byte arrays. Thanks, /Ger
1
4839
by: jrs_14618 | last post by:
Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was wondering if anybody has experienced the same issues
7
3505
by: aine_canby | last post by:
Hi, Im totally new to Python so please bare with me. Data is entered into my program using the folling code - str = raw_input(command) words = str.split() for word in words:
9
15727
by: thijs.braem | last post by:
Hi everyone, I'm having quite some troubles trying to convert Unicode to String (for use in psycopg, which apparently doesn't know how to cope with unicode strings). The error I keep having is something like this: ERREUR: Squence d'octets invalide pour le codage UTF8 : 0xe02063 (sorry, locale is french, it means "byte sequence invalid for encoding
0
5053
by: deloford | last post by:
Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to get the db to translate to UTF-8 for non technical reasons. So I have a string coming back with the character (ISO value 156). this character appears in .NET as a box character because 156 is not a valid Unicode character value. I have been...
1
1452
by: Peter Robinson | last post by:
Dear list I am at my wits end on what seemed a very simple task: I have some greek text, nicely encoded in utf8, going in and out of a xml database, being passed over and beautifully displayed on the web. For example: the most common greek word of all 'kai' (or καιif your mailer can see utf8) So all I want to do is: step through this string a character at a time, and do something for each character (actually set a width...
1
3852
by: Mudcat | last post by:
In short what I'm trying to do is read a document using an xml parser and then upload that data back into a database. I've got the code more or less completed using xml.etree.ElementTree for the parser and dbi/ odbc for my db connection. To fix problems with unicode I built a work-around by mapping unicode characters to equivalent ascii characters and then encoding everything to ascii. That allowed me to build the application and debug...
0
7938
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8430
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8298
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6752
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5892
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5452
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3914
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
3962
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1548
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.