unicode, bytes redux

willie

(beating a dead horse)

Is it too ridiculous to suggest that it'd be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it's feasible to calculate the number
of bytes that make up the unicode code points.

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"

u = buf.decode('UTF-8')

# ... later ...

u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

Sep 25 '06 #1

Subscribe Post Reply

1570

Duncan Booth

willie <wi****@jamots.comwrote:

Is it too ridiculous to suggest that it'd be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it's feasible to calculate the number
of bytes that make up the unicode code points.

So what sort of output do you expect from this:

>>a = '\xc9'.decode('latin1')
b = '\xc3\x89'.decode('utf8')
print (a+b).bytes()

???

And if you say that's an unfair question because you expected all the byte
strings to be using the same encoding then there's no point storing it on
every unicode object; you might as well store it once globally.

Sep 25 '06 #2

Paul Rubin

willie <wi****@jamots.comwrites:

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"
u = buf.decode('UTF-8')
# ... later ...
u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

Duncan Booth explains why that doesn't work. But I don't see any big
problem with a byte count function that lets you specify an encoding:

u = buf.decode('UTF-8')
# ... later ...
u.bytes('UTF-8') -3
u.bytes('UCS-4') -4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.

Sep 25 '06 #3

Leif K-Brooks

Paul Rubin wrote:

Duncan Booth explains why that doesn't work. But I don't see any big
problem with a byte count function that lets you specify an encoding:

u = buf.decode('UTF-8')
# ... later ...
u.bytes('UTF-8') -3
u.bytes('UCS-4') -4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.

It requires a fairly large change to code and API for a relatively
uncommon problem. How often do you need to know how many bytes an
encoded Unicode string takes up without needing the encoded string itself?

Sep 25 '06 #4

Paul Rubin

Leif K-Brooks <eu*****@ecritters.bizwrites:

It requires a fairly large change to code and API for a relatively
uncommon problem. How often do you need to know how many bytes an
encoded Unicode string takes up without needing the encoded string
itself?

Shrug. I don't see a real large change--the code would just check for
an optional arg and process accordingly. I don't know if the issue
comes up often enough to be worth making such accomodations for. I do
know that we had an extensive newsgroup thread about it, from which
this discussion came, but I haven't paid that much attention.

Sep 25 '06 #5

John Machin

willie wrote:

(beating a dead horse)

Is it too ridiculous to suggest that it'd be nice
if the unicode object were to remember the
encoding of the string it was decoded from?

Where it's been is irrelevant. Where it's going to is what matters.

So that it's feasible to calculate the number
of bytes that make up the unicode code points.

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"

u = buf.decode('UTF-8')

# ... later ...

u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

Suppose the unicode object was decoded using some encoding other than
the one that's going to be used to store the info in the database:

| >>sg = '\xc9\xb5\xb9\xcf'
| >>len(sg)
| 4
| >>u = sg.decode('gb2312')

later:
u.bytes() =4

but

| >>len(u.encode('utf8'))
| 6

and by the way, what about the memory overhead of storing the name of
the encoding (in the above case 7 (6 + overhead))?

What would u"abcdef".bytes() produce? An exception?

HTH,
John

Sep 25 '06 #6

John Machin

Paul Rubin wrote:

Leif K-Brooks <eu*****@ecritters.bizwrites:
It requires a fairly large change to code and API for a relatively
uncommon problem. How often do you need to know how many bytes an
encoded Unicode string takes up without needing the encoded string
itself?

Shrug. I don't see a real large change--the code would just check for
an optional arg and process accordingly. I don't know if the issue
comes up often enough to be worth making such accomodations for. I do
know that we had an extensive newsgroup thread about it, from which
this discussion came, but I haven't paid that much attention.

Actually, what Willie was concerned about was some cockamamie DBMS
which required to be fed Unicode, which it encoded as UTF-8, but
silently truncated if it was more than the n in varchar(n) ... or
something like that.

So all he needs is a boolean result: u.willitfit(encoding, width)

This can of course be optimised with simple early-loop-exit tests:
if n_bytes_so_far + n_remaining_uchars width: return False
elif n_bytes_so_far + n_remaining_uchars * M <= width: return True
# where M is the maximum #bytes per Unicode char for the encoding
that's being used.

Tell you what, why don't you and Willie get together and write a PEP?

Cheers,
John

Sep 25 '06 #7

Paul Rubin

"John Machin" <sj******@lexicon.netwrites:

Actually, what Willie was concerned about was some cockamamie DBMS
which required to be fed Unicode, which it encoded as UTF-8,

Yeah, I remember that.

Tell you what, why don't you and Willie get together and write a PEP?

If enough people care about the problem, I'd say just submit a code
patch. I haven't needed it myself, but I haven't (so far) had to deal
with unicode that often. It's a reasonably logical thing to want.
Imagine if the normal length(string) function required copying the
string around.

Sep 25 '06 #8

John Machin

Paul Rubin wrote:

"John Machin" <sj******@lexicon.netwrites:
Actually, what Willie was concerned about was some cockamamie DBMS
which required to be fed Unicode, which it encoded as UTF-8,

Yeah, I remember that.

Tell you what, why don't you and Willie get together and write a PEP?

If enough people care about the problem, I'd say just submit a code
patch. I haven't needed it myself, but I haven't (so far) had to deal
with unicode that often. It's a reasonably logical thing to want.
Imagine if the normal length(string) function required copying the
string around.

Almost as bad: just imagine a language that had a normal strlen(string)
function that required mucking all the way through the string until you
hit some cockamamie in-band can't-happen-elsewhere sentinel.

Cheers,
John

Sep 25 '06 #9

Steven D'Aprano

On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote:

willie <wi****@jamots.comwrites:
># U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"
u = buf.decode('UTF-8')
# ... later ...
u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

Duncan Booth explains why that doesn't work. But I don't see any big
problem with a byte count function that lets you specify an encoding:

u = buf.decode('UTF-8')
# ... later ...
u.bytes('UTF-8') -3
u.bytes('UCS-4') -4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.

Unless I'm misunderstanding something, your bytes code would have to
perform exactly the same algorithmic calculations as converting the
encoded string in the first place, except it doesn't need to store the
newly encoded string, merely the number of bytes of each character.

Here is a bit of pseudo-code that might do what you want:

def bytes(unistring, encoding):
length = 0
for c in unistring:
length += len(c.encode(encoding))
return length

At the cost of some speed, you can avoid storing the entire encoded string
in memory, which might be what you want if you are dealing with truly
enormous unicode strings.

Alternatively, instead of calling encode() on each character, you can
write a function (presumably in C for speed) that does the exact same
thing as encode, but without storing the encoded characters, merely adding
their lengths. Now you have code duplication, which is usually a bad idea.
If for no other reason, some poor schmuck has to maintain them both! (And
I bet it won't be Willie, for all his enthusiasm for the idea.)

This whole question seems to me like an awful example of premature
optimization. Your computer has probably got well in excess of 100MB, and
you're worried about duplicating a few hundred or thousand (or even
hundred thousand) bytes for a few milliseconds (just long enough to grab
the length)?
--
Steven D'Aprano

Sep 25 '06 #10

Fredrik Lundh

John Machin wrote:

Actually, what Willie was concerned about was some cockamamie DBMS
which required to be fed Unicode, which it encoded as UTF-8, but
silently truncated if it was more than the n in varchar(n) ... or
something like that.

So all he needs is a boolean result: u.willitfit(encoding, width)

at what point in the program would that method be used ?

how large are the strings, for typical cases ?

</F>

Sep 25 '06 #11

John Roth

willie wrote:

(beating a dead horse)

Is it too ridiculous to suggest that it'd be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it's feasible to calculate the number
of bytes that make up the unicode code points.

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"

u = buf.decode('UTF-8')

# ... later ...

u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

Yup, it's a dead horse. As suggested elsewhere in
the thread, the unicode object is not the proper
place for this functionality. Also, as suggested,
it's not even the desired functionality: what's really
wanted is the ability to tell how long the string
is going to be in various encodings.

That's easy enough to do today - just encode the
darn thing and use len(). I don't see any reason
to expand the language to support a data base
product that goes out of its way to make it difficult
for developers.

John Roth

Sep 25 '06 #12

John Machin

Fredrik Lundh wrote:

John Machin wrote:

Actually, what Willie was concerned about was some cockamamie DBMS
which required to be fed Unicode, which it encoded as UTF-8, but
silently truncated if it was more than the n in varchar(n) ... or
something like that.

So all he needs is a boolean result: u.willitfit(encoding, width)

at what point in the program would that method be used ?

Never, I hope. Were you taking that as a serious suggestion? Fredrik,
perhaps your irony detector needs a little preventative maintenance :-)

>
how large are the strings, for typical cases ?

He did mention it several posts ago -- I recall varchar(50) or
something like that. IOW as Duncan Booth said in effect at the start of
the 1st thread, the OP's got more problems that just doubling up on
u.encode('utf8') ...

Cheers,
John

Sep 25 '06 #13

Fredrik Lundh

John Machin wrote:

So all he needs is a boolean result: u.willitfit(encoding, width)

at what point in the program would that method be used ?

Never, I hope. Were you taking that as a serious suggestion? Fredrik,
perhaps your irony detector needs a little preventative maintenance :-)

well, Willie did ask for something like that, didn't he ? I'm just gathering
requirements...

>how large are the strings, for typical cases ?

He did mention it several posts ago -- I recall varchar(50) or
something like that.

ok. wake me up when he's dealing with columns defined as
varchar(50000000) or so...

</F>

Sep 25 '06 #14

Walter Dörwald

Steven D'Aprano wrote:

On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote:

>willie <wi****@jamots.comwrites:
>># U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"
u = buf.decode('UTF-8')
# ... later ...
u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)
Duncan Booth explains why that doesn't work. But I don't see any big
problem with a byte count function that lets you specify an encoding:

u = buf.decode('UTF-8')
# ... later ...
u.bytes('UTF-8') -3
u.bytes('UCS-4') -4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.

Unless I'm misunderstanding something, your bytes code would have to
perform exactly the same algorithmic calculations as converting the
encoded string in the first place, except it doesn't need to store the
newly encoded string, merely the number of bytes of each character.

Here is a bit of pseudo-code that might do what you want:

def bytes(unistring, encoding):
length = 0
for c in unistring:
length += len(c.encode(encoding))
return length

That wouldn't work for stateful encodings:

>>len(u"abc".encode("utf-16"))

>>bytes(u"abc", "utf-16")

12

Use a stateful encoder instead:

import codecs
def bytes(unistring, encoding):
length = 0
enc = codecs.getincrementalencoder(encoding)()
for c in unistring:
length += len(enc.encode(c))
return length

Servus,
Walter

Sep 25 '06 #15

Similar topics

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

using unicode in XML

by: Naresh Agarwal | last post by:

Hi XML uses UTF-8 by default. Is that correct? Also, can we use Unicode in XML? thanks, Naresh

.NET Framework

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...

.NET Framework

Convert DOS Cyrillic text to Unicode

by: Nikolay Petrov | last post by:

How can I convert DOS cyrillic text to Unicode

Visual Basic .NET

why isn't Unicode the default encoding?

by: John Salerno | last post by:

Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it...

Python

Unicode, encodings, and asian languages: need some help.

by: apprentice | last post by:

Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...

.NET Framework

byte count unicode string

by: willie | last post by:

>willie wrote: wrote:

Python

[unicode] inconvenient unicode conversion of non-string arguments

by: Holger Joukl | last post by:

Hi there, I consider the behaviour of unicode() inconvenient wrt to conversion of non-string arguments. While you can do: u'17.3' you cannot do:

Python

unicode(s, enc).encode(enc) == s ?

by: mario | last post by:

I have checks in code, to ensure a decode/encode cycle returns the original string. Given no UnicodeErrors, are there any cases for the following not to be True? unicode(s, enc).encode(enc)...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++