469,643 Members | 1,486 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,643 developers. It's quick & easy.

why isn't Unicode the default encoding?

Forgive my newbieness, but I don't quite understand why Unicode is still
something that needs special treatment in Python (and perhaps
elsewhere). I'm reading Dive Into Python right now, and it constantly
refers to a 'regular string' versus a 'Unicode string' and how you need
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1? Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?
Mar 20 '06 #1
15 1872
John Salerno wrote:
Forgive my newbieness, but I don't quite understand why Unicode is still
something that needs special treatment in Python (and perhaps
elsewhere). I'm reading Dive Into Python right now, and it constantly
refers to a 'regular string' versus a 'Unicode string' and how you need
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1?
Well, *I* use UTF-8, but that's neither here nor there.
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?


It would break a hell of a lot of code. Try using the -U command line argument
to the Python interpreter. That makes unicode strings default.

[~]$ python -U
Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
'foo' u'foo'


Python tries very hard to remain backwards compatible. Python 3.0 is the
designated "break compatibility so we can remove all of the cruft that's built
up" release. It is still several years away although Guido is starting to work
on it now.

--
Robert Kern
ro*********@gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Mar 20 '06 #2
Robert Kern wrote:
Well, *I* use UTF-8, but that's neither here nor there.


I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?


It would break a hell of a lot of code. Try using the -U command line argument
to the Python interpreter. That makes unicode strings default.


I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)
Mar 20 '06 #3
John Salerno <jo******@nospamgmail.com> wrote:
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1?
The point is, that, with a regular string, you don't know its encoding
or whether it has an encoding at all - it might as well be just a byte
buffer. The best thing would be to have byte buffer and a unicode string
type but, this can't happen as long as you don't want to break existing
code.
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?


It's proposed for python 3000 (http://www.python.org/doc/peps/pep-3000/)
and I think it will make it into the language.

Cheers,
--Jan Niklas
Mar 20 '06 #4
John Salerno wrote:
Robert Kern wrote:
Well, *I* use UTF-8, but that's neither here nor there.


I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?


I think it unlikely, but I have no numbers to give. And I'll bet that that book
doesn't either.
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?


It would break a hell of a lot of code. Try using the -U command line argument
to the Python interpreter. That makes unicode strings default.


I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)


No, it isn't. You seem to be somewhat confused about Unicode. At least you are
misusing terminology quite a bit. You may want to read the following articles:

http://www.joelonsoftware.com/articles/Unicode.html
http://effbot.org/zone/unicode-objects.htm

--
Robert Kern
ro*********@gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Mar 20 '06 #5
Robert Kern <ro*********@gmail.com> wrote:
I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?


I think it unlikely, but I have no numbers to give. And I'll bet that that book
doesn't either.


I haven't got any numbers, but my guess would be that many the chinese
will add their share to the UTF-16 numbers. I don't know about other
asian languages, though.

Cheers,
--Jan Niklas
Mar 20 '06 #6
Robert Kern wrote:
I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)


No, it isn't. You seem to be somewhat confused about Unicode. At least you are
misusing terminology quite a bit. You may want to read the following articles:


I meant to say 'superset'
Mar 20 '06 #7
Robert Kern wrote:
http://www.joelonsoftware.com/articles/Unicode.html


That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then stores higher code
points in however many bytes they may need? If so, I guess I had been
mislead by the '8' in the name, thinking that UTF-8 was another way of
storing characters in one byte (which would make it no different than
Latin-1, I suppose).
Mar 20 '06 #8
John Salerno wrote:
Robert Kern wrote:
http://www.joelonsoftware.com/articles/Unicode.html


That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then stores higher code
points in however many bytes they may need? If so, I guess I had been
mislead by the '8' in the name, thinking that UTF-8 was another way of
storing characters in one byte (which would make it no different than
Latin-1, I suppose).


That's all correct, except for the last parenthetical remark: using
a single-byte character set isn't the same as using Latin-1. There
are various single-byte characters sets; they have names like Latin-2,
Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on.

Regards,
Martin
Mar 20 '06 #9
> I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)


The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.

Unicode is a superset of ASCII and Latin-1, but not of byte sequences.

Regards,
Martin
Mar 20 '06 #10
Martin v. Löwis wrote:
John Salerno wrote:
Robert Kern wrote:
http://www.joelonsoftware.com/articles/Unicode.html


That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then stores higher code
points in however many bytes they may need? If so, I guess I had been
mislead by the '8' in the name, thinking that UTF-8 was another way of
storing characters in one byte (which would make it no different than
Latin-1, I suppose).


That's all correct, except for the last parenthetical remark: using
a single-byte character set isn't the same as using Latin-1. There
are various single-byte characters sets; they have names like Latin-2,
Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on.

Regards,
Martin


Oh, I just meant that Latin-1 was an example of a one-byte character
set, right? So UTF-8 would be identical to it if it worked how I used to
think it did.
Mar 20 '06 #11
Martin v. Löwis wrote:
The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.


Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not? As it is now, it seems to equate the bytes with number
of characters, but if the document was written using Unicode characters,
is it possible that read(2) might only pull out one character?
Mar 20 '06 #12
John Salerno wrote:
So as it turns out, Unicode and UTF-8 are not the same thing?
Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.

Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE, which is another encoding that can
store the whole Unicode character repertoire as bytes. However
UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is.

Further confusion arises because the encoding 'UTF-16' can actually
mean two things that are deceptively different:

- Unicode characters stored natively in 16-bit units (using two
UTF-16 characters to represent characters outside of the Basic
Multilingual Plane)

- Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected
automatically using a Byte Order Mark when loaded, or chosen
arbitrarily when saving

Yet more confusion arises because UTF-32 (which can reference any
Unicode character directly) has the same problem. And though
wide-unicode builds of Python understand the first meaning (unicode()
strings are stored natively as UTF-32), they don't support the 8-bit
encodings UTF-32_LE and UTF-32_BE. Phew!

To summarise: confusion.
Am I right to say that UTF-8 stores the first 128 Unicode code points
in a single byte, and then stores higher code points in however many
bytes they may need?


That is correct.

To answer the original question, we're always going to need byte
strings. They're a fundamental part of computing and the need to
process them isn't going to go away. However as Unicode text
manipulation becomes a more common event than byte string processing,
it makes sense to change the default kind of string you get when you
type a literal.

Personally I would like to see byte strings available under an easy
syntax like b'...' and UTF-32 strings available as w'...', or something
like that - currently having u'...' mean either UTF-16 or UTF-32
depending on compile-time options is very very annoying to the few
kinds of programs that really do need to know the difference. But
whatever is chosen, it's all tasty Python 3000 future-soup and not
worth worrying about for the moment.

--
And Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/

Mar 20 '06 #13
John Salerno wrote:
Martin v. Löwis wrote:
The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.


Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not? As it is now, it seems to equate the bytes with number
of characters, but if the document was written using Unicode characters,
is it possible that read(2) might only pull out one character?


Exactly. read(2) might pull out one character, or only half a character.
It all depends on the encoding of the data you're reading.

If you're reading or writing text to a file (or anywhere, for that
matter) you need to know the unicode encoding of the file's content to
read it correctly.

Fortunately, the codecs module makes the whole process relatively painless:
import codecs
f = open("a_utf8_encoded_file.txt")
stream = codecs.getreader('utf-8')(f)
c = stream.read(1)


The 'stream' works on unicode characters so 'c' is a unicode instance,
i.e. a whole textual character.

- Matt

--
__
/ \__ Matt Goodall, Pollenation Internet Ltd
\__/ \ w: http://www.pollenation.net
__/ \__/ e: ma**@pollenation.net
/ \__/ \ t: +44 (0)113 2252500
\__/ \__/
/ \ Any views expressed are my own and do not necessarily
\__/ reflect the views of my employer.
Mar 20 '06 #14
John Salerno wrote:
Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not?
The read method currently returns a byte string, not a Unicode string.
It's not clear to me how the numeric argument should be interpreted when
it returns characters some day; it might be best to take the number as
counting characters, then. However, not supporting a numeric argument
at all might also be reasonable.
As it is now, it seems to equate the bytes with number
of characters, but if the document was written using Unicode characters,
is it possible that read(2) might only pull out one character?


Unicode isn't a character coding (*all* documents in the world are
"written in Unicode", including those encoded with ASCII or
Latin-1).

In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes. How many characters that constitutes
depends on the encoding - but read() doesn't return a character
string.

It might be that these two bytes are only part of a character,
e.g. if you need three bytes to encode a character, or it might
be that they are parts of two characters, e.g. when you get the
second byte of the first character and the first byte of the
second one. In some encodings (e.g. ISO-2022), these bytes
may indicate *no* character, e.g. when the bytes just indicate
an in-stream change of character set.

Regards,
Martin
Mar 21 '06 #15
In article <44***********************@news.freenet.de>, Martin v. Lwis wrote:
In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes.


It returns *up to* two bytes. Sorry to be picky but I think it's
relevant to the topic because it illustrates how it's difficult
to change the definition of file.read() to return characters
instead of bytes (if the file is ready to read, there will always
be one or more bytes available (or EOF), but there won't always
be one or more characters available).
Mar 21 '06 #16

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by Bill Eldridge | last post: by
4 posts views Thread by webdev | last post: by
7 posts views Thread by Robert | last post: by
4 posts views Thread by Petr Jakes | last post: by
8 posts views Thread by sonald | last post: by
9 posts views Thread by Gerry | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.