More elegant UTF-8 encoder

Bjoern Hoehrmann

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 10 '07 #1

Subscribe Post Reply

4289

christian.bau

On Jun 10, 2:42 pm, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bjo...@hoehrmann.de ·http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/

What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.

Jun 10 '07 #2

Richard Tobin

In article <m9********************************@hive.bjoern.ho ehrmann.de>,
Bjoern Hoehrmann <bj****@hoehrmann.dewrote:

>I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable

"Choose any two".

To be honest, I don't see the point. It looks fast enough: after all,
you must be reading the data from somewhere, which is likely to be
much slower. Unless you have profiling data showing that it's a
significant overhead, forget it. As for clearer, it depends where
you're starting from. If you want to match a typical textual
description of UTF-8, I think something like this is much clearer:

unsigned char b[4] = {0, 0, 0, 0};

if(c < 0x80)
b[0] = c;
else if(c < 0x800)
{
b[1] = 0xc0 + (c >6);
b[0] = 0x80 + (c & 0x3f);
}
else if(c < 0x10000)
{
b[2] = 0xe0 + (c >12);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}
else
{
b[3] = 0xf0 + (c >18);
b[2] = 0x80 + ((c >12) & 0x3f);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}

return b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24);

That's untested and derived from code intended to output bytes in
sequence. Of course you could replace the array assignments with
returns of expressions composing the parts.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 10 '07 #3

J. J. Farrell

On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

Jun 10 '07 #4

Richard Tobin

In article <11*********************@g4g2000hsf.googlegroups.c om>,
christian.bau <ch***********@cbau.wanadoo.co.ukwrote:

>And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.

I agree!

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 10 '07 #5

Richard Tobin

In article <11**********************@o11g2000prd.googlegroups .com>,
J. J. Farrell <jj*@bcs.org.ukwrote:

>You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

4 bytes is sufficient to cover all values up to 0x10ffff. I don't
think there's any prospect of codes being allocated outside that range
in the foreseeable future.

-- Richard

--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 10 '07 #6

Bjoern Hoehrmann

* christian.bau wrote in comp.lang.c:

>What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

Well, I have what you can consider a regular expression engine based on
Janusz Brzozowski's notion of derivatives of regular expression, meaning
that, given a regular expression and a character, it computes a regular
expression matching the rest of the string. Currently it stores ranges
of characters using the Unicode scalar value and transcodes from UTF-8
to UTF-32.

For several reasons, I want to avoid transcoding to UTF-32, so I want to
change it so that, given a regex and a octet, it computes a new regex. I
am experimenting with possible solutions, one is to exploit

utf8toint(c1) < utf8toint(c2) <=c1 < c2

which allows me to store the character ranges in their utf8toint encoded
form. The derivative of a range with respect to an octet can then easily
be computed by computing the intersection of the range and a new range
consisting of the minimal and maximal utf8toint value given the octet(s)
seen up to that point (they consist of the current byte followed by n-1
0x80 and 0xBF octets respectively, where n is the required length).

So a range [ U+0000 - U+00FF ] would be stored as [ 0x0000 - 0xc3bf ]
and if it sees e.g. a 0xc2 it would create a range [ 0xc280 - 0xc2bf ],
compute the intersection which is [ 0xc280 - 0xc2bf ] and drop the seen
byte, resulting in [ 0x80 - 0xbf ]; I can always tell, due to how UTF-8
byte patterns are organized, whether a given range is a partial range
and how many bytes are still needed to make a full character, though I
will be storing the remaining byte count for performance reasons.

Obviously I could do something similar by partially decoding the UTF-8
octets and storing Unicode scalar value ranges in the derivative instead
or mix these approaches in some way, but that seemed more difficult to
me. Similarily, rewriting the regular expression upfront so it matches
on bytes rather than characters would be more difficult. So, while it
might be unusual, I don't think this is particularily bizarre.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 11 '07 #7

Bjoern Hoehrmann

* Richard Tobin wrote in comp.lang.c:

>To be honest, I don't see the point.

I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 11 '07 #8

Clark Cox

On 2007-06-10 15:25:07 -0700, "J. J. Farrell" <jj*@bcs.org.uksaid:

On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:
>Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

4-bytes is sufficient to contain any legal Unicode codepoint in UTF-8
representation.

--
Clark S. Cox III
cl*******@gmail.com

Jun 11 '07 #9

Stephen Sprunk

"J. J. Farrell" <jj*@bcs.org.ukwrote in message
news:11**********************@o11g2000prd.googlegr oups.com...

what if the UTF-8 encoding requires more bytes than any
available integer type?

That's only a risk in C89. C99 requires "long long", which is at least 64
bits, and the longest valid UTF-8 sequence is 7 octets (56 bits).

Using "int" is just plain broken, since that isn't guaranteed to hold any
more than two octets. "long" is less broken, since it's capable of holding
at least four octets and that's enough for all currently-assigned
codepoints.

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 11 '07 #10

Clark Cox

On 2007-06-11 09:01:13 -0700, "Stephen Sprunk" <st*****@sprunk.orgsaid:

"J. J. Farrell" <jj*@bcs.org.ukwrote in message
news:11**********************@o11g2000prd.googlegr oups.com...
>what if the UTF-8 encoding requires more bytes than any
available integer type?

That's only a risk in C89.

It's not even a risk there, as long must be at least 32 bits

C99 requires "long long", which is at least 64 bits, and the longest
valid UTF-8 sequence is 7 octets (56 bits).

No. There are no legal UTF-8 sequences that are longer than 4-octets.

Using "int" is just plain broken, since that isn't guaranteed to hold
any more than two octets.

Agreed

"long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.

UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).
--
Clark S. Cox III
cl*******@gmail.com

Jun 11 '07 #11

Richard Tobin

In article <2007061109263816807-clarkcox3@gmailcom>,
Clark Cox <cl*******@gmail.comwrote:

>UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).

It's UTF-16 that's broken.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 11 '07 #12

christian.bau

On Jun 11, 4:05 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

* Richard Tobin wrote in comp.lang.c:

To be honest, I don't see the point.

I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.
--
Björn Höhrmann · mailto:bjo...@hoehrmann.de ·http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/

if (c < 0x80)
return c;
else if (c < 0x800)
return ((c << 2) & 0x1f00) | (c & 0x003f) | 0xc080;
else if (c < 0x10000)
return ((c << 4) & 0x0f0000) | ((c << 2) & 0x3f00) | (c & 0x003f) |
0xe08080;
else
return ((c << 6) & 0x07000000) | ((c << 4) & 0x3f0000) | ((c << 2) &
0x3f00) | (c & 0x003f) | 0xf0808080;

So I assume that you have lots of UTF-8 encoded text, and every time
you extract the next character, you don't extract the Unicode
codepoint, but this strange UTF-8 encoded version of the codepoint,
because it would be faster to calculate from UTF-8?

Jun 11 '07 #13

Stephen Sprunk

"Clark Cox" <cl*******@gmail.comwrote in message
news:2007061109263816807-clarkcox3@gmailcom...

On 2007-06-11 09:01:13 -0700, "Stephen Sprunk" <st*****@sprunk.orgsaid:
> C99 requires "long long", which is at least 64 bits, and the
longest valid UTF-8 sequence is 7 octets (56 bits).

No. There are no legal UTF-8 sequences that are longer than 4-
octets.

....

> "long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.

UTF-8 is officially capped at 4 octets. There is no way to make it longer
without breaking Unicode (consider round-tripping with
UTF-16).

The Unicode folks and IETF agree with you, but the ISO standard doesn't
limit UTF-8 to four or codepoints to U+10FFFF.

While I'll grant it's unlikely, it's indeed _possible_ that the limit will
be lifted in the future. Since UTF-8 follows a consistent pattern up to
seven octets, there's no reason not to allow for encoding or decoding it as
long as it's well-formed. The UCS-2 folks all got burned when UTF-16 came
out with its surrogates, remember, and it didn't even take that long; I
don't plan on repeating their mistakes. Just like I never thought 640kB RAM
(or 4GB) was enough for everybody and allowed for more if/when it became
possible...

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 12 '07 #14

CBFalconer

Stephen Sprunk wrote:

>

.... snip ...

>
While I'll grant it's unlikely, it's indeed _possible_ that the
limit will be lifted in the future. Since UTF-8 follows a
consistent pattern up to seven octets, there's no reason not to
allow for encoding or decoding it as long as it's well-formed.
The UCS-2 folks all got burned when UTF-16 came out with its
surrogates, remember, and it didn't even take that long; I don't
plan on repeating their mistakes. Just like I never thought
640kB RAM (or 4GB) was enough for everybody and allowed for more
if/when it became possible...

Hell, back in '78 I proposed a system with the outrageous memory
addressing capacity of 24 bits, or 16 Megs. Who could possibly
need (or afford) more. It also provided for 16 bit words.
Published in DDJ.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>
<http://kadaitcha.cx/vista/dogsbreakfast/index.html>
cbfalconer at maineline dot net

--
Posted via a free Usenet account from http://www.teranews.com

Jun 12 '07 #15

Stephen Sprunk

"Richard Tobin" <ri*****@cogsci.ed.ac.ukwrote in message
news:f4**********@pc-news.cogsci.ed.ac.uk...

In article <2007061109263816807-clarkcox3@gmailcom>,
Clark Cox <cl*******@gmail.comwrote:
>>UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).

It's UTF-16 that's broken.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

Thankfully, the IETF has already made a first step in that direction:
http://www.ietf.org/rfc/rfc4042.txt

Yes, I know the publication date*, but it's still somewhat relevant...

S

* For those that aren't aware, the IETF publishes spoof standards most years
on April Fools' Day (1 Apr). One, RFC 1149, was actually implemented.

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 13 '07 #16

websnarf

On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

UCS-4 or UTF-32 is 31 bits whose valid range is a subset of
[0x0,0x10FFFF]. UTF-8 is a variable length encoding of code points
from 1 to 4 octets. So, in C the output data type you are looking for
is probably an unsigned long, not an unsigned int (though a struct
{ int len, unsigned char v[4]}; seems more appropriate if you don't
want to worry about speed).

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

On a modern processor you are getting your ass kicked on the control
flow. Let's try this again:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t ret, c;
static uint32_t encodingmode[4] = { 0x0, 0xc080, 0xe08080,
0xf0808080 };

/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000));

/* Count the length */
c = (-(cp & 0xffff0000)) >UINT32_C(31);
c += (-(cp & 0xfffff800)) >UINT32_C(31);
c += (-(cp & 0xffffff80)) >UINT32_C(31);

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

I haven't tested this, but it seems ok upon visual inspection.

--
Paul Hsieh
http://bstring.sf.net/
http://www.azillionmonkeys.com/qed/unicode.html

Jun 14 '07 #17

Dik T. Winter

In article <f4**********@pc-news.cogsci.ed.ac.ukri*****@cogsci.ed.ac.uk (Richard Tobin) writes:

In article <2007061109263816807-clarkcox3@gmailcom>,
Clark Cox <cl*******@gmail.comwrote:
UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).

That is false. It is capped at six octets. There is round-tripping
with UTF-16, but that is a bit elaborate. In UTF-8 the surrogates
should *not* be encoded, but the actual code-point. (Encoding U+D800
to U+DFFF is not permitted in UTF-8.)

It's UTF-16 that's broken.

Indeed, and that becomes visible when we get beyond plane 16.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

With the current rate of increase that would be in 210 years. But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000. But one of these days I should look
at the differences between 4.1 and 5.0.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Jun 14 '07 #18

websnarf

On Jun 11, 11:42 pm, "Stephen Sprunk" <step...@sprunk.orgwrote:

"Clark Cox" <clarkc...@gmail.comwrote in message
news:2007061109263816807-clarkcox3@gmailcom...
On 2007-06-11 09:01:13 -0700, "Stephen Sprunk" <step...@sprunk.orgsaid:
C99 requires "long long", which is at least 64 bits, and the
longest valid UTF-8 sequence is 7 octets (56 bits).

No. There are no legal UTF-8 sequences that are longer than 4-
octets.
...

"long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.

UTF-8 is officially capped at 4 octets. There is no way to make it longer
without breaking Unicode (consider round-tripping with
UTF-16).

The Unicode folks and IETF agree with you, but the ISO standard doesn't
limit UTF-8 to four or codepoints to U+10FFFF.

The *OLDER* ISO 10646 standard allowed for larger encodings. However,
the ISO 10646 has merged with Unicode (version 3.0 I think) and thus
obsoleted/abandonded its old expanded range.

While I'll grant it's unlikely, it's indeed _possible_ that the limit will
be lifted in the future.

We would probably have to encounter an extra-terrestrial life form
that used sequential symbolic communications like we do, and who
decided that an alphabet 30 times larger than the Chinese one was part
of their communications systems. Its not going to happen here on
earth.

Since UTF-8 follows a consistent pattern up to
seven octets, there's no reason not to allow for encoding or decoding it as
long as it's well-formed. The UCS-2 folks all got burned when UTF-16 came
out with its surrogates, remember, and it didn't even take that long; I
don't plan on repeating their mistakes.

You are in charge of the Unicode Standards? The original Unicode
people were idiots and could not properly count the number of Chinese
characters. Perhaps the "offset printing lobby" tricked them into
choosing too few bits to throw a monkey wrench into the system.

[...] Just like I never thought 640kB RAM
(or 4GB) was enough for everybody and allowed for more if/when it became
possible...

Just like huh? RAM requirements are clearly tied to Moore's Law. But
Alphabet sizes? I don't know how old writing is, but if its about
5000 years old, and we assume a constant growth rate, then Unicode
will still be good for about 1000 years in its current form.

However, now that so much of language and human activity is tied up in
the current incumbent communications systems, I would claim that in
fact growth of alphabets will be severely curtailed, except for
certain marginal applications (alphabets for learning disabled people,
Indigenous people's language when/if they decide to convert them to
written form etc.)

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Jun 14 '07 #19

Clark Cox

On 2007-06-13 18:08:38 -0700, "Dik T. Winter" <Di********@cwi.nlsaid:

In article <f4**********@pc-news.cogsci.ed.ac.uk>
ri*****@cogsci.ed.ac.uk (Richard Tobin) writes:
In article <2007061109263816807-clarkcox3@gmailcom>,
Clark Cox <cl*******@gmail.comwrote:
>UTF-8 is officially capped at 4 octets. There is no way to make it
>longer without breaking Unicode (consider round-tripping with UTF-16).

That is false. It is capped at six octets. There is round-tripping
with UTF-16, but that is a bit elaborate.

It's not that elaborate; it's dead simple in fact.

UTF-16 -Unicode scalar value -UTF-8
UTF-8 -Unicode scalar value -UTF-16

This is not possible if UTF-8 is extended beyond 4 bytes.

In UTF-8 the surrogates should *not* be encoded,

I never claimed that they should.

but the actual code-point. (Encoding U+D800 to U+DFFF is not permitted
in UTF-8.)

It's UTF-16 that's broken.

Indeed, and that becomes visible when we get beyond plane 16.

UTF-16 is perfectly suited to represent all of the possible Unicode
values (as is 4-byte UTF-8)

>
But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

With the current rate of increase that would be in 210 years. But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000. But one of these days I should look
at the differences between 4.1 and 5.0.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amste

--
Clark S. Cox III
cl*******@gmail.com

Jun 14 '07 #20

websnarf

On Jun 13, 6:08 pm, "Dik T. Winter" <Dik.Win...@cwi.nlwrote:

In article <f4k7df$d5...@pc-news.cogsci.ed.ac.ukrich...@cogsci.ed.ac.uk (Richard Tobin) writes:

In article <2007061109263816807-clarkcox3@gmailcom>,
Clark Cox <clarkc...@gmail.comwrote:
UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with
UTF-16).

That is false. It is capped at six octets.

No, that's the old mechanistic definition from the ISO 10646
standard. The 5 and 6 byte encodings have been retired and are not
considered correct anymore.

[...] There is round-tripping
with UTF-16, but that is a bit elaborate. In UTF-8 the surrogates
should *not* be encoded, but the actual code-point. (Encoding U+D800
to U+DFFF is not permitted in UTF-8.)

Ok, this is a bit of a contorted way of thinking about it. Unicode,
defines text streams as a sequence of code points which are each from
a certain subset of the range [0x0,0x10FFFF]. UTF-16 and UTF-8 can
each encode the entire legal range. The surrogates, the endian
reversed BOM character and various other characters are illegal mostly
because UTF-16 cannot encode them independently.

It's UTF-16 that's broken.

Well its broken in the sense that it is the real limiter for the total
range, and it consumes higher than average storage space for western
text. Otherwise, from the Unicode standard point of view, its fine.

Indeed, and that becomes visible when we get beyond plane 16.

Hm?

But we'll have 10-bit bytes before we need more than 0x10ffff code
points in Unicode.

With the current rate of increase that would be in 210 years.

Care to explain this prediction? (I assumed continuous exponential
growth over 5000 years, leading to a result of 1000 years of
additional use; what's your model?)

[...] But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000.

More likely, some undefined/reserved Unicode space would be used to
make another layer of surrogate pairs. This would allow existing
UTF-16 encoders to continue to be used.

[...] But one of these days I should look at the differences between 4.1
and 5.0.

I highly doubt there are any differences related to encodings.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Jun 14 '07 #21

Bjoern Hoehrmann

* we******@gmail.com wrote in comp.lang.c:

/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000));

/* Count the length */
c = (-(cp & 0xffff0000)) >UINT32_C(31);
c += (-(cp & 0xfffff800)) >UINT32_C(31);
c += (-(cp & 0xffffff80)) >UINT32_C(31);

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

Thanks, this is quite nice, although it does not work for code points in
the range U+0040 to U+007F; for those the 7th bit should end up in the
least significant byte, while the code above shifts it into the next. So
it seems a little bit of control flow is needed, at least I don't see a
way around that.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 14 '07 #22

Bjoern Hoehrmann

* Bjoern Hoehrmann wrote in comp.lang.c:

unsigned int
utf8toint(unsigned int c) {
[...]
}

For the reverse, I came up with this:

uint32_t
utf8dec(uint32_t c) {
/* drop leading bits in 3 byte seqs */
if ((c & 0x00C00000) == 0xC00000)
c &= 0xFF1FFFFF;

return ((c >6) & 0x1C0000)
+ ((c >4) & 0x03F000)
+ ((c >2) & 0x000FC0)
+ ((c >0) & 0x00007f);
}
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 14 '07 #23

websnarf

On Jun 14, 6:34 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

* websn...@gmail.com wrote in comp.lang.c:
/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000));

/* Count the length */
c = (-(cp & 0xffff0000)) >UINT32_C(31);
c += (-(cp & 0xfffff800)) >UINT32_C(31);
c += (-(cp & 0xffffff80)) >UINT32_C(31);

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

Thanks, this is quite nice, although it does not work for code points in
the range U+0040 to U+007F; for those the 7th bit should end up in the
least significant byte, while the code above shifts it into the next. So
it seems a little bit of control flow is needed, at least I don't see a
way around that.

Crap, yeah. That's what I get for trying to eyeball it. So lets give
this another go:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t c;
static uint32_t emode[4] = { 0x0, 0xc080, 0xe08080, 0xf0808080 };
static uint32_t mmode[4] = { 0x7f, 0x3f, 0x3f, 0x3f };

/* Count the length */
c = (-(cp & 0xffff0000)) >UINT32_C(31);
c += (-(cp & 0xfffff800)) >UINT32_C(31);
c += (-(cp & 0xffffff80)) >UINT32_C(31);

return (cp & mmode[c]) |
((cp << 2) & UINT32_C(0x3f00)) |
((cp << 4) & UINT32_C(0x3f0000)) |
((cp << 6) & UINT32_C(0x3f000000)) |
emode[c];
}

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Jun 14 '07 #24

Dik T. Winter

In article <11*********************@e9g2000prf.googlegroups.c omwe******@gmail.com writes:

On Jun 13, 6:08 pm, "Dik T. Winter" <Dik.Win...@cwi.nlwrote:

....

But we'll have 10-bit bytes before we need more than 0x10ffff code
points in Unicode.
With the current rate of increase that would be in 210 years.

Care to explain this prediction? (I assumed continuous exponential
growth over 5000 years, leading to a result of 1000 years of
additional use; what's your model?)

Well, starting with 29929 code points in 1.0 (in 1991) and raised to
97786 code points in 4.1 (2005), that means an increase by a factor
of 3.27 in 14 years. Increasing it to 10FFFF code points (1114111)
would mean an increase by a factor of 37.23. But we can do better.
Assuming linear growth, I come to:
(1114111 - 29929) / (97786 - 29929) * 14 = 224.
And as 224 - 14 = 210...

Care to explain how you came to 1000 years with exponential growth?
With the figures above, with exponential growth I would expect
about 154 years.

But of course all models are flawed because the change from 2.1 to
3.0 saw an increase by 10307 code points (many more new scripts),
and 3.0 to 3.1 saw an increase by 44978 code points (Chinese).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Jun 15 '07 #25

Bjoern Hoehrmann

* we******@gmail.com wrote in comp.lang.c:

>Thanks, this is quite nice, although it does not work for code points in
the range U+0040 to U+007F; for those the 7th bit should end up in the
least significant byte, while the code above shifts it into the next. So
it seems a little bit of control flow is needed, at least I don't see a
way around that.

Crap, yeah. That's what I get for trying to eyeball it. So lets give
this another go:

>static uint32_t mmode[4] = { 0x7f, 0x3f, 0x3f, 0x3f };

return (cp & mmode[c]) |
((cp << 2) & UINT32_C(0x3f00)) |

But the mask needs to be 0x00 if c is zero, otherwise you would put
the bit into two places. So this would need to be:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t c;
static uint32_t emode[4] = { 0x00, 0xc080, 0xe08080, 0xf0808080 };
static uint32_t mmode[4] = { 0x7f, 0x003f, 0x00003f, 0x0000003f };
static uint32_t nmode[4] = { 0x00, 0x3f00, 0x003f00, 0x00003f00 };

/* Count the length */
c = (-(cp & 0xffff0000)) >UINT32_C(31);
c += (-(cp & 0xfffff800)) >UINT32_C(31);
c += (-(cp & 0xffffff80)) >UINT32_C(31);

return ((cp << 0) & mmode[c]) |
((cp << 2) & nmode[c]) |
((cp << 4) & UINT32_C(0x003f0000)) |
((cp << 6) & UINT32_C(0x3f000000)) |
emode[c];
}
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jun 15 '07 #26

Richard Tobin

In article <JJ********@cwi.nl>, Dik T. Winter <Di********@cwi.nlwrote:

>Well, starting with 29929 code points in 1.0 (in 1991) and raised to
97786 code points in 4.1 (2005), that means an increase by a factor
of 3.27 in 14 years. Increasing it to 10FFFF code points (1114111)
would mean an increase by a factor of 37.23.

Very interesting, but completely detached from reality. The increase
in Unicode code points is not some natural process that might be
expected to be linear or exponential. It results from an effort first
to include all the characters in living languages (which is, I
believe, more or less complete), and then to add characters from dead
languages once they are satisfactorily enumerated. Barring some
remarkable discoveries, 0x10ffff should be quite enough.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 15 '07 #27

Richard Tobin

In article <11**********************@n15g2000prd.googlegroups .com>,
<we******@gmail.comwrote:

>On a modern processor you are getting your ass kicked on the control
flow.

That depends. You also need to take into account the distribution of
your data. If it consists only of English text, then 99.9% of the
characters will be ASCII, so an immediate test

if (c < 0x80) return c;

is a big win. If you include western European languages, it will
still get about 90% of characters. Obviously if you have a lot of
Chinese texts the situation will be different.

Also, if I understand correctly, modern processors can speculatively
execute multiple branches, so the control flow may not be such a
problem and the branches may even be computed in parallel. There's
no substitute for benchmarking.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 15 '07 #28

websnarf

On Jun 15, 3:26 am, rich...@cogsci.ed.ac.uk (Richard Tobin) wrote:

In article <1181783286.771652.130...@n15g2000prd.googlegroups .com>,
<websn...@gmail.comwrote:
On a modern processor you are getting your ass kicked on the control
flow.

That depends. You also need to take into account the distribution of
your data. If it consists only of English text, then 99.9% of the
characters will be ASCII, so an immediate test

if (c < 0x80) return c;

is a big win. If you include western European languages, it will
still get about 90% of characters.

Tell that to the Greeks, French or Russians. The above is a good
idea, basically for English, and may be ok for Spanish and German.

[...] Obviously if you have a lot of
Chinese texts the situation will be different.

No, Chinese would at least have a consistent branching pattern and
therefore not pay those sorts of penalties.

Also, if I understand correctly, modern processors can speculatively
execute multiple branches,

I am not aware of any processor which does that. Its a bit harder to
do that than you might think. (If there are multiple outstanding
speculations, do you try to do two-sided branch flow for all of them?)

[...] so the control flow may not be such a
problem and the branches may even be computed in parallel. There's
no substitute for benchmarking.

Knowledge is sometimes useful. That's why you don't need to benchmark
a bubble sort against a heap sort.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Jun 16 '07 #29

Keith Thompson

we******@gmail.com writes:

On Jun 15, 3:26 am, rich...@cogsci.ed.ac.uk (Richard Tobin) wrote:
>In article <1181783286.771652.130...@n15g2000prd.googlegroups .com>,
<websn...@gmail.comwrote:
>On a modern processor you are getting your ass kicked on the control
flow.

That depends. You also need to take into account the distribution of
your data. If it consists only of English text, then 99.9% of the
characters will be ASCII, so an immediate test

if (c < 0x80) return c;

is a big win. If you include western European languages, it will
still get about 90% of characters.

Tell that to the Greeks, French or Russians. The above is a good
idea, basically for English, and may be ok for Spanish and German.

Greek and Russian are not western European languages. French uses
accented characters, but the majority of typical French text is plain
ASCII, n'est-ce pas?

[...]

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Jun 16 '07 #30

Richard Tobin

In article <11**********************@e9g2000prf.googlegroups. com>,
<we******@gmail.comwrote:

>is a big win. If you include western European languages, it will
still get about 90% of characters.

>Tell that to the Greeks, French or Russians.

Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Jun 16 '07 #31

Dik T. Winter

In article <f5***********@pc-news.cogsci.ed.ac.ukri*****@cogsci.ed.ac.uk (Richard Tobin) writes:

In article <11**********************@e9g2000prf.googlegroups. com>,
<we******@gmail.comwrote:

is a big win. If you include western European languages, it will
still get about 90% of characters.

Tell that to the Greeks, French or Russians.

Russian is not western European.

How do you define western European?

Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.

That may be the case. But there is only one of the western European
languages that will fit completely in that range, and I do not know
whether it is indeed 10% accented characters in all other languages
(and there are more than you think). I would think that figure is
exceeded in Frisian, one of the official languages of the Netherlands.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Jun 20 '07 #32

Joachim Schmitz

"Dik T. Winter" <Di********@cwi.nlschrieb im Newsbeitrag
news:JJ********@cwi.nl...

In article <f5***********@pc-news.cogsci.ed.ac.ukri*****@cogsci.ed.ac.uk
(Richard Tobin) writes:
In article <11**********************@e9g2000prf.googlegroups. com>,
<we******@gmail.comwrote:

>is a big win. If you include western European languages, it will
>still get about 90% of characters.

>Tell that to the Greeks, French or Russians.
Russian is not western European.

How do you define western European?

Does it matter? Russia is eastern Europe and even goes into Asia, there's no
european country that is more eastern, so it can't be western, can it?

Eastern Europe used to be devided from western Europe by the Iron Curtain.
Even now this still is the line to draw with the minor exception of eastern
Germany, maybe 8-)

Also Russia uses a completly different character set. The Greek too. While
the French only have a couple of accented characters, the Dutch having one
extra character (ij), etc, in addition to the Latin alphabeth

Bye, Jojo

Jun 20 '07 #33

Richard Bos

"Dik T. Winter" <Di********@cwi.nlwrote:

In article <f5***********@pc-news.cogsci.ed.ac.ukri*****@cogsci.ed.ac.uk (Richard Tobin) writes:
In article <11**********************@e9g2000prf.googlegroups. com>,
<we******@gmail.comwrote:
>
is a big win. If you include western European languages, it will
still get about 90% of characters.
>
Tell that to the Greeks, French or Russians.
>
Russian is not western European.

How do you define western European?

From the West of Europe, obviously. Russia is about as far East as you
can go while still being in Europe.

Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.

That may be the case. But there is only one of the western European
languages that will fit completely in that range,

Two, if you count dead languages. Since English is completely identical
to Latin in all other regards (hence the ban on split infinitives), this
is but proper.

and I do not know whether it is indeed 10% accented characters in all
other languages (and there are more than you think). I would think
that figure is exceeded in Frisian, one of the official languages of
the Netherlands.

Is not! It's a speech defect. But no, you'd be surprised how few accents
there are in a typical Frisian text. Odd ones, such as a circonflexe on
the 'y', but not that many. Too bloody many unaccented 'y's and 'j's
inserted any which where, but not that many accents.
All this, and nobody has yet mentioned that categorical statements about
what kind of code is more time-efficient are usually the sign of a very
poor programmer? Measure, people, measure! And don't be surprised to
find a difference of only 1% either way.

Richard

Jun 20 '07 #34

Dik T. Winter

In article <f5**********@online.de"Joachim Schmitz" <jo**@schmitz-digital.dewrites:

"Dik T. Winter" <Di********@cwi.nlschrieb im Newsbeitrag
news:JJ********@cwi.nl...
In article <f5***********@pc-news.cogsci.ed.ac.ukri*****@cogsci.ed.ac.uk
(Richard Tobin) writes:
In article <11**********************@e9g2000prf.googlegroups. com>,
<we******@gmail.comwrote:
>
is a big win. If you include western European languages, it will
still get about 90% of characters.
>
Tell that to the Greeks, French or Russians.
>
Russian is not western European.
How do you define western European?

Does it matter? Russia is eastern Europe and even goes into Asia, there's no
european country that is more eastern, so it can't be western, can it?

But there are many people who would not call German western European either.
Rather central European.

Eastern Europe used to be devided from western Europe by the Iron Curtain.
Even now this still is the line to draw with the minor exception of eastern
Germany, maybe 8-)

Ah. But because Greece and in fact also Yugoslavia were not behind the
Iron Curtain they belong to Western Europe? And Cyrillic is also used
in some of the former Yugoslavian Republics (it was actually invented
in Croatia, but not used there). And we have now also Bulgaria in the
EU, using a Cyrillic script. It will not be long before the banknotes
are adapted to include that script.

Also Russia uses a completly different character set. The Greek too. While
the French only have a couple of accented characters, the Dutch having one
extra character (ij), etc, in addition to the Latin alphabeth

Take care. Linguists do not agree that "ij" is an extra character in
Dutch, it is a bit controversial. And you are missing the accented letters
that are used in Dutch quite a lot (diaeresis amongst others, with a function
different from the Umlaut in German, our neighbouring country is called
België in Dutch).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Jun 25 '07 #35

Dik T. Winter

In article <46****************@news.xs4all.nlrl*@hoekstra-uitgeverij.nl (Richard Bos) writes:

"Dik T. Winter" <Di********@cwi.nlwrote:
In article <f5***********@pc-news.cogsci.ed.ac.ukri*****@cogsci.ed.ac.uk (Richard Tobin) writes:

....

Russian is not western European.
How do you define western European?

From the West of Europe, obviously. Russia is about as far East as you
can go while still being in Europe.

But that script is used more west than the most western part of (former)
Russia.

and I do not know whether it is indeed 10% accented characters in all
other languages (and there are more than you think). I would think
that figure is exceeded in Frisian, one of the official languages of
the Netherlands.

Is not! It's a speech defect.

Perhaps, but in that case it is an official speech defect. And there are
two more such in the Netherlands (called regional languages).

All this, and nobody has yet mentioned that categorical statements about
what kind of code is more time-efficient are usually the sign of a very
poor programmer? Measure, people, measure! And don't be surprised to
find a difference of only 1% either way.

Right. And: make it correct first, bother about optimisation only when
time is a problem.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Jun 25 '07 #36

More elegant UTF-8 encoder

Similar topics