473,757 Members | 10,263 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

More elegant UTF-8 encoder

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 10 '07
35 4352
On 2007-06-11 09:01:13 -0700, "Stephen Sprunk" <st*****@sprunk .orgsaid:
"J. J. Farrell" <jj*@bcs.org.uk wrote in message
news:11******** **************@ o11g2000prd.goo glegroups.com.. .
>what if the UTF-8 encoding requires more bytes than any
available integer type?

That's only a risk in C89.
It's not even a risk there, as long must be at least 32 bits
C99 requires "long long", which is at least 64 bits, and the longest
valid UTF-8 sequence is 7 octets (56 bits).
No. There are no legal UTF-8 sequences that are longer than 4-octets.

Using "int" is just plain broken, since that isn't guaranteed to hold
any more than two octets.
Agreed
"long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.
UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).
--
Clark S. Cox III
cl*******@gmail .com

Jun 11 '07 #11
In article <20070611092638 16807-clarkcox3@gmail com>,
Clark Cox <cl*******@gmai l.comwrote:
>UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).
It's UTF-16 that's broken.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 11 '07 #12
On Jun 11, 4:05 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
* Richard Tobin wrote in comp.lang.c:
To be honest, I don't see the point.

I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.
--
Björn Höhrmann · mailto:bjo...@h oehrmann.de ·http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/
if (c < 0x80)
return c;
else if (c < 0x800)
return ((c << 2) & 0x1f00) | (c & 0x003f) | 0xc080;
else if (c < 0x10000)
return ((c << 4) & 0x0f0000) | ((c << 2) & 0x3f00) | (c & 0x003f) |
0xe08080;
else
return ((c << 6) & 0x07000000) | ((c << 4) & 0x3f0000) | ((c << 2) &
0x3f00) | (c & 0x003f) | 0xf0808080;

So I assume that you have lots of UTF-8 encoded text, and every time
you extract the next character, you don't extract the Unicode
codepoint, but this strange UTF-8 encoded version of the codepoint,
because it would be faster to calculate from UTF-8?
Jun 11 '07 #13
"Clark Cox" <cl*******@gmai l.comwrote in message
news:2007061109 263816807-clarkcox3@gmail com...
On 2007-06-11 09:01:13 -0700, "Stephen Sprunk" <st*****@sprunk .orgsaid:
> C99 requires "long long", which is at least 64 bits, and the
longest valid UTF-8 sequence is 7 octets (56 bits).

No. There are no legal UTF-8 sequences that are longer than 4-
octets.
....
> "long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.

UTF-8 is officially capped at 4 octets. There is no way to make it longer
without breaking Unicode (consider round-tripping with
UTF-16).
The Unicode folks and IETF agree with you, but the ISO standard doesn't
limit UTF-8 to four or codepoints to U+10FFFF.

While I'll grant it's unlikely, it's indeed _possible_ that the limit will
be lifted in the future. Since UTF-8 follows a consistent pattern up to
seven octets, there's no reason not to allow for encoding or decoding it as
long as it's well-formed. The UCS-2 folks all got burned when UTF-16 came
out with its surrogates, remember, and it didn't even take that long; I
don't plan on repeating their mistakes. Just like I never thought 640kB RAM
(or 4GB) was enough for everybody and allowed for more if/when it became
possible...

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 12 '07 #14
Stephen Sprunk wrote:
>
.... snip ...
>
While I'll grant it's unlikely, it's indeed _possible_ that the
limit will be lifted in the future. Since UTF-8 follows a
consistent pattern up to seven octets, there's no reason not to
allow for encoding or decoding it as long as it's well-formed.
The UCS-2 folks all got burned when UTF-16 came out with its
surrogates, remember, and it didn't even take that long; I don't
plan on repeating their mistakes. Just like I never thought
640kB RAM (or 4GB) was enough for everybody and allowed for more
if/when it became possible...
Hell, back in '78 I proposed a system with the outrageous memory
addressing capacity of 24 bits, or 16 Megs. Who could possibly
need (or afford) more. It also provided for 16 bit words.
Published in DDJ.

--
<http://www.cs.auckland .ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfoc us.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>
<http://kadaitcha.cx/vista/dogsbreakfast/index.html>
cbfalconer at maineline dot net

--
Posted via a free Usenet account from http://www.teranews.com

Jun 12 '07 #15
"Richard Tobin" <ri*****@cogsci .ed.ac.ukwrote in message
news:f4******** **@pc-news.cogsci.ed. ac.uk...
In article <20070611092638 16807-clarkcox3@gmail com>,
Clark Cox <cl*******@gmai l.comwrote:
>>UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).

It's UTF-16 that's broken.

But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.
Thankfully, the IETF has already made a first step in that direction:
http://www.ietf.org/rfc/rfc4042.txt

Yes, I know the publication date*, but it's still somewhat relevant...

S

* For those that aren't aware, the IETF publishes spoof standards most years
on April Fools' Day (1 Apr). One, RFC 1149, was actually implemented.

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 13 '07 #16
On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.
UCS-4 or UTF-32 is 31 bits whose valid range is a subset of
[0x0,0x10FFFF]. UTF-8 is a variable length encoding of code points
from 1 to 4 octets. So, in C the output data type you are looking for
is probably an unsigned long, not an unsigned int (though a struct
{ int len, unsigned char v[4]}; seems more appropriate if you don't
want to worry about speed).
unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}
On a modern processor you are getting your ass kicked on the control
flow. Let's try this again:

#include "pstdint.h" /* http://www.pobox.com/~qed/pstdint.h */

uint32_t utf32ToUtf8 (uint32_t cp) {
uint32_t ret, c;
static uint32_t encodingmode[4] = { 0x0, 0xc080, 0xe08080,
0xf0808080 };

/* Spread the bits to their target locations */
ret = (cp & UINT32_C(0x3f)) |
((cp << 2) & UINT32_C(0x3f00 )) |
((cp << 4) & UINT32_C(0x3f00 00)) |
((cp << 6) & UINT32_C(0x3f00 0000));

/* Count the length */
c = (-(cp & 0xffff0000)) >UINT32_C(31) ;
c += (-(cp & 0xfffff800)) >UINT32_C(31) ;
c += (-(cp & 0xffffff80)) >UINT32_C(31) ;

/* Merge the spread bits with the mode bits */
return ret | encodingmode[c];
}

I haven't tested this, but it seems ok upon visual inspection.

--
Paul Hsieh
http://bstring.sf.net/
http://www.azillionmonkeys.com/qed/unicode.html

Jun 14 '07 #17
In article <f4**********@p c-news.cogsci.ed. ac.ukri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
In article <20070611092638 16807-clarkcox3@gmail com>,
Clark Cox <cl*******@gmai l.comwrote:
UTF-8 is officially capped at 4 octets. There is no way to make it
longer without breaking Unicode (consider round-tripping with UTF-16).
That is false. It is capped at six octets. There is round-tripping
with UTF-16, but that is a bit elaborate. In UTF-8 the surrogates
should *not* be encoded, but the actual code-point. (Encoding U+D800
to U+DFFF is not permitted in UTF-8.)
It's UTF-16 that's broken.
Indeed, and that becomes visible when we get beyond plane 16.
But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.
With the current rate of increase that would be in 210 years. But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000. But one of these days I should look
at the differences between 4.1 and 5.0.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Jun 14 '07 #18
On Jun 11, 11:42 pm, "Stephen Sprunk" <step...@sprunk .orgwrote:
"Clark Cox" <clarkc...@gmai l.comwrote in message
news:2007061109 263816807-clarkcox3@gmail com...
On 2007-06-11 09:01:13 -0700, "Stephen Sprunk" <step...@sprunk .orgsaid:
C99 requires "long long", which is at least 64 bits, and the
longest valid UTF-8 sequence is 7 octets (56 bits).
No. There are no legal UTF-8 sequences that are longer than 4-
octets.
...
"long" is less broken, since it's capable of holding at least four
octets and that's enough for all currently-assigned codepoints.
UTF-8 is officially capped at 4 octets. There is no way to make it longer
without breaking Unicode (consider round-tripping with
UTF-16).

The Unicode folks and IETF agree with you, but the ISO standard doesn't
limit UTF-8 to four or codepoints to U+10FFFF.
The *OLDER* ISO 10646 standard allowed for larger encodings. However,
the ISO 10646 has merged with Unicode (version 3.0 I think) and thus
obsoleted/abandonded its old expanded range.
While I'll grant it's unlikely, it's indeed _possible_ that the limit will
be lifted in the future.
We would probably have to encounter an extra-terrestrial life form
that used sequential symbolic communications like we do, and who
decided that an alphabet 30 times larger than the Chinese one was part
of their communications systems. Its not going to happen here on
earth.
Since UTF-8 follows a consistent pattern up to
seven octets, there's no reason not to allow for encoding or decoding it as
long as it's well-formed. The UCS-2 folks all got burned when UTF-16 came
out with its surrogates, remember, and it didn't even take that long; I
don't plan on repeating their mistakes.
You are in charge of the Unicode Standards? The original Unicode
people were idiots and could not properly count the number of Chinese
characters. Perhaps the "offset printing lobby" tricked them into
choosing too few bits to throw a monkey wrench into the system.
[...] Just like I never thought 640kB RAM
(or 4GB) was enough for everybody and allowed for more if/when it became
possible...
Just like huh? RAM requirements are clearly tied to Moore's Law. But
Alphabet sizes? I don't know how old writing is, but if its about
5000 years old, and we assume a constant growth rate, then Unicode
will still be good for about 1000 years in its current form.

However, now that so much of language and human activity is tied up in
the current incumbent communications systems, I would claim that in
fact growth of alphabets will be severely curtailed, except for
certain marginal applications (alphabets for learning disabled people,
Indigenous people's language when/if they decide to convert them to
written form etc.)

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Jun 14 '07 #19
On 2007-06-13 18:08:38 -0700, "Dik T. Winter" <Di********@cwi .nlsaid:
In article <f4**********@p c-news.cogsci.ed. ac.uk>
ri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
In article <20070611092638 16807-clarkcox3@gmail com>,
Clark Cox <cl*******@gmai l.comwrote:
>UTF-8 is officially capped at 4 octets. There is no way to make it
>longer without breaking Unicode (consider round-tripping with UTF-16).

That is false. It is capped at six octets. There is round-tripping
with UTF-16, but that is a bit elaborate.
It's not that elaborate; it's dead simple in fact.

UTF-16 -Unicode scalar value -UTF-8
UTF-8 -Unicode scalar value -UTF-16

This is not possible if UTF-8 is extended beyond 4 bytes.
In UTF-8 the surrogates should *not* be encoded,
I never claimed that they should.
but the actual code-point. (Encoding U+D800 to U+DFFF is not permitted
in UTF-8.)
It's UTF-16 that's broken.

Indeed, and that becomes visible when we get beyond plane 16.
UTF-16 is perfectly suited to represent all of the possible Unicode
values (as is 4-byte UTF-8)
>
But we'll have 10-bit bytes before we need more than 0x10ffff code points
in Unicode.

With the current rate of increase that would be in 210 years. But in
the current (eh, 4.1) coding the largest serious code point was U+2FA1D,
and the largest *defined* code point was U+10FFFF. For five bytes of
UTF-8 we need at least U+200000. But one of these days I should look
at the differences between 4.1 and 5.0.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amste

--
Clark S. Cox III
cl*******@gmail .com

Jun 14 '07 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
1357
by: Steve | last post by:
Hey, well I'm really pleased with myself for writing a little script today that changes variables according to a radio button. The thing is, I have this feeling part of my script could be a lot more elegant and compact than it is: if ($radiobutton=="radio1") {$item1 =$item1 * 54 ;}; //each of these blocks is a range each block assigns different prices to items if ($radiobutton=="radio1") {$item2 =$item2 * 56 ;}; if...
4
1423
by: Kamilche | last post by:
''' Is there a more elegant way of doing this? I would like to have the arguments to pack automatically taken from lst. ''' def pack(self): lst = return struct.pack('<IIIiiiBBBHHH', \ self.id, self.parent, self.number, \
6
5019
by: Kamilche | last post by:
Is there a more elegant way to change the working directory of Python to the directory of the currently executing script, and add a folder called 'Shared' to the Python search path? This is what I have. It seems like it could be shorter, somehow. # Switch Python to the current directory import os, sys pathname, scriptname = os.path.split(sys.argv) pathname = os.path.abspath(pathname)
6
3094
by: Philipp Lenssen | last post by:
Is there any way I can keep those URLs... www.example.com/?q=hello&type= from appearing? Let's say I have a form with a select-options box. Along with a single text-input. Now if "q" is the name of the input box and "type" is the name of the select-box, I also have a default for type (which is empty and does nothing). Do I need to have htaccess redirects to clean above URL to this one: www.example.com/?q=hello
17
7384
by: Fresh Air Rider | last post by:
Hello Could anyone please explain how I can pass more than one arguement/parameter value to a function using <asp:linkbutton> or is this a major shortfall of the language ? Consider the following code fragments in which I want to list through all files on a directory with a link to download each file by passing a filename and whether or not to force the download dialog box to appear.
13
1212
by: Edward W. | last post by:
hello, I have this function below which is simple and easy to understand private function ListHeight (byval UserScreenHeight as int) as int if UserScreenHeight < 1024 return 30 else return 50 end if end function
10
2366
by: p3t3r | last post by:
I have a treeview sourced from a SiteMap. I want to use 2 different CSS styles for the root level nodes. The topmost root node should not have a top border, all the other root nodes should have a top border. Is it possible to have more than 1 style at the same level (parent node) when using a SiteMap? I want it to appear something like this and I can only find a way to either have the border on all root nodes or none at all. In...
5
6496
by: Licheng Fang | last post by:
I want to store Chinese in Unicode internally in my program, and give output in UTF-8 or GBK format. After two days of searching and reading, I still cannot find a simple and straightforward way to do the code conversions. In particular, I want portability of the code across platfroms (Windows and Linux), and I don't like having to refer the user of my code to some third party libraries for compiling. Some STL references point to the...
10
32828
by: sherifffruitfly | last post by:
Hi all, This is how I'm currently getting Friday of last week. It strikes me as cumbersome. Is there a slicker/more elegant way? Thanks for any ideas, cdj
3
1583
by: Anonymous | last post by:
I want to be able to restrict the set of classes for which a template class can be instantiated (i.e enforce that all instantiation MUST be for classes taht derive from a base type BaseType). This occured to me immediately, but use of a dummy variable is not elegant - are there other (more elegant) ways of doing this? template <class DerivedType> class MyClass
0
9489
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9298
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10072
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9906
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9885
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
6562
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
3829
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3399
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2698
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.