473,577 Members | 3,273 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

More elegant UTF-8 encoder

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 10 '07 #1
35 4309
On Jun 10, 2:42 pm, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bjo...@h oehrmann.de ·http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/
What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.

Jun 10 '07 #2
In article <m9************ *************** *****@hive.bjoe rn.hoehrmann.de >,
Bjoern Hoehrmann <bj****@hoehrma nn.dewrote:
>I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable
"Choose any two".

To be honest, I don't see the point. It looks fast enough: after all,
you must be reading the data from somewhere, which is likely to be
much slower. Unless you have profiling data showing that it's a
significant overhead, forget it. As for clearer, it depends where
you're starting from. If you want to match a typical textual
description of UTF-8, I think something like this is much clearer:

unsigned char b[4] = {0, 0, 0, 0};

if(c < 0x80)
b[0] = c;
else if(c < 0x800)
{
b[1] = 0xc0 + (c >6);
b[0] = 0x80 + (c & 0x3f);
}
else if(c < 0x10000)
{
b[2] = 0xe0 + (c >12);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}
else
{
b[3] = 0xf0 + (c >18);
b[2] = 0x80 + ((c >12) & 0x3f);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}

return b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24);

That's untested and derived from code intended to output bytes in
sequence. Of course you could replace the array assignments with
returns of expressions composing the parts.

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 10 '07 #3
On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

Jun 10 '07 #4
In article <11************ *********@g4g20 00hsf.googlegro ups.com>,
christian.bau <ch***********@ cbau.wanadoo.co .ukwrote:
>And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.
I agree!

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 10 '07 #5
In article <11************ **********@o11g 2000prd.googleg roups.com>,
J. J. Farrell <jj*@bcs.org.uk wrote:
>You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?
4 bytes is sufficient to cover all values up to 0x10ffff. I don't
think there's any prospect of codes being allocated outside that range
in the foreseeable future.

-- Richard

--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 10 '07 #6
* christian.bau wrote in comp.lang.c:
>What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.
Well, I have what you can consider a regular expression engine based on
Janusz Brzozowski's notion of derivatives of regular expression, meaning
that, given a regular expression and a character, it computes a regular
expression matching the rest of the string. Currently it stores ranges
of characters using the Unicode scalar value and transcodes from UTF-8
to UTF-32.

For several reasons, I want to avoid transcoding to UTF-32, so I want to
change it so that, given a regex and a octet, it computes a new regex. I
am experimenting with possible solutions, one is to exploit

utf8toint(c1) < utf8toint(c2) <=c1 < c2

which allows me to store the character ranges in their utf8toint encoded
form. The derivative of a range with respect to an octet can then easily
be computed by computing the intersection of the range and a new range
consisting of the minimal and maximal utf8toint value given the octet(s)
seen up to that point (they consist of the current byte followed by n-1
0x80 and 0xBF octets respectively, where n is the required length).

So a range [ U+0000 - U+00FF ] would be stored as [ 0x0000 - 0xc3bf ]
and if it sees e.g. a 0xc2 it would create a range [ 0xc280 - 0xc2bf ],
compute the intersection which is [ 0xc280 - 0xc2bf ] and drop the seen
byte, resulting in [ 0x80 - 0xbf ]; I can always tell, due to how UTF-8
byte patterns are organized, whether a given range is a partial range
and how many bytes are still needed to make a full character, though I
will be storing the remaining byte count for performance reasons.

Obviously I could do something similar by partially decoding the UTF-8
octets and storing Unicode scalar value ranges in the derivative instead
or mix these approaches in some way, but that seemed more difficult to
me. Similarily, rewriting the regular expression upfront so it matches
on bytes rather than characters would be more difficult. So, while it
might be unusual, I don't think this is particularily bizarre.
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 11 '07 #7
* Richard Tobin wrote in comp.lang.c:
>To be honest, I don't see the point.
I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 11 '07 #8
On 2007-06-10 15:25:07 -0700, "J. J. Farrell" <jj*@bcs.org.uk said:
On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
>Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsi gned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?
4-bytes is sufficient to contain any legal Unicode codepoint in UTF-8
representation.

--
Clark S. Cox III
cl*******@gmail .com

Jun 11 '07 #9
"J. J. Farrell" <jj*@bcs.org.uk wrote in message
news:11******** **************@ o11g2000prd.goo glegroups.com.. .
what if the UTF-8 encoding requires more bytes than any
available integer type?
That's only a risk in C89. C99 requires "long long", which is at least 64
bits, and the longest valid UTF-8 sequence is 7 octets (56 bits).

Using "int" is just plain broken, since that isn't guaranteed to hold any
more than two octets. "long" is less broken, since it's capable of holding
at least four octets and that's enough for all currently-assigned
codepoints.

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 11 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
1347
by: Steve | last post by:
Hey, well I'm really pleased with myself for writing a little script today that changes variables according to a radio button. The thing is, I have this feeling part of my script could be a lot more elegant and compact than it is: if ($radiobutton=="radio1") {$item1 =$item1 * 54 ;}; //each of these blocks is a range each block assigns...
4
1414
by: Kamilche | last post by:
''' Is there a more elegant way of doing this? I would like to have the arguments to pack automatically taken from lst. ''' def pack(self): lst = return struct.pack('<IIIiiiBBBHHH', \ self.id, self.parent, self.number, \
6
5007
by: Kamilche | last post by:
Is there a more elegant way to change the working directory of Python to the directory of the currently executing script, and add a folder called 'Shared' to the Python search path? This is what I have. It seems like it could be shorter, somehow. # Switch Python to the current directory import os, sys pathname, scriptname =...
6
3085
by: Philipp Lenssen | last post by:
Is there any way I can keep those URLs... www.example.com/?q=hello&type= from appearing? Let's say I have a form with a select-options box. Along with a single text-input. Now if "q" is the name of the input box and "type" is the name of the select-box, I also have a default for type (which is empty and does nothing). Do I need to have...
17
7356
by: Fresh Air Rider | last post by:
Hello Could anyone please explain how I can pass more than one arguement/parameter value to a function using <asp:linkbutton> or is this a major shortfall of the language ? Consider the following code fragments in which I want to list through all files on a directory with a link to download each file by passing a filename and whether or...
13
1201
by: Edward W. | last post by:
hello, I have this function below which is simple and easy to understand private function ListHeight (byval UserScreenHeight as int) as int if UserScreenHeight < 1024 return 30 else return 50 end if end function
10
2338
by: p3t3r | last post by:
I have a treeview sourced from a SiteMap. I want to use 2 different CSS styles for the root level nodes. The topmost root node should not have a top border, all the other root nodes should have a top border. Is it possible to have more than 1 style at the same level (parent node) when using a SiteMap? I want it to appear something like...
5
6487
by: Licheng Fang | last post by:
I want to store Chinese in Unicode internally in my program, and give output in UTF-8 or GBK format. After two days of searching and reading, I still cannot find a simple and straightforward way to do the code conversions. In particular, I want portability of the code across platfroms (Windows and Linux), and I don't like having to refer the...
10
32772
by: sherifffruitfly | last post by:
Hi all, This is how I'm currently getting Friday of last week. It strikes me as cumbersome. Is there a slicker/more elegant way? Thanks for any ideas, cdj
3
1569
by: Anonymous | last post by:
I want to be able to restrict the set of classes for which a template class can be instantiated (i.e enforce that all instantiation MUST be for classes taht derive from a base type BaseType). This occured to me immediately, but use of a dummy variable is not elegant - are there other (more elegant) ways of doing this? template <class...
0
7845
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
8121
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8286
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
8143
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6517
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5664
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5340
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3779
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
1107
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.