473,729 Members | 2,345 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

More elegant UTF-8 encoder

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 10 '07 #1
35 4340
On Jun 10, 2:42 pm, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bjo...@h oehrmann.de ·http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/
What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.

Jun 10 '07 #2
In article <m9************ *************** *****@hive.bjoe rn.hoehrmann.de >,
Bjoern Hoehrmann <bj****@hoehrma nn.dewrote:
>I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable
"Choose any two".

To be honest, I don't see the point. It looks fast enough: after all,
you must be reading the data from somewhere, which is likely to be
much slower. Unless you have profiling data showing that it's a
significant overhead, forget it. As for clearer, it depends where
you're starting from. If you want to match a typical textual
description of UTF-8, I think something like this is much clearer:

unsigned char b[4] = {0, 0, 0, 0};

if(c < 0x80)
b[0] = c;
else if(c < 0x800)
{
b[1] = 0xc0 + (c >6);
b[0] = 0x80 + (c & 0x3f);
}
else if(c < 0x10000)
{
b[2] = 0xe0 + (c >12);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}
else
{
b[3] = 0xf0 + (c >18);
b[2] = 0x80 + ((c >12) & 0x3f);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}

return b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24);

That's untested and derived from code intended to output bytes in
sequence. Of course you could replace the array assignments with
returns of expressions composing the parts.

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 10 '07 #3
On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?

Jun 10 '07 #4
In article <11************ *********@g4g20 00hsf.googlegro ups.com>,
christian.bau <ch***********@ cbau.wanadoo.co .ukwrote:
>And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.
I agree!

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 10 '07 #5
In article <11************ **********@o11g 2000prd.googleg roups.com>,
J. J. Farrell <jj*@bcs.org.uk wrote:
>You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?
4 bytes is sufficient to cover all values up to 0x10ffff. I don't
think there's any prospect of codes being allocated outside that range
in the foreseeable future.

-- Richard

--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 10 '07 #6
* christian.bau wrote in comp.lang.c:
>What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.
Well, I have what you can consider a regular expression engine based on
Janusz Brzozowski's notion of derivatives of regular expression, meaning
that, given a regular expression and a character, it computes a regular
expression matching the rest of the string. Currently it stores ranges
of characters using the Unicode scalar value and transcodes from UTF-8
to UTF-32.

For several reasons, I want to avoid transcoding to UTF-32, so I want to
change it so that, given a regex and a octet, it computes a new regex. I
am experimenting with possible solutions, one is to exploit

utf8toint(c1) < utf8toint(c2) <=c1 < c2

which allows me to store the character ranges in their utf8toint encoded
form. The derivative of a range with respect to an octet can then easily
be computed by computing the intersection of the range and a new range
consisting of the minimal and maximal utf8toint value given the octet(s)
seen up to that point (they consist of the current byte followed by n-1
0x80 and 0xBF octets respectively, where n is the required length).

So a range [ U+0000 - U+00FF ] would be stored as [ 0x0000 - 0xc3bf ]
and if it sees e.g. a 0xc2 it would create a range [ 0xc280 - 0xc2bf ],
compute the intersection which is [ 0xc280 - 0xc2bf ] and drop the seen
byte, resulting in [ 0x80 - 0xbf ]; I can always tell, due to how UTF-8
byte patterns are organized, whether a given range is a partial range
and how many bytes are still needed to make a full character, though I
will be storing the remaining byte count for performance reasons.

Obviously I could do something similar by partially decoding the UTF-8
octets and storing Unicode scalar value ranges in the derivative instead
or mix these approaches in some way, but that seemed more difficult to
me. Similarily, rewriting the regular expression upfront so it matches
on bytes rather than characters would be more difficult. So, while it
might be unusual, I don't think this is particularily bizarre.
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 11 '07 #7
* Richard Tobin wrote in comp.lang.c:
>To be honest, I don't see the point.
I am asking because I hope to learn something; I currently do not see a
way to improve the code in some way, if someone else manages to provide
an improved version, I could learn from that. I can't give any hard and
fast rules what would constitute an improvement, but if the alternative
has many more non-whitespace characters, compiles to slower code on my
system, or introduces undefined behavior or platform-specific code, it
is unlikely an improvement, while eliminating a variable without nega-
tively affecting performance might well be.
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 11 '07 #8
On 2007-06-10 15:25:07 -0700, "J. J. Farrell" <jj*@bcs.org.uk said:
On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrma nn.dewrote:
>Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsi gned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

I'd have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?
4-bytes is sufficient to contain any legal Unicode codepoint in UTF-8
representation.

--
Clark S. Cox III
cl*******@gmail .com

Jun 11 '07 #9
"J. J. Farrell" <jj*@bcs.org.uk wrote in message
news:11******** **************@ o11g2000prd.goo glegroups.com.. .
what if the UTF-8 encoding requires more bytes than any
available integer type?
That's only a risk in C89. C99 requires "long long", which is at least 64
bits, and the longest valid UTF-8 sequence is 7 octets (56 bits).

Using "int" is just plain broken, since that isn't guaranteed to hold any
more than two octets. "long" is less broken, since it's capable of holding
at least four octets and that's enough for all currently-assigned
codepoints.

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov
--
Posted via a free Usenet account from http://www.teranews.com

Jun 11 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
1356
by: Steve | last post by:
Hey, well I'm really pleased with myself for writing a little script today that changes variables according to a radio button. The thing is, I have this feeling part of my script could be a lot more elegant and compact than it is: if ($radiobutton=="radio1") {$item1 =$item1 * 54 ;}; //each of these blocks is a range each block assigns different prices to items if ($radiobutton=="radio1") {$item2 =$item2 * 56 ;}; if...
4
1423
by: Kamilche | last post by:
''' Is there a more elegant way of doing this? I would like to have the arguments to pack automatically taken from lst. ''' def pack(self): lst = return struct.pack('<IIIiiiBBBHHH', \ self.id, self.parent, self.number, \
6
5017
by: Kamilche | last post by:
Is there a more elegant way to change the working directory of Python to the directory of the currently executing script, and add a folder called 'Shared' to the Python search path? This is what I have. It seems like it could be shorter, somehow. # Switch Python to the current directory import os, sys pathname, scriptname = os.path.split(sys.argv) pathname = os.path.abspath(pathname)
6
3092
by: Philipp Lenssen | last post by:
Is there any way I can keep those URLs... www.example.com/?q=hello&type= from appearing? Let's say I have a form with a select-options box. Along with a single text-input. Now if "q" is the name of the input box and "type" is the name of the select-box, I also have a default for type (which is empty and does nothing). Do I need to have htaccess redirects to clean above URL to this one: www.example.com/?q=hello
17
7377
by: Fresh Air Rider | last post by:
Hello Could anyone please explain how I can pass more than one arguement/parameter value to a function using <asp:linkbutton> or is this a major shortfall of the language ? Consider the following code fragments in which I want to list through all files on a directory with a link to download each file by passing a filename and whether or not to force the download dialog box to appear.
13
1209
by: Edward W. | last post by:
hello, I have this function below which is simple and easy to understand private function ListHeight (byval UserScreenHeight as int) as int if UserScreenHeight < 1024 return 30 else return 50 end if end function
10
2361
by: p3t3r | last post by:
I have a treeview sourced from a SiteMap. I want to use 2 different CSS styles for the root level nodes. The topmost root node should not have a top border, all the other root nodes should have a top border. Is it possible to have more than 1 style at the same level (parent node) when using a SiteMap? I want it to appear something like this and I can only find a way to either have the border on all root nodes or none at all. In...
5
6496
by: Licheng Fang | last post by:
I want to store Chinese in Unicode internally in my program, and give output in UTF-8 or GBK format. After two days of searching and reading, I still cannot find a simple and straightforward way to do the code conversions. In particular, I want portability of the code across platfroms (Windows and Linux), and I don't like having to refer the user of my code to some third party libraries for compiling. Some STL references point to the...
10
32813
by: sherifffruitfly | last post by:
Hi all, This is how I'm currently getting Friday of last week. It strikes me as cumbersome. Is there a slicker/more elegant way? Thanks for any ideas, cdj
3
1583
by: Anonymous | last post by:
I want to be able to restrict the set of classes for which a template class can be instantiated (i.e enforce that all instantiation MUST be for classes taht derive from a base type BaseType). This occured to me immediately, but use of a dummy variable is not elegant - are there other (more elegant) ways of doing this? template <class DerivedType> class MyClass
0
8917
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9426
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9281
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9200
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6722
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4525
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4795
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3238
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2163
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.