473,757 Members | 10,736 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

More elegant UTF-8 encoder

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsig ned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jun 10 '07
35 4352
In article <11************ **********@e9g2 000prf.googlegr oups.com>,
<we******@gmail .comwrote:
>is a big win. If you include western European languages, it will
still get about 90% of characters.
>Tell that to the Greeks, French or Russians.
Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Jun 16 '07 #31
In article <f5***********@ pc-news.cogsci.ed. ac.ukri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
In article <11************ **********@e9g2 000prf.googlegr oups.com>,
<we******@gmail .comwrote:
is a big win. If you include western European languages, it will
still get about 90% of characters.
Tell that to the Greeks, French or Russians.

Russian is not western European.
How do you define western European?
Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.
That may be the case. But there is only one of the western European
languages that will fit completely in that range, and I do not know
whether it is indeed 10% accented characters in all other languages
(and there are more than you think). I would think that figure is
exceeded in Frisian, one of the official languages of the Netherlands.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Jun 20 '07 #32
"Dik T. Winter" <Di********@cwi .nlschrieb im Newsbeitrag
news:JJ******** @cwi.nl...
In article <f5***********@ pc-news.cogsci.ed. ac.ukri*****@cogsci. ed.ac.uk
(Richard Tobin) writes:
In article <11************ **********@e9g2 000prf.googlegr oups.com>,
<we******@gmail .comwrote:
>is a big win. If you include western European languages, it will
>still get about 90% of characters.
>Tell that to the Greeks, French or Russians.
Russian is not western European.

How do you define western European?
Does it matter? Russia is eastern Europe and even goes into Asia, there's no
european country that is more eastern, so it can't be western, can it?

Eastern Europe used to be devided from western Europe by the Iron Curtain.
Even now this still is the line to draw with the minor exception of eastern
Germany, maybe 8-)

Also Russia uses a completly different character set. The Greek too. While
the French only have a couple of accented characters, the Dutch having one
extra character (ij), etc, in addition to the Latin alphabeth

Bye, Jojo
Jun 20 '07 #33
"Dik T. Winter" <Di********@cwi .nlwrote:
In article <f5***********@ pc-news.cogsci.ed. ac.ukri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
In article <11************ **********@e9g2 000prf.googlegr oups.com>,
<we******@gmail .comwrote:
>
is a big win. If you include western European languages, it will
still get about 90% of characters.
>
Tell that to the Greeks, French or Russians.
>
Russian is not western European.

How do you define western European?
From the West of Europe, obviously. Russia is about as far East as you
can go while still being in Europe.
Russian is not western European. French has accented characters, but
less than 10% (yes, I checked some examples). Overall, I believe that
about 90% of characters in western European languages will be from
the Unicode range below 0x80.

That may be the case. But there is only one of the western European
languages that will fit completely in that range,
Two, if you count dead languages. Since English is completely identical
to Latin in all other regards (hence the ban on split infinitives), this
is but proper.
and I do not know whether it is indeed 10% accented characters in all
other languages (and there are more than you think). I would think
that figure is exceeded in Frisian, one of the official languages of
the Netherlands.
Is not! It's a speech defect. But no, you'd be surprised how few accents
there are in a typical Frisian text. Odd ones, such as a circonflexe on
the 'y', but not that many. Too bloody many unaccented 'y's and 'j's
inserted any which where, but not that many accents.
All this, and nobody has yet mentioned that categorical statements about
what kind of code is more time-efficient are usually the sign of a very
poor programmer? Measure, people, measure! And don't be surprised to
find a difference of only 1% either way.

Richard
Jun 20 '07 #34
In article <f5**********@o nline.de"Joachi m Schmitz" <jo**@schmitz-digital.dewrite s:
"Dik T. Winter" <Di********@cwi .nlschrieb im Newsbeitrag
news:JJ******** @cwi.nl...
In article <f5***********@ pc-news.cogsci.ed. ac.ukri*****@cogsci. ed.ac.uk
(Richard Tobin) writes:
In article <11************ **********@e9g2 000prf.googlegr oups.com>,
<we******@gmail .comwrote:
>
is a big win. If you include western European languages, it will
still get about 90% of characters.
>
Tell that to the Greeks, French or Russians.
>
Russian is not western European.
How do you define western European?

Does it matter? Russia is eastern Europe and even goes into Asia, there's no
european country that is more eastern, so it can't be western, can it?
But there are many people who would not call German western European either.
Rather central European.
Eastern Europe used to be devided from western Europe by the Iron Curtain.
Even now this still is the line to draw with the minor exception of eastern
Germany, maybe 8-)
Ah. But because Greece and in fact also Yugoslavia were not behind the
Iron Curtain they belong to Western Europe? And Cyrillic is also used
in some of the former Yugoslavian Republics (it was actually invented
in Croatia, but not used there). And we have now also Bulgaria in the
EU, using a Cyrillic script. It will not be long before the banknotes
are adapted to include that script.
Also Russia uses a completly different character set. The Greek too. While
the French only have a couple of accented characters, the Dutch having one
extra character (ij), etc, in addition to the Latin alphabeth
Take care. Linguists do not agree that "ij" is an extra character in
Dutch, it is a bit controversial. And you are missing the accented letters
that are used in Dutch quite a lot (diaeresis amongst others, with a function
different from the Umlaut in German, our neighbouring country is called
België in Dutch).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Jun 25 '07 #35
In article <46************ ****@news.xs4al l.nlrl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
"Dik T. Winter" <Di********@cwi .nlwrote:
In article <f5***********@ pc-news.cogsci.ed. ac.ukri*****@cogsci. ed.ac.uk (Richard Tobin) writes:
....
Russian is not western European.
How do you define western European?

From the West of Europe, obviously. Russia is about as far East as you
can go while still being in Europe.
But that script is used more west than the most western part of (former)
Russia.
and I do not know whether it is indeed 10% accented characters in all
other languages (and there are more than you think). I would think
that figure is exceeded in Frisian, one of the official languages of
the Netherlands.

Is not! It's a speech defect.
Perhaps, but in that case it is an official speech defect. And there are
two more such in the Netherlands (called regional languages).
All this, and nobody has yet mentioned that categorical statements about
what kind of code is more time-efficient are usually the sign of a very
poor programmer? Measure, people, measure! And don't be surprised to
find a difference of only 1% either way.
Right. And: make it correct first, bother about optimisation only when
time is a problem.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Jun 25 '07 #36

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
1357
by: Steve | last post by:
Hey, well I'm really pleased with myself for writing a little script today that changes variables according to a radio button. The thing is, I have this feeling part of my script could be a lot more elegant and compact than it is: if ($radiobutton=="radio1") {$item1 =$item1 * 54 ;}; //each of these blocks is a range each block assigns different prices to items if ($radiobutton=="radio1") {$item2 =$item2 * 56 ;}; if...
4
1423
by: Kamilche | last post by:
''' Is there a more elegant way of doing this? I would like to have the arguments to pack automatically taken from lst. ''' def pack(self): lst = return struct.pack('<IIIiiiBBBHHH', \ self.id, self.parent, self.number, \
6
5019
by: Kamilche | last post by:
Is there a more elegant way to change the working directory of Python to the directory of the currently executing script, and add a folder called 'Shared' to the Python search path? This is what I have. It seems like it could be shorter, somehow. # Switch Python to the current directory import os, sys pathname, scriptname = os.path.split(sys.argv) pathname = os.path.abspath(pathname)
6
3094
by: Philipp Lenssen | last post by:
Is there any way I can keep those URLs... www.example.com/?q=hello&type= from appearing? Let's say I have a form with a select-options box. Along with a single text-input. Now if "q" is the name of the input box and "type" is the name of the select-box, I also have a default for type (which is empty and does nothing). Do I need to have htaccess redirects to clean above URL to this one: www.example.com/?q=hello
17
7384
by: Fresh Air Rider | last post by:
Hello Could anyone please explain how I can pass more than one arguement/parameter value to a function using <asp:linkbutton> or is this a major shortfall of the language ? Consider the following code fragments in which I want to list through all files on a directory with a link to download each file by passing a filename and whether or not to force the download dialog box to appear.
13
1212
by: Edward W. | last post by:
hello, I have this function below which is simple and easy to understand private function ListHeight (byval UserScreenHeight as int) as int if UserScreenHeight < 1024 return 30 else return 50 end if end function
10
2366
by: p3t3r | last post by:
I have a treeview sourced from a SiteMap. I want to use 2 different CSS styles for the root level nodes. The topmost root node should not have a top border, all the other root nodes should have a top border. Is it possible to have more than 1 style at the same level (parent node) when using a SiteMap? I want it to appear something like this and I can only find a way to either have the border on all root nodes or none at all. In...
5
6496
by: Licheng Fang | last post by:
I want to store Chinese in Unicode internally in my program, and give output in UTF-8 or GBK format. After two days of searching and reading, I still cannot find a simple and straightforward way to do the code conversions. In particular, I want portability of the code across platfroms (Windows and Linux), and I don't like having to refer the user of my code to some third party libraries for compiling. Some STL references point to the...
10
32828
by: sherifffruitfly | last post by:
Hi all, This is how I'm currently getting Friday of last week. It strikes me as cumbersome. Is there a slicker/more elegant way? Thanks for any ideas, cdj
3
1583
by: Anonymous | last post by:
I want to be able to restrict the set of classes for which a template class can be instantiated (i.e enforce that all instantiation MUST be for classes taht derive from a base type BaseType). This occured to me immediately, but use of a dummy variable is not elegant - are there other (more elegant) ways of doing this? template <class DerivedType> class MyClass
0
9489
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9298
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10072
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9906
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9885
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9737
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8737
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5329
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3399
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.