How to handle Unicode?

Rui Maciel

I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

From what I gathered, the two main methods (based on standard, not 3rd party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?
Thanks in advance
Rui Maciel
--
Running Kubuntu 6.10 with KDE 3.5.6 and proud of it.
jabber:ru********@jabber.org

Feb 27 '07 #1

Subscribe Post Reply

2564

Richard Tobin

In article <45***********************@news.telepac.pt>,
Rui Maciel <ru********@gmail.comwrote:

>As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

>So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?

In my XML parser, I use 16-bit integers internally to represent data
in UTF-16, converting from external encodings "by hand". I chose this
because my applications needed to do things like regular expression
matching, and when I started, wide character support was even less
portable than it is now. (I rejected UTF-32 for its space
inefficiency.)

With hindsight, this was a mistake. It made it harder for others to
write applications based on the library, I had to provide numerous
support functions, and I still had to deal with multi-word characters
(using surrogates) in several places.

UTF-8 has the advantage that many single-byte string functions will
still work unchanged: strlen() (when interpreted as meaning length in
bytes), strcpy(), all the ones that don't interpret bytes except '\0'.
You can use ordinary string literals. You can even use strchr() etc
when searching for ASCII characters. Writing a regular expression
matcher would of course have been slightly more tedious, but not
enormously so.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Feb 27 '07 #2

Stephen Sprunk

"Rui Maciel" <ru********@gmail.comwrote in message
news:45***********************@news.telepac.pt...

>I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

Sadly, that's my experience too. It's hard to find public examples of code
that properly handles i18n.

From what I gathered, the two main methods (based on standard, not 3rd
party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

Not quite. Those things actually work together. Internally, your data
sohuld be wchar_t. However, one needs to pick a representation for
communicating with the outside world (via networks, files, etc.), and UTF-8
is a good choice for that since it's reasonably compact and (this is the
important part) it's compatible with programs that are not i18n-aware.

You do _not_ want to write wchar_t's directly to a file in binary mode,
since different systems may have different ideas of what a wchar_t is. This
is not so different from problems where systems disagree on what a char is,
but that's less common these days (but still a problem if you want to write
truly portable code).

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table.

That's not quite correct. Any implementation has to provide you a working
wchar_t and the various functions to manipulate them. Unfortunately, there
is no guarantee it will provide the locales needed to input/output those
wchar_t's in a particular representation (such as UTF-8). This is why
there's not much good portable code floating around, since you can't count
on the implementation being usable in practice; if you want your code to be
portable, you have to rely on a third-party library and hope that _it_ is
portable.

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov

--
Posted via a free Usenet account from http://www.teranews.com

Feb 27 '07 #3

user923005

I like ICU:
http://www-306.ibm.com/software/glob.../icu/index.jsp

Feb 27 '07 #4

SM Ryan

Rui Maciel <ru********@gmail.comwrote:
# I want to support Unicode on a pet project of mine (small markup language
# parser). I've read a bit about Unicode (didn't delved beyond the basics)
# and I searched for some info on how to support Unicode on C programs.
# Unfortunately I wasn't able to find articles that could be considered more
# than loose ends, small blog entries and side remarks, never delving too
# much into specifics.

There are libraries ported to many systems that can do UTF,
Unicode, and other encodings. The Tcl library, for example, can
probably do just about anything you want, and it has been
ported to probably any system you want to run on.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
One of the drawbacks of being a martyr is that you have to die.

Feb 28 '07 #5

Similar topics

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...

Python

handle chinese characters?

by: js | last post by:

Hi, hwo to use ASP to handle chinese characters output? I have try two ways: 1. store the string in a text file(unicode file) 2. store the string in an access database The first method got an...

ASP / Active Server Pages

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

Can't seemd to get ReadFile API to work! Returns invalid handle error?

by: Schorschi | last post by:

Can't seemd to get ReadFile API to work! Returns invalid handle error? =========================================================================== Ok, the visual basic gurus, help! The...

Visual Basic .NET

About upper() and lower to handle multibyte char

by: Weiping | last post by:

Hi, while upgrade to 8.0 (beta3) we got some problem: we have a database which encoding is UNICODE, when we do queries like: select upper('ÖÐÎÄ'); --select some multibyte character, then...

PostgreSQL Database

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...

Python

How should I handle the multibyte char set string in C++?

by: Dancefire | last post by:

Hi, everyone, I'm writing a program using wstring(wchar_t) as internal string. The problem is raised when I convert the multibyte char set string with different encoding to wstring(which is...

C / C++

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA