473,405 Members | 2,338 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

How to handle Unicode?

I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

From what I gathered, the two main methods (based on standard, not 3rd party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?
Thanks in advance
Rui Maciel
--
Running Kubuntu 6.10 with KDE 3.5.6 and proud of it.
jabber:ru********@jabber.org
Feb 27 '07 #1
4 2564
In article <45***********************@news.telepac.pt>,
Rui Maciel <ru********@gmail.comwrote:
>As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.
>So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?
In my XML parser, I use 16-bit integers internally to represent data
in UTF-16, converting from external encodings "by hand". I chose this
because my applications needed to do things like regular expression
matching, and when I started, wide character support was even less
portable than it is now. (I rejected UTF-32 for its space
inefficiency.)

With hindsight, this was a mistake. It made it harder for others to
write applications based on the library, I had to provide numerous
support functions, and I still had to deal with multi-word characters
(using surrogates) in several places.

UTF-8 has the advantage that many single-byte string functions will
still work unchanged: strlen() (when interpreted as meaning length in
bytes), strcpy(), all the ones that don't interpret bytes except '\0'.
You can use ordinary string literals. You can even use strchr() etc
when searching for ASCII characters. Writing a regular expression
matcher would of course have been slightly more tedious, but not
enormously so.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Feb 27 '07 #2
"Rui Maciel" <ru********@gmail.comwrote in message
news:45***********************@news.telepac.pt...
>I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.
Sadly, that's my experience too. It's hard to find public examples of code
that properly handles i18n.
From what I gathered, the two main methods (based on standard, not 3rd
party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.
Not quite. Those things actually work together. Internally, your data
sohuld be wchar_t. However, one needs to pick a representation for
communicating with the outside world (via networks, files, etc.), and UTF-8
is a good choice for that since it's reasonably compact and (this is the
important part) it's compatible with programs that are not i18n-aware.

You do _not_ want to write wchar_t's directly to a file in binary mode,
since different systems may have different ideas of what a wchar_t is. This
is not so different from problems where systems disagree on what a char is,
but that's less common these days (but still a problem if you want to write
truly portable code).
As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table.
That's not quite correct. Any implementation has to provide you a working
wchar_t and the various functions to manipulate them. Unfortunately, there
is no guarantee it will provide the locales needed to input/output those
wchar_t's in a particular representation (such as UTF-8). This is why
there's not much good portable code floating around, since you can't count
on the implementation being usable in practice; if you want your code to be
portable, you have to rely on a third-party library and hope that _it_ is
portable.

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov

--
Posted via a free Usenet account from http://www.teranews.com

Feb 27 '07 #3
Rui Maciel <ru********@gmail.comwrote:
# I want to support Unicode on a pet project of mine (small markup language
# parser). I've read a bit about Unicode (didn't delved beyond the basics)
# and I searched for some info on how to support Unicode on C programs.
# Unfortunately I wasn't able to find articles that could be considered more
# than loose ends, small blog entries and side remarks, never delving too
# much into specifics.

There are libraries ported to many systems that can do UTF,
Unicode, and other encodings. The Tcl library, for example, can
probably do just about anything you want, and it has been
ported to probably any system you want to run on.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
One of the drawbacks of being a martyr is that you have to die.
Feb 28 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
8
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...
5
by: js | last post by:
Hi, hwo to use ASP to handle chinese characters output? I have try two ways: 1. store the string in a text file(unicode file) 2. store the string in an access database The first method got an...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Schorschi | last post by:
Can't seemd to get ReadFile API to work! Returns invalid handle error? =========================================================================== Ok, the visual basic gurus, help! The...
3
by: Weiping | last post by:
Hi, while upgrade to 8.0 (beta3) we got some problem: we have a database which encoding is UNICODE, when we do queries like: select upper('ÖÐÎÄ'); --select some multibyte character, then...
24
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...
10
by: Dancefire | last post by:
Hi, everyone, I'm writing a program using wstring(wchar_t) as internal string. The problem is raised when I convert the multibyte char set string with different encoding to wstring(which is...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.