473,587 Members | 2,516 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to handle Unicode?

I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.

From what I gathered, the two main methods (based on standard, not 3rd party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.

As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.

So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?
Thanks in advance
Rui Maciel
--
Running Kubuntu 6.10 with KDE 3.5.6 and proud of it.
jabber:ru****** **@jabber.org
Feb 27 '07 #1
4 2589
In article <45************ ***********@new s.telepac.pt>,
Rui Maciel <ru********@gma il.comwrote:
>As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table. So that leaves UTF-8 and the "regular" c strings as the
only solution. Yet, as some characters extend themselves beyond a single
byte, I believe a program must need some sort of work to handle those.
>So, what are your views on this subject? How do you handle Unicode on your C
code and what precautions do you take to be able to handle it?
In my XML parser, I use 16-bit integers internally to represent data
in UTF-16, converting from external encodings "by hand". I chose this
because my applications needed to do things like regular expression
matching, and when I started, wide character support was even less
portable than it is now. (I rejected UTF-32 for its space
inefficiency.)

With hindsight, this was a mistake. It made it harder for others to
write applications based on the library, I had to provide numerous
support functions, and I still had to deal with multi-word characters
(using surrogates) in several places.

UTF-8 has the advantage that many single-byte string functions will
still work unchanged: strlen() (when interpreted as meaning length in
bytes), strcpy(), all the ones that don't interpret bytes except '\0'.
You can use ordinary string literals. You can even use strchr() etc
when searching for ASCII characters. Writing a regular expression
matcher would of course have been slightly more tedious, but not
enormously so.

-- Richard
--
"Considerat ion shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.
Feb 27 '07 #2
"Rui Maciel" <ru********@gma il.comwrote in message
news:45******** *************** @news.telepac.p t...
>I want to support Unicode on a pet project of mine (small markup language
parser). I've read a bit about Unicode (didn't delved beyond the basics)
and I searched for some info on how to support Unicode on C programs.
Unfortunately I wasn't able to find articles that could be considered more
than loose ends, small blog entries and side remarks, never delving too
much into specifics.
Sadly, that's my experience too. It's hard to find public examples of code
that properly handles i18n.
From what I gathered, the two main methods (based on standard, not 3rd
party
libraries) which are used when working with Unicode are the extensive use
of wchar_t and interpret UTF-8 from "regular" c strings.
Not quite. Those things actually work together. Internally, your data
sohuld be wchar_t. However, one needs to pick a representation for
communicating with the outside world (via networks, files, etc.), and UTF-8
is a good choice for that since it's reasonably compact and (this is the
important part) it's compatible with programs that are not i18n-aware.

You do _not_ want to write wchar_t's directly to a file in binary mode,
since different systems may have different ideas of what a wchar_t is. This
is not so different from problems where systems disagree on what a char is,
but that's less common these days (but still a problem if you want to write
truly portable code).
As far as I know (correct me if I'm wrong) the wchar_t approach is not
portable across platforms or even compilers. Therefore it is automatically
out of the table.
That's not quite correct. Any implementation has to provide you a working
wchar_t and the various functions to manipulate them. Unfortunately, there
is no guarantee it will provide the locales needed to input/output those
wchar_t's in a particular representation (such as UTF-8). This is why
there's not much good portable code floating around, since you can't count
on the implementation being usable in practice; if you want your code to be
portable, you have to rely on a third-party library and hope that _it_ is
portable.

S

--
Stephen Sprunk "Those people who think they know everything
CCIE #3723 are a great annoyance to those of us who do."
K5SSS --Isaac Asimov

--
Posted via a free Usenet account from http://www.teranews.com

Feb 27 '07 #3
Rui Maciel <ru********@gma il.comwrote:
# I want to support Unicode on a pet project of mine (small markup language
# parser). I've read a bit about Unicode (didn't delved beyond the basics)
# and I searched for some info on how to support Unicode on C programs.
# Unfortunately I wasn't able to find articles that could be considered more
# than loose ends, small blog entries and side remarks, never delving too
# much into specifics.

There are libraries ported to many systems that can do UTF,
Unicode, and other encodings. The Tcl library, for example, can
probably do just about anything you want, and it has been
ported to probably any system you want to run on.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
One of the drawbacks of being a martyr is that you have to die.
Feb 28 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
17603
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir
8
5259
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
8
3646
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same...
5
11767
by: js | last post by:
Hi, hwo to use ASP to handle chinese characters output? I have try two ways: 1. store the string in a text file(unicode file) 2. store the string in an access database The first method got an error: Active Server Pages error 'ASP 0239' Cannot process file UNICODE ASP files are not supported.
4
6052
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using...
2
5724
by: Schorschi | last post by:
Can't seemd to get ReadFile API to work! Returns invalid handle error? =========================================================================== Ok, the visual basic gurus, help! The following is a diskette class (vb .net) that works find, in that I can validate a diskette is mounted, dismount it, lock it, unlock it, get diskette...
3
4217
by: Weiping | last post by:
Hi, while upgrade to 8.0 (beta3) we got some problem: we have a database which encoding is UNICODE, when we do queries like: select upper('ÖÐÎÄ'); --select some multibyte character, then postgresql response: ERROR: invalid multibyte character for locale
24
9037
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special charcaters from an ASCII perspective. I get the following error: > SQLiteCur.execute(sql, row)
10
9072
by: Dancefire | last post by:
Hi, everyone, I'm writing a program using wstring(wchar_t) as internal string. The problem is raised when I convert the multibyte char set string with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in Win32, and UCS4 in Linux?). I have 2 ways to do the job:
0
8216
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8349
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7974
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6629
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5395
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3845
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3882
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2364
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1455
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.