UTF-8 in char* - C / C++

Jacky Cheung

Hi,

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

Jacky

Nov 14 '05 #1

Subscribe Reply

15501

Joona I Palaste

Jacky Cheung <ja*****@yahoo. com> scribbled the following:

Hi, I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

AFAIK UTF-8 does not have NUL characters (most people prefer to spell it
that way to avoid confusion). UTF-8 only includes "normal" ASCII
characters and special characters with bit 7 set. You're in no more
danger of seeing NUL in UTF-8 than you are of seeing it in ASCII.
Note that vCards, by themselves, are completely off-topic here.

--
/-- Joona Palaste (pa*****@cc.hel sinki.fi) ------------- Finland --------\
\-- http://www.helsinki.fi/~palaste --------------------- rules! --------/
"He said: 'I'm not Elvis'. Who else but Elvis could have said that?"
- ALF

Nov 14 '05 #2

Hallvard B Furuseth

Jacky Cheung wrote:

I am developing a vCard application which have to support UTF-8. Does
the UTF-8 in char* will crash the strlen, I mean does UTF-8 have some
char which treat as NULL character in strlen?

Well, it has a null control character, but it means more or less the
same as the ASCII null character. So if you just want to handle normal
text, you can use normal C strings, and thus strlen().
BTW, if you have only written programs for ASCII before, you might note
that functions like getchar() return character values in the range of
'unsigned char' or EOF, while 'char' can be negative. So code like
char buf[] = "<UTF-8 string>";
int ch, i;
...
while ((ch = getchar()) != EOF) {
if (ch == buf[i]) ...
is wrong. (Even if you don't use UTF-8, but you may not have noticed
before.) You need to convert ch to char or buf[j] to unsigned char
before comparing the two.

--
Hallvard

Nov 14 '05 #3

Chris Torek

In article <news:br******* ****@imsp212.ne tvigator.com>
Jacky Cheung <ja*****@yahoo. com> writes:

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

UTF-8 is simply an encoding mechanism for taking larger-than-8-bit
values and storing them in 8-bit values. The details of this
mechanism are pretty much off-topic in comp.lang.c, but here we
can say that UTF-8 encoded characters will always fit in objects
of type "unsigned char", as those will have at least 8 bits.

Your actual question above cannot (quite) be answered as asked as
it appears to contain at least one false assumption, i.e., that
the presence of a '\0' character in an array of unsigned char will
"crash" strlen(). In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char, searching forward until it finds a '\0' value, then returning
the number of non-'\0'-"char"s it has skipped. Passing strlen()
the address of an array of "char" that does *not* contain a '\0'
could cause the program to crash (or indeed exhibit any behavior
at all); so I think what you really mean to ask is:

"Given some sequence of values in some wider-than-8-bit
character set (such as 16 or 32 bit Unicode), suppose I have
encoded it in 8-bit bytes using the UTF-8 scheme. Can I
(usefully) apply strlen() to the result?"

The answer to this version of the question is "maybe". In particular,
you must ensure that:

a) none of the 8-bit values is a trap representation in plain
"char" if plain "char" is signed (and the C language proper is
not terribly helpful here, but you could constrain yourself to
two's complement systems or those with wide-enough "plain" chars,
by checking that either CHAR_MAX >= 255 -- i.e., no UTF-8 value
will be negative -- or that -CHAR_MIN <= -128);

b) that the "char" array you have used to stored the encoded
values is '\0'-terminated;

c) that you did not embed any '\0' values in that array, and

d) that the resulting strlen() value meets any other criteria
you may hide beneath the word "useful".

The conditions in part (a) are met by most C systems today, so you
might simply assume them (and document that assumption somewhere).
The conditions in part (b) and (c) may, or may not, arise naturally
out of the values you are UTF-8 encoding -- this part is up to you.
Part (d) is likewise something only you can answer.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Nov 14 '05 #4

Chris

I remember when I did UCS2 for similar vCard application, I used the
following structure:

typedef struct ucs2_tag {
unsigned short* str_ptr;
unsigned int length;
} ucs2, *ucs2_ptr;

By doing it like this, it'll be obvious to people who need to work on your
code what you're doing and stick to memcpy() for string copying then you dun
need to worry about the NUL terminator or not, this is achieved at a small
cost of managing the memory usage for each structure that you create.

Also, you might need to write your own myStrlen() to count the number of
characters of an input string since it's length can be unpredictable.

So in your case you can declare a structure for UTF-8 as:

typedef struct utf8_tag {
unsigned char* str_ptr;
unsigned int length;
} utf8, *utf8_ptr;

The answer to your 2nd question is that NUL character is NOT EQUIVALENT to
NULL!!!
There is no such thing as "NULL character" but there exists an "NUL
character", which is the '\0' at the end of a string buffer.

Just out of curiosity, are you a mobile phone software developer?

"Jacky Cheung" <ja*****@yahoo. com> wrote in message
news:br******** ***@imsp212.net vigator.com...

Hi,

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which treat as NULL character in strlen?

Jacky

Nov 14 '05 #5

Simon Biber

"Chris" <ch*******@nosp am.hotmail.com> wrote:

I remember when I did UCS2 for similar vCard application, I used the
following structure:

typedef struct ucs2_tag {
unsigned short* str_ptr;
unsigned int length;
} ucs2, *ucs2_ptr;
That looks good. I would use a value of type size_t to store the
length, and as a style point I wouldn't provide a pointer typedef.
Users can declare a
ucs2 *my_ptr;
if they like.
By doing it like this, it'll be obvious to people who need to work on
your code what you're doing and stick to memcpy() for string copying
then you dun need to worry about the NUL terminator or not, this is
achieved at a small cost of managing the memory usage for each
structure that you create.
True, and useful in the case of UCS2. Or you could just use C's
wide strings if they are implemented in UCS2 on your platform.
Also, you might need to write your own myStrlen() to count the
number of characters of an input string since it's length can
be unpredictable.
That would be more a file format issue; in the case of a UTF-8
encoded text file, there should not be any embedded zero bytes
and usual I/O or string functions like fgets and strlen should
work fine.
So in your case you can declare a structure for UTF-8 as:
typedef struct utf8_tag {
unsigned char* str_ptr;
unsigned int length;
} utf8, *utf8_ptr;

This is typically not needed for UTF-8. UTF-8 has the important
property that any code value from 0 to 127 inclusive codes for
the respective ASCII character and cannot occur as part of the
multi-byte representation for a higher character. Therefore, any
zero byte occuring in the UTF-8 string is indeed a real ASCII NUL
character and therefore can be used transparently with the usual
C semantics of string termination.

--
Simon.

Nov 14 '05 #6

grobbeltje

Chris Torek <no****@torek.n et> wrote:

In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char,

Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars. The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars.

Am I wrong (again)? Does it really matter whether the chars in
this comparison are signed or not?

ps: Sorry for my bad english.
Grobbeltje (just another curious newbee).
--
Next time you think you're perfect, try walking on water.

Nov 14 '05 #7

Kevin Goodsell

grobbeltje wrote:

Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

You're reading in the wrong place. The behavior of char is described
earlier, in section 6.2.5/15:

The three types char, signed char, and unsigned char are
collectively called the character types. The implementation
shall define char to have the same range, representation,
and behavior as either signed char or unsigned char.

So char behaves exactly like either signed char or unsigned char, and
the implementation must decide (and document) which. The three types are
distinct, however.
As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars.
This interpretation is wrong. I'm unsure of how you got that from the
passage you quoted.
The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars.
I'm not totally sure what you mean here. To the best of my knowledge,
strlen should operate correctly on any character type.

Am I wrong (again)? Does it really matter whether the chars in
this comparison are signed or not?

The 'signed-ness' of the objects involved in a comparison is certainly
important, but I'm still not sure what you are getting at.

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.

Nov 14 '05 #8

David Resnick

"Jacky Cheung" <ja*****@yahoo. com> wrote in message news:<br******* ****@imsp212.ne tvigator.com>.. .

Hi,

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

Jacky

There are no embedded NUL characters in a UTF-8 encoded string,
that is one of its primary virtues. However, you need to note
that the strlen of a UTF-8 string is greater than (unless that
string is all ASCII) the number of characters represented by
that string...

-David

Nov 14 '05 #9

Christian Bau

In article <br**********@n ews.tue.nl>, grobbeltje <gr*****@hotmai l.com>
wrote:

Chris Torek <no****@torek.n et> wrote:
In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char,

Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars. The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars.

The compiler has to make a choice between two possibilities: Either
plain "char" behaves exactly the same way as "unsigned char", or plain
"char" behaves exactly the same as "signed char". The compiler must make
its decision and then stick with it.

Nov 14 '05 #10

Similar topics

4194

how to test text to see if maybe it is UTF-8????

by: lawrence | last post by:

Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work? Something that maybe gives 90% confidence that a given block of text is or is not UTF-8 encoded?

PHP

6414

Psycopg and queries with UTF-8 data

by: Alban Hertroys | last post by:

Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file. And guess what, it contains utf-8 encoded characters... Now my problem is that psycopg will only accept queries of type str, so how do I get my utf-8 encoded data into the DB? I can't do query.encode('ascii'), that would be similar to: >>> x =...

Python

8235

UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP

by: Mike Dee | last post by:

A very very basic UTF-8 question that's driving me nuts: If I have this in the beginning of my Python script in Linux: #!/usr/bin/env python # -*- coding: UTF-8 -*- should I - or should I not - be able to use non-ASCII characters in strings and in Tk GUI button labels and GUI window titles and in raw_input data without Python returning wrong case in manipulated

Python

5750

French "No" character entity

by: Haines Brown | last post by:

I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I find an answer to this question (don't find it in the W3C_char_entities document). -- Haines Brown brownh@hartford-hwp.com

HTML / CSS

18768

LoadXML and UTF-8 encoding

by: jmgonet | last post by:

Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><a>SchÃ¶nbÃ¼hl</a>"; doc.LoadXml(s); doc.Save("d:\\temp\\test.xml");

.NET Framework

13895

Unicode and utf 8 /utf 16

by: archana | last post by:

Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2 character in streamreader and streamwriter. How unicode and utf chacters are stored.

C# / C Sharp

12156

Converting from UTF-16 to UTF-32

by: Jimmy Shaw | last post by:

Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32, but with zero padding and hence no real conversion is necessary? If I am completely wrong and some intricate conversion operation needs to take place, can anyone give me some primer on the subject?

C / C++

7297

Printing UTF-8

by: sheldon.regular | last post by:

I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. Ã¤Ã¶Ã¼ Ã¤Ã¶Ã¼ '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc' u'\xe4\xf6\xfc' Ã¤Ã¶Ã¼

Python

2353

UTF-8 encoding problem

by: shreshth.luthra | last post by:

Hi All, I am having a GUI which accepts a Unicode string and searches a given set of xml files for that string. Now, i have 2 XML files both of them saved in UTF-8 format, having characters of different language. Although both of them are having UTF-8 as BoM, but only first file is having UTF-8 defined in XML declration at the top of the XML file as

Visual Basic .NET

5035

UTF-8 encoding in AJAX web application.

by: Allan Ebdrup | last post by:

I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars from another HTML page viewed in IE7, that is also UTF-8 encoded (search for "china" on google.com). I paste the chineese chars into a content editable div. My Ajax webservice compiles an XML where the data from the content editable div is...

C# / C Sharp

9603

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10640

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10376

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10120

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6881

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5550

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5689

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4332

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3861

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP