Unicode: ugh! - C / C++

Ben Pfaff

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.
--
"There's only one thing that will make them stop hating you.
And that's being so good at what you do that they can't ignore you.
I told them you were the best. Now you damn well better be."
--Orson Scott Card, _Ender's Game_

Mar 13 '06 #1

Subscribe Reply

1645

Artie Gold

Ben Pfaff wrote:

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

....and your C question is...

;-) ;-)

--ag

--
Artie Gold -- Austin, Texas
http://goldsays.blogspot.com
"You can't KISS* unless you MISS**"
[*-Keep it simple, stupid. **-Make it simple, stupid.]

Mar 13 '06 #2

osmium

"Ben Pfaff" writes:

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?

Mar 13 '06 #3

Ben Pfaff

"osmium" <r1********@com cast.net> writes:

"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally " terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

It's amazing how much they managed to get wrong in a single
sentence.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Mar 13 '06 #4

osmium

"Ben Pfaff" wrote:

"osmium" <r1********@com cast.net> writes:
"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.
What's your complaint? That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally " terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

I glossed over the word "conventionally ", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.
It's amazing how much they managed to get wrong in a single
sentence.

I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.

Mar 13 '06 #5

Keith Thompson

"osmium" <r1********@com cast.net> writes:

"Ben Pfaff" wrote:
"osmium" <r1********@com cast.net> writes:
"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally " terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

I glossed over the word "conventionally ", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.
It's amazing how much they managed to get wrong in a single
sentence.

I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 13 '06 #6

Richard G. Riley

On 2006-03-13, Keith Thompson <ks***@mib.or g> wrote:

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

In restrospect it was a bit silly to have a NULL and a "null"
character,'\0', and then to compound it all with a "null pointer"... 2
seconds with google shows generations of confusion and standards
abuse.

Mar 13 '06 #7

Jordan Abel

On 2006-03-13, Keith Thompson <ks***@mib.or g> wrote:

"osmium" <r1********@com cast.net> writes:
"Ben Pfaff" wrote:
"osmium" <r1********@com cast.net> writes:
"Ben Pfaff" writes:
> The Unicode standard says this in section 3.9:
>
> "For example, a string is defined as a pointer to char in the
> C language, and is conventionally terminated with a NULL
> character."
>
> You'd think folks writing standards would bother to properly read
> and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally " terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

I glossed over the word "conventionally ", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.
It's amazing how much they managed to get wrong in a single
sentence.

I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.

Mar 13 '06 #8

Keith Thompson

Jordan Abel <ra*******@gmai l.com> writes:

On 2006-03-13, Keith Thompson <ks***@mib.or g> wrote:

[...]

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.

Yes, of course; that's the "very little" I was referring to.

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Mar 13 '06 #9

Richard Tobin

In article <ln************ @nuthaus.mib.or g>,
Keith Thompson <ks***@mib.or g> wrote:

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

However, the text is in the Unicode standard, and there NULL means the
character with code 0.

-- Richard

Mar 13 '06 #10

Similar topics

17628

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir

Python

5288

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib

Python

3673

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?

Python

4675

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the

HTML / CSS

6080

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

2641

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349

Python

2019

Unicode case conversion

by: Matthias Kaeppler | last post by:

Hi, I need to convert a UTF-8 encoded string to all lowercase. I tried using boost::algorithm::to_lower but it couldn't handle the characters in a Glib::ustring (which are of type gunichar). Is there any safe, portable, and *working* way to do case conversion with a unicode string? Regards,

C / C++

1131

UGH, Framework goes just so far AGAIN!

by: Ray Cassick \(Home\) | last post by:

Ok, what is up here. The 2005 framework contains all kinds of cool new structures now that we have Generics and all but they always seem to fall just short of exactly what I need. In 2003 I needed a sets construct and they did not have it so I had to create one. I am not sure yet if they have one in 2005 yet BTW... I was dealing with the Dictionary last night and thought it would be a great

Visual Basic .NET

9089

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special charcaters from an ASCII perspective. I get the following error: > SQLiteCur.execute(sql, row)

Python

10039

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

11359

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10928

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10543

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9734

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

8102

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6149

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4779

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

4346

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP