473,232 Members | 1,538 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,232 software developers and data experts.

Unicode: ugh!

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.
--
"There's only one thing that will make them stop hating you.
And that's being so good at what you do that they can't ignore you.
I told them you were the best. Now you damn well better be."
--Orson Scott Card, _Ender's Game_
Mar 13 '06 #1
17 1605
Ben Pfaff wrote:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


....and your C question is...

;-) ;-)

--ag

--
Artie Gold -- Austin, Texas
http://goldsays.blogspot.com
"You can't KISS* unless you MISS**"
[*-Keep it simple, stupid. **-Make it simple, stupid.]
Mar 13 '06 #2
"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


What's your complaint? That the ASCII null should be spelled NUL?
Mar 13 '06 #3
"osmium" <r1********@comcast.net> writes:
"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


What's your complaint? That the ASCII null should be spelled NUL?


Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

It's amazing how much they managed to get wrong in a single
sentence.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Mar 13 '06 #4
"Ben Pfaff" wrote:
"osmium" <r1********@comcast.net> writes:
"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.
What's your complaint? That the ASCII null should be spelled NUL?


Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.


I glossed over the word "conventionally", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.
It's amazing how much they managed to get wrong in a single
sentence.


I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.
Mar 13 '06 #5
"osmium" <r1********@comcast.net> writes:
"Ben Pfaff" wrote:
"osmium" <r1********@comcast.net> writes:
"Ben Pfaff" writes:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?


Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.


I glossed over the word "conventionally", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.
It's amazing how much they managed to get wrong in a single
sentence.


I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.


Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 13 '06 #6
On 2006-03-13, Keith Thompson <ks***@mib.org> wrote:

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.


In restrospect it was a bit silly to have a NULL and a "null"
character,'\0', and then to compound it all with a "null pointer"... 2
seconds with google shows generations of confusion and standards
abuse.

Mar 13 '06 #7
On 2006-03-13, Keith Thompson <ks***@mib.org> wrote:
"osmium" <r1********@comcast.net> writes:
"Ben Pfaff" wrote:
"osmium" <r1********@comcast.net> writes:
"Ben Pfaff" writes:
> The Unicode standard says this in section 3.9:
>
> "For example, a string is defined as a pointer to char in the
> C language, and is conventionally terminated with a NULL
> character."
>
> You'd think folks writing standards would bother to properly read
> and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.


I glossed over the word "conventionally", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.
It's amazing how much they managed to get wrong in a single
sentence.


I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.


Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.


Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.
Mar 13 '06 #8
Jordan Abel <ra*******@gmail.com> writes:
On 2006-03-13, Keith Thompson <ks***@mib.org> wrote:

[...]
Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.


Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.


Yes, of course; that's the "very little" I was referring to.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 13 '06 #9
In article <ln************@nuthaus.mib.org>,
Keith Thompson <ks***@mib.org> wrote:
Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.


However, the text is in the Unicode standard, and there NULL means the
character with code 0.

-- Richard
Mar 13 '06 #10
Ben Pfaff <bl*@cs.stanford.edu> wrote:

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


Not after you've worked in the standards world for a while, you
wouldn't. :-) Of course, some committees are better than others.

-Larry Jones

Wow, how existential can you get? -- Hobbes
Mar 14 '06 #11
Ben Pfaff <bl*@cs.stanford.edu> wrote:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.

Richard
Mar 14 '06 #12
rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
Ben Pfaff <bl*@cs.stanford.edu> wrote:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.
[NULL is correct for Unicode.]


There's a lot more wrong with it than misspelling "null".
--
"The way I see it, an intelligent person who disagrees with me is
probably the most important person I'll interact with on any given
day."
--Billy Chambless
Mar 14 '06 #13
On 2006-03-14, Richard Bos <rl*@hoekstra-uitgeverij.nl> wrote:
Ben Pfaff <bl*@cs.stanford.edu> wrote:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.


Well, yeah. That's the english word/phrase for which NUL is an
abbreviation, just like we have START OF TEXT for STX, and so on.
Mar 14 '06 #14
In article <44****************@news.xs4all.nl> rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
....
The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL.


That name was already present in Unicode 1.1.5 (July 1995) (the earliest
reference that is available online).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Mar 15 '06 #15
rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:
Ben Pfaff <bl*@cs.stanford.edu> wrote:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.


The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.


I took another look at the text I quoted. At a second look, it
is clearly *not* referring to the Unicode character called NULL,
because Unicode character names in the Unicode standard are
expressed in small capital letters. The NULL in the paragraph
above is in full-size capital letters, so it does not refer to a
Unicode character name.
--
"I don't have C&V for that handy, but I've got Dan Pop."
--E. Gibbons
Mar 15 '06 #16
In article <87************@benpfaff.org>,
Ben Pfaff <bl*@cs.stanford.edu> wrote:
I took another look at the text I quoted. At a second look, it
is clearly *not* referring to the Unicode character called NULL,
because Unicode character names in the Unicode standard are
expressed in small capital letters.


They are in small caps when written as (for example) "U+004B LATIN
CAPITAL LETTER K", but they also appear (without the U+XXXX) in plain
capitals (e.g. in the character tables themselves), lower case, and
italics. So I don't think you can deduce that they are not referring
to the Unicode character. Though they might well be using it in a
more generic sense of a null character without reference to Unicode in
particular (which would be more accurate in a sense, because as far as
I can see nothing guarantees that C's string-terminating character
maps to U+0000).

Anyway, I certainly don't think you can assume the author was
confusing it with the C macro NULL.

-- Richard
Mar 15 '06 #17
ri*****@cogsci.ed.ac.uk (Richard Tobin) wrote:
In article <87************@benpfaff.org>,
Ben Pfaff <bl*@cs.stanford.edu> wrote:
I took another look at the text I quoted. At a second look, it
is clearly *not* referring to the Unicode character called NULL,
because Unicode character names in the Unicode standard are
expressed in small capital letters.


They are in small caps when written as (for example) "U+004B LATIN
CAPITAL LETTER K", but they also appear (without the U+XXXX) in plain
capitals (e.g. in the character tables themselves), lower case, and
italics. So I don't think you can deduce that they are not referring
to the Unicode character. Though they might well be using it in a
more generic sense of a null character without reference to Unicode in
particular (which would be more accurate in a sense, because as far as
I can see nothing guarantees that C's string-terminating character
maps to U+0000).


The null character in C must be a character with value zero. U+0000
trivially also has value zero. If an implementation manages not to map
the one onto the other, I would say that that implementation does not
have Unicode as its character set, but at most Unicode-rearranged.

Richard
Mar 16 '06 #18

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
8
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
1
by: Matthias Kaeppler | last post by:
Hi, I need to convert a UTF-8 encoded string to all lowercase. I tried using boost::algorithm::to_lower but it couldn't handle the characters in a Glib::ustring (which are of type gunichar). ...
6
by: Ray Cassick \(Home\) | last post by:
Ok, what is up here. The 2005 framework contains all kinds of cool new structures now that we have Generics and all but they always seem to fall just short of exactly what I need. In 2003 I...
24
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.