By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,906 Members | 1,387 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,906 IT Pros & Developers. It's quick & easy.

Multi-byte chars

P: n/a
I've been reading the C standard online and I'm puzzled as to what multibyte
chars are. Wide chars I believe would be characters for languages such as
cantonese or Japanese. I know the ASCII character set specifies that each
character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
character?
Also how would you use the function parameter main (char argc, char
**argv) if that's correct?

Bill

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 80,000 Newsgroups - 16 Different Servers! =-----
Nov 13 '05 #1
Share this Question
Share on Google+
43 Replies


P: n/a
Bill Cunningham wrote:
I've been reading the C standard online and I'm puzzled as to what
multibyte chars are.
A multibyte character is a "sequence of one or more bytes representing a
member of the extended character set of either the source or the execution
environment", if I have the quote from 3.7.2 right.
Wide chars I believe would be characters for
languages such as cantonese or Japanese.
C isn't as specific as that. See 3.7.3.
I know the ASCII character set
specifies that each character such as 'b' or 'B' is an 8 bit character.


7 bits, not 8. ASCII is a 7-bit code.

<snip>

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #2

P: n/a
Bill Cunningham <so**@some.net> wrote:

I've been reading the C standard online and I'm puzzled as to what multibyte
chars are. Wide chars I believe would be characters for languages such as
cantonese or Japanese. I know the ASCII character set specifies that each
character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
character?


A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value. Other characters are encoded as multiple bytes,
each of which has the top bit set; the first byte is in the range \xc0
to \xfd and indicates the number of bytes that follow, subsequent bytes
are in the range \x80 to \xbf. UTF-8 encoded characters can be any
length between one and six bytes. So 'A' is encoded as \x41 but ''
(the copyright sign) is encoded as \xc2\xa9.

Multibyte encodings can be very space efficient, but they are difficult
to process since different characters have different lengths. Wide
characters, on the other hand, are intended to be efficient for
processing, but not necessarily space efficient. Wide characters are
integers that are large enough so that every logical character can be
represented in just one wide character.

-Larry Jones

If I get a bad grade, it'll be YOUR fault for not doing the work for me!
-- Calvin
Nov 13 '05 #3

P: n/a

<la************@eds.com> wrote in message news:nv**********@cvg-65-27-189-87.cinci.rr.com...
Bill Cunningham <so**@some.net> wrote:

I've been reading the C standard online and I'm puzzled as to what multibyte
chars are. Wide chars I believe would be characters for languages such as
cantonese or Japanese. I know the ASCII character set specifies that each
character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
character?


A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value.


My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set. Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #4

P: n/a
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

<la************@eds.com> wrote in message news:nv**********@cvg-65-27-189-87.cinci.rr.com...
Bill Cunningham <so**@some.net> wrote:
>
> I've been reading the C standard online and I'm puzzled as to what multibyte
> chars are. Wide chars I believe would be characters for languages such as
> cantonese or Japanese. I know the ASCII character set specifies that each
> character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
> character?
A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value.


My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set.


Non sequitur. The fact that A belongs to the basic character set has
no relevance on the value of L'A', AFAICT. All the standard has to say
on the issue is:

11 A wide character constant has type wchar_t, an integer type
defined in the <stddef.h> header. The value of a wide character
constant containing a single multibyte character that maps to
a member of the extended execution character set is the wide
character corresponding to that multibyte character, as defined
by the mbtowc function, with an implementation-defined current
locale.
Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?


Nope, he was merely describing what happens on an implementation using
ASCII for normal characters and UCS for wide characters (therefore UTF-8
for multi-byte characters).

There is nothing preventing an implementation from using EBCDIC for
normal characters and UCS for wide characters, in which case it is foolish
to expect 'A' == L'A'.

Furthermore, there is nothing preventing an implementation from using
ASCII for normal characters and EBCDIC for wide characters (or vice
versa). The fact that C99 supports UCNs in source code means nothing WRT
the execution character set (whose extended version need not contain any
additional characters).

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #5

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:
<la************@eds.com> wrote in message news:nv**********@cvg-65-27-189-87.cinci.rr.com...
Bill Cunningham <so**@some.net> wrote:
>
> I've been reading the C standard online and I'm puzzled as to what multibyte
> chars are. Wide chars I believe would be characters for languages such as
> cantonese or Japanese. I know the ASCII character set specifies that each
> character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
> character?

A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value.


My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set.


Non sequitur. The fact that A belongs to the basic character set has
no relevance on the value of L'A', AFAICT. All the standard has to say
on the issue is:

11 A wide character constant has type wchar_t, an integer type
defined in the <stddef.h> header. The value of a wide character
constant containing a single multibyte character that maps to
a member of the extended execution character set is the wide
character corresponding to that multibyte character, as defined
by the mbtowc function, with an implementation-defined current
locale.


And in 7.17p2:

wchar_t

which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales; the null character
shall have the code value zero and each member of the basic
character set shall have a code value equal to its value when used
as the lone character in an integer character constant.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #6

P: n/a
Jun Woong <my******@hanmail.net> wrote:

My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set. Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?


Yes, but. That requirement is a hold-over from the very earliest days of
extended character set support, before there were functions to convert
between wide and narrow characters. Now that those functions exist,
there is no longer any reason for the requirement, and the committee has
voted to remove it. See the committee's response to DR #279:

<http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/dr_279.htm>

-Larry Jones

Somebody's always running my life. I never get to do what I want to do.
-- Calvin
Nov 13 '05 #7

P: n/a
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

And in 7.17p2:

wchar_t

which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales; the null character
shall have the code value zero and each member of the basic
character set shall have a code value equal to its value when used
as the lone character in an integer character constant.


This requirement, carried on from C89, is simply broken: implementations
that don't use ASCII for normal characters wouldn't be able to use *any*
of the ASCII extensions (UCS, most importantly) for wide characters.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #8

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

And in 7.17p2:

wchar_t

which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales; the null character
shall have the code value zero and each member of the basic
character set shall have a code value equal to its value when used
as the lone character in an integer character constant.


This requirement, carried on from C89, is simply broken: implementations
that don't use ASCII for normal characters wouldn't be able to use *any*
of the ASCII extensions (UCS, most importantly) for wide characters.


Then, the proper answer to my previous question should be mention of
the DR in process, not citation of an irrelevant wording.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #9

P: n/a

<la************@eds.com> wrote in message news:73***********@cvg-65-27-189-87.cinci.rr.com...
[...]

Yes, but. That requirement is a hold-over from the very earliest days of
extended character set support, before there were functions to convert
between wide and narrow characters. Now that those functions exist,
there is no longer any reason for the requirement,


Weren't there some conversion functions between wide and multibyte
characters in C90? Do you mean that the wording in question was
written before the C89 committee decided to put those functions into
the standard, or that now we have more complete set of functions to
deal with wide and multibyte characters so don't need the requirement
any more?
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #10

P: n/a
Jun Woong <my******@hanmail.net> wrote:

Weren't there some conversion functions between wide and multibyte
characters in C90? Do you mean that the wording in question was
written before the C89 committee decided to put those functions into
the standard, or that now we have more complete set of functions to
deal with wide and multibyte characters so don't need the requirement
any more?


There were conversions between wide characters and multibyte *strings*,
but there weren't any conversions dealing with single byte characters
until btowc() and wctob() were added in NA1.

-Larry Jones

Oh yeah? You just wait! -- Calvin
Nov 13 '05 #11

P: n/a
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

>And in 7.17p2:
>
> wchar_t
>
> which is an integer type whose range of values can represent
> distinct codes for all members of the largest extended character
> set specified among the supported locales; the null character
> shall have the code value zero and each member of the basic
> character set shall have a code value equal to its value when used
> as the lone character in an integer character constant.


This requirement, carried on from C89, is simply broken: implementations
that don't use ASCII for normal characters wouldn't be able to use *any*
of the ASCII extensions (UCS, most importantly) for wide characters.


Then, the proper answer to my previous question should be mention of
the DR in process, not citation of an irrelevant wording.


I have quoted the *relevant* wording. The library clause has no business
defining the semantics of wide characters, which are a language issue.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #12

P: n/a

<la************@eds.com> wrote in message news:st***********@cvg-65-27-189-87.cinci.rr.com...
Jun Woong <my******@hanmail.net> wrote:

Weren't there some conversion functions between wide and multibyte
characters in C90? Do you mean that the wording in question was
written before the C89 committee decided to put those functions into
the standard, or that now we have more complete set of functions to
deal with wide and multibyte characters so don't need the requirement
any more?


There were conversions between wide characters and multibyte *strings*,
but there weren't any conversions dealing with single byte characters
until btowc() and wctob() were added in NA1.


Oh, now I see your point, thank you. I thought it in an implementer's
viewpoint who has full access to the internal state for the
conversion.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #13

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes: [...]

Then, the proper answer to my previous question should be mention of
the DR in process, not citation of an irrelevant wording.


I have quoted the *relevant* wording. The library clause has no business

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ defining the semantics of wide characters, which are a language issue. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~


Sorry, but this makes me feel that it's not worth discussing this
problem with you any more. Some implementations of the standard
library depended on that '%' == L'%' with the requirement of C90,
and it was a reliable choice in practice *at that time*.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #14

P: n/a
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:[...]
>
>Then, the proper answer to my previous question should be mention of
>the DR in process, not citation of an irrelevant wording.


I have quoted the *relevant* wording. The library clause has no business

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
defining the semantics of wide characters, which are a language issue.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~

Sorry, but this makes me feel that it's not worth discussing this
problem with you any more.


As I've already told you, you're always welcome to ignore my posts.
The text you've underlined makes perfect sense to me (otherwise I
wouldn't have written in the first place).
Some implementations of the standard
library depended on that '%' == L'%' with the requirement of C90,
and it was a reliable choice in practice *at that time*.


The implementor can depend on *anything* he wants, because he has full
control over the implementation, he doesn't need any guarantees from the
standard about the relationship between normal characters and wide
characters because he knows *exactly* what this relationship is on that
particular implementation.

I thought this was obvious to you...

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #15

P: n/a
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:[...]
>>
>> I have quoted the *relevant* wording. The library clause has no business
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> defining the semantics of wide characters, which are a language issue.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
>[...]

The text you've underlined makes perfect sense to me (otherwise I
wouldn't have written in the first place).


According to your logic, the following program is not s.c. even in


Don't invoke my logic, since you're obviously unable to undestand it.
C90, which is perfectly incorrect thought. Is this what you are
saying?

#include <stdio.h>

int main(void)
{
if ('a' == L'a') puts("okay");

return 0;
}


Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation? Remove the broken text from the library
clause and C90 becomes more sensible. Ditto about C99, which contains
the same text.
>Some implementations of the standard
>library depended on that '%' == L'%' with the requirement of C90,
>and it was a reliable choice in practice *at that time*.


The implementor can depend on *anything* he wants, because he has full
control over the implementation, he doesn't need any guarantees from the
standard about the relationship between normal characters and wide
characters because he knows *exactly* what this relationship is on that
particular implementation.


The story changes if the implementer wants to make as many parts of
his library conform to the standard as possible.


The standard contains no requirement that the standard library is
implemented in C in the first place. A library implementation conforms
to the standard if it follows the standard specification for the library,
no matter in what language it is written or how portable or non-portable
its code is. Ideally, all the parts of the library should conform to the
library specification, not only "as many parts as possible" ;-)

Assuming that you're talking about implementing the library in portable
C (which is definitely NOT what you wrote above), I fail to see how the
assumption 'a' == L'a' can make the code more portable.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #16

P: n/a
Dan Pop <Da*****@cern.ch> wrote:

Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation?


I wouldn't call it broken, just overly restrictive. Until very
recently, no one with an EBCDIC implementation wanted the wchar_t
encoding to be anything other than IBM's DBCS (Double Byte Character
Set), which has the same relation to EBCDIC that Unicode/ISO 10646 has
to ASCII.

-Larry Jones

He doesn't complain, but his self-righteousness sure gets on my nerves.
-- Calvin
Nov 13 '05 #17

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes: [...]

According to your logic, the following program is not s.c. even in


Don't invoke my logic, since you're obviously unable to undestand it.


Sorry, your logic is too foolish for me to understand.
C90, which is perfectly incorrect thought. Is this what you are
saying?

#include <stdio.h>

int main(void)
{
if ('a' == L'a') puts("okay");

return 0;
}
Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation? Remove the broken text from the library
clause and C90 becomes more sensible.


This is completely your personal opinion, which is completely
different from the text of C90 exactly says; please don't force others
to follow your poor opinion as did in "return; in main()" discussion.

I've never thought that it was broken, considering that we didn't have
enough support for multibyte and wide characters in C90, it was rather
very restrictive. The only problem I can see about this is that the
committee should have removed it when drafting C99, since we already
had lots of support for the characters then.

[...]

The story changes if the implementer wants to make as many parts of
his library conform to the standard as possible.


The standard contains no requirement that the standard library is
implemented in C in the first place. A library implementation conforms
to the standard if it follows the standard specification for the library,
no matter in what language it is written or how portable or non-portable
its code is. Ideally, all the parts of the library should conform to the
library specification, not only "as many parts as possible" ;-)


Sorry for my poor wording.

Assuming that you're talking about implementing the library in portable
C (which is definitely NOT what you wrote above), I fail to see how the
assumption 'a' == L'a' can make the code more portable.


Try to implement one of the printf() family in C90 (excluding NA1).
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #18

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes: [...] Rudeness works both ways ;-)
It's fortune that you know it.

This is completely your personal opinion, which is completely
different from the text of C90 exactly says;


Nope, it isn't, because it's my opinion about what C90 says.


Yes, it's just your opinion, not what C90 says, which is what I said.
So what?
I'm not
denying that it says what it says, merely claiming that what it says is
wrong. For reasons I have clearly explained.
I don't think so. It's very restrictive rather than broken at that
time; read Larry's posting on this.
please don't force others
to follow your poor opinion as did in "return; in main()" discussion.
Are you a complete idiot or what? I didn't force anyone to adopt any of
my opinions in any discussion (how could I do that, assuming that I wanted
to?).


You said it's broken. I said it's not broken, just very restrictive.
But what C90 says doesn't change regardless of whatever we think about
it. The standards, C90 and C99 as the current state, explicitly
guarantees that 'a' == L'a'. What's the problem with this? What
justifies you to say:

The fact that A belongs to the basic character set has
no relevance on the value of L'A'

?

If you meant to say that the wording in the standard should be revised
or will be revised, then you should have done so (as Larry did), not
given me the poor explanation above.
I've never thought that it was broken, considering that we didn't have
enough support for multibyte and wide characters in C90,
Why wasn't the support enough? And if it wasn't enough, why didn't the
committee add the missing bits, instead of breaking the standard?


Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
section, IIRC.
it was rather
very restrictive. The only problem I can see about this is that the ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~committee should have removed it when drafting C99, since we already ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~had lots of support for the characters then.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since both standards say the same thing, your argument about not enough
support in C90 is completely unsupported. Try something better.
Read the underlined wording.

Try to implement one of the printf() family in C90 (excluding NA1).


Convert the format string to wide characters and use only wide character
constants in the implementation of printf. Generate the output as wide
characters and convert them to multibyte characters before actually
outputting them. Where is the portability problem? Which of these
conversions isn't supported by C89?

The thing I can't figure out is how to generate a multibyte format string
in C89, as a string literal. The only solution is to start with a wide
string literal and convert it to a multibyte character string.


The multibyte character sequence given to printf() by user can have
redundant shift characters which can make the resulting mb characters
from the wide characters differ from the original. The guarantee that
'%' == L'%' can make it easy to write a code to scan the conversion
specifier from the mb character sequence, despite lack of support for
conversion between characters; of course, there was a more complicated
way to do it not depedning on the fact.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #19

P: n/a
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:[...]

It's fortune that you know it.


Could you please be a little more careful when writing English text?
>This is completely your personal opinion, which is completely
>different from the text of C90 exactly says;


Nope, it isn't, because it's my opinion about what C90 says.


Yes, it's just your opinion, not what C90 says, which is what I said.
So what?


I am perfectly entitled to my opinion. Just like anyone else.
I'm not
denying that it says what it says, merely claiming that what it says is
wrong. For reasons I have clearly explained.


I don't think so. It's very restrictive rather than broken at that
time; read Larry's posting on this.


I have: it didn't sound very convincing to someone inclined to use his
own judgement instead of blindly believing everything said by a committee
member.

A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
characters), for NO good reason, is downright broken in my book. And both
C89 and C99 do that.
>please don't force others
>to follow your poor opinion as did in "return; in main()" discussion.


Are you a complete idiot or what? I didn't force anyone to adopt any of
my opinions in any discussion (how could I do that, assuming that I wanted
to?).


You said it's broken. I said it's not broken, just very restrictive.
But what C90 says doesn't change regardless of whatever we think about
it. The standards, C90 and C99 as the current state, explicitly
guarantees that 'a' == L'a'. What's the problem with this? What
justifies you to say:

The fact that A belongs to the basic character set has
no relevance on the value of L'A'


I have already explained what. And I agree that the standard provides
this guarantee. What's the problem with this? ;-)
>I've never thought that it was broken, considering that we didn't have
>enough support for multibyte and wide characters in C90,


Why wasn't the support enough? And if it wasn't enough, why didn't the
committee add the missing bits, instead of breaking the standard?


Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
section, IIRC.


Quote the relevant paragraphs.
>it was rather
>very restrictive. The only problem I can see about this is that the ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >committee should have removed it when drafting C99, since we already ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~ >had lots of support for the characters then. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since both standards say the same thing, your argument about not enough
support in C90 is completely unsupported. Try something better.


Read the underlined wording.


Does it change the fact that both standards say the same thing? If not,
the underlined text doesn't prove anything at all.
>Try to implement one of the printf() family in C90 (excluding NA1).


Convert the format string to wide characters and use only wide character
constants in the implementation of printf. Generate the output as wide
characters and convert them to multibyte characters before actually
outputting them. Where is the portability problem? Which of these
conversions isn't supported by C89?

The thing I can't figure out is how to generate a multibyte format string
in C89, as a string literal. The only solution is to start with a wide
string literal and convert it to a multibyte character string.


The multibyte character sequence given to printf() by user can have
redundant shift characters which can make the resulting mb characters
from the wide characters differ from the original.


Differ in what sense? Are the semantics of the text preserved or not?
The guarantee that
'%' == L'%' can make it easy to write a code to scan the conversion
specifier from the mb character sequence,
Nope, it cannot: you cannot process multibyte characters *before*
converting them to wide characters, because the standard does NOT
specify the encoding mechanism. Keep in mind that characters from the
base character set preserve their single byte values *only* in the initial
shift state (whatever that is):

While in the
initial shift state, all single-byte characters retain their usual
interpretation and do not alter the shift state. The interpretation
^^^^^^^^^^^^^^^^^^
for subsequent bytes in the sequence is a function of the current
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
shift state.
^^^^^^^^^^^^despite lack of support for
conversion between characters; of course, there was a more complicated
way to do it not depedning on the fact.


There is no other way, without making assumptions about how mb characters
are encoded (see the quote above). And if you make such assumptions,
your code is no longer portable. There is no easy way to tell whether
a byte you read from the string corresponds to a single byte character
or is a shift state changer or is the first character of a multibyte
character.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #20

P: n/a
In article <be**********@news.hananet.net>, my******@hanmail.net
says...
"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
Don't invoke my logic, since you're obviously unable to undestand it.


Sorry, your logic is too foolish for me to understand.


Can the two of you go off privately somewhere and beat each other to
a pulp? Watching it here doesn't seem very productive.

--
Randy Howard
remove the obvious bits from my address to reply.
Nov 13 '05 #21

P: n/a

"Jun Woong" <my******@hanmail.net> wrote in message news:be**********@news.hananet.net...
[...]

char foo[] = "\x70\x70\x01\x02";
char bar[MB_CUR_MAX];

Assuming that str[] contains a valid multibyte character sequence,
'\x70' is a shift character and redundant shift characters are
allowed,

mbtowc(&wc, str, sizeof(str)-1);


Sorry. Two occurrences of "str" should be replaced with "foo".
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #22

P: n/a
In <kp***********@cvg-65-27-189-87.cinci.rr.com> la************@eds.com writes:
Dan Pop <Da*****@cern.ch> wrote:

I am perfectly entitled to my opinion. Just like anyone else.


Indeed you are, as am I.
A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
characters), for NO good reason, is downright broken in my book. And both
C89 and C99 do that.


My opinion is that your opinion is downright broken. ;-)

There were very good reasons for the restriction in C89.


This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.

AFAICT, there was NO good reason for this restriction in C89. Due to the
shift state issue, it provided no help when dealing with mb character
strings.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #23

P: n/a
Dan Pop <Da*****@cern.ch> wrote [quoting me]:

There were very good reasons for the restriction in C89.


This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.


Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)

-Larry Jones

I stand FIRM in my belief of what's right! I REFUSE to
compromise my principles! -- Calvin
Nov 13 '05 #24

P: n/a
Dan Pop <Da*****@cern.ch> wrote:

The work on Unicode started in 1986, which is a good three years before
the adoption of C89.


But it hadn't gotten very far by the time C89 was finished (which was,
remember, a year before it was published due to procedural snafus). The
16-bit camp and the 32-bit camp were both deeply entrenched and fighting
with each other, leading to the eventual schism between the ISO 10646
folks and the Unicode folks that wasn't reconciled until fairly
recently. There wasn't even concensus among the masses that a universal
character set was practical, achievable, or even desirable.

-Larry Jones

Everything's gotta have rules, rules, rules! -- Calvin
Nov 13 '05 #25

P: n/a
la************@eds.com wrote:
Dan Pop <Da*****@cern.ch> wrote [quoting me]:

There were very good reasons for the restriction in C89.


This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.


Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)


Why can't the implementation provide, for it's own use, a lookup table
of what_percent_looks_like_in_this_locale[] - after all, mbtowc clearly
has this information available.

- Kevin.

Nov 13 '05 #26

P: n/a

"Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:ne********************@tomato.pcug.org.au...
la************@eds.com wrote:

Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.
As you've said, it is necessary to convert the format string to a
sequence of wide characters and look for one corresponding to a percent
sign. But what is the wide character code for a percent sign? It's
tempting to say that it's L'%', but remember that the wide character
encoding is allowed to be locale-specific, and the user is allowed to
change the current locale at any time, so that doesn't work without
something like the restriction under discussion. (With the restriction,
of course, you don't even need to use a wide character constant, '%' is
sufficient).

Without it, you'd be forced to call mbtowc on "%" every time to get the
current encoding, but the implementation must behave as if no library
function calls mbtowc, so you'd also have to save and restore its state
around the call. That was considered to be unacceptable overhead to
require, thus the restriction. (Which, as I've said before, was
innocuous at the time since no one was even contemplating an
implementation where it did not hold.)


Why can't the implementation provide, for it's own use, a lookup table
of what_percent_looks_like_in_this_locale[] - after all, mbtowc clearly
has this information available.


One reason I can think is portability. One easier (but not portable)
way than you said is to take advantage of an internal access to the
state of the conversion.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #27

P: n/a
Jun Woong <my******@hanmail.net> wrote:

"Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:ne********************@tomato.pcug.org.au...
la************@eds.com wrote: [ ...implementing _Printf, and '%' == L'%'... ]
> Without it, you'd be forced to call mbtowc on "%" every time to get the
> current encoding, but the implementation must behave as if no library
> function calls mbtowc, so you'd also have to save and restore its state
> around the call. That was considered to be unacceptable overhead to
> require, thus the restriction. (Which, as I've said before, was
> innocuous at the time since no one was even contemplating an
> implementation where it did not hold.)


Why can't the implementation provide, for it's own use, a lookup table
of what_percent_looks_like_in_this_locale[] - after all, mbtowc clearly
has this information available.


One reason I can think is portability. One easier (but not portable)
way than you said is to take advantage of an internal access to the
state of the conversion.


There are plenty of library functions that have unacceptable overheads
when implemented in a portable manner, but can usually be efficiently
implemented in a non-portable way. In particular, strcmp() comes to
mind - so I don't think the possibility of a portable implementation
suffering unacceptable overhead when a non-portable implementation
wouldn't is sufficient reason to add the restriction.

- Kevin.

Nov 13 '05 #28

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <be**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

When C90 was the current standard, was there UCS?
UCS did exist when C99 was drafted, yet the broken text is still there.


I've already said that I agree with your position that C99 shouldn't
have had the text. I guess it was a mistake.
The work on Unicode started in 1986, which is a good three years before
the adoption of C89.
Its publication was certainly after C90's.

I also agree with that C99 (or C90+NA1) should have been revised to
get rid of the wording in question, but never do about C90.


What *exactly* was it buying to C90?


The text in C90 didn't make a major problem in practice at that time.

[...]

PJ Plauger describes the history about NA1 in that section, which is
reasonable long. IIRC when C90 was published, the commitee already
knew that C90's support for some features like the wide characters was
not enough. But because the committee promised later supplement (which
was NA1) to members who objected approval of the standard, we was able
to have C90 at that time.


This doesn explain anything at all about the necessity of having
'a' == L'a', does it?


Read in context, please.

char foo[] = "\x70\x70\x01\x02";
char bar[MB_CUR_MAX];

Assuming that str[] contains a valid multibyte character sequence,
'\x70' is a shift character and redundant shift characters are
allowed,

mbtowc(&wc, str, sizeof(str)-1);
wctomb(bar, wc);

the sequence in bar[] can be "\x70\x01\x02". Is this wrong?


I can't see anything wrong with that. Where is the problem?

DP> Convert the format string to wide characters and use only wide character
~~~~~~~~~~~~~~~~~~~
DP> constants in the implementation of printf. Generate the output as wide
~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
DP> characters and convert them to multibyte characters before actually
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~
DP> outputting them. [...]
~~~~~~~~~~~~~~~


Corret, but this is not what I said. What I said is,

while(mbtowc(&wc, fmtstr, len) > 0) {
if (wc == '%') /* conversion specifier */

(Sure, the implementation is allowed to use mbtowc for this purpose).
This construct depends on the guarantee that '%' == L'%'.


And what the hell is wrong with

if (wc == L'%') /* conversion specifier */

which does NOT depend on that guarantee and is what I have suggested as
the portable solution to your problem?


Nope, it still depends on the guarantee. If there is no guarantee like
that, wc can have a different value from L'%' depending on locales,
even if wc contains a wide percent character in that locale.

Misunderstanding here. What I had in my mind (and used before) needs
an internal access to the state for the character conversion, which is
non-portable, of course.


Then, why did you invoke *portability* arguments for the usefulness of
the guarantee under discussion?


See above. And the reason I mentioned the other way is to say that an
implementer can rely on the implementation details if he doesn't care
about portability.

Nope, the code was equally easy to write in pure C89, without relying on
the guarantee, as demonstrated above.


In an incorrect way.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #29

P: n/a

"Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:ne********************@tomato.pcug.org.au...
Jun Woong <my******@hanmail.net> wrote:

[...]

One reason I can think is portability. One easier (but not portable)
way than you said is to take advantage of an internal access to the
state of the conversion.


There are plenty of library functions that have unacceptable overheads
when implemented in a portable manner, but can usually be efficiently
implemented in a non-portable way. In particular, strcmp() comes to
mind - so I don't think the possibility of a portable implementation
suffering unacceptable overhead when a non-portable implementation
wouldn't is sufficient reason to add the restriction.


The story can change, if the committee thought over a possibility for
uses to want to write a similar code in a portable way like that.
Without such a guarantee, the only way you, as an user of an
implementation who don't know about the implementation details, can
write a similar code is to use a technique that's somewhat complicated
and has overhead.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #30

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
In <kp***********@cvg-65-27-189-87.cinci.rr.com> la************@eds.com writes:

[...]

There were very good reasons for the restriction in C89.


This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.


The reason I didn't ask what they were is not that I'm not immune to
it. It's because I know what they are.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #31

P: n/a
Jun Woong <my******@hanmail.net> wrote:

"Kevin Easton" <kevin@-nospam-pcug.org.au> wrote in message news:ne********************@tomato.pcug.org.au...
Jun Woong <my******@hanmail.net> wrote:

[...]
>
> One reason I can think is portability. One easier (but not portable)
> way than you said is to take advantage of an internal access to the
> state of the conversion.


There are plenty of library functions that have unacceptable overheads
when implemented in a portable manner, but can usually be efficiently
implemented in a non-portable way. In particular, strcmp() comes to
mind - so I don't think the possibility of a portable implementation
suffering unacceptable overhead when a non-portable implementation
wouldn't is sufficient reason to add the restriction.


The story can change, if the committee thought over a possibility for
uses to want to write a similar code in a portable way like that.
Without such a guarantee, the only way you, as an user of an
implementation who don't know about the implementation details, can
write a similar code is to use a technique that's somewhat complicated
and has overhead.


That's already true - a completely portable implementation of ROT13 is
far more complicaed and has more overhead than an implementation that
assumes ASCII.

- Kevin.

Nov 13 '05 #32

P: n/a
In <0o***********@cvg-65-27-189-87.cinci.rr.com> la************@eds.com writes:
Dan Pop <Da*****@cern.ch> wrote [quoting me]:

There were very good reasons for the restriction in C89.


This statement is worth zilch without an enumeration of the "very good
reasons". Unlike JW, I'm completely immune to the "magister dixit" style
of argumentation.


Bully for you. This isn't my area of expertise, thus the appeal to
authority. P. J. Plauger alludes to the kinds of problems it was
intended to address in his discussion of the _Printf function in "The
Standard C Library".

The fundamental issue is how to recognize a "%" in the format string.


And the trivial solution is btowc(), rather than imposing even *more*
conditions on the encoding of the character sets used by a conforming
implementation.

It doesn't look like the design of btowc() was beyond the capabilities of
the X3J11 committee, and its necessity is obvious, given the restrictions
of use of mbtowc().

But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #33

P: n/a
Dan Pop <Da*****@cern.ch> wrote:

In <0o***********@cvg-65-27-189-87.cinci.rr.com> la************@eds.com writes:

The fundamental issue is how to recognize a "%" in the format string.
And the trivial solution is btowc(), rather than imposing even *more*
conditions on the encoding of the character sets used by a conforming
implementation.


btowc() didn't exist in C90 (it was added in AM1), so it hardly
qualifies as a "trivial solution". (And I'm not sure what you mean by
"imposing even *more* conditions on the encoding", C imposes very few
conditions.)
It doesn't look like the design of btowc() was beyond the capabilities of
the X3J11 committee, and its necessity is obvious, given the restrictions
of use of mbtowc().
No, it's necessity was *not* obvious -- the restriction served the
purpose just as well. The committee did not have sufficient expertise
in this area to be comfortable inventing a complete solution. The small
group of experts advising us recommended that we adopt just the minimum
set of basic capabilities and they would then go off and consider a more
complete solution to be adopted as an amendment later. They had no
problem with the restriction, nor did they advise removing it in the
amendment that they ultimately produced. As I've said numerous times
now to no avail, at the time, *NO ONE* had even contemplated an
environment where the restriction was a problem. Much like the
restriction on the encoding of the digits, it was viewed as recognizing
the way the world worked; no one expected anyone to seriously propose an
encoding that would run afoul of it.
But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.


That's true for printf() and friends, but that was just an *example* of
the kinds of problems the restriction was intended to address, it was
not the sole problem. User code (particularly third-party library code)
cannot so easily avoid the state problems.

-Larry Jones

I can feel my brain beginning to atrophy already. -- Calvin
Nov 13 '05 #34

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
[...]

But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.


How could it be safe without saving and restoring the state
information, if an user interleaves a call to printf() between
two calls to mbtowc(), the latter of which depends on the state
changed by the former?
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #35

P: n/a
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:be**********@sunnews.cern.ch...
[...]

But even mbtowc() could be safely used by printf for this purpose, right
before calling it on the first character of the format string, which
already assumes the initial shift state: converting % is not going to
cause any change of shift state.


How could it be safe without saving and restoring the state
information, if an user interleaves a call to printf() between
two calls to mbtowc(), the latter of which depends on the state
changed by the former?


We have already agreed that a portable implementation of printf *must*
use mbtowc to parse the format string, haven't we? Wasn't implementing
printf in a portable way *your* argument?

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #36

P: n/a
In <r7**********@cvg-65-27-189-87.cinci.rr.com> la************@eds.com writes:
Dan Pop <Da*****@cern.ch> wrote:

In <0o***********@cvg-65-27-189-87.cinci.rr.com> la************@eds.com writes:

The fundamental issue is how to recognize a "%" in the format string.
And the trivial solution is btowc(), rather than imposing even *more*
conditions on the encoding of the character sets used by a conforming
implementation.


btowc() didn't exist in C90 (it was added in AM1), so it hardly
qualifies as a "trivial solution".


I know that it didn't exist in C90, but this doesn't make it a less
trivial solution, as explained below. Once the problem was identified,
there were two solutions: the wrong one, which the committee chose, and
the correct one: provide the required conversion function.

(And I'm not sure what you mean by
"imposing even *more* conditions on the encoding", C imposes very few
conditions.)


It shouldn't impose *any*, because it claims that the issue is beyond its
scope. Yet, it imposes several conditions:

1. The encoding of any member of the base character set, when stored
in a char, has a non-negative value.

2. The digit characters have contiguous encodings.

3. The members of the base character set have the same value when encoded
as character constants, wide character constants and multibyte
characters in the initial shift state.

4. Whatever I can't remember or I'm not even aware of.
It doesn't look like the design of btowc() was beyond the capabilities of
the X3J11 committee, and its necessity is obvious, given the restrictions
of use of mbtowc().


No, it's necessity was *not* obvious -- the restriction served the
purpose just as well.


The problem was well understood and choosing the restriction as its
solution was obviously the WRONG thing. For the reason already explained.

Since mbtowc was already in C89, adding btowc wouldn't have required any
extra amount of expertise or put a significant load on the implementor.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #37

P: n/a
Dan Pop <Da*****@cern.ch> wrote:

I know that it didn't exist in C90, but this doesn't make it a less
trivial solution, as explained below. Once the problem was identified,
there were two solutions: the wrong one, which the committee chose, and
the correct one: provide the required conversion function.
It must be nice to see everything in black and white and not have to
worry about those annoying shades of gray.
1. The encoding of any member of the base character set, when stored
in a char, has a non-negative value.
That's a restriction on the implementation of type char, not a
restriction on the character set.
2. The digit characters have contiguous encodings.
That is a restriction on the character set. It also happens to be a
very desirable characteristic of a coded character set; so desirable
that no one has ever reported meeting one that doesn't have it.
3. The members of the base character set have the same value when encoded
as character constants, wide character constants and multibyte
characters in the initial shift state.


Twenty years ago that appeared to fall into the same category as the
previous restriction. Today, it does not.

-Larry Jones

Geez, I gotta have a REASON for everything? -- Calvin
Nov 13 '05 #38

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:bf**********@sunnews.cern.ch...
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

[...]

How could it be safe without saving and restoring the state
information, if an user interleaves a call to printf() between
two calls to mbtowc(), the latter of which depends on the state
changed by the former?


We have already agreed that a portable implementation of printf *must*
use mbtowc to parse the format string, haven't we?


I've already said that an implementation is not allowed to use mbtowc
for that purpose. As said repeatedly C89's support for the extended
character set was not enough.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #39

P: n/a
In <eb***********@dhcp065-029-213-110.cinci.rr.com> la************@eds.com writes:
Dan Pop <Da*****@cern.ch> wrote:

I know that it didn't exist in C90, but this doesn't make it a less
trivial solution, as explained below. Once the problem was identified,
there were two solutions: the wrong one, which the committee chose, and
the correct one: provide the required conversion function.


It must be nice to see everything in black and white and not have to
worry about those annoying shades of gray.


Especially when they don't exist. If the btowc solution had any
drawbacks, you'd have a point. But since its only the committee
solution that has drawbacks, you don't.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #40

P: n/a
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:bf**********@sunnews.cern.ch...
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:[...]
>
>How could it be safe without saving and restoring the state
>information, if an user interleaves a call to printf() between
>two calls to mbtowc(), the latter of which depends on the state
>changed by the former?


We have already agreed that a portable implementation of printf *must*
use mbtowc to parse the format string, haven't we?


I've already said that an implementation is not allowed to use mbtowc
for that purpose.


Then, what *exactly* was your point when you talked about implementing
printf with portable code in C89?
As said repeatedly C89's support for the extended
character set was not enough.


Which is hardly an excuse for the broken requirement that 'a' == L'a'
instead of providing a function converting characters to wide characters.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #41

P: n/a

"Dan Pop" <Da*****@cern.ch> wrote in message news:bf***********@sunnews.cern.ch...
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes: [...]

I've already said that an implementation is not allowed to use mbtowc
for that purpose.


Then, what *exactly* was your point when you talked about implementing
printf with portable code in C89?


I've already said what I meant in some of my previous postings. To
answer this question makes the same discussion get started again.
As said repeatedly C89's support for the extended
character set was not enough.


Which is hardly an excuse for the broken requirement that 'a' == L'a'
instead of providing a function converting characters to wide characters.


The historical process for C89 can be an excuse, though I don't think
the requirement broken, considering that it didn't make any
*practical* problem at that time and that there was no objection
against it among the committee members.
--
Jun, Woong (my******@hanmail.net)
Dept. of Physics, Univ. of Seoul

Nov 13 '05 #42

P: n/a
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:

"Dan Pop" <Da*****@cern.ch> wrote in message news:bf***********@sunnews.cern.ch...
In <bf**********@news.hananet.net> "Jun Woong" <my******@hanmail.net> writes:[...]
>
>I've already said that an implementation is not allowed to use mbtowc
>for that purpose.


Then, what *exactly* was your point when you talked about implementing
printf with portable code in C89?


I've already said what I meant in some of my previous postings. To
answer this question makes the same discussion get started again.


Your *attempt* to answer was too incoherent to be comprehensible.
>As said repeatedly C89's support for the extended
>character set was not enough.


Which is hardly an excuse for the broken requirement that 'a' == L'a'
instead of providing a function converting characters to wide characters.


The historical process for C89 can be an excuse,


Nope, it cannot. Either the committee wanted to add *working* support for
wide characters, in which case they should have done the right thing, or
they didn't, in which case they shouldn't have put wide characters at all
in the C standard.
though I don't think
the requirement broken, considering that it didn't make any
*practical* problem at that time
How do you know how many implementors have been inconvenienced by that
requirement?
and that there was no objection against it among the committee members.


Which proves exactly zilch: if none of them was *really* interested in
a proper solution, why would you expect any objections?

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 13 '05 #43

P: n/a
On Tue, 15 Jul 2003 22:13:04 GMT, la************@eds.com wrote:
Dan Pop <Da*****@cern.ch> wrote: <snip>
2. The digit characters have contiguous encodings.

And consecutive, which is actually a stronger requirement.
That is a restriction on the character set. It also happens to be a
very desirable characteristic of a coded character set; so desirable
that no one has ever reported meeting one that doesn't have it.

It's desirable, as are consecutive or at least ascending letters, only
for machine processing of the data represented therein, which was not
the purpose (or application) of many character codes. In fact, until
the rise of electronic digital computers, I believe pretty much the
only code for even limited processing (rather than transmission or
storage) of data was Hollerith card (which survives almost unchanged
as a subset of EBCDIC).

Important examples of nonconsecutive digit codes:

International Alphabet 2 aka "Baudot" code, 5-bits with 2 shift states
(letters and figures); the digits were the FIGS-shift of the top
letter row of the (US standard) keyboard QWERTYUIOP, which wouldn't
have been consecutive even if letter codes had been, which they
weren't: much like the telegraph and later radio "Morse" code, they
were originally designed to use fewest "mark" bits for the commonest
letters to reduce power usage over long wires. Used AIUI by Teletype
models prior to 33 (the first IA5/ASCII model), and in the Telex
public switched network into the '80s at least, even though probably
few if any of the terminals were still Teletypes, and many were
computers and not really terminals at all.

I think I still have some software containing Baudot/ASCII tables --
stored somewhere among a bunch of files in a now-unsupported format
written on 9-track magtape, if you can read that :-)

TeleTypeSetter or TTS, 6-bits, 2-shift but digits had their own codes
i.e. corresponding approximately to the 2 shifts of a 4-row typewriter
keyboard including both upper and lower case letters. Originally used
to operate Linotype machines in duplicate or remotely, used at least
into the '70s in a variety of pre-press equipment. Also chose fewest
'1' bits for commonest codes, but I think by this time the motivation
was more to reduce average punch wear.

I'm pretty sure Frieden Flexowriter, a classic typewriter mechanism
(i.e. typebar basket and platen on moving carriage) modified into a
computer terminal, used a code that did not have consecutive digits,
but I didn't spend much time looking at its tapes and don't remember.

Although none of these was ever used, and I don't think would even
have been considered, as the internal code for a computer, and thus
irrelevant to the design of C or any other programming language.

- David.Thompson1 at worldnet.att.net
Nov 13 '05 #44

This discussion thread is closed

Replies have been disabled for this discussion.