By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,233 Members | 1,789 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,233 IT Pros & Developers. It's quick & easy.

tolower and Unicode

P: n/a
HI all,

I want to know if tolower can handle unicode characters (codepoints)
correctly? Is it specified by the Standard? If yes, then a reference to
the Standard would be appreciated. Or, is it implementation defined?

Thanks ..

Sep 25 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
In article <11*********************@b28g2000cwb.googlegroups. com>
Kelvin Moss <km**********@yahoo.comwrote:
>I want to know if tolower can handle unicode characters (codepoints)
correctly?
The tolower() function is only required to handle values in
the set {EOF, [0..UCHAR_MAX]} (typically -1 to 255 inclusive).
If the tolower() in a given implementation only *does* handle
those, it is clearly not going to handle Unicode codepoints
above #00ff.

In C99, there is a towupper() function that handles wide
characters. If the implementation happens to use Unicode for
its wide characters (and of course supports C99 well enough),
this will do it; if not, it will not.

There is nothing forcing any given implementation to *not*
handle Unicode with toupper(), but there is nothing forcing it
to do so either.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Sep 25 '06 #2

P: n/a
On Mon, 25 Sep 2006 07:23:26 +0000, Chris Torek wrote:
<snip>
In C99, there is a towupper() function that handles wide characters. If
the implementation happens to use Unicode for its wide characters (and
of course supports C99 well enough), this will do it; if not, it will
not.

There is nothing forcing any given implementation to *not* handle
Unicode with toupper(), but there is nothing forcing it to do so either.
How might towupper() handle changing the case of the German eszet (funky
looking B to mine eyes). In the lower case variant it can be represented
by a single integer (32-bit, 64-bit, 1024-bit, w'ever). However, the
uppercase must be "SS", necessitating two integers of some type
(regardless of the encoding format: UTF-8, UTF-16, UTF-32).

Point being, the standard C string manipulation interface CANNOT
fully support Unicode, and IMHO the above example is a rather trivial
proof of why ISO C cannot support Unicode at any level of sufficiency,
short of a scenario where "Unicode" is used as a fancy term for 7-bit
ASCII.
Sep 26 '06 #3

P: n/a
On Mon, 25 Sep 2006 18:51:02 -0700, William Ahern wrote:
<snip>
Point being, the standard C string manipulation interface CANNOT fully
support Unicode, and IMHO the above example is a rather trivial proof of
why ISO C cannot support Unicode at any level of sufficiency, short of a
scenario where "Unicode" is used as a fancy term for 7-bit ASCII.
By "cannot support" I meant neither by it's historical nor wide-character
interfaces. I did not mean to imply the impossibility of a Unicode string
library written in ISO C.

This problem isn't a problem of C, per se. Languages like Python and Java
have also inherited this problem (partly inherited from C). The real issue
lies in the now faulty assumptions behind the functional interfaces.

http://www.unicode.org/faq/casemap_charprop.html

- Bill

Sep 26 '06 #4

P: n/a
>On Mon, 25 Sep 2006 07:23:26 +0000, Chris Torek wrote:
>In C99, there is a towupper() function that handles wide characters. If
the implementation happens to use Unicode for its wide characters (and
of course supports C99 well enough), this will do it; if not, it will
not.
In article <pa****************************@25thandClement.com >
William Ahern <wi*****@25thandClement.comwrote:
>How might towupper() handle changing the case of the German eszet (funky
looking B to mine eyes). In the lower case variant it can be represented
by a single integer (32-bit, 64-bit, 1024-bit, w'ever). However, the
uppercase must be "SS", necessitating two integers of some type
(regardless of the encoding format: UTF-8, UTF-16, UTF-32).
Indeed. According to the Standard, it will have to leave the
lowercase eszet as a lowercase eszet, there being no uppercase
character equivalent.
>Point being, the standard C string manipulation interface CANNOT
fully support Unicode ...
Well, in the sense of "completely translating a lowercase string
to an equivalent (but longer) uppercase string", no. But towupper()
(and indeed plain toupper() as well) *could* do the job of "translating
a lowercase Unicode character to its (single) uppercase equivalent,
where that exists". Of course, as I said above, there is not even
a guarantee that wide strings and wchar_t characters use Unicode
at all.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (4039.22'N, 11150.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Oct 6 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.