473,396 Members | 1,879 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

RFC: the state of charset support in C

I've spent the last two days delving into the state of charset support
in C, and I wrote a blog post summarizing my findings.

http://blog.reverberate.org/2007/04/...ntures-with-c/

I'm new to this stuff, and I would very much appreciate gentle
corrections about any mistakes or misconceptions I've made!

I'd also appreciate hearing about the situation on Windows and more
obscure UNIXes. For example, is iconv() available to Windows
programmers?

Thanks,
Josh

Apr 21 '07 #1
11 1278
On 21 Apr 2007 14:25:19 -0700, Joshua Haberman <jh*******@gmail.com>
wrote in comp.lang.c:
I've spent the last two days delving into the state of charset support
in C, and I wrote a blog post summarizing my findings.

http://blog.reverberate.org/2007/04/...ntures-with-c/

I'm new to this stuff, and I would very much appreciate gentle
corrections about any mistakes or misconceptions I've made!
To tell you the truth, the biggest misconception you have made is that
this is topical on comp.lang.c, because it is not.
I'd also appreciate hearing about the situation on Windows and more
obscure UNIXes. For example, is iconv() available to Windows
programmers?
A simple check of the C standard would have told you that it contains
no header named iconv.h or function named iconv(). Since it is a
non-standard (from a C point of view) extension, it is not topical
here, and what platforms might have such an extension, and what that
extension might do, is for platform specific groups.

C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html
Apr 22 '07 #2
Joshua Haberman <jh*******@gmail.comwrote:
I've spent the last two days delving into the state of charset support
in C, and I wrote a blog post summarizing my findings.
http://blog.reverberate.org/2007/04/...ntures-with-c/
I'm new to this stuff, and I would very much appreciate gentle
corrections about any mistakes or misconceptions I've made!
charset support is not comprehensive, and franky broken IMHO given the
apparent intentions.

The wide-character interface is insufficient, because the routines available
pre-suppose qualities of a character set that many character sets are unable
to abide by. In many cases, a you cannot make critical determinations (like
"isalpha") given soley a single wchar_t object (regardless of the width). I
suggest you spend some time over at unicode.org understanding the issues
yourself.

For comprehensive, correct and arguably portable character set manipulation,
I suggest the ICU library. But that's off-topic here. That aside, you can
muddle through using standard C interfaces if you cut back on your
requirements; i.e. increase the level of opacity such that you don't need to
make certain distinctions (like isalpha/iswalpha), and/or employ UTF-8 in
such a way that it works within the confines of the "C" locale. I said
"muddle", but maybe that's an unnecessarily deragatory characterization
since adjusting scope is often the best way to address an issue. I simply
mean to dispossess those who believe standard C really can support
comprehensive I18N text manipulation of that notion.

Apr 22 '07 #3
Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.
<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,
as it doesn't impose a signedness on unadorned "char"?
</pedant>

--
+-------------------------+--------------------+-----------------------+
| Kenneth J. Brody | www.hvcomputer.com | #include |
| kenbrody/at\spamcop.net | www.fptech.com | <std_disclaimer.h|
+-------------------------+--------------------+-----------------------+
Don't e-mail me at: <mailto:Th*************@gmail.com>
Apr 23 '07 #4
Kenneth Brody said:
Jack Klein wrote:
[...]
>C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,
ITYM -127
as it doesn't impose a signedness on unadorned "char"?
It guarantees the existence of unsigned char, however.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Apr 23 '07 #5
Richard Heathfield wrote:
>
Kenneth Brody said:
Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.
<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,

ITYM -127
So the standard says nothing about what 0x80 means as a signed char?

I suppose that's allowed to be a trap value?
as it doesn't impose a signedness on unadorned "char"?

It guarantees the existence of unsigned char, however.
True.

--
+-------------------------+--------------------+-----------------------+
| Kenneth J. Brody | www.hvcomputer.com | #include |
| kenbrody/at\spamcop.net | www.fptech.com | <std_disclaimer.h|
+-------------------------+--------------------+-----------------------+
Don't e-mail me at: <mailto:Th*************@gmail.com>
Apr 23 '07 #6
Kenneth Brody said:
Richard Heathfield wrote:
>>
Kenneth Brody said:
Doesn't the Standard guarantee 0 through 255, _or_ -128 through
127,

ITYM -127

So the standard says nothing about what 0x80 means as a signed char?
Well, no, not really. Presumably on typical two's complement "def char
is signed" systems it'll evaluate to -128 (and this does seem to be
what happens in practice), on ones' complement it'll be - um - whatever
it is :-) - and on sign-and-magnitude it'll be -0. In each case, it is
possible for the implementation to ascribe a meaning to 0x80, but it
needn't necessarily be the same meaning on each platform.

<snip>

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Apr 23 '07 #7
Kenneth Brody <ke******@spamcop.netwrites:
Richard Heathfield wrote:
>>
Kenneth Brody said:
Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,

ITYM -127

So the standard says nothing about what 0x80 means as a signed char?
Correct. The standard allows signed integers to be represented either
in two's-complement, one's-complement, or signed-magnitude. (That's
in C99; I think C90 was more vague).
I suppose that's allowed to be a trap value?
unsigned char is not allowed to have trap values; I *think* the
standard may make a similar statement about signed char. (If so, I'm
sure someone will provide chapter and verse shortly.)

But in general, yes, signed types are allowed to have trap values,
though if 0x8000 isn't simply -32768 it's more likely to be -0.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Apr 23 '07 #8
Kenneth Brody <kenbr...@spamcop.netwrote:
Richard Heathfield wrote:
Kenneth Brody said:
Jack Klein wrote:
C guarantees for 8-bit characters having numeric values
in the range of 0 to 255 inclusive. It allows, but
does not require, support for wider character types.
Everything else is implementation-defined.
>
Doesn't the Standard guarantee 0 through 255, _or_ -128
through 127,
ITYM -127

So the standard says nothing about what 0x80 means as a signed
char?
No.
I suppose that's allowed to be a trap value?
That's debatable. 6.2.6.1p5 says...

Certain object representations need not represent a value of
the object type. If the stored value of an object has such a
representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. ...

There are two points of view on how to interepret this:

1) This implies that the value of such representation for
signed character types is merely unspecified. [Non-
trapping trap representations!]

2) Since the standard fails to define the behaviour for
character types, they too invoke undefined behaviour.
[Although you have to question why the standard was so
explicit.]

Popular view seems to be that the standard could be better
written in this regard, that point 2 applies, and that it's
therefore not worth 'fixing' since nothing is actually
'broken' under that view.

--
Peter

Apr 23 '07 #9
On Apr 24, 4:46 am, Kenneth Brody <kenbr...@spamcop.netwrote:
So the standard says nothing about what 0x80 means as a signed char?
A signed char cannot have the value of 0x80 (assuming SCHAR_MAX
to be 0x7F).

Assigning the value of 0x80 to a signed char would cause
implementation-defined behaviour.
I suppose that's allowed to be a trap value?
There are no trap values. There are only trap representations.

I suppose you mean to ask, what happens when you interpret
the representation 0x80 as signed char? As other posters noted,
it could be a value of some sort or it could be a trap representation.

I don't see any text that describes whether the value is
implementation-
defined or merely unspecified; nor any text describing what happens if
you evaluate that value. The section about reading trap reps is quite
clear that it only says the behaviour is undefined if the type is not
a
character type.

Apr 25 '07 #10
Old Wolf said:
On Apr 24, 4:46 am, Kenneth Brody <kenbr...@spamcop.netwrote:
>So the standard says nothing about what 0x80 means as a signed char?

A signed char cannot have the value of 0x80 (assuming SCHAR_MAX
to be 0x7F).
Portably speaking, you're right. But non-portably, implementations may
allow a signed char on an 8-bit-char system to be 0x80. Typical PC
implementations set SCHAR_MIN to -128. (For example, Turbo C, Borland C
(IIRC), Microsoft C, gcc...)

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.
Apr 25 '07 #11
Richard Heathfield <rj*@see.sig.invalidwrites:
Old Wolf said:
>On Apr 24, 4:46 am, Kenneth Brody <kenbr...@spamcop.netwrote:
>>So the standard says nothing about what 0x80 means as a signed char?

A signed char cannot have the value of 0x80 (assuming SCHAR_MAX
to be 0x7F).

Portably speaking, you're right. But non-portably, implementations may
allow a signed char on an 8-bit-char system to be 0x80. Typical PC
implementations set SCHAR_MIN to -128. (For example, Turbo C, Borland C
(IIRC), Microsoft C, gcc...)
If you assume that "0x80" refers to a bit pattern (a representation),
that's true. But in C, 0x80 is an integer literal with the value
+128. If SCHAR_MAX is 0x7F (+127), then a signed char cannot have the
value of 0x80 (+128).

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Apr 25 '07 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Leo | last post by:
Hi, I'm back to ASP for a short while and I was wondering how I could save the entire Application/Session-state to a file or database and how to read it back afterwards. I'd like to preserve...
12
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following...
3
by: Nhi Lam | last post by:
Hi, I understand that there are 3 modes in which I can configure the SessionStateModule. What I need is an out of process Session State store with fail over support. The "SQL Server Mode" seems...
3
by: gb | last post by:
We are in the process of upgrading part of a large system to .NET, whilst the majority will remain ASP. Sharing session state information will not be a problem at the moment as it is trivial and...
6
by: Daniel Walzenbach | last post by:
Hi, I have a web application which sometimes throws an “out of memory” exception. To get an idea what happens I traced some values using performance monitor and got the following values (for...
6
by: Eric McVicker | last post by:
Session state has options to be inproc, state server or sql server. Why does Application state not allow for state server or sql server so the same Application state could be shared between...
5
by: joeblast | last post by:
I have a Web service that gets the financial periods and hold a reference to a disconnected dataset built at initialization. Web methods work on the dataset inside the web service. Everything is...
5
by: =?Utf-8?B?Sm9obiBBdXN0aW4=?= | last post by:
I want to ensure that the Insert Key state is 'Insert' rather than 'Overwrite' in a particular textbox (or possibly for the whole form). How can I go about this? -- John Austin
5
by: =?Utf-8?B?QmlsbHkgWmhhbmc=?= | last post by:
Hi All, I am using asp.net session state service to store session. The concurrent online user will be almost 2000. Could asp.net session state service afford this? Is there any limitation...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.