By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,227 Members | 1,288 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,227 IT Pros & Developers. It's quick & easy.

Any forseeable disasters?

P: n/a
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer for
this, but wchar_t isn't guaranteed to be 32-Bit.

Is there any forseeable disasters to putting this at the
beginning of your translation unit:

#define wchar_t unsigned long

The only one I can think of is function overloading:

void Blah(wchar_t) {}
void Blah(unsigned long) {}

-JKop
Jul 22 '05 #1
Share this Question
Share on Google+
33 Replies


P: n/a

"JKop" <NU**@NULL.NULL> wrote in message
news:EB******************@news.indigo.ie...
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer for
this,
Unicode uses sixteen bits. So since a byte must be at least
eight bits wide, any two byte sequence is large enough (and
might be larger) to represent any Unicode character.
but wchar_t isn't guaranteed to be 32-Bit.
Right.

Is there any forseeable disasters to putting this at the
beginning of your translation unit:

#define wchar_t unsigned long


Yes. You're not allowed to #define a keyword.

-Mike
Jul 22 '05 #2

P: n/a
JKop wrote:
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer for
this, but wchar_t isn't guaranteed to be 32-Bit.

Is there any forseeable disasters to putting this at the
beginning of your translation unit:

#define wchar_t unsigned long

The only one I can think of is function overloading:

void Blah(wchar_t) {}
void Blah(unsigned long) {}

-JKop


Unicode has the following encodings

utf-7 - multibyte but no bytes have values > 127
utf-8 - multibyte (1-6 bytes per char)
utf-16 - 16 bit "code" - multi-value code points - see "surrogate pairs"
- utf-16 coves 2^20 + 2^16 code points
ucs-4 - 32 bit codes

On most platforms where sizeof(wchar_t)==4, the wchar_t encoding is
ucs-4 while cases where sizeof(wchar_t)==2, the encoding is utf-16.

It's just so much easier to deal with utf-8 for other reasons as well.

Endianness of utf-16 and ucs-4 mean that the encoding is stateful which
makes for all kinds of issues when reading and writing to files.

Consider using utf-8. It might mean that you don't need to do anything!

G
Jul 22 '05 #3

P: n/a
Le samedi 7 août 2004 à 22:25:55, Mike Wahler a écrit dans
comp.lang.c++*:
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer for
this,


Unicode uses sixteen bits. So since a byte must be at least
eight bits wide, any two byte sequence is large enough (and
might be larger) to represent any Unicode character.


Wrong. There are about 95,000 characters in Unicode today. How do you
fit all of them in 16 bits?

<http://www.unicode.org/versions/Unicode4.0.0/>

#define wchar_t unsigned long


Yes. You're not allowed to #define a keyword.


Wrong again. Never seen the following?

#define for if (0) {} else for

--
___________ 2004-08-07 23:32:29
_/ _ \_`_`_`_) Serge PACCALIN -- sp ad mailclub.net
\ \_L_) Il faut donc que les hommes commencent
-'(__) par n'être pas fanatiques pour mériter
_/___(_) la tolérance. -- Voltaire, 1763
Jul 22 '05 #4

P: n/a
"Serge Paccalin" <sp@mailclub.no.spam.net.invalid> wrote in message
news:pw***************@canttouchthis-127.0.0.1...
#define wchar_t unsigned long


Yes. You're not allowed to #define a keyword.


Wrong again. Never seen the following?

#define for if (0) {} else for


You can legally define a keyword, but you can't legally define a keyword *and*
include any standard headers (17.4.3.1.1/2 - "A translation unit that includes a
header shall not contain any macros that define names declared or defined in
that header. Nor shall such a translation unit define macros for names
lexically identical to keywords.").

Regards,
Paul Mensonides
Jul 22 '05 #5

P: n/a
Mike Wahler posted:

"JKop" <NU**@NULL.NULL> wrote in message
news:EB******************@news.indigo.ie...
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer
for this,
Unicode uses sixteen bits. So since a byte must be at least eight bits wide, any two byte sequence is large enough (and might be larger) to represent any Unicode character.

Minimum bitness of wchar_t = 8 bits.

Minimum range of wchar_t = 0 to 127.
-JKop

Jul 22 '05 #6

P: n/a
JKop wrote:
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer for
this, but wchar_t isn't guaranteed to be 32-Bit.

Is there any forseeable disasters to putting this at the
beginning of your translation unit:

#define wchar_t unsigned long


Yes. wchar_t is a built in type, so the above looks for trouble. Not to
mention that it is not needed in the first place since in most systems
wchar_t is enough sufficient to store Unicode characters. After all, it
was wide character sets it was created for.


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #7

P: n/a
JKop wrote:
Minimum bitness of wchar_t = 8 bits.

Minimum range of wchar_t = 0 to 127.

Actually it is sizeof(char) <= sizeof(wchar_t) <= sizeof(long)

From TC++PL 3 on page 72:
"A type wchar_ t is provided to hold characters of a larger character
set such as Unicode. It is a distinct type. The size of wchar_ t is
implementation-defined and large enough to hold the largest character
set supported by the implementation’s locale (see §21.7, §C.3.3). The
strange name is a leftover from C. In C, wchar_ t is a typedef (§4.9.7)
rather than a built-in type. The suffix _t was added to distinguish
standard typedefs."


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #8

P: n/a
JKop wrote:
Minimum bitness of wchar_t = 8 bits.

Minimum range of wchar_t = 0 to 127.


Fixed some spaces and added some asterisks:


Actually it is sizeof(char) <= sizeof(wchar_t) <= sizeof(long)

From TC++PL 3 on page 72:
"A type wchar_t is provided to hold characters of a larger character set
such as Unicode. It is a distinct type. The size of wchar_t is
implementation-defined and *large enough* to hold the *largest character
set* supported by the implementation’s locale (see §21.7, §C.3.3). The
strange name is a leftover from C. In C, wchar_t is a typedef (§4.9.7)
rather than a built-in type. The suffix _t was added to distinguish
standard typedefs."


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #9

P: n/a
The smallest integral type with the smallest range is:

char

8-bit
0 to 127

The type wchar_t maps to one of the other integral types. It can map to
char. As such:

wchar_t
8-Bit
0 to 127
signed main()
{
wchar_t = 127;

++wchar_t;

//The above statment is implementation-defined.
//It may cause undefined behaviour.
}
-JKop
Jul 22 '05 #10

P: n/a
JKop wrote:
The smallest integral type with the smallest range is:

char

8-bit
0 to 127

Yes but please pay attention to this particular sentence:

"The size of wchar_t is implementation-defined and *large enough* to
hold the *largest character set* supported by the implementation’s locale".

The type wchar_t maps to one of the other integral types. It can map to
char. As such:

wchar_t
8-Bit
0 to 127
signed main()


Strictly speaking I am not sure that the above can be considered
well-defined and portable, since main() is a special function.


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #11

P: n/a

Yes. You're not allowed to #define a keyword.


Wrong again. Never seen the following?

#define for if (0) {} else for

More correctly stated, it is undefined behavior to redefine a keyword
if you use any of the standard headers.

Jul 22 '05 #12

P: n/a
Ioannis Vranos posted:
Strictly speaking I am not sure that the above can be considered
well-defined and portable, since main() is a special function.


int main() {}

signed int main() {}

signed main() {}

int main(void) {}

signed int main(void) {}

signed main(void) {}
The above 6 are identical.
-JKop

Jul 22 '05 #13

P: n/a
JKop wrote:
int main() {}

signed int main() {}

signed main() {}

int main(void) {}

signed int main(void) {}

signed main(void) {}
The above 6 are identical.

Actually, int, signed int, signed are the same type. However since
main() is *not* a normal function and in the standard only the int one
is mentioned, I do not know for sure that they are the same. However
they may be, perhaps someone else can shed more light on this.


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #14

P: n/a

"Ioannis Vranos" <iv*@guesswh.at.grad.com> wrote in message news:cf***********@ulysses.noc.ntua.gr...
Actually, int, signed int, signed are the same type. However since

main() is *not* a normal function and in the standard only the int one
is mentioned, I do not know for sure that they are the same. However
they may be, perhaps someone else can shed more light on this.


The standard says that main must return int (all of the alternatives listed
meet this requirement) and otherwise as long as the two specific signatures
provided in the standard are supported, it is implementation defined.

That is, the standard says what the return type must be, NOT what the
declaration must look like.

Jul 22 '05 #15

P: n/a
Ron Natalie wrote:
The standard says that main must return int (all of the alternatives listed
meet this requirement) and otherwise as long as the two specific signatures
provided in the standard are supported, it is implementation defined.

That is, the standard says what the return type must be, NOT what the
declaration must look like.


Ok, then signed main() is valid. :-)


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #16

P: n/a
"JKop" <NU**@NULL.NULL> wrote in message
news:1L******************@news.indigo.ie...
Ioannis Vranos posted:
Strictly speaking I am not sure that the above can be considered
well-defined and portable, since main() is a special function.


int main() {}

signed int main() {}

signed main() {}

int main(void) {}

signed int main(void) {}

signed main(void) {}
The above 6 are identical.


Well, except when considering the number of keystrokes required to type each
of them and the probability that using each of them will causing confusion.
They all may do the same thing, but int main() or int main(void) look like
the best choices when you take that information into consideration.

--
David Hilsee
Jul 22 '05 #17

P: n/a
> Well, except when considering the number of keystrokes
required to type
each of them and the probability that using each of them will causing confusion. They all may do the same thing, but int main() or int main(void) look like the best choices when you take that information into consideration.


I myself prefer:

signed main()
{

}
-JKop
Jul 22 '05 #18

P: n/a
"JKop" <NU**@NULL.NULL> wrote in message
news:k4******************@news.indigo.ie...
Well, except when considering the number of keystrokes

required to type
each of them and the probability that using each of them

will causing
confusion. They all may do the same thing, but int main()

or int
main(void) look like the best choices when you take that

information
into consideration.


I myself prefer:

signed main()
{

}


But why? Do you just like being different from every other C++ programmer
on the planet?

--
David Hilsee
Jul 22 '05 #19

P: n/a
On Sat, 07 Aug 2004 22:43:59 GMT, JKop <NU**@NULL.NULL> wrote in
comp.lang.c++:
Mike Wahler posted:

"JKop" <NU**@NULL.NULL> wrote in message
news:EB******************@news.indigo.ie...
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer

for this,

Unicode uses sixteen bits. So since a byte must be at

least
eight bits wide, any two byte sequence is large enough

(and
might be larger) to represent any Unicode character.

Minimum bitness of wchar_t = 8 bits.

Minimum range of wchar_t = 0 to 127.


No, minimum range of wchar_t must be the same as the minimum range of
char. And that must be either -127 to 127, or 0 to 255. There is no
integer type in C++ which may have a range of only 0 to 127.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
Jul 22 '05 #20

P: n/a
On Sun, 08 Aug 2004 03:05:08 +0300, Ioannis Vranos
<iv*@guesswh.at.grad.com> wrote in comp.lang.c++:
JKop wrote:
Let's say you want to store a character of the Unicode
character system. You want a 32-Bit unsigned integer for
this, but wchar_t isn't guaranteed to be 32-Bit.

Is there any forseeable disasters to putting this at the
beginning of your translation unit:

#define wchar_t unsigned long


Yes. wchar_t is a built in type, so the above looks for trouble. Not to
mention that it is not needed in the first place since in most systems
wchar_t is enough sufficient to store Unicode characters. After all, it
was wide character sets it was created for.


Unicode was originally a 16-bit encoding, and quite a few
implementations provide a 16-bit wchar_t. This is most likely the
reason that Java's type 'char' was defined as 16 bits. But Unicode
has grown to more than 64K defined values, and can no longer fit into
individual 16-bit types without state dependent encoding.

Is there some reason why you suddenly feel the need to add so much
superfluous white space between the end of your text and your
signature line? Why don't you just learn to use a proper signature
delimiter, as specified by the appropriate RFCs? It is not hard at
all, I have been doing it for many years.

A proper signature line consists of the four character sequence:

'-', '-', ' ', '\n'

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html
Jul 22 '05 #21

P: n/a
David Hilsee posted:
I myself prefer:

signed main()
{

}


But why? Do you just like being different from every other C++
programmer on the planet?

No, I like to be reminded that it's signed. Maybe my brain's wired a bit
weird, but when I don't see "signed", I tend to think that it's unsigned. I
would have assumed that that was the default, but ofcourse it's not.
-JKop
Jul 22 '05 #22

P: n/a
Jack Klein posted:
Minimum bitness of wchar_t = 8 bits.

Minimum range of wchar_t = 0 to 127.


No, minimum range of wchar_t must be the same as the minimum range of
char. And that must be either -127 to 127, or 0 to 255. There is no
integer type in C++ which may have a range of only 0 to 127.


If you're intelligent enough to post that, you should be able to draw the
conclusion from it that I did.

It's implementation-defined whether or not a char is unsigned or signed.
Looking at the differences:

signed char : -127 to 127

unsigned char : 0 to 225
They overlap at 0 to 127. Concordantly:
signed main()
{
char = -5; //Implementation-defined

char = 130; //Implementation-defined

char = 0; //No problem

char = 127; //No problem

char = 128; //Implementation-defined

char = -1; //Implementation-defined
}
Therefore, the minimum range for char is 0 to 127. As wchar_t may be based
upon *any* of the integral types, it may be based on char, and as such its
minimum range is 0 to 127.
-JKop
Jul 22 '05 #23

P: n/a
"JKop" <NU**@NULL.NULL> wrote in message
news:wP******************@news.indigo.ie...
David Hilsee posted:
I myself prefer:

signed main()
{

}
But why? Do you just like being different from every other C++
programmer on the planet?

No, I like to be reminded that it's signed. Maybe my brain's wired a bit
weird, but when I don't see "signed", I tend to think that it's unsigned.

I would have assumed that that was the default, but ofcourse it's not.


Your brain will probably get re-wired over time. In your example code in
another thread, you wrote "int" instead of "signed", so I bet the
assimilati... er, re-wiring has already begun. :-)

--
David Hilsee
Jul 22 '05 #24

P: n/a
David Hilsee posted:
No, I like to be reminded that it's signed. Maybe my brain's wired a
bit weird, but when I don't see "signed", I tend to think that it's
unsigned. I would have assumed that that was the default, but ofcourse
it's not.


Your brain will probably get re-wired over time. In your example code
in another thread, you wrote "int" instead of "signed", so I bet the
assimilati... er, re-wiring has already begun. :-)

It just seems to me that positive numbers are much more the norm. Negative
numbers are "more special". Think about it, even in school, I didn't learn
about negative numbers until I was about 10 or 11. So from that, positive
numbers come first, then negative numbers. I would've made int unsigned, and
if you wanted a signed integer, then: signed int.

Anyway, looks like resistence is futile! :-D
-JKop

Jul 22 '05 #25

P: n/a
JKop wrote:
If you're intelligent enough to post that, you should be able to draw the
conclusion from it that I did.

Actually the range of char is either that of signed char either that of
unsigned char. And there is numeric_limits<char> to know which one is
currently implemented.

So Jack is right.



It's implementation-defined whether or not a char is unsigned or signed.
Looking at the differences:

signed char : -127 to 127

unsigned char : 0 to 225
They overlap at 0 to 127. Concordantly:

Consequently forget the rest and use either signed char or unsigned char
explicitly if you want to be range specific or numeric_limits to take
run-time decisions.
Also whcar_t is not based on any integral type, it is a built in type of
its own. So in theory it can be of value range different than the rest
types.

Also "The size of wchar_t is implementation-defined and *large enough*
to hold the *largest character set* supported by the implementation’s
locale" as mentioned in TC++PL, guarantees that you will never have
problems storing Unicode or any other wide character set supported by a
system.
The rest is non-sense.


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #26

P: n/a
Jack Klein wrote:
Unicode was originally a 16-bit encoding, and quite a few
implementations provide a 16-bit wchar_t. This is most likely the
reason that Java's type 'char' was defined as 16 bits. But Unicode
has grown to more than 64K defined values, and can no longer fit into
individual 16-bit types without state dependent encoding.

However in TC++PL is mentioned:

"The size of wchar_t is implementation-defined and *large enough* to
hold the *largest character set* supported by the implementation’s locale".

Isn't it valid?
Is there some reason why you suddenly feel the need to add so much
superfluous white space between the end of your text and your
signature line?

Yes, to occupy few more bytes in my messages and make you run out of
memory. :-)
Why don't you just learn to use a proper signature
delimiter, as specified by the appropriate RFCs? It is not hard at
all, I have been doing it for many years.

A proper signature line consists of the four character sequence:

'-', '-', ' ', '\n'

What if I make it "--\t\n"?


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #27

P: n/a
Actually the range of char is either that of signed char either that of
unsigned char. And there is numeric_limits<char> to know which one is
currently implemented.

So Jack is right.


I disagree.

The minimum range of a char is 0 to 127. By this I mean the following:

A) If you find a C++ compiler that cannot store the values from 0 to 127 in
a char, then you haven't got a C++ compiler.

B) If you find a C++ compiler that can store -3 in a char, then that's very
good, but the Standard provides no such assurance. If you find a C++
compiler that can store 130 in a char, then that's very good, but the
Standard provides no such assurance.

0 to 127 are the only values you can reliably store in a char when you're
writing portable code. As such 0 to 127 is the minimum range for a char.
Page 82 of the Standard:

3.9.1 Fundamental types

5 Type wchar_t is a distinct type whose values can represent distinct codes
for all members of the largest
extended character set specified among the supported locales (22.1.1). Type
wchar_t shall have the same
size, signedness, and alignment requirements (3.9) as one of the other
integral types, called its underlying
type.
My rationale:

A) char is an integral type

As such, wchar_t can possibly have the same size, signedness and alignment
requirements as char.

As such, the minimum range for wchar_t is 0 to 127.

As such, you cannot reliably store -1 in a wchar_t in portable code, nor can
you store 130 in a wchar_t in portable code.

As regards "the supported locales", the Standard gives no guarantee that
Unicode exists as a supported local. As such, you cannot reliably use a
wchar_t to store a Unicode character when writing portable code.
-JKop
Jul 22 '05 #28

P: n/a

"Ioannis Vranos" <iv*@guesswh.at.grad.com> wrote in message news:cf**********@ulysses.noc.ntua.gr...
Also whcar_t is not based on any integral type, it is a built in type of
its own. So in theory it can be of value range different than the rest
types.


Incorrect. While wchar_t is a distinct type (i.e., not a typedef so that
it can participate in overloading distinctly from integers), it has the same
representation of some integral type (called it's underlying type),.

See 3.9.1/5
Jul 22 '05 #29

P: n/a
JKop wrote:
As regards "the supported locales", the Standard gives no guarantee that
Unicode exists as a supported local. As such, you cannot reliably use a
wchar_t to store a Unicode character when writing portable code.

I can't understand what you mean in the above. How can you use a Unicode
in a system not supporting Unicodes?


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #30

P: n/a
JKop wrote:

0 to 127 are the only values you can reliably store in a char when you're
writing portable code. As such 0 to 127 is the minimum range for a char.


Well...
CMIIW, but isn't it required that sizeof(char) <= sizeof(short) <=
sizeof(int) <= sizeof(long)?

This implies that the minimum range for _every_ integral numeric type is
0..127 (*). According to this, we don't really need distinct data types
at all: No integral type can be expected to store values outside the
range 0..127, so if we want to write /really/ portable code, we can
never use values outside this range anyway. It would even be impossible
to have a string that is longer than 127 characters... I think you'll
agree that programming under such restrictions is rather unpleasant
(especially if most numbers are >127).

All data types reflect what is supported (maybe "natural" is a better
word) on a machine. If you need to handle 32bit numbers on a 8bit
processor and the compiler does not support 32bit numbers or some
workarounds for that processor, you're out of luck.

IMHO there are no machine independent data types; just pick the data
type that is appropriate, i.e. char for strings, wchar_t for unicode
strings. Or simply make a typedef unsigned long my_wchar_t; if you
really need more than what your current platform offers.

What I'm trying to say: No matter what code you write, you always have
to know the platform that your code is supposed to run on. There is no
point in writing code that might compile or even run on a pocket
calculator if your program will later run only on high-end PCs.
(*) i.e. the overlapping range for signed/unsigned char. I don't know if
it's even smaller than that. What about a 4bit processor?

--
Regards,
Tobias
Jul 22 '05 #31

P: n/a
Tobias Güntner posted:
JKop wrote:

0 to 127 are the only values you can reliably store in a char when you're writing portable code. As such 0 to 127 is the minimum range for a char.

Well...
CMIIW, but isn't it required that sizeof(char) <= sizeof

(short) <= sizeof(int) <= sizeof(long)?

This implies that the minimum range for _every_ integral

numeric type. According to this, we don't really need
Char: Minimum 8-Bit
Short: Minimum 16-Bit
Int: Minimum 16-Bit
Long: Minimum 32-Bit

The Standard says some bullshit like "the same minimums
from the Standard C, refer to chapter BLAH of the C
Standard". Standard C specifies the above limits.
Bullshit I know, C++ and C are two separate languages.

-JKop

Jul 22 '05 #32

P: n/a
JKop wrote:
Char: Minimum 8-Bit
Short: Minimum 16-Bit
Int: Minimum 16-Bit
Long: Minimum 32-Bit

Where does the standard mention this?

The Standard says some bullshit like "the same minimums
from the Standard C, refer to chapter BLAH of the C
Standard". Standard C specifies the above limits.

C90 or C99? Because C++ retains C90 as a subset except from the parts
where things are defined otherwise.


Regards,

Ioannis Vranos

http://www23.brinkster.com/noicys
Jul 22 '05 #33

P: n/a
Ioannis Vranos <iv*@guesswh.at.grad.com> wrote:
JKop wrote:
Char: Minimum 8-Bit
Short: Minimum 16-Bit
Int: Minimum 16-Bit
Long: Minimum 32-Bit


Where does the standard mention this?


It says INT_MIN <= -32767 and INT_MAX >= 32767, ie. there are at
least 65535 distinct values for int, therefore at least 16
bits of storage are required. Similar reasoning applies to
the other types.
Jul 22 '05 #34

This discussion thread is closed

Replies have been disabled for this discussion.