By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,851 Members | 2,104 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,851 IT Pros & Developers. It's quick & easy.

how to initial and print the unicode character?

P: n/a
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <wctype.h>
#include <string.h>

int main(int argc, char *argv[])
{
wchar_t *cur_buff=L"X";
wprintf(cur_buff);
return 0;
}

in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence. The source file is saved as ascci
code, and the character set is gb2312. i wonder why this happens?

Jul 4 '06 #1
Share this Question
Share on Google+
15 Replies


P: n/a
On 2006-07-04, wizardyhnr <wi********@gmail.comwrote:
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
Without looking at your actual problem, here's a few tips:
1) It's fairly unlikely that you actually have a C99 compiler.
2) It's very unlikely that something with the word "C++" in it
is even a C compiler, let alone a C99 compiler.

Other than that, we don't care what OS or platform you have. We discuss
standard C here, and that's platform independant.

--
Andrew Poelstra <http://www.wpsoftware.net/blog>
To email me, use "apoelstra" at the above address.
"You people hate mathematics." -- James Harris
Jul 4 '06 #2

P: n/a
* wizardyhnr:
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <wctype.h>
#include <string.h>

int main(int argc, char *argv[])
{
wchar_t *cur_buff=L"X";
wprintf(cur_buff);
return 0;
}

in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence. The source file is saved as ascci
code, and the character set is gb2312. i wonder why this happens?
Don't know about C, but in C++ you'd have to put a 'const' in there,

wchar_t const* curr_buff = L"X";

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
Jul 4 '06 #3

P: n/a
"Alf P. Steinbach" <al***@start.nowrites:
* wizardyhnr:
> wchar_t *cur_buff=L"X";

Don't know about C, but in C++ you'd have to put a 'const' in there,
wchar_t const* curr_buff = L"X";
Not in C.
--
"It wouldn't be a new C standard if it didn't give a
new meaning to the word `static'."
--Peter Seebach on C99
Jul 4 '06 #4

P: n/a
* Ben Pfaff:
"Alf P. Steinbach" <al***@start.nowrites:
>* wizardyhnr:
>> wchar_t *cur_buff=L"X";
Don't know about C, but in C++ you'd have to put a 'const' in there,
wchar_t const* curr_buff = L"X";

Not in C.
On second thought, perhaps not in C++ either (sorry for being a bit
fast). Haven't checked, and since this is a C newsgroup, won't do. The
C++ non-const possibility for char* is just for C compatibility.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
Jul 4 '06 #5

P: n/a
wizardyhnr said:
i want to try ANSI C99's unicode [functions].
Unicode is not mentioned even once in my copy of the C99 Standard. On the
other hand, wide characters /are/ so mentioned, so let's assume you meant
that.

<snip>
in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence.
As long as the compiler (or, almost certainly, the preprocessor in this
case) supports the basic source character set, it remains within its rights
to reject any other characters it encounters within the source code.

You can, however, read information into a wchar_t from a file at run-time. I
suggest you explore that option.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Jul 4 '06 #6

P: n/a
Richard Heathfield wrote:
wizardyhnr said:
i want to try ANSI C99's unicode [functions].

Unicode is not mentioned even once in my copy of the C99 Standard. On the
other hand, wide characters /are/ so mentioned, so let's assume you meant
that.
Unicode is explicitly mentioned in TC2 in the description for
__STDC_ISO_10646__, and while the wording before TC2 does not mention
"Unicode", the differences between Unicode and ISO 10646 are not
relevant here.

#ifndef __STDC_ISO_10646__
#error
#endif
/* Now, the assumption that C's wide character functions are Unicode
functions is valid */

Also, the \U and \u escape sequences work with Unicode / ISO 10646
character values.

Jul 4 '06 #7

P: n/a
Andrew Poelstra <ap*******@localhost.localdomainwrote:
On 2006-07-04, wizardyhnr <wi********@gmail.comwrote:
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:

Without looking at your actual problem, here's a few tips:
0) It's ISO C99, and has been from the start.
1) It's fairly unlikely that you actually have a C99 compiler.
It's actually 100% sure he hasn't.
2) It's very unlikely that something with the word "C++" in it
is even a C compiler, let alone a C99 compiler.
It's actually 100% sure it is an IDE with compiler suite which provides
C++, C89, and a Win32 library (MingW, to be precise), but not C99.

Richard
Jul 4 '06 #8

P: n/a
wizardyhnr wrote:
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
[snip code]

I'm only guessing but I think the source character set of your compiler
doesn't support characters other than the basic C character set.

Maybe the following URLs could shed some light in this obscure area?
<http://evanjones.ca/unicode-in-c.html>
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
<http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF>

Also the authoritative resource:
<http://www.unicode.org/>

Jul 4 '06 #9

P: n/a
wizardyhnr wrote:
i want to try ANSI C99's unicode fuctions.
[snip]

The following leans towards Linux, it's still a good introduction to
Unicode and C's support for it by means of the wchar_t type and
standard library functions.

<http://www-128.ibm.com/developerworks/linux/library/l-linuni.html>

Jul 4 '06 #10

P: n/a
i think i made a mistake about whether compiler fully support standard
c99, but i think some people too much emphsis it.
i recomplile the function with mingw 5.0.3, whose gcc version is 3.4.5,
and still the problem holds on. thought it may not support all features
of c99, i do not think the compiler would not support wide character
functions.

Jul 4 '06 #11

P: n/a

santosh 写道:
wizardyhnr wrote:
i want to try ANSI C99's unicode fuctions.
[snip]

The following leans towards Linux, it's still a good introduction to
Unicode and C's support for it by means of the wchar_t type and
standard library functions.

<http://www-128.ibm.com/developerworks/linux/library/l-linuni.html>
Jul 4 '06 #12

P: n/a
"Alf P. Steinbach" <al***@start.nowrites:
[...]
Don't know about C, but in C++ you'd have to put a 'const' in there,

wchar_t const* curr_buff = L"X";
No, you don't have to (though at least one C compiler, namely gcc, can
be invoked with an option that would make it necessary), but it's a
good idea anyway.

String literals are not const, but attempting to modify a string
literal invokes undefined behavior. Using const could help you catch
an error that the compiler otherwise wouldn't warn you about.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Jul 4 '06 #13

P: n/a
In article <Nf******************************@bt.com>,
Richard Heathfield <in*****@invalid.invalidwrote:
>wizardyhnr said:
>in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence.
>As long as the compiler (or, almost certainly, the preprocessor in this
case) supports the basic source character set, it remains within its rights
to reject any other characters it encounters within the source code.
That is arguably incorrect, Richard.

C89 2.2.1 Character Sets
[...]
In a character constant or string literal, members of the execution
character set shall be represented by corresponding members of
the source character set or by escape sequences consisting of
the backslash \ followed by one or more characters. A byte with all
bits set to 0, called the null character, shall exist in the basic
execution set; it is used to terminate a character string literal.
[...]
In the execution character set, there shall be control characters
representing alert, backspace, carriage return, and new line. If any
other characters are encountered in a source file (except in
a character constant, a string literal, a header name, a comment,
or a preprocessing token that is never converted to a token), the
behaviour is undefined.

3.1.3.4 Character Constants
[...]
An integer character constant is a sequence of one or more multibyte
characters enclosed in single-quotes, as in 'x' or 'ab'. A wide
character constant is the same, except prefixed by the letter L.
With a few exceptions detailed later, the elements of the sequence
are any members of the source character set; they are mapped in an
implementation-defined manner to members of the execution character set.
Thus, string constants (and string literals) are allowed to contain
multi-byte characters; the value of those is implementation-defined,
and it is true that the implementation might choose to define the
values as being illegal. You are technically correct about that aspect,
though -in a way- misleading, in that the standard explicitly allows
for multi-byte character support, so it is, at least psychologically,
not the same kind of "within its rights" as would be, say, whether
dollar-sign is permitted in identifier names (which would clearly
be extension.)

I would, though, argue that your statement is not exactly correct, in that
the C89 standard defines the source character set, and defines the
execution character set, and defines the allowed characters in
literals to include representations of the execution character set,
*and the basic execution character set is defined to include some characters
that do not appear in the basic source character set*. It is thus not
permitted for the compiler to define the representation of those
additional characters (null, alert, backspace, carriage return, and
new line) as being illegal.

There is the semantic question of whether (e.g.) \a appearing in
a literal is a single character or a pair of characters for the purpose
of "If any other characters are encountered in the source file", but
notice that 3.1.3.4 specifically notes that there are exceptions to
"the elements of the sequence are any members of the source character set".

I'm not entirely clear, reading the whole of 3.1.3.4, as to which
portions are considered by the standard to be the "exceptions" and
which not, but for the purposes of this present nit, is is enough to
point out that the standard -says- there are exceptions, and
thus that within literals, there are permited values defined as valid
and yet which are not members of the source character set.
--
There are some ideas so wrong that only a very intelligent person
could believe in them. -- George Orwell
Jul 4 '06 #14

P: n/a
# in the function, the initialization of wchar_t *cur_buff is L"X", if X
# is an ascii character, then all things function well. But if X is
# non-ascii charater such as a Chinese character, compiler would alert
# that this is a illegal byte sequence. The source file is saved as ascci
# code, and the character set is gb2312. i wonder why this happens?

Beyond ASCII, there are many different ways encode unicode. Unless your
compiler and edittor are using the same encoding, the compiler is going
to see garbage. Some encodings exclude certain byte values. and that could
well be the illegal byte sequence.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
I'm not even supposed to be here today.
Jul 5 '06 #15

P: n/a

SM Ryan wrote:
# in the function, the initialization of wchar_t *cur_buff is L"X", if X
# is an ascii character, then all things function well. But if X is
# non-ascii charater such as a Chinese character, compiler would alert
# that this is a illegal byte sequence. The source file is saved as ascci
# code, and the character set is gb2312. i wonder why this happens?

Beyond ASCII, there are many different ways encode unicode. Unless your
compiler and edittor are using the same encoding, the compiler is going
to see garbage. Some encodings exclude certain byte values. and that could
well be the illegal byte sequence.

--
SM Ryan http://www.rawbw.com/~wyrmwif/
I'm not even supposed to be here today.
i think maybe this is the reason

Jul 7 '06 #16

This discussion thread is closed

Replies have been disabled for this discussion.