473,795 Members | 2,954 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

stdin charset

Hi,

I'm new to c/c++ and working on string stuff with visual studio 2005.
I'm trying to understand something, for example when i do this:

wstring st;
wcin >st;

if the input is pure ascii, then everything is ok, but if there are
unicode characters like "ÅŸ" (u+015f) what is the encoding of st now?
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?

Btw, when i do something like this:

wsring a = L"ÅŸ";
wstring b;
wcin >b;

and write "ÅŸ" into console,

(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.

Thanks.

Apr 29 '07 #1
5 2687
On Apr 30, 9:27 am, Antimon <anti...@gmail. comwrote:
I'm new to c/c++ and working on string stuff with visual studio 2005.
NB. I'm not expert on this, but am posting because nobody else
has yet, so perhaps I can help you a little, at least.
I'm trying to understand something, for example when i do this:

wstring st;
wcin >st;

if the input is pure ascii, then everything is ok, but if there are
unicode characters like "ÅŸ" (u+015f) what is the encoding of st now?
It depends on your compiler. From what I know of Microsoft, it's
likely to be UTF-16.
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?
C++ includes the C functions for converting between "wide
character" and "multi-byte character sequence". It doesn't
specify that MBCS has to be UTF-8, but if you're lucky then
it will turn out to be that on your compiler. Try using the
function wcstombs() on your wstring and it might spit out
UTF-8 if you're lucky.
Btw, when i do something like this:

wsring a = L"ÅŸ";
wstring b;
wcin >b;

and write "ÅŸ" into console,

(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.
You can check what you have got by printing it out as a series
of unsigned chars, e.g. :

void hex_dump( void const *ptr, size_t nbytes )
{
unsigned char const *p = ptr;
while (nbytes--)
printf("%02X", *p++);
putchar('\n');
}

and then call it like this:
hex_dump( a.c_str(), a.size() * sizeof(wchar_t) );
hex_dump( b.c_str(), b.size() * sizeof(wchar_t) );

Apr 29 '07 #2
Antimon a écrit :
Hi,

I'm new to c/c++ and working on string stuff with visual studio 2005.
I'm trying to understand something, for example when i do this:

wstring st;
wcin >st;

if the input is pure ascii, then everything is ok, but if there are
unicode characters like "ÅŸ" (u+015f) what is the encoding of st now?
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?

Btw, when i do something like this:

wsring a = L"ÅŸ";
wstring b;
wcin >b;

and write "ÅŸ" into console,

(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.

Thanks.
C++ does not know anything about encoding (UTF-8, UTF-16 or what ever)
In C++, a wide char is just mean to be a place holder for a 2-char data.
You can put whatever you want on it.

If you want to use encoding, you should use a library that handle this.

J.
Apr 30 '07 #3
On Apr 29, 11:27 pm, Antimon <anti...@gmail. comwrote:
I'm new to c/c++ and working on string stuff with visual studio 2005.
I'm trying to understand something, for example when i do this:
wstring st;
wcin >st;
if the input is pure ascii, then everything is ok, but if there are
unicode characters like "?" (u+015f) what is the encoding of st now?
It depends on the system. Windows uses (I think) UTF-16, and
Linux UTF-32. Older systems have different conventions, which
may vary according to the compiler. (G++ and Sun CC behave
differently under Solaris, for example.)
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?
It depends on the system, the compiler, and possibly even some
options of the compiler.
Btw, when i do something like this:
wsring a = L"?";
wstring b;
wcin >b;
and write "?" into console,
(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.
When reading from wcin (or any wide string input), how the input
is encoded depends on the locale embedded in the stream. By
default, this should be the "C" locale (although if you change
the global locale in a constructor of a static object, there may
be some issues concerning order of initialization) , however, and
I can't imagine any problems with this with regards to the "C"
locale. (At least with "?", which is pure ASCII. For
historical reasons, Windows does not use the same default code
page in console windows as it uses elsewhere, so you often do
get surprises.)

FWIW: I'm unable to duplicate what you describe on my Windows
machine (with VC++ 2005). Both a and b, above, contained a
single character with the value 0x003F (which corresponds to the
UTF-16 code for '?').

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Apr 30 '07 #4
When reading from wcin (or any wide string input), how the input
is encoded depends on the locale embedded in the stream. By
default, this should be the "C" locale (although if you change
the global locale in a constructor of a static object, there may
be some issues concerning order of initialization) , however, and
I can't imagine any problems with this with regards to the "C"
locale. (At least with "?", which is pure ASCII. For
historical reasons, Windows does not use the same default code
page in console windows as it uses elsewhere, so you often do
get surprises.)

FWIW: I'm unable to duplicate what you describe on my Windows
machine (with VC++ 2005). Both a and b, above, contained a
single character with the value 0x003F (which corresponds to the
UTF-16 code for '?').
I think that's because your newsreader displays that character as "?"
It was a "s with cedilla". Unicode character \u015F. I tried something
else, here:

wstring a = L"ÅŸ";
wstring b;
wcin >b;

wcout << (unsigned int)a[0] << "\n";
wcout << (unsigned int)b[0] << "\n";

(a is the unicode character \u015F that i mentioned before.) when i
run this and again, write the same character as "a" holds. i get the
output:

351
159

first one (a) is right. \u015F is 351. But what the hell is 159? :) So
if i add "locale::global (locale(""));" to top, i get:

351
376

still, it doesn't read UTF-16 from console. I've been reading throuhg
msdn about vs2005 and unicode stuff but no luck yet.

Thanks alot for helping.

Apr 30 '07 #5
On Apr 30, 1:09 pm, Antimon <anti...@gmail. comwrote:
When reading from wcin (or any wide string input), how the input
is encoded depends on the locale embedded in the stream. By
default, this should be the "C" locale (although if you change
the global locale in a constructor of a static object, there may
be some issues concerning order of initialization) , however, and
I can't imagine any problems with this with regards to the "C"
locale. (At least with "?", which is pure ASCII. For
historical reasons, Windows does not use the same default code
page in console windows as it uses elsewhere, so you often do
get surprises.)
FWIW: I'm unable to duplicate what you describe on my Windows
machine (with VC++ 2005). Both a and b, above, contained a
single character with the value 0x003F (which corresponds to the
UTF-16 code for '?').
I think that's because your newsreader displays that character as "?"
My newsreader displays '?' with a '?', yes:-). But you're
right. On the machine on which I read your message, the only
fonts I have installed are ISO 8859-1, and anything which is not
representable in that codeset is displayed as a '?'. I see the
s-cedilla here (although the way I've configured my editor
doesn't allow inputing it---my printer wouldn't understand it,
so there's no point).

And yes, my experiment was with a '?'. (And I did the
experiment because I simply couldn't believe that a normal ASCII
character like '?' could cause problems.)
It was a "s with cedilla". Unicode character \u015F. I tried something
else, here:
wstring a = L"?";
wstring b;
wcin >b;
wcout << (unsigned int)a[0] << "\n";
wcout << (unsigned int)b[0] << "\n";
(a is the unicode character \u015F that i mentioned before.) when i
run this and again, write the same character as "a" holds. i get the
output:
351
159
Wierd. At first, I thought that perhaps something was trimming
the upper bits somewhere, but 159 is 0x009F, and just trimming
the bits would give 0x005F.
first one (a) is right. \u015F is 351. But what the hell is 159? :)
Application Program Command:-). Whatever that means (but it is
a control character).
So if i add "locale::global (locale(""));" to top, i get:
351
376
Which is 0x178: LATIN CAPITAL LETTER Y WITH DIAERESIS.

This is curious because normally, the locale for wcin should be
set when the object is constructed, and this is before main(),
so you should always get locale "C" (I don't know if this is
intentional, but that's effectively what the standard says.).
Quite obviously, changing the global locale is changing
something, but I don't know what. (I suspect that this is
occuring because IIRC, the Microsoft implementation of wcin
goes through the FILE*, and FILE* will reflect all changes to
the global locale.)

At any rate, the fact that changing the locale does have an
effect is good news, in a way, since it probably means that all
you have to do is find the correct local. And regretfully, I
can't help much there, since all of my experience has been on
Unix platforms (where the available locales are all represented
by sub-directories of a directory locale, usually in /usr/lib).

BTW: when outputting codes, as above, it's usually easier if you
set the hex flag, so that the values are in hex. And there is
an enormous amount of information, including the full code
charts, available on line at the Unicode site
(www.unicode.org)---nothing that will help you with this
particular problem, of course, but probably useful in the long
run.
still, it doesn't read UTF-16 from console. I've been reading throuhg
msdn about vs2005 and unicode stuff but no luck yet.
You might try the Dinkumware site. I don't know if it has
anything useful, but Dinkumware did provide Microsoft with the
libraries, and the head of the company, Plauger, is probably the
best expert in the world concerning the subtilities of handling
different code sets.

As a general rule, however, expect problems anytime you go
beyond basic ASCII.

--
James Kanze (Gabi Software) email: ja*********@gma il.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 1 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1836
by: lickspittle | last post by:
Hi, I have Python embedded with my other code, and when my other code opens a console and redirects stdout, stdin and stderr to it, then calls PyRun_InteractiveLoop, it immediately returns with an EOF. After some debugging the reason for this appears to be that the stdin and stdout that the ReadLine function in the tokeniser include are not affected by my redirection and refer to a non-terminal so don't stall waiting for input from my...
3
18376
by: Harayasu | last post by:
Hi, Using fgets() I can read from stdin and with fputs() I can write to stdout. Now I have two programs, one writing to stdin and the other one reading from stdin. And I would like the second program to read the characters the first program has written to stdin, but I don't get it how to do this. The program which writes to stdin:
23
7754
by: herrcho | last post by:
What's the difference between STDIN and Keyboard buffer ? when i get char through scanf, i type in some characters and press enter, then, where do the characters go ? to STDIN or Keyboard buffer ? are they same ? thanks ^^
6
2486
by: Charlie Zender | last post by:
Hi, I have a program which takes the output filename argument from stdin. Once the program knows the output filename, it tries to open it. If the output file exists, the program asks the user to confirm whether he really wants to overwrite the existing output file. The problem is that the second read from stdin, to obtain the user response whether to overwrite the existing output file, never waits for the user's response. It's as if a...
6
2234
by: ccdrbrg | last post by:
What is the best way to protect stdin within a library? I am writing a terminal based program that provides plugin capability using the dlopen() API. Sequencing program commands (typed) and library input prompts will not happen if stdin is supplied by pipe or redirection. So, I would like to include a statement in the pluggin
1
7477
by: asdsd sir | last post by:
Hi!I'm new in Python and i'd like to ask some general questions about stdin,stdout... Firstly... if we type like something like : cat "file.txt"|python somefile.py #somefile.py import sys
8
5075
by: aine_canby | last post by:
The following line in my code is failing because sys.stdin.encoding is Null. This has only started happening since I started working with Pydef in Eclipse SDK. Any ideas? uni=unicode(word,sys.stdin.encoding) Thanks, Aine.
31
4079
by: Nikos Chantziaras | last post by:
Hello. Is there a way to check if the current process has an stdin handle? In the win32 API, one can do: _eof(_fileno(stdin)) Crucial here is that the above doesn't block. Is there a standard way to do the same without resorting to OS-specific API calls?
16
2889
by: fbertasso | last post by:
Hi, I´m opening stdin to get a file and pass it through a pipe. razor=popen ("/var/qmail/bin/razor-check -home=/var/qmail/razor", "w"); while( (ret=fread(linha,1,sizeof(linha),stdin) ) 0 ) { fwrite(linha,1,sizeof(linha),razor); } pclose(razor)
0
1996
by: Gabriel Genellina | last post by:
En Thu, 25 Sep 2008 09:49:31 -0300, Almar Klein <almar.klein@gmail.com> escribió: Use subprocess.PIPE Usually the tricky part is to figure out exactly whether there is more input or not. With Python it's easy, use the ps1 prompt. --- begin --- import sys
0
9673
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9522
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10217
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10003
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7544
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6784
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5440
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5566
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4114
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.