By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,538 Members | 927 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,538 IT Pros & Developers. It's quick & easy.

Unicode and stream

P: n/a
Hello.

I have compiler BC Builder 6.0.

I have an example:

#include <strstrea.h>

int main () {
wchar_t ff [10] = {' s','d ', 'f', 'g', 't'};
istrstream b1 (ff);
return 0;
}

This example have compile error.
Error message: Could not find a match for ' istrstream:: istrstream (wchar_t *).

Questions:

1. Can I have a Unicode stream?
2. If it is impossible, can I work with Unicode without the OS tools?
I want work with Unicode only by language tools.
3. Is there the other compilers with support Unicode streams?
4. What is about Unicode stream in the standard?

Thank Basil
Jul 22 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Dietmar Kuehl wrote:

If a user
starts using e.g. a 'std::wstring' to hold Unicode characters, he is
probably in for a few surprises, even if 'wchar_t' is large enough to
accomodate UCS-32! For example, the 'size()' function does no longer
count the number of "glyphs" (what is normally considered to be a
character) because e.g. a u-umlaut (the second character of my last
name) is not necessarily represented by one character but possible
encoded as the "u" character followed by the umlaut composing
character.


Unicode does not deal with glyphs. Just ask 'em! A 32 bit wide character
is large enough to hold all Unicode characters. All implementations of
Unicode have to deal with combining characters. This isn't a C++ issue.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
Jul 22 '05 #2

P: n/a
Pete Becker wrote:
Unicode does not deal with glyphs. Just ask 'em!
Effectively, a glyph is what a user wants see at some point and in the
description of combining characters (Unicode 4.0, section 2.10) they
definitely talk about glyphs. Also, whether they deal with them or not
is not really that relevant at all: for example, if you count the
"characters" in my name (correctly written; since enough programs get
it
wrong I use a common transformation in most electronic conversation)
you want to get four, independent on whether the "u-umlaut" Unicode
character or a "u" character and a "umlaut" combining character is
used.
If you used a 'std::wstring' to represent the Unicode characters, you
would get four or five depending on what some software choose to
represent the "u-umlaut".
A 32 bit wide character is large enough to hold all Unicode characters.

I didn't dispute this. However, some Unicode sequences don't make any
sense if you rip apart certain characters, notably the combination of
a Unicode character and a following combining character (which are two
Unicode characters if I got things right).
All implementations of
Unicode have to deal with combining characters. This isn't a C++

issue.

I didn't claim that it is an issue specific to C++. I just pointed out
that the C and C++ libraries do not provide any help in processing
Unicode. In particular, the view taken by the these libraries with
respect to character processing (which does not include the code
conversion facilities, IMO, as these operate on bytes rather than on
characters) is that each character is a fixed sized unit, e.g. of
type 'char' or 'wchar_t' (these two character types are directly
supported; user might choose to use e.g. 'long' if their implementation
has choosen to use a 16 bit entity for 'wchar_t' but this would imply
that they provide a whole bunch of stuff, e.g. suitable facets) and
Unicode does not exactly fit this description, not even UCS-4
(I erronously labeled UCS-4 "UCS-32" in an earlier article). ... and
I think it *is* a C++ issue that C++ has no real Unicode support. Of
course, this *is* also an issue for various other languages - despite
the claims of some proponents of such other languages that the language
has proper Unicode support.
--
<mailto:di***********@yahoo.com> <http://www.dietmar-kuehl.de/>
<http://www.contendix.com> - Software Development & Consulting

Jul 22 '05 #3

P: n/a
Dietmar Kuehl wrote:

I didn't dispute this. However, some Unicode sequences don't make any
sense if you rip apart certain characters, notably the combination of
a Unicode character and a following combining character (which are two
Unicode characters if I got things right).


No, that makes perfect sense: it's two Unicode characters, the first
being, say, LATIN SMALL LETTER U (0x0075), and the second being
COMBINING DIAERESIS (0x0308). If you're concerned about keeping those
two Unicode characters together, replace them with the single character
LATIN SMALL LETTER U WITH DIAERESIS (0x00fc).

The point is that in Unicode every code point (i.e. valid numeric value
in a 32-bit representation) always means the same thing; you don't have
to look at context to figure out what it means. That's the basic
requirement for wchar_t, as well. It's not the case for char, though,
because the meaning of a single code point can depend on what comes
after it (first byte in a multi-byte character) or what came before it
(with shift encodings and with the second or subsequent bytes in a
multi-byte character).

As to glyphs, they involve a great deal more than what we might call a
"letter". From the Unicode standard:

The difference between identifying a code value and rendering it
on screen or paper is crucial to understanding the Unicode
Standard's role in text processing. The character identified by
a Unicode value is an abstract entity, such as "LATIN CAPITAL
LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper,
called a glyph, is a visual representation of the character.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
Jul 22 '05 #4

P: n/a
Pete Becker wrote:

As to glyphs, they involve a great deal more than what we might call a
"letter". From the Unicode standard:

The difference between identifying a code value and rendering it
on screen or paper is crucial to understanding the Unicode
Standard's role in text processing. The character identified by
a Unicode value is an abstract entity, such as "LATIN CAPITAL
LETTER A" or "BENGALI DIGIT 5". The mark made on screen or paper,
called a glyph, is a visual representation of the character.


Sorry, thinking too slowly today. I was trying to suggest that we use
different terminology, because "glyph" really isn't what you're talking
about. That's why I said "letter". I think it gets at what we're talking
about: 'u-umlaut', whether it's represented by two Unicode characters or
one, is a single letter, and it's not 'u'. At least, most of the time
it's not. <g>

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
Jul 22 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.