UTF-16 & wchar_t: the 2nd worst thing about C++

Steven T. Hatton

This is one of the first obstacles I encountered when getting started with
C++. I found that everybody had their own idea of what a string is. There
was std::string, QString, xercesc::XMLString, etc. There are also char,
wchar_t, QChar, XMLCh, etc., for character representation. Coming from
Java where a String is a String is a String, that was quite a shock.

Well, I'm back to looking at this, and it still isn't pretty. I've found
what appears to be a way to go between QString and XMLCh. XMLCh is
reported to be UFT-16. QString is documented to be the same. QString
provides very convenient functions for 'converting'[*] between NTBS const
char*, std::string and QString[**]. So, using QString as an intermediary,
I can construct a std::string from a const XMLCh* NTBS, and a const XMLCh*
NTBS from a std::string.

My question is whether I can do this without the QString intermediary. That
is, given a UTF-16 NTBS, can I construct a std::string representing the
same characters? And given a std::string, can I convert it to a UTF-16
NTBS? I have been told that some w_char implementations are UTF-16, and
some are not.

My reading of the ISO/IEC 14882:2003 is that implementations must support
the UTF-16 character set[***], but are not required to use UTF-16 encoding.
The proper way to express a member of the UTF-16 character set is to use
the form \UNNNNNNNN[****], where NNNNNNNN is the universal-character-name,
or \UNNNN, where NNNN is the character short name of the a
universal-character-name whose value is \U0000NNNN, unless the character is
a member of the basic character set, or if the hexadecimal value of the
character expressed is less than 0x20, or if the hexadecimal value of the
character expressed is in the range 0x7f-0x9f (inclusive). Members of the
UTF-16 character set which are also members of the basic character set are
to be expressed using their literal symbol in an L-prefixed character
literal, or an L-prefixed string literal.

This tells me that the UTF-16 defined by the Xerces XMLCh does not conform
to the definition of the extended character set of a C++ implementation.
http://xml.apache.org/xerces-c/apiDo...pp-source.html

Is my understanding of this situation correct?

UTF-16 seems to be a good candidate for a lingua Franka of runtime character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?
[*] here 'converting' means either type conversion or constructing a new
variable to hold a different encoding of the same characters sequence.

[**]Leaving aside unanswered questions such as whether knowing that both are
encoded as UTF16 is sufficient to assume the representations are identical.

[***]Here UTF-16 is used as a synonym for UCS-2 described in ISO/IEC 10646
"Universal Multiple-Octet Coded Character Set", though there may be subtle
differences.
[****] the case of the 'U' in \UNNNNNNNN or \UNNNN is irrelevant.

So what is the worst thing about C++? '#'
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #1

Subscribe Post Reply

8146

Steven T. Hatton

Steven T. Hatton wrote:

UTF-16 seems to be a good candidate for a lingua Franka of runtime
character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?

To answer my own question, that appears to be partially correct. The
implementation must provide support for the subset of UTF-16 required by
all of its supported locales.

I asked for some clarification as to why Xerces-C uses a differnet data type
than wchar_t to hold UTF-16. One person responded by arguing that the
standard does not specify that wchar_t is 16 bits. He pointed out that GCC
uses a 32 bit data type for wchar_t. In practice, does that matter? I
mean, in the real world, will any system use a different amount of physical
memory for a 32 bit data type than for a 16 bit type?

--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #2

Greg

Steven T. Hatton wrote:

Steven T. Hatton wrote:
UTF-16 seems to be a good candidate for a lingua Franka of runtime
character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?

To answer my own question, that appears to be partially correct. The
implementation must provide support for the subset of UTF-16 required by
all of its supported locales.

I asked for some clarification as to why Xerces-C uses a differnet data type
than wchar_t to hold UTF-16. One person responded by arguing that the
standard does not specify that wchar_t is 16 bits. He pointed out that GCC
uses a 32 bit data type for wchar_t. In practice, does that matter? I
mean, in the real world, will any system use a different amount of physical
memory for a 32 bit data type than for a 16 bit type?

Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?

So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.

Greg

Mar 9 '06 #3

Steven T. Hatton

Greg wrote:

Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?
On a 32 bit system the unit of data processed by each instruction is 32
bits. That means that storing two 16 bit values in one 32 bit word would
require some kind of packing and unpacking. Perhaps I am wrong, but my
understanding is that such processor overhead is typically not expended
when dealing with in-memory data.
So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.

Do you have any benchmark examples to demonstrate this?
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #4

Jakob Bieling

Steven T. Hatton <ch********@germania.sup> wrote:

This is one of the first obstacles I encountered when getting started
with C++. I found that everybody had their own idea of what a string
is. There was std::string, QString, xercesc::XMLString, etc. There
A string in C++ is an std::string and nothing else. All the QString
and XMLString stuff you found are just reinventions of the wheel. They
might (and should, imho) have used the Standard C++ std::string type.
are also char, wchar_t, QChar, XMLCh, etc., for character
Same here. According to the Standard a char is "large enough to
store any member of the implementation's basic character set" (3.9.1/1)
and a wchar_t "is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales" (3.9.1/5).
representation. Coming from Java where a String is a String is a
String, that was quite a shock.

You can't keep people from reinventing the wheel .. :|

I am not going into the problems you described with those
reinventions, as I am not familiar with those. But converting strings in
C++ to wide-strings and back could be done this way:

#include <string>

inline std::wstring widen (std::string const& s)
{
return std::wstring (s.begin (), s.end ());
}

inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}

int main (int argc, char* argv [])
{
std::string s = "hello world";
std::wstring s1 = widen (s);
std::string s2 = narrow (s1);
}

Now I am sure you can create similar functions to convert from and
to those other string types.

If you need to use all three of those string types, my advice is to
make the Standard string types your internal string types (those you
work with) and not litter your code with non-standard string classes.
Then convert your std::string/std::wstring objects to whatever string
class is required.

regards
--
jb

(reply address in rot13, unscramble first)

Mar 9 '06 #5

peter koch

Greg wrote:

Steven T. Hatton wrote:
Steven T. Hatton wrote:
UTF-16 seems to be a good candidate for a lingua Franka of runtime
character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?
To answer my own question, that appears to be partially correct. The
implementation must provide support for the subset of UTF-16 required by
all of its supported locales.

I asked for some clarification as to why Xerces-C uses a differnet data type
than wchar_t to hold UTF-16. One person responded by arguing that the
standard does not specify that wchar_t is 16 bits. He pointed out that GCC
uses a 32 bit data type for wchar_t. In practice, does that matter? I
mean, in the real world, will any system use a different amount of physical
memory for a 32 bit data type than for a 16 bit type?

Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?

I can not see how using a variable-length character set can be faster
than using a fixed size one. In my opinion, UTF16 gives you nothing
compared to UTF8 unless you totally ignore the fact that the data might
be encoded. But if you do that, how are you going to react to input
from the surrounding world?

/Peter

So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.

Greg

Mar 9 '06 #6

Pete Becker

Steven T. Hatton wrote:

On a 32 bit system the unit of data processed by each instruction is 32
bits. That means that storing two 16 bit values in one 32 bit word would
require some kind of packing and unpacking. Perhaps I am wrong, but my
understanding is that such processor overhead is typically not expended
when dealing with in-memory data.

On most systems, an 8-bit byte is the fundamental addressable storage
unit. Simple char arrays use one byte per char. When wchar_t is 16 bits
it occupies two bytes. When it's 32 bits it occupies four bytes. Try it:

int main()
{
char cvalues[2];
printf("%p %p\n", &cvalues[0], &cvalues[1]);
short svalues[2]; // assuming 16-bit short
printf("%p %p\n", &svalues[0], &svalues[1]);
return 0;
}

--

Pete Becker
Roundhouse Consulting, Ltd.

Mar 9 '06 #7

Steven T. Hatton

Jakob Bieling wrote:

Steven T. Hatton <ch********@germania.sup> wrote:
This is one of the first obstacles I encountered when getting started
with C++. I found that everybody had their own idea of what a string
is. There was std::string, QString, xercesc::XMLString, etc. There
A string in C++ is an std::string and nothing else. All the QString
and XMLString stuff you found are just reinventions of the wheel. They
might (and should, imho) have used the Standard C++ std::string type.
are also char, wchar_t, QChar, XMLCh, etc., for character

Same here. According to the Standard a char is "large enough to
store any member of the implementation's basic character set" (3.9.1/1)
and a wchar_t "is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales" (3.9.1/5).
representation. Coming from Java where a String is a String is a
String, that was quite a shock.

You can't keep people from reinventing the wheel .. :|

I am not going into the problems you described with those
reinventions, as I am not familiar with those.

<quote url="http://xml.apache.org/xerces-c/build-misc.html#XMLChInfo">
What should I define XMLCh to be?

XMLCh should be defined to be a type suitable for holding a utf-16 encoded
(16 bit) value, usually an unsigned short.

All XML data is handled within Xerces-C++ as strings of XMLCh characters.
Regardless of the size of the type chosen, the data stored in variables of
type XMLCh will always be utf-16 encoded values.

Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is
utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is
not based on Unicode at all (HP/UX, AS/400, system 390).

Some earlier releases of Xerces-C++ defined XMLCh to be the same type as
wchar_t on most platforms, with the goal of making it possible to pass
XMLCh strings to library or system functions that were expecting wchar_t
parameters. This approach has been abandoned because of

* Portability problems with any code that assumes that the types of XMLCh
and wchar_t are compatible

* Excessive memory usage, especially in the DOM, on platforms with 32 bit
wchar_t.

* utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t
on Solaris and Linux. The problem occurs with Unicode characters with
values greater than 64k; in ucs-4 the value is stored as a single 32 bit
quantity. With utf-16, the value will be stored as a "surrogate pair" of
two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still
create the utf-16 encoded surrogate pairs, which are illegal in ucs-4
encoded wchar_t strings.
</quote>

inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}
Can I rely on that to convert all UTF-32 to UTF-8?
Now I am sure you can create similar functions to convert from and
to those other string types.
As I've already indicated, QString provides conversion functions. Xerces
also provides a transcode function, but it is not as easy to use.
Moreover, transcoding is expensive.
If you need to use all three of those string types, my advice is to
make the Standard string types your internal string types (those you
work with) and not litter your code with non-standard string classes.
Then convert your std::string/std::wstring objects to whatever string
class is required.

With Xerces, that really is not an option.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #8

Steven T. Hatton

peter koch wrote:

I can not see how using a variable-length character set can be faster
than using a fixed size one. In my opinion, UTF16 gives you nothing
compared to UTF8 unless you totally ignore the fact that the data might
be encoded. But if you do that, how are you going to react to input
from the surrounding world?

I don't understand what you mean here. If my understanding is correct, UTF-8
uses different numbers of bytes for different characters. UTF-16 does that
far less. UTF-32 doesn't do it at all.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #9

Jakob Bieling

Steven T. Hatton <ch********@germania.sup> wrote:

Jakob Bieling wrote:

inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}

Can I rely on that to convert all UTF-32 to UTF-8?

No, the output will only be a string of single-byte characters
(non-UTF). Thus you will lose information.

You should disregard my comment about using
std::wstring/std::string. I was not aware of the complexity of UTF until
a few minutes ago and should not have answered in this extent with my
half-knowledge about it.

regards
--
jb

(reply address in rot13, unscramble first)

Mar 9 '06 #10

Pete Becker

Steven T. Hatton wrote:

I don't understand what you mean here. If my understanding is correct, UTF-8
uses different numbers of bytes for different characters. UTF-16 does that
far less. UTF-32 doesn't do it at all.

Yes, but "far less" is not zero, and code that deals with UTF-8 or
UTF-16 has to be aware of the possibility of multi-character encodings.
Code for UTF-32 does not, so it's far simpler. For example, if you're
moving around in an array of characters, moving N characters in UTF-32
is just a pointer adjustment. Moving N characters in UTF-8 or UTF-16
requires examining every character along the way.

--

Pete Becker
Roundhouse Consulting, Ltd.

Mar 9 '06 #11

Steven T. Hatton

Pete Becker wrote:

On most systems, an 8-bit byte is the fundamental addressable storage
unit. Simple char arrays use one byte per char. When wchar_t is 16 bits
it occupies two bytes. When it's 32 bits it occupies four bytes. Try it:

int main()
{
char cvalues[2];
printf("%p %p\n", &cvalues[0], &cvalues[1]);
short svalues[2]; // assuming 16-bit short
printf("%p %p\n", &svalues[0], &svalues[1]);
return 0;
}

But that doesn't tell me what's going on in terms of physical storage. An
octet may be the smallest addressable unit of storage, but that doesn't
mean it is the smallest retrievable unit of storage. Data is not moved
around in 8-bit chunks. The smallest chunk of data that gets moved between
registers on a 32-bit processor is a 32 bit word.

I've been trying to contrive some kind of a test, but my implementation
seems to be trying to outsmart me by reusing some of the data I'm feeding
it.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #12

Steven T. Hatton

Pete Becker wrote:

Steven T. Hatton wrote:

I don't understand what you mean here. If my understanding is correct,
UTF-8
uses different numbers of bytes for different characters. UTF-16 does
that
far less. UTF-32 doesn't do it at all.

Yes, but "far less" is not zero, and code that deals with UTF-8 or
UTF-16 has to be aware of the possibility of multi-character encodings.
Code for UTF-32 does not, so it's far simpler. For example, if you're
moving around in an array of characters, moving N characters in UTF-32
is just a pointer adjustment. Moving N characters in UTF-8 or UTF-16
requires examining every character along the way.

Well, if I happen to know the particular subset of UTF-16 I'm dealing with
will not have any second plane (IIRC) characters, then I can ignore the
fact that some UTF-16 is, indeed, multi-unit. At one point I was under the
impression that this is what UCS-2 was about, but I don't belive that is
correct.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #13

peter koch

Steven T. Hatton wrote:

Jakob Bieling wrote:
[snip]
XMLCh should be defined to be a type suitable for holding a utf-16 encoded
(16 bit) value, usually an unsigned short.

All XML data is handled within Xerces-C++ as strings of XMLCh characters.
Regardless of the size of the type chosen, the data stored in variables of
type XMLCh will always be utf-16 encoded values.

Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is
utf-16 (AIX, Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is
not based on Unicode at all (HP/UX, AS/400, system 390).

Some earlier releases of Xerces-C++ defined XMLCh to be the same type as
wchar_t on most platforms, with the goal of making it possible to pass
XMLCh strings to library or system functions that were expecting wchar_t
parameters. This approach has been abandoned because of

* Portability problems with any code that assumes that the types of XMLCh
and wchar_t are compatible

* Excessive memory usage, especially in the DOM, on platforms with 32 bit
wchar_t.

* utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t
on Solaris and Linux. The problem occurs with Unicode characters with
values greater than 64k; in ucs-4 the value is stored as a single 32 bit
quantity. With utf-16, the value will be stored as a "surrogate pair" of
two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still
create the utf-16 encoded surrogate pairs, which are illegal in ucs-4
encoded wchar_t strings.
</quote>
inline std::string narrow (std::wstring const& s)
{
return std::string (s.begin (), s.end ());
}
Can I rely on that to convert all UTF-32 to UTF-8?

You can not, and I do not believe the code above is correct (no
error-detection).
Still it is (in my opinion, I am not sure this is required by the
standard) a bad idea to store UTF-8 or UTF-16 data in a
std::basic_string.
I would expect for a standard string s that s[n] gives the character at
position n and that s.size() gives me the number of characters in that
character. For encoded strings this is simply wrong.

/Peter

Now I am sure you can create similar functions to convert from and
to those other string types.
As I've already indicated, QString provides conversion functions. Xerces
also provides a transcode function, but it is not as easy to use.
Moreover, transcoding is expensive.

Is it that bad? I would expect most conversions from one characterset
to another to be relatively fast - most likely bounded by the memory
bandwidth available.

If you need to use all three of those string types, my advice is to
make the Standard string types your internal string types (those you
work with) and not litter your code with non-standard string classes.
Then convert your std::string/std::wstring objects to whatever string
class is required.
With Xerces, that really is not an option.

If the Xerces interface is inadequate in that direction you should
provide a wrapper to Xerces - converting (most likely) your UCS-4
characters to the internal Xerces format on the way in to Xerces and
convert the other way when reading Xerces data.

/Peter --
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #14

Phlip

Pete Becker wrote:

Yes, but "far less" is not zero, and code that deals with UTF-8 or
UTF-16 has to be aware of the possibility of multi-character encodings.
Code for UTF-32 does not
I thought all UTFs had multi-character rules. I'm aware UTF-32 will only
need them after we add >4 billion characters to Unicode. Maybe after
meeting a couple thousand alien species, each with a diverse culture...

(Furtherless, the task of moving N glyphs raises its ugly head, because some
are composites...)

Steven Hatton wrote:
So what is the worst thing about C++? '#'

Anyone who learns only part of a language, and its styles and idioms, will
have the potential to abuse some feature. You could say the same thing
about 'if' statements. They have a great potential for abuse. Yet you don't
often read posts here bragging "I know better than to abuse 'if'
statements!!"

Except me. ;-)

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Mar 9 '06 #15

Pete Becker

Steven T. Hatton wrote:

But that doesn't tell me what's going on in terms of physical storage.
Sure it does. It tells you the addresses where those things are stored.
An
octet may be the smallest addressable unit of storage, but that doesn't
mean it is the smallest retrievable unit of storage. Data is not moved
around in 8-bit chunks.
On many systems it is. But, granted, when you're talking "normal"
desktop systems, you're generally dealing with 32-bit bus widths.
The smallest chunk of data that gets moved between
registers on a 32-bit processor is a 32 bit word.

Even if it's true, it doesn't matter. Register to register moves are
fast. It sounds like you're trying to micro-optimize for cache behavior.
I much prefer to leave that to the compiler writers. They know a great
deal more about it than I do.

--

Pete Becker
Roundhouse Consulting, Ltd.

Mar 9 '06 #16

Pete Becker

peter koch wrote:

Still it is (in my opinion, I am not sure this is required by the
standard) a bad idea to store UTF-8 or UTF-16 data in a
std::basic_string.
I would expect for a standard string s that s[n] gives the character at
position n and that s.size() gives me the number of characters in that
character. For encoded strings this is simply wrong.

I agree: basic_string has no knowledge of variable-length encodings. It
won't give the behavior you want when your text is encoded in UTF-8,
UTF-16, shift-JIS, or any other variable-length encoding.

The answer for shift-JIS is to translate to wide characers and use
basic_string<wchar_t>. For UTF-8 or UTF-16, use a 32-bit character type.

--

Pete Becker
Roundhouse Consulting, Ltd.

Mar 9 '06 #17

Steven T. Hatton

Pete Becker wrote:

Steven T. Hatton wrote:

But that doesn't tell me what's going on in terms of physical storage.

Sure it does. It tells you the addresses where those things are stored.

Indeed. I did not look closely enough at the example. Now that I think
about it, an array /will/ store data contiguously. I'm not sure what
happens to individual integer values, or characters.

The smallest chunk of data that gets moved between
registers on a 32-bit processor is a 32 bit word.

Even if it's true, it doesn't matter. Register to register moves are
fast. It sounds like you're trying to micro-optimize for cache behavior.
I much prefer to leave that to the compiler writers. They know a great
deal more about it than I do.

I'm trying to understand the rationale for Xerces using their own XMLCh
rather than wchar_t. Their argument is that it requires much less storage
than the 4-byte wchar_t used on Unix/Linux implementations. It would
appear that they are correct. But that puts us back to the question of
whether they need to examine every character while bumping pointers.

--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #18

Pete Becker

Steven T. Hatton wrote:

I'm trying to understand the rationale for Xerces using their own XMLCh
rather than wchar_t.
Okay.
Their argument is that it requires much less storage
than the 4-byte wchar_t used on Unix/Linux implementations. It would
appear that they are correct. But that puts us back to the question of
whether they need to examine every character while bumping pointers.

Yup. That's the tradeoff. They may have decided to ignore that
possibility. That's what Java did, because at the time, all of Unicode
fit in 16 bits. When Unicode grew, they had to hack in support for
variable-width characters.

--

Pete Becker
Roundhouse Consulting, Ltd.

Mar 9 '06 #19

Steven T. Hatton

Pete Becker wrote:

Steven T. Hatton wrote:

Their argument is that it requires much less storage
than the 4-byte wchar_t used on Unix/Linux implementations. It would
appear that they are correct. But that puts us back to the question of
whether they need to examine every character while bumping pointers.

Yup. That's the tradeoff. They may have decided to ignore that
possibility. That's what Java did, because at the time, all of Unicode
fit in 16 bits. When Unicode grew, they had to hack in support for
variable-width characters.

I'm pretty sure they bit the bullet and went all the way. That's probably
why transcoding to and from XMLCh is so expensive. Once it's in their
internal form (UTF-16), I suspect there really aren't that many instances
where they need to worry about bumping pointers per character. They surely
don't need it to determine two sequences are equal. If they happen to find
they are not, then they resort to the more expensive operations at the
point of divergence.

--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 9 '06 #20

Steven T. Hatton

Phlip wrote:

Steven Hatton wrote:
So what is the worst thing about C++? '#'

Anyone who learns only part of a language, and its styles and idioms, will
have the potential to abuse some feature. You could say the same thing
about 'if' statements. They have a great potential for abuse. Yet you
don't often read posts here bragging "I know better than to abuse 'if'
statements!!"

Have you ever inadvertently defined a header guard to be the same value as
one used by a 3rd party library? Ever had a macro collision between two
libraries? And then there are, of course, the more straight forward
problems such as changing the name of a source file, answering the phone,
and creating a new source file with the same name as the original.

There are places where I've seen the CPP used effectively to do things in a
better way than any I can come up with. For example:

http://websvn.kde.org/trunk/KDE/kdev...20&view=markup
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 11 '06 #21

Phlip

Steven T. Hatton wrote:

Ever had a macro collision between two
libraries?
Yes, and I have also written the wrong stuff inside an 'if' statement!

All languages have trade-offs between elegant things and icky things.
http://websvn.kde.org/trunk/KDE/kdev...20&view=markup

Now imagine if you wrote that on the job, and solved a tricky problem.
However, your supervisor once wrote a book that advised against many kinds
of code abuse - common in his industry - and then had to review your work.
Maybe he can't see past the macros to the elegant results. Maybe your
staying power at that company goes way down. Just because others abused some
language feature and he wrote a book about it.

Here's an even better example of macro elegance:

http://www.codeproject.com/macro/metamacros.asp

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Mar 11 '06 #22

Steven T. Hatton

Phlip wrote:

Steven T. Hatton wrote:
Ever had a macro collision between two
libraries?
Yes, and I have also written the wrong stuff inside an 'if' statement!

If you are experiencing macro-like problems when using them, I will suggest
avoiding global variables, and to use the scoping facilities of the
language. Something which is not an option with the CPP.
All languages have trade-offs between elegant things and icky things.
http://websvn.kde.org/trunk/KDE/kdev...20&view=markup
Now imagine if you wrote that on the job, and solved a tricky problem.
However, your supervisor once wrote a book that advised against many kinds
of code abuse - common in his industry - and then had to review your work.
Maybe he can't see past the macros to the elegant results. Maybe your
staying power at that company goes way down. Just because others abused
some language feature and he wrote a book about it.
Not really a major problem in the case of the r++ code. I've already done a
regexp replacement on all the macros with the resulting code working just
fine.
Here's an even better example of macro elegance:

http://www.codeproject.com/macro/metamacros.asp

Perhaps I've failed to appreciate something, but I don't find the example
overly compelling. I've been working on someting similar using templates.
One big difference is that with templates I can use type information to
form hierarchies of "typelets". The biggest shortcoming of themplates
verses macros is that macros can use one string for both code generation
and string literal creation. But a bit of `C-M-%' deals with that. To get
hard strings into templates use char[], and wrap them in an anonymous
namespace to avoid ODR indictments. If I really need high-powered string
manipulation, I have tools that make sawdust out of the CPP. In addition
to stringification, another advantage of the CPP, one that other approaches
don't offer, is the ability to distribute the unexpanded macros to anybody
with a standard C++ implementation and know they will be expanded
predictably - barring the possibility of macro collisions. But the avian
flue also has the ability to have it's code understood and expanded by most
hosts.

--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Mar 11 '06 #23

Phlip

Steven T. Hatton wrote:

If you are experiencing macro-like problems when using them, I will
suggest
avoiding global variables, and to use the scoping facilities of the
language. Something which is not an option with the CPP.

You are having fun arguing with things I'm not saying.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Mar 12 '06 #24

UTF-16 & wchar_t: the 2nd worst thing about C++

Similar topics