By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,012 Members | 1,235 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,012 IT Pros & Developers. It's quick & easy.

why is wcschr so slow???

P: n/a
I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded free
from their web site.

Apr 4 '06 #1
Share this Question
Share on Google+
22 Replies


P: n/a

"Albert Oppenheimer" <sp**@spam.com> wrote in message
news:e0**********@geraldo.cc.utexas.edu...
I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

C? Did you mean C++, or are you in the wrong newsgroup?
I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this
substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded
free
from their web site.


I don't see that function anywhere in my reference books here. Is it a
Borland extension to strchr? If so, you could ask them (or on a borland
newsgroup) about any performance issues with their implementation of that
function.

Perhaps their free compiler is worth every penny? :-)

Or perhaps your method of calling the function (or the data you're using)
doesn't work well with the way they desigend it? They'd be the ones to ask,
I think.

-Howard


Apr 4 '06 #2

P: n/a
> I don't see that function anywhere in my reference books here. Is it a
Borland extension to strchr? If so, you could ask them (or on a borland
newsgroup) about any performance issues with their implementation of that
function.


wcschr is the unicode version of strchr.
It processes 16-bit characters instead of 8-bit bytes.

God knows what reference books you used. A good place to look for standard
C library functions for XP (yes, I did say this is on XP) is at

http://msdn.microsoft.com/library/de...nipulation.asp

Borland has the same standard C functions as Microsoft. And standard C
functions are basic to C++ just like standard C expressions.

Apr 4 '06 #3

P: n/a
Albert Oppenheimer wrote:
wcschr is the unicode version of strchr.
It processes 16-bit characters instead of 8-bit bytes.


Calling such a function "the unicode version" is misleading.

(Not to blame you - much documentation has this problem. But I _do_ blame
you for confusing wchar_t for 16-bits. Sometimes it's more!)

A function that truly handles Unicode will handle an encoding, such as
UTF-16, and it will deal correctly with the various Unicode shenanigans,
such as composite characters.

The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph. (If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)

If this wcschr() did indeed process Unicode correctly, it might be a little
slow.

If all it does is iterate over wchar_t elements, then it had no reason to be
slow, and the Original Poster must look elsewhere for the problem.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 4 '06 #4

P: n/a
It often surprises me how many people in here can't admit to themselves that
they don't know, and compulsively post drivel.
Apr 4 '06 #5

P: n/a
Albert Oppenheimer wrote:
It often surprises me how many people in here can't admit to themselves
that they don't know, and compulsively post drivel.


Welcome to my killfile. And good luck with your problem.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #6

P: n/a
Phlip wrote:

The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.
An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.
(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)


Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.

--

Pete Becker
Roundhouse Consulting, Ltd.
Apr 5 '06 #7

P: n/a
Pete Becker wrote:
Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.


An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.


Right. Deep within the "mind" of the lowly wcschr() function, such things
are hypothetical. It will match combining diaereses as if they were
independent glyphs, and won't match those which precombined. That's why
short posts on such topics are risky, and the alternative is long boring
posts. But feel free to nitpick...

The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For example,
f is softer than ph, so flip has a meaning different than ... you get the
idea.

"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".

[A pause to check my post's encoding. It will go out as Western Europe,
meaning ISO Latin 1. I suspect that's also ISO 8897-1.

[That's funny, because I thought I had it set to Greek these days for some
strange reason...]

A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.

Different locales required different encodings and character widths for
various reasons. In the beginning, there was ASCII, based on encoding the
Latin alphabet, without accent marks, into a 7-bit protocol. Early systems
reserved the 8th bit for a parity check. Then cultures with short phonetic
alphabets computerized their own glyphs. Each culture claimed the same
"high-ASCII" range of the 8 bits in a byte-the ones with the 8th bit turned
on. User interface software, to enable more than one locale, selects the
"meaning" of the high-ASCII characters by selecting a "code page". On some
hardware devices, this variable literally selected the hardware page of a
jump table to convert codes into glyphs.

Modern GUIs still use code page numbers, typically defined by the
"International Standards Organization", or its member committees. The ISO
8859-7 encoding, for example, stores Latin characters in their ASCII
locations, and Greek characters in the high-ASCII.

<warning topicality="off">

Internationalize a resource file to Greek like this:

LANGUAGE LANG_GREEK, SUBLANG_NEUTRAL
#pragma code_page(1253)

STRINGTABLE DISCARDABLE
BEGIN
IDS_WELCOME "?p?d??? st?? ????da." // <-- imagine Greek there
END

</warning>

The quoted Greek words might appear as garbage on your desktop, in a real RC
file, in a USENET post [like this one], or in a compiled application. On
WinXP, fix this by opening the Regional and Language Options applet, and
switching the combo box labeled "Select a language to match the language
version of the non-Unicode programs you want to use" to Greek. Unless if the
garbage is ? marks, in which case a library function somewhere has replaced
the garbage with placeholders.

That user interface verbiage uses "non-Unicode" to mean the "default code
page". When a program runs using that resource, the code page "1253"
triggers the correct interpretation, as (roughly) ISO 8859-7.

MS Windows sometimes supports more than one code page per locale. The two
similar pages, 1253 and ISO 8859-7, differ by a couple of glyphs.

Some languages require more than 127 glyphs. To fit these locales within
8-bit hardware, more complex encodings map some glyphs into more than one
byte. The bytes without their 8th bit still encode ASCII, but any byte with
its 8th bit set is a member of a short sequence of multiple bytes that
require some math formula to extract their actual char set index. These
"Multiple Byte Character Sets" support locale-specific code pages for
cultures from Arabia to Vietnam. However, you cannot put glyphs from too
many different cultures into the same string. OS support functions cannot
expect strings with mixed code

Sanskrit shares a very popular script called Devanagari with several other
Asian languages. (Watch the movie "Seven Years in Tibet" to see a big
ancient document, written with beautiful flowing Devanagari, explaining why
Brad Pitt is not allowed in Tibet.)

Devanagari's code page could have been 57002, based on the standard "Indian
Script Code for Information Interchange". MS Windows does not support this
locale-specific code page. Accessing Devanagari and writing Sanskrit (or
most other modern Indian languages) requires the Mother of All Char Sets,
Unicode.

ISO 10646, and the "Unicode Consortium", maintain the complete char set of
all humanity's glyphs. To reduce the total count, Unicode supplies many
shortcuts. For example, many fonts place glyph clusters, such as accented
characters, into one glyph. Unicode usually defines each glyph component
separately, and relies on software to merge glyphs into one letter. That
rule helps Unicode not fill up with all permutations of combinations of
ligating accented modified characters.

Many letters, such as ñ, have more than one Unicode representation. Such a
glyph could be a single code point (L"\xF1"), grandfathered in from a
well-established char set, or could be a composition of two glyphs
(L"n\x303"). The C languages introduce 16-bit string literals with an L.

Text handling functions must not assume each data character is one glyph, or
compare strings using na<ve character comparisons. Functions that process
Unicode support commands to merge all compositions, or expand all
compositions.

The C languages support a 16-bit character type, wchar_t, and a matching
wcs*() function for every str*() function. The strcmp() function, to compare
8-bit strings, has a matching wcscmp() function to compare 16-bit strings.
These functions return 0 when their string arguments match.

Irritatingly, documentation for wcscmp() often claims it can compare
"Unicode" strings. This Characterization Test demonstrates how that claim
misleads:

TEST_(TestCase, Hoijarvi)
{
std::string str("Höijärvi");
WCHAR composed[20] = {0};

MultiByteToWideChar(
CP_ACP,
MB_COMPOSITE,
str.c_str(),
-1,
composed,
sizeof composed
);
CPPUNIT_ASSERT(0 != wcscmp(L"Höijärvi", composed));
CPPUNIT_ASSERT(0 == wcscmp(L"Ho\x308ija\x308rvi", composed));
CPPUNIT_ASSERT(0 == lstrcmpW(L"Höijärvi", composed));

CPPUNIT_ASSERT_EQUAL
(
CSTR_EQUAL,
CompareStringW
(
LOCALE_USER_DEFAULT,
NORM_IGNORECASE,
L"höijärvi", -1,
composed, -1
)
);
}

The test starts with an 8-bit string, "Höijärvi", expressed in this post's
code page, ISO 8859-1, also known as Latin 1. Then MultiByteToWideChar()
converts it into a Unicode string with all glyphs decomposed into their
constituents.

The first assertion reveals that wcscmp() compares raw characters, and
thinks "ö" differs from "o\x308", where \x308 is the COMBINING DIAERESIS
code point.

The second assertion proves the exact bits inside composed contain primitive
o and a glyphs followed by combining diæreses.

This assertion...

CPPUNIT_ASSERT(0 == lstrcmpW(L"Höijärvi", composed));

.....reveals the MS Windows function lstrcmpW() correctly matches glyphs, not
their constituent characters.

The long assertion with CompareStringW() demonstrates how to augment
lstrcmpW()'s internal behavior with more complex arguments.

If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.

Whatever the size of your characters, you must store Unicode using its own
kind of Multiple Byte Character Set.

UTF converts raw Unicode to encodings within characters of fixed bit widths.
MS Windows, roughly speaking, represents UTF-8 as a code page among many.
However, roughly speaking again, when an application compiles with the
_UNICODE flag turned on, and executes on a version of Windows derived from
WinNT, it obeys UTF-16 as a code page, regardless of locale.

Because a _UNICODE-enabled application can efficiently use UTF-16 to store a
glyph from any culture, such applications needn't link their locales to
specific code pages. They can manipulate strings containing any glyph. In
this mode, all glyphs are created equal.

Put another way, UTF-8 can store characters of any UNICODE code point, but
Win32 programs can only easily make use of UTF-16 characters.
Which has nothing at all to do with the original problem.


Right: wcschr() can't be slow, so something else was going on.

Get more Greek here:

http://www.greencheese.org/TheFrogs

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #8

P: n/a
On Wed, 05 Apr 2006 06:54:11 +0200, Pete Becker <pe********@acm.org>
wrote:
Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.


An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.


But there is Unicode Nomalization:
http://www.unicode.org/reports/tr15/
(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)


Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.


For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.

Best regards,
Roland Pibinger
Apr 5 '06 #9

P: n/a
Roland Pibinger wrote:
For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.


Something in our GNU Linux tool stack at work uses 32-bit wchar_t. Go
figure. Writing code portable to Win32 with 16-bit wchar_t gets even more
interesting...

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #10

P: n/a
Roland Pibinger wrote:
On Wed, 05 Apr 2006 06:54:11 +0200, Pete Becker <pe********@acm.org>
wrote:
Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.
An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.

But there is Unicode Nomalization:
http://www.unicode.org/reports/tr15/


Yes, canonicalization can give you consistent character representations.
It still doesn't mean that a character is a glyph. There are glyphs that
require more than one character.
(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)


Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.

For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.


Unicode has more than 65536 characters, so no matter how you slice it,
you can't encode all of its characters with 16 bits. You need 21. Or you
need an encoding where for some values you need to also look at one or
more subsequenct values to know what character you're dealing with.
That's how UTF-8 and UTF-16 work.

Again: UTF-16 is not a fixed-width encoding. Some code points require
more than one 16-bit value for their UTF-16 representation. If wchar_t
is 16 bits wide then the wcs* functions will do what they do, and that
may or may not actually be what you need to correctly analyze character
strings. For example, if you have an array of wchar_t named buf and you
want to look at the 10th character in buf, you have to count, one
character at a time, from the beginning. You can't just look at buf[10],
because there might be some characters before it that use two UTF-16 values.

That doesn't mean that you can't compare two arrays of wchar_t or two
wstring objects for equality. They don't know anything about character
representations, and equality just means all the bits are the same. In
fact, basic_string doesn't care a whit about what constitutes a
character. It just holds whatever elements you put into it, and it's up
to you to make sense out of them. But the main reason for adding wchar_t
to C was to get rid of the multibyte encodings that were so common with
ordinary char's. Shift-JIS is a pain to keep track of, and moving to a
wider character type eliminated the problems of not knowing where you
were in a character string. UTF-16 brings those problems back, although
to a lesser degree.

--

Pete Becker
Roundhouse Consulting, Ltd.
Apr 5 '06 #11

P: n/a
On Wed, 05 Apr 2006 19:59:48 GMT, Phlip <ph*******@gmail.com> wrote:
Roland Pibinger wrote:
For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.


Something in our GNU Linux tool stack at work uses 32-bit wchar_t. Go
figure. Writing code portable to Win32 with 16-bit wchar_t gets even more
interesting...


Well, at least std::basic_string is a template which lets you define a
uniform 16-bit fixed-width string type, even on Linux.

Best wishes,
Roland Pibinger
Apr 5 '06 #12

P: n/a
Pete Becker wrote:
Unicode has more than 65536 characters, so no matter how you slice it, you
can't encode all of its characters with 16 bits. You need 21.


And then when we share cultural exchanges with the Zargons of Vorg VII, we
will need 22. ;-)

(Note to those catching up: There are 2 reasons a character is not a glyph.
Composite characters bond two glyphlets into one glyph, and some glyphs use
code points outside the range of your base character type, requiring
multiple character glyphs.

(Then when you edit them, sometimes glyphlets are glyphs.)

Also note that sometimes we say "character" to mean "the char or wchar_t
element of a string", and sometimes "glyph". That sucks too, and I will try
to stop.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #13

P: n/a
On Wed, 05 Apr 2006 22:03:56 +0200, Pete Becker <pe********@acm.org>
wrote:
Roland Pibinger wrote:
For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.
Unicode has more than 65536 characters, so no matter how you slice it,
you can't encode all of its characters with 16 bits. You need 21. Or you
need an encoding where for some values you need to also look at one or
more subsequenct values to know what character you're dealing with.
That's how UTF-8 and UTF-16 work.


AFAIK, UTF-16 once was a fixed-width encoding. The question now is:
Can we assume a fixed-width UTF-16 "subset" for everyday work? IMO,
the answer is yes. I can do without the ancient Gothic alphabet in my
daily work (http://en.wikipedia.org/wiki/Gothic_alphabet).
Again: UTF-16 is not a fixed-width encoding. Some code points require
more than one 16-bit value for their UTF-16 representation.
An int may be larger than INT_MAX. Is this a reason to abandon ints or
to check every int for overflow?
If wchar_t
is 16 bits wide then the wcs* functions will do what they do, and that
may or may not actually be what you need to correctly analyze character
strings. For example, if you have an array of wchar_t named buf and you
want to look at the 10th character in buf, you have to count, one
character at a time, from the beginning. You can't just look at buf[10],
because there might be some characters before it that use two UTF-16 values.
But that's the question. You can do it if you resonably restrict
yourself to the canonical fixed-width UTF-16 subset. Otherwise you
better use a special Unicode library but not C/C++ Standard functions.

That doesn't mean that you can't compare two arrays of wchar_t or two
wstring objects for equality. They don't know anything about character
representations, and equality just means all the bits are the same. In
fact, basic_string doesn't care a whit about what constitutes a
character. It just holds whatever elements you put into it, and it's up
to you to make sense out of them. But the main reason for adding wchar_t
to C was to get rid of the multibyte encodings that were so common with
ordinary char's. Shift-JIS is a pain to keep track of, and moving to a
wider character type eliminated the problems of not knowing where you
were in a character string. UTF-16 brings those problems back, although
to a lesser degree.


An application can control which strings it creates and which strings
it lets in. It therefore can avoid borderline problems with UTF-16 and
use Standard functions in an uncomplicated way.

Best regards,
Roland Pibinger
Apr 5 '06 #14

P: n/a
Roland Pibinger wrote:
Can we assume a fixed-width UTF-16 "subset" for everyday work?


No, because some glyphs might be composite characters.

There are two more important questions:

A. can we do text-in-text-out with no glyph awareness?
B. where do we set the envelop for business goals?

If the answer to A. is Yes, then we can freely pass text through the wsc
functions, except when wcschr() and such functions become glyph-hostile.

As soon as you need something as mundane as a regular expression, you need
smart character awareness. (That's why boost's regex opts to bond with ICU,
a character encoding library.)

The answer to B. is you should set technical goals just a little wider than
your business goals. If the business only wants to target the Western
European languages, you should _not_ design for raw Unicode. You should
enable ISO Latin 1, and should write clean code. The cleanest code has its
string literals in resource files for easy replacement, and has only a few
modules that process text. That makes upgrades to more locales easier,
without writing speculative code.

(I once had major fun porting a GUI to Greek. A reputable vendor of
internationalization tools wrote the GUI for Western Europe, and filled it
up with lots of calls to translation functions that did nothing when the
program ran in only one code-page. Activating Greek triggered bugs in every
single one of these speculative calls, because they had been written but
never tested. So, naturally, I got blamed for each bug I encountered.)

If the business side wants to widen their target, say, all the code-page
oriented locales (Greek, Russian, Arabic, etc.) then you _still_ don't
enable for Unicode. You will use it, sometimes, as an intermediate point
between translating encodings between locales.

When the business side wants Traditional Chinese, Inuit, Kannada, etc, only
_then_ do you party with your Unicode!

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #15

P: n/a
On Wed, 05 Apr 2006 20:56:52 GMT, Phlip <ph*******@gmail.com> wrote:
If the business side wants to widen their target, say, all the code-page
oriented locales (Greek, Russian, Arabic, etc.) then you _still_ don't
enable for Unicode. You will use it, sometimes, as an intermediate point
between translating encodings between locales.


I don't quite understand that argument. E.g the Windows platform
defines a "Unicode" (UTF-16) type. Why make it complicated? Why not
standardize on one character encoding ("subset") for applications
(UTF-16) and one for transport (UTF-8)?

Best wishes,
Roland Pibinger
Apr 5 '06 #16

P: n/a
Roland Pibinger wrote:
I don't quite understand that argument. E.g the Windows platform
defines a "Unicode" (UTF-16) type. Why make it complicated? Why not
standardize on one character encoding ("subset") for applications
(UTF-16) and one for transport (UTF-8)?


standards-in-standards-out

Suppose you have a chat server that must read various chat protocols. The
following dissertation is entirely made up (this time).

<deep breath>

Some IcyQueue chats send raw Unicode packed into UTF-8, packed into &U
escapes inside RTF (which is itself 7-bit). Next, all XML and HTML assumes
UTF-8 unless it declares something else. But the WaZoo chat server sends
chat as XML-style HTML packets containing undeclared ISO Latin 1, or
declared ISO 8869 variants.

Next, when you port your chat server to a Win95-derived platform, deep
Unicode support goes away (except for TextOutW and similar graphics
primitives). So you must strap-on a system to convert from UTF-x to MBCS, or
sometimes UCS2.

Next, when you chat with India, using one of the 20-something languages they
all know, you might send in ISCII, which is an alternative ASCII that covers
the Devanagari languages, and others. Windows only supports these in
Unicode, but the incoming line might be Brand X. China and Japan have the
same deal with Big5 and ... etc.

I apologize to anyone who Googled in for all these acronyms. Truth is always
stranger than fiction, especially down here in Babylon.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #17

P: n/a
JE

Albert Oppenheimer wrote:
I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded free
from their web site.


Are you sure your source data is zero-terminated? If you're feeding in
oriental characters, for example, your source might get out of phase
with wcschr.

Apr 6 '06 #18

P: n/a
Roland Pibinger wrote:

AFAIK, UTF-16 once was a fixed-width encoding.
UTF-16 is not a fixed-width encoding. Unicode used to fit in 16-bit
values. It no longer does.
The question now is:
Can we assume a fixed-width UTF-16 "subset" for everyday work?
Sure. That's different from saying that UTF-16 is fixed-width, and it
runs a risk that you and your customers have to be aware of.
IMO,
the answer is yes. I can do without the ancient Gothic alphabet in my
daily work (http://en.wikipedia.org/wiki/Gothic_alphabet).

Again: UTF-16 is not a fixed-width encoding. Some code points require
more than one 16-bit value for their UTF-16 representation.

An int may be larger than INT_MAX. Is this a reason to abandon ints or
to check every int for overflow?


Of course not. But that's not the point. You have to be aware of the
limitiations that you're getting, and pay attention to them. UTF-16 is
not a fixed-width encoding.
If wchar_t
is 16 bits wide then the wcs* functions will do what they do, and that
may or may not actually be what you need to correctly analyze character
strings. For example, if you have an array of wchar_t named buf and you
want to look at the 10th character in buf, you have to count, one
character at a time, from the beginning. You can't just look at buf[10],
because there might be some characters before it that use two UTF-16 values.

But that's the question. You can do it if you resonably restrict
yourself to the canonical fixed-width UTF-16 subset. Otherwise you
better use a special Unicode library but not C/C++ Standard functions.


Yes, of course. If you don't use anything that doesn't fit in 16 bits
then you don't need to worry about things that don't fit in 16 bits.
That means you're not doing Unicode, because Unicode can't be
represented in 16 bits.

That doesn't mean that you can't compare two arrays of wchar_t or two
wstring objects for equality. They don't know anything about character
representations, and equality just means all the bits are the same. In
fact, basic_string doesn't care a whit about what constitutes a
character. It just holds whatever elements you put into it, and it's up
to you to make sense out of them. But the main reason for adding wchar_t
to C was to get rid of the multibyte encodings that were so common with
ordinary char's. Shift-JIS is a pain to keep track of, and moving to a
wider character type eliminated the problems of not knowing where you
were in a character string. UTF-16 brings those problems back, although
to a lesser degree.

An application can control which strings it creates and which strings
it lets in. It therefore can avoid borderline problems with UTF-16 and
use Standard functions in an uncomplicated way.


Tell that to your customers when they try to type in characters that you
reject.

--

Pete Becker
Roundhouse Consulting, Ltd.
Apr 6 '06 #19

P: n/a
Roland Pibinger wrote:
On Wed, 05 Apr 2006 20:56:52 GMT, Phlip <ph*******@gmail.com> wrote:
If the business side wants to widen their target, say, all the code-page
oriented locales (Greek, Russian, Arabic, etc.) then you _still_ don't
enable for Unicode. You will use it, sometimes, as an intermediate point
between translating encodings between locales.

I don't quite understand that argument. E.g the Windows platform
defines a "Unicode" (UTF-16) type. Why make it complicated? Why not
standardize on one character encoding ("subset") for applications
(UTF-16) and one for transport (UTF-8)?


UTF-16 is not a fixed-width encoding, so handling characters is
complicated. You can't avoid those complications by announcing that
you're using a "Unicode" (UTF-16) type -- they're inherent in
variable-width encodings. Ask the Japanese about the problems they have
with with Shift-JIS. Yes, you can avoid those problems by deciding not
to support all of Unicode, but if you do that you can't call that Unicode.

--

Pete Becker
Roundhouse Consulting, Ltd.
Apr 6 '06 #20

P: n/a
In message <oa*****************@newssvr33.news.prodigy.com> , Phlip
<ph*******@gmail.com> writes
Roland Pibinger wrote:
Can we assume a fixed-width UTF-16 "subset" for everyday work?
No, because some glyphs might be composite characters.

There are two more important questions:

A. can we do text-in-text-out with no glyph awareness?
B. where do we set the envelop for business goals?

If the answer to A. is Yes, then we can freely pass text through the wsc
functions, except when wcschr() and such functions become glyph-hostile.

As soon as you need something as mundane as a regular expression, you need
smart character awareness. (That's why boost's regex opts to bond with ICU,
a character encoding library.)

The answer to B. is you should set technical goals just a little wider than
your business goals. If the business only wants to target the Western
European languages, you should _not_ design for raw Unicode. You should
enable ISO Latin 1, and should write clean code. The cleanest code has its
string literals in resource files for easy replacement, and has only a few
modules that process text. That makes upgrades to more locales easier,
without writing speculative code.

(I once had major fun porting a GUI to Greek. A reputable vendor of
internationalization tools wrote the GUI for Western Europe, and filled it
up with lots of calls to translation functions that did nothing when the
program ran in only one code-page. Activating Greek triggered bugs in every
single one of these speculative calls, because they had been written but
never tested. So, naturally, I got blamed for each bug I encountered.)

If the business side wants to widen their target, say, all the code-page
oriented locales (Greek, Russian, Arabic, etc.) then you _still_ don't
enable for Unicode. You will use it, sometimes, as an intermediate point
between translating encodings between locales.


You'd also better ask them whether they really only want to use these
locales one at a time. The "code-page" model doesn't work too well when
you want to display Russian and Arabic simultaneously.

When the business side wants Traditional Chinese, Inuit, Kannada, etc, only
_then_ do you party with your Unicode!

Or when they ask you why your $$$ application isn't as polyglot as their
free web browser.

--
Richard Herring
Apr 6 '06 #21

P: n/a
Richard Herring wrote:
B. where do we set the envelop for business goals?
Or when they ask you why your $$$ application isn't as polyglot as their
free web browser.


Premature localization is yet another form of premature complexity, like
premature optimization, premature threading, etc.

If the business side declares they do not yet need polyglot, and if you
write polyglot code, then you won't get early feedback on your features.
Leave such features out until the business side requests them, so they
become responsible for exercising them.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 6 '06 #22

P: n/a
In message <Sb*******************@newssvr30.news.prodigy.com> , Phlip
<ph*******@gmail.com> writes
Richard Herring wrote:
B. where do we set the envelop for business goals?
Or when they ask you why your $$$ application isn't as polyglot as their
free web browser.


Premature localization is yet another form of premature complexity,


Where do you draw the line between complexity and generality?
like
premature optimization, premature threading, etc.

If the business side declares they do not yet need polyglot, and if you
write polyglot code, then you won't get early feedback on your features.
Leave such features out until the business side requests them, so they
become responsible for exercising them.


Even though you then have to go back and start from scratch, because
your code's crammed with language-dependent assumptions?

--
Richard Herring
Apr 6 '06 #23

This discussion thread is closed

Replies have been disabled for this discussion.