why is wcschr so slow???

Albert Oppenheimer

I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded free
from their web site.

Apr 4 '06 #1

Subscribe Reply

3508

Howard

"Albert Oppenheimer" <sp**@spam.co m> wrote in message
news:e0******** **@geraldo.cc.u texas.edu...

I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

C? Did you mean C++, or are you in the wrong newsgroup?
I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this
substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded
free
from their web site.

I don't see that function anywhere in my reference books here. Is it a
Borland extension to strchr? If so, you could ask them (or on a borland
newsgroup) about any performance issues with their implementation of that
function.

Perhaps their free compiler is worth every penny? :-)

Or perhaps your method of calling the function (or the data you're using)
doesn't work well with the way they desigend it? They'd be the ones to ask,
I think.

-Howard

Apr 4 '06 #2

Albert Oppenheimer

> I don't see that function anywhere in my reference books here. Is it a

Borland extension to strchr? If so, you could ask them (or on a borland
newsgroup) about any performance issues with their implementation of that
function.

wcschr is the unicode version of strchr.
It processes 16-bit characters instead of 8-bit bytes.

God knows what reference books you used. A good place to look for standard
C library functions for XP (yes, I did say this is on XP) is at

http://msdn.microsoft.com/library/de...nipulation.asp

Borland has the same standard C functions as Microsoft. And standard C
functions are basic to C++ just like standard C expressions.

Apr 4 '06 #3

Phlip

Albert Oppenheimer wrote:

wcschr is the unicode version of strchr.
It processes 16-bit characters instead of 8-bit bytes.

Calling such a function "the unicode version" is misleading.

(Not to blame you - much documentation has this problem. But I _do_ blame
you for confusing wchar_t for 16-bits. Sometimes it's more!)

A function that truly handles Unicode will handle an encoding, such as
UTF-16, and it will deal correctly with the various Unicode shenanigans,
such as composite characters.

The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph. (If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)

If this wcschr() did indeed process Unicode correctly, it might be a little
slow.

If all it does is iterate over wchar_t elements, then it had no reason to be
slow, and the Original Poster must look elsewhere for the problem.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Apr 4 '06 #4

Albert Oppenheimer

It often surprises me how many people in here can't admit to themselves that
they don't know, and compulsively post drivel.

Apr 4 '06 #5

Phlip

Albert Oppenheimer wrote:

It often surprises me how many people in here can't admit to themselves
that they don't know, and compulsively post drivel.

Welcome to my killfile. And good luck with your problem.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Apr 5 '06 #6

Pete Becker

Phlip wrote:

The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.
An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.
(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)

Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.

--

Pete Becker
Roundhouse Consulting, Ltd.

Apr 5 '06 #7

Phlip

Pete Becker wrote:

Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.

An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.

Right. Deep within the "mind" of the lowly wcschr() function, such things
are hypothetical. It will match combining diaereses as if they were
independent glyphs, and won't match those which precombined. That's why
short posts on such topics are risky, and the alternative is long boring
posts. But feel free to nitpick...

The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For example,
f is softer than ph, so flip has a meaning different than ... you get the
idea.

"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".

[A pause to check my post's encoding. It will go out as Western Europe,
meaning ISO Latin 1. I suspect that's also ISO 8897-1.

[That's funny, because I thought I had it set to Greek these days for some
strange reason...]

A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.

Different locales required different encodings and character widths for
various reasons. In the beginning, there was ASCII, based on encoding the
Latin alphabet, without accent marks, into a 7-bit protocol. Early systems
reserved the 8th bit for a parity check. Then cultures with short phonetic
alphabets computerized their own glyphs. Each culture claimed the same
"high-ASCII" range of the 8 bits in a byte-the ones with the 8th bit turned
on. User interface software, to enable more than one locale, selects the
"meaning" of the high-ASCII characters by selecting a "code page". On some
hardware devices, this variable literally selected the hardware page of a
jump table to convert codes into glyphs.

Modern GUIs still use code page numbers, typically defined by the
"Internatio nal Standards Organization", or its member committees. The ISO
8859-7 encoding, for example, stores Latin characters in their ASCII
locations, and Greek characters in the high-ASCII.

<warning topicality="off ">

Internationaliz e a resource file to Greek like this:

LANGUAGE LANG_GREEK, SUBLANG_NEUTRAL
#pragma code_page(1253)

STRINGTABLE DISCARDABLE
BEGIN
IDS_WELCOME "?p?d??? st?? ????da." // <-- imagine Greek there
END

</warning>

The quoted Greek words might appear as garbage on your desktop, in a real RC
file, in a USENET post [like this one], or in a compiled application. On
WinXP, fix this by opening the Regional and Language Options applet, and
switching the combo box labeled "Select a language to match the language
version of the non-Unicode programs you want to use" to Greek. Unless if the
garbage is ? marks, in which case a library function somewhere has replaced
the garbage with placeholders.

That user interface verbiage uses "non-Unicode" to mean the "default code
page". When a program runs using that resource, the code page "1253"
triggers the correct interpretation, as (roughly) ISO 8859-7.

MS Windows sometimes supports more than one code page per locale. The two
similar pages, 1253 and ISO 8859-7, differ by a couple of glyphs.

Some languages require more than 127 glyphs. To fit these locales within
8-bit hardware, more complex encodings map some glyphs into more than one
byte. The bytes without their 8th bit still encode ASCII, but any byte with
its 8th bit set is a member of a short sequence of multiple bytes that
require some math formula to extract their actual char set index. These
"Multiple Byte Character Sets" support locale-specific code pages for
cultures from Arabia to Vietnam. However, you cannot put glyphs from too
many different cultures into the same string. OS support functions cannot
expect strings with mixed code

Sanskrit shares a very popular script called Devanagari with several other
Asian languages. (Watch the movie "Seven Years in Tibet" to see a big
ancient document, written with beautiful flowing Devanagari, explaining why
Brad Pitt is not allowed in Tibet.)

Devanagari's code page could have been 57002, based on the standard "Indian
Script Code for Information Interchange". MS Windows does not support this
locale-specific code page. Accessing Devanagari and writing Sanskrit (or
most other modern Indian languages) requires the Mother of All Char Sets,
Unicode.

ISO 10646, and the "Unicode Consortium", maintain the complete char set of
all humanity's glyphs. To reduce the total count, Unicode supplies many
shortcuts. For example, many fonts place glyph clusters, such as accented
characters, into one glyph. Unicode usually defines each glyph component
separately, and relies on software to merge glyphs into one letter. That
rule helps Unicode not fill up with all permutations of combinations of
ligating accented modified characters.

Many letters, such as ñ, have more than one Unicode representation. Such a
glyph could be a single code point (L"\xF1"), grandfathered in from a
well-established char set, or could be a composition of two glyphs
(L"n\x303"). The C languages introduce 16-bit string literals with an L.

Text handling functions must not assume each data character is one glyph, or
compare strings using na<ve character comparisons. Functions that process
Unicode support commands to merge all compositions, or expand all
compositions.

The C languages support a 16-bit character type, wchar_t, and a matching
wcs*() function for every str*() function. The strcmp() function, to compare
8-bit strings, has a matching wcscmp() function to compare 16-bit strings.
These functions return 0 when their string arguments match.

Irritatingly, documentation for wcscmp() often claims it can compare
"Unicode" strings. This Characterizatio n Test demonstrates how that claim
misleads:

TEST_(TestCase, Hoijarvi)
{
std::string str("Höijärvi") ;
WCHAR composed[20] = {0};

MultiByteToWide Char(
CP_ACP,
MB_COMPOSITE,
str.c_str(),
-1,
composed,
sizeof composed
);
CPPUNIT_ASSERT( 0 != wcscmp(L"Höijär vi", composed));
CPPUNIT_ASSERT( 0 == wcscmp(L"Ho\x30 8ija\x308rvi", composed));
CPPUNIT_ASSERT( 0 == lstrcmpW(L"Höij ärvi", composed));

CPPUNIT_ASSERT_ EQUAL
(
CSTR_EQUAL,
CompareStringW
(
LOCALE_USER_DEF AULT,
NORM_IGNORECASE ,
L"höijärvi", -1,
composed, -1
)
);
}

The test starts with an 8-bit string, "Höijärvi", expressed in this post's
code page, ISO 8859-1, also known as Latin 1. Then MultiByteToWide Char()
converts it into a Unicode string with all glyphs decomposed into their
constituents.

The first assertion reveals that wcscmp() compares raw characters, and
thinks "ö" differs from "o\x308", where \x308 is the COMBINING DIAERESIS
code point.

The second assertion proves the exact bits inside composed contain primitive
o and a glyphs followed by combining diæreses.

This assertion...

CPPUNIT_ASSERT( 0 == lstrcmpW(L"Höij ärvi", composed));

.....reveals the MS Windows function lstrcmpW() correctly matches glyphs, not
their constituent characters.

The long assertion with CompareStringW( ) demonstrates how to augment
lstrcmpW()'s internal behavior with more complex arguments.

If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.

Whatever the size of your characters, you must store Unicode using its own
kind of Multiple Byte Character Set.

UTF converts raw Unicode to encodings within characters of fixed bit widths.
MS Windows, roughly speaking, represents UTF-8 as a code page among many.
However, roughly speaking again, when an application compiles with the
_UNICODE flag turned on, and executes on a version of Windows derived from
WinNT, it obeys UTF-16 as a code page, regardless of locale.

Because a _UNICODE-enabled application can efficiently use UTF-16 to store a
glyph from any culture, such applications needn't link their locales to
specific code pages. They can manipulate strings containing any glyph. In
this mode, all glyphs are created equal.

Put another way, UTF-8 can store characters of any UNICODE code point, but
Win32 programs can only easily make use of UTF-16 characters.
Which has nothing at all to do with the original problem.

Right: wcschr() can't be slow, so something else was going on.

Get more Greek here:

http://www.greencheese.org/TheFrogs

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Apr 5 '06 #8

Roland Pibinger

On Wed, 05 Apr 2006 06:54:11 +0200, Pete Becker <pe********@acm .org>
wrote:

Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.

An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.

But there is Unicode Nomalization:
http://www.unicode.org/reports/tr15/

(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)

Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.

For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.

Best regards,
Roland Pibinger

Apr 5 '06 #9

Phlip

Roland Pibinger wrote:

For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.

Something in our GNU Linux tool stack at work uses 32-bit wchar_t. Go
figure. Writing code portable to Win32 with 16-bit wchar_t gets even more
interesting...

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!

Apr 5 '06 #10

Similar topics

2876

Form Slow, Then Fast

by: Neil | last post by:

I have a very puzzling situation with a database. It's an Access 2000 mdb with a SQL 7 back end, with forms bound using ODBC linked tables. At our remote location (accessed via a T1 line) the time it took to go to a record was very slow. The go to mechanism was a box that the user typed the index value into a combo box, with very simple code attached: with me.RecordsetClone .FindFirst " = " & me.cboGoTo If Not .NoMatch Then Me.Bookmark...

Microsoft SQL Server

3345

Link Tables problem....Way too slow to operate.

by: David | last post by:

Hi, We have an internal network of 3 users. Myself & one other currently have individual copies of the front-end MS Access forms and via our individual ODBC links we have used the: File > Get External Data > Link Tables > select ODBC Databases facility to link to our back-end MySQL Server. On both our machines the tables appear in the window very quickly and if we hit 'Select All', all the tables start loading really quickly into our...

Microsoft Access / VBA

1136

VC7.1 bug: managed call to wcschr

by: nospam | last post by:

/// Compile with "cl /Zc:wchar_t /clr bug.cpp" #include <stdio.h> #include <string.h> #define TEST const wchar_t* p = wcschr( L"aa:aa", L':' ); printf( "%ls\n", p ); #pragma managed void f_managed()

.NET Framework

2873

datagrid too slow

by: Jennyfer J Barco | last post by:

In my application I have a datagrid. The code calls a Stored procedure and brings like 200 records. I created a dataset and then a dataview to bind the results of the query to my grid using MyGrid.DataBind() Once the records are loaded, to handle the next, previous button is too slow. I have in the same screen OptionsBox and everytime I click in any option I show some text fields in the screen. Anything the user does is very slow. When...

ASP.NET

5686

Is python very slow compared to C

by: diffuser78 | last post by:

I have just started to learn python. Some said that its slow. Can somebody pin point the issue. Thans

Python

1316

why is wcschr is so slow???

by: Albert Oppenheimer | last post by:

C / C++

10899

MySQL won't create a slow query log

by: mezise | last post by:

Posted by Pratchaya: ------------------------------------------------------ MySQL Slow Log ERROR In my.cnf i add these lines ####### log-bin log-slow-queries = /var/log/mysqld-slow.log

MySQL Database

3439

bookmark continuous form slow slow slow

by: eighthman11 | last post by:

using Access 2003 and sql server version 8.0 Hey everyone. Created a text box where the user types in an Inventory number and it takes them to that inventory number on the contimuous form. The form is based on a link table to sql server. Here is the code: Dim rst As DAO.Recordset Dim InventoryItem As String InventoryItem = "'" & "TextBoxValue" & "'"

Microsoft Access / VBA

2342

Does php slow your website down

by: penworthamnaynesh | last post by:

Does php slow your website down? This is what i would like to know.The reason is because my site is writtent 50% in html and 50% in php and is very slow at loading. And i cant tell wether php is doing it or html o is it another reason because i only have 20gb bandwidth My site is called : ultimate city the game http://www.ultimate-gamez.net

PHP

8059

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8000

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8470

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

6815

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6011

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5475

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4023

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1589

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1328

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General