473,608 Members | 2,443 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

why is wcschr so slow???

I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded free
from their web site.

Apr 4 '06 #1
22 3508

"Albert Oppenheimer" <sp**@spam.co m> wrote in message
news:e0******** **@geraldo.cc.u texas.edu...
I thought my program had to be caught in a loop, and cancelled it through
the task manager. It took about one second in Java, but re-implemented in
C, it had already run over one minute.

C? Did you mean C++, or are you in the wrong newsgroup?
I set up a debugger to display the current location each loop and let it
run. It did reach completion, but it took 20 minutes.

I replaced the calls to wcschr in my program with calls to this
substitute:

static WCHAR* altchr(register WCHAR* s, register WCHAR c) {
while (TRUE)
{ if (*s == c)
return s;
if (*s == 0)
return 0;
++s;
}
}

Now my program finishes instantly, faster than the Java version, as you
might expect.

What could wcschr be doing that takes so long???

I'm on XP, using the Borland 5.5.1 C++ compiler that can be downloaded
free
from their web site.


I don't see that function anywhere in my reference books here. Is it a
Borland extension to strchr? If so, you could ask them (or on a borland
newsgroup) about any performance issues with their implementation of that
function.

Perhaps their free compiler is worth every penny? :-)

Or perhaps your method of calling the function (or the data you're using)
doesn't work well with the way they desigend it? They'd be the ones to ask,
I think.

-Howard


Apr 4 '06 #2
> I don't see that function anywhere in my reference books here. Is it a
Borland extension to strchr? If so, you could ask them (or on a borland
newsgroup) about any performance issues with their implementation of that
function.


wcschr is the unicode version of strchr.
It processes 16-bit characters instead of 8-bit bytes.

God knows what reference books you used. A good place to look for standard
C library functions for XP (yes, I did say this is on XP) is at

http://msdn.microsoft.com/library/de...nipulation.asp

Borland has the same standard C functions as Microsoft. And standard C
functions are basic to C++ just like standard C expressions.

Apr 4 '06 #3
Albert Oppenheimer wrote:
wcschr is the unicode version of strchr.
It processes 16-bit characters instead of 8-bit bytes.


Calling such a function "the unicode version" is misleading.

(Not to blame you - much documentation has this problem. But I _do_ blame
you for confusing wchar_t for 16-bits. Sometimes it's more!)

A function that truly handles Unicode will handle an encoding, such as
UTF-16, and it will deal correctly with the various Unicode shenanigans,
such as composite characters.

The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph. (If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)

If this wcschr() did indeed process Unicode correctly, it might be a little
slow.

If all it does is iterate over wchar_t elements, then it had no reason to be
slow, and the Original Poster must look elsewhere for the problem.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 4 '06 #4
It often surprises me how many people in here can't admit to themselves that
they don't know, and compulsively post drivel.
Apr 4 '06 #5
Albert Oppenheimer wrote:
It often surprises me how many people in here can't admit to themselves
that they don't know, and compulsively post drivel.


Welcome to my killfile. And good luck with your problem.

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #6
Phlip wrote:

The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.
An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.
(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)


Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.

--

Pete Becker
Roundhouse Consulting, Ltd.
Apr 5 '06 #7
Pete Becker wrote:
Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.


An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.


Right. Deep within the "mind" of the lowly wcschr() function, such things
are hypothetical. It will match combining diaereses as if they were
independent glyphs, and won't match those which precombined. That's why
short posts on such topics are risky, and the alternative is long boring
posts. But feel free to nitpick...

The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For example,
f is softer than ph, so flip has a meaning different than ... you get the
idea.

"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".

[A pause to check my post's encoding. It will go out as Western Europe,
meaning ISO Latin 1. I suspect that's also ISO 8897-1.

[That's funny, because I thought I had it set to Greek these days for some
strange reason...]

A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.

Different locales required different encodings and character widths for
various reasons. In the beginning, there was ASCII, based on encoding the
Latin alphabet, without accent marks, into a 7-bit protocol. Early systems
reserved the 8th bit for a parity check. Then cultures with short phonetic
alphabets computerized their own glyphs. Each culture claimed the same
"high-ASCII" range of the 8 bits in a byte-the ones with the 8th bit turned
on. User interface software, to enable more than one locale, selects the
"meaning" of the high-ASCII characters by selecting a "code page". On some
hardware devices, this variable literally selected the hardware page of a
jump table to convert codes into glyphs.

Modern GUIs still use code page numbers, typically defined by the
"Internatio nal Standards Organization", or its member committees. The ISO
8859-7 encoding, for example, stores Latin characters in their ASCII
locations, and Greek characters in the high-ASCII.

<warning topicality="off ">

Internationaliz e a resource file to Greek like this:

LANGUAGE LANG_GREEK, SUBLANG_NEUTRAL
#pragma code_page(1253)

STRINGTABLE DISCARDABLE
BEGIN
IDS_WELCOME "?p?d??? st?? ????da." // <-- imagine Greek there
END

</warning>

The quoted Greek words might appear as garbage on your desktop, in a real RC
file, in a USENET post [like this one], or in a compiled application. On
WinXP, fix this by opening the Regional and Language Options applet, and
switching the combo box labeled "Select a language to match the language
version of the non-Unicode programs you want to use" to Greek. Unless if the
garbage is ? marks, in which case a library function somewhere has replaced
the garbage with placeholders.

That user interface verbiage uses "non-Unicode" to mean the "default code
page". When a program runs using that resource, the code page "1253"
triggers the correct interpretation, as (roughly) ISO 8859-7.

MS Windows sometimes supports more than one code page per locale. The two
similar pages, 1253 and ISO 8859-7, differ by a couple of glyphs.

Some languages require more than 127 glyphs. To fit these locales within
8-bit hardware, more complex encodings map some glyphs into more than one
byte. The bytes without their 8th bit still encode ASCII, but any byte with
its 8th bit set is a member of a short sequence of multiple bytes that
require some math formula to extract their actual char set index. These
"Multiple Byte Character Sets" support locale-specific code pages for
cultures from Arabia to Vietnam. However, you cannot put glyphs from too
many different cultures into the same string. OS support functions cannot
expect strings with mixed code

Sanskrit shares a very popular script called Devanagari with several other
Asian languages. (Watch the movie "Seven Years in Tibet" to see a big
ancient document, written with beautiful flowing Devanagari, explaining why
Brad Pitt is not allowed in Tibet.)

Devanagari's code page could have been 57002, based on the standard "Indian
Script Code for Information Interchange". MS Windows does not support this
locale-specific code page. Accessing Devanagari and writing Sanskrit (or
most other modern Indian languages) requires the Mother of All Char Sets,
Unicode.

ISO 10646, and the "Unicode Consortium", maintain the complete char set of
all humanity's glyphs. To reduce the total count, Unicode supplies many
shortcuts. For example, many fonts place glyph clusters, such as accented
characters, into one glyph. Unicode usually defines each glyph component
separately, and relies on software to merge glyphs into one letter. That
rule helps Unicode not fill up with all permutations of combinations of
ligating accented modified characters.

Many letters, such as ñ, have more than one Unicode representation. Such a
glyph could be a single code point (L"\xF1"), grandfathered in from a
well-established char set, or could be a composition of two glyphs
(L"n\x303"). The C languages introduce 16-bit string literals with an L.

Text handling functions must not assume each data character is one glyph, or
compare strings using na<ve character comparisons. Functions that process
Unicode support commands to merge all compositions, or expand all
compositions.

The C languages support a 16-bit character type, wchar_t, and a matching
wcs*() function for every str*() function. The strcmp() function, to compare
8-bit strings, has a matching wcscmp() function to compare 16-bit strings.
These functions return 0 when their string arguments match.

Irritatingly, documentation for wcscmp() often claims it can compare
"Unicode" strings. This Characterizatio n Test demonstrates how that claim
misleads:

TEST_(TestCase, Hoijarvi)
{
std::string str("Höijärvi") ;
WCHAR composed[20] = {0};

MultiByteToWide Char(
CP_ACP,
MB_COMPOSITE,
str.c_str(),
-1,
composed,
sizeof composed
);
CPPUNIT_ASSERT( 0 != wcscmp(L"Höijär vi", composed));
CPPUNIT_ASSERT( 0 == wcscmp(L"Ho\x30 8ija\x308rvi", composed));
CPPUNIT_ASSERT( 0 == lstrcmpW(L"Höij ärvi", composed));

CPPUNIT_ASSERT_ EQUAL
(
CSTR_EQUAL,
CompareStringW
(
LOCALE_USER_DEF AULT,
NORM_IGNORECASE ,
L"höijärvi", -1,
composed, -1
)
);
}

The test starts with an 8-bit string, "Höijärvi", expressed in this post's
code page, ISO 8859-1, also known as Latin 1. Then MultiByteToWide Char()
converts it into a Unicode string with all glyphs decomposed into their
constituents.

The first assertion reveals that wcscmp() compares raw characters, and
thinks "ö" differs from "o\x308", where \x308 is the COMBINING DIAERESIS
code point.

The second assertion proves the exact bits inside composed contain primitive
o and a glyphs followed by combining diæreses.

This assertion...

CPPUNIT_ASSERT( 0 == lstrcmpW(L"Höij ärvi", composed));

.....reveals the MS Windows function lstrcmpW() correctly matches glyphs, not
their constituent characters.

The long assertion with CompareStringW( ) demonstrates how to augment
lstrcmpW()'s internal behavior with more complex arguments.

If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.

Whatever the size of your characters, you must store Unicode using its own
kind of Multiple Byte Character Set.

UTF converts raw Unicode to encodings within characters of fixed bit widths.
MS Windows, roughly speaking, represents UTF-8 as a code page among many.
However, roughly speaking again, when an application compiles with the
_UNICODE flag turned on, and executes on a version of Windows derived from
WinNT, it obeys UTF-16 as a code page, regardless of locale.

Because a _UNICODE-enabled application can efficiently use UTF-16 to store a
glyph from any culture, such applications needn't link their locales to
specific code pages. They can manipulate strings containing any glyph. In
this mode, all glyphs are created equal.

Put another way, UTF-8 can store characters of any UNICODE code point, but
Win32 programs can only easily make use of UTF-16 characters.
Which has nothing at all to do with the original problem.


Right: wcschr() can't be slow, so something else was going on.

Get more Greek here:

http://www.greencheese.org/TheFrogs

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #8
On Wed, 05 Apr 2006 06:54:11 +0200, Pete Becker <pe********@acm .org>
wrote:
Phlip wrote:
The wcs functions don't; they just treat each wchar_t element as one
hypothetical glyph.


An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.


But there is Unicode Nomalization:
http://www.unicode.org/reports/tr15/
(If any do I'd be glad to know, but I know the simple
ones don't, so wcschr() probably qualifies.)


Conforming ones don't, and that's the point: they traffic in fixed width
characters. UTF-16 is not a fixed-width encoding, so a 16-bit wchar_t
can't be used correctly for Unicode. Which has nothing at all to do with
the original problem.


For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.

Best regards,
Roland Pibinger
Apr 5 '06 #9
Roland Pibinger wrote:
For major platforms (Windows, Mac, Java) you can and must assume that
normalized UTF-16 is a fixed-width encoding. Otherwise you could not
use the wcs* functions or std::wstring on that platforms, you could
not even compare wchar_t* strings or wstring objects.


Something in our GNU Linux tool stack at work uses 32-bit wchar_t. Go
figure. Writing code portable to Win32 with 16-bit wchar_t gets even more
interesting...

--
Phlip
http://www.greencheese.org/ZeekLand <-- NOT a blog!!!
Apr 5 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
2876
by: Neil | last post by:
I have a very puzzling situation with a database. It's an Access 2000 mdb with a SQL 7 back end, with forms bound using ODBC linked tables. At our remote location (accessed via a T1 line) the time it took to go to a record was very slow. The go to mechanism was a box that the user typed the index value into a combo box, with very simple code attached: with me.RecordsetClone .FindFirst " = " & me.cboGoTo If Not .NoMatch Then Me.Bookmark...
2
3345
by: David | last post by:
Hi, We have an internal network of 3 users. Myself & one other currently have individual copies of the front-end MS Access forms and via our individual ODBC links we have used the: File > Get External Data > Link Tables > select ODBC Databases facility to link to our back-end MySQL Server. On both our machines the tables appear in the window very quickly and if we hit 'Select All', all the tables start loading really quickly into our...
4
1136
by: nospam | last post by:
/// Compile with "cl /Zc:wchar_t /clr bug.cpp" #include <stdio.h> #include <string.h> #define TEST const wchar_t* p = wcschr( L"aa:aa", L':' ); printf( "%ls\n", p ); #pragma managed void f_managed()
3
2873
by: Jennyfer J Barco | last post by:
In my application I have a datagrid. The code calls a Stored procedure and brings like 200 records. I created a dataset and then a dataview to bind the results of the query to my grid using MyGrid.DataBind() Once the records are loaded, to handle the next, previous button is too slow. I have in the same screen OptionsBox and everytime I click in any option I show some text fields in the screen. Anything the user does is very slow. When...
50
5686
by: diffuser78 | last post by:
I have just started to learn python. Some said that its slow. Can somebody pin point the issue. Thans
1
1316
by: Albert Oppenheimer | last post by:
I thought my program had to be caught in a loop, and cancelled it through the task manager. It took about one second in Java, but re-implemented in C, it had already run over one minute. I set up a debugger to display the current location each loop and let it run. It did reach completion, but it took 20 minutes. I replaced the calls to wcschr in my program with calls to this substitute: static WCHAR* altchr(register WCHAR* s,...
2
10899
by: mezise | last post by:
Posted by Pratchaya: ------------------------------------------------------ MySQL Slow Log ERROR In my.cnf i add these lines ####### log-bin log-slow-queries = /var/log/mysqld-slow.log
13
3439
by: eighthman11 | last post by:
using Access 2003 and sql server version 8.0 Hey everyone. Created a text box where the user types in an Inventory number and it takes them to that inventory number on the contimuous form. The form is based on a link table to sql server. Here is the code: Dim rst As DAO.Recordset Dim InventoryItem As String InventoryItem = "'" & "TextBoxValue" & "'"
10
2342
by: penworthamnaynesh | last post by:
Does php slow your website down? This is what i would like to know.The reason is because my site is writtent 50% in html and 50% in php and is very slow at loading. And i cant tell wether php is doing it or html o is it another reason because i only have 20gb bandwidth My site is called : ultimate city the game http://www.ultimate-gamez.net
0
8059
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8000
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8470
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
6815
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6011
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5475
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4023
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1589
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1328
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.