mbstowcs and wcsstr problems

Kelvin Moss

Hi all,

I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string. It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.

So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.
2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?

I haven't worked on wide strings so please correct.

Thanks ..

Sep 5 '06 #1

Subscribe Post Reply

3290

Thomas Lumley

Kelvin Moss wrote:

I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string.

Don't Do That. The definition of mbstowcs specifies that the input is
a (possibly multibyte) character string. If you pass it an argument
that is a wide string, or an array of doubles, or a picture of an
orang-utan, it won't be able to cope.

Sometimes it may be able to tell that you have lied to it (because the
argument contains something that isn't a valid multibyte character
sequence) and it will return -1. Otherwise, if your wide character
type includes zero bytes for many wide characters then it is likely to
see one of these and think it is a terminating \0. Or worse things
may happen.

It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.

Good.

So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.

See above

2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?

No.

First note that "unicode string" is not sufficient identification.
Unicode represents any character you are likely to encounter as a
number. In addition you need to specify an "encoding" that says how
those numbers are stored.

Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]

For example, if your multibyte encoding is UTF-8 and your wchar_t is a
32-bit unsigned int then the sequence
0x48 0x49 0x00 0x00
could be a multibyte character string "AB" followed by a terminating
\0, followed coincidentally by another \0, or it could be a
single-character string in Chinese. The computer can't tell, so you
have to keep track.

String manipulation was certainly easier in the old days, at least for
English-speaking people with dollars as their currency unit.

-thomas

Sep 5 '06 #2

Stephen Sprunk

"Kelvin Moss" <km**********@yahoo.comwrote in message
news:11**********************@74g2000cwt.googlegro ups.com...

I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string. It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.

That's what one should expect.

So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.

Passing wchar_t* to a multi-byte function is invalid since those
functions are defined to take char* parameters. It's unlikely they will
work correctly or leave your data unchanged because you're lying to the
function about what type you're giving them. Your compiler should issue
a diagnostic when you do that; are you ignoring them? Or are you using
the wrong types so pervasively that the compiler can't figure it out?

2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?

It's simple deduction on your part. If you have a char*, it's a narrow
string, possibly multi-byte. If you have a wchar_t*, it's a wide
string. By definition.

Note that there's no such thing as a "unicode string". There are
various encodings of characters, some narrow, some multibyte, and some
wide. "Unicode" may mean you have UCS-4, UCS-2, UTF-16, UTF-8, UTF-7,
etc. encoding; you need to think about the encoding your data is in, not
just whether it's "unicode".

I haven't worked on wide strings so please correct.

Don't pass wchar_t*'s to mb functions, and don't pass char*'s to wcs
functions.

Keep your strings in the appropriate type, and convert when you need the
other type. That's all there is to it.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Sep 5 '06 #3

Kelvin Moss

Thomas Lumley wrote:

Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]

Does mbstowcs assume it knows the encoding of the string?
Or, does it try to find the encoding of the character on its own.

I think the latter.

Thanks ..

Sep 6 '06 #4

Stephen Sprunk

"Kelvin Moss" <km**********@yahoo.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...

Thomas Lumley wrote:
>Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]

Does mbstowcs assume it knows the encoding of the string?
Or, does it try to find the encoding of the character on its own.

I think the latter.

It deduces the correct encoding from the locale you set.

Unfortunately, getting the locale set correctly is
implementation-dependent, though setlocale(LC_ALL, "") generally works
if the user's environment is set up correctly. If you want to use a
locale other than the user's default, you're on your own to figure out
how to do that.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

--
Posted via a free Usenet account from http://www.teranews.com

Sep 6 '06 #5

Similar topics

1,596 problems in .Net 1.1

by: Jim Hubbard | last post by:

Are you up to speed on the difficulties in using the 1.1 .Net framework? Not if you are unaware of the 1,596 issues listed at KBAlertz (http://www.kbalertz.com/technology_3.aspx). If you are...

.NET Framework

DATE function problems

by: Corky | last post by:

This works: db2 SELECT DISTINCT PROBLEM_OBJECTS.PROBLEM_ID FROM PROBLEM_OBJECTS INNER JOIN PROBLEMS ON PROBLEM_OBJECTS.PROBLEM_ID = PROBLEMS.PROBLEM_ID WHERE INTEGER(DAYS(CURRENT DATE) -...

DB2 Database

recompiling VS NET WinForms app with VS NET 2003 causing serious problems.

by: BBFrost | last post by:

We just recently moved one of our major c# apps from VS Net 2002 to VS Net 2003. At first things were looking ok, now problems are starting to appear. So far ... (1) ...

C# / C Sharp

Incorrect use of mbstowcs?

by: Ralph A. Moritz | last post by:

Hi all, my knowledge of character encoding issues is pretty limited and I simply cannot find a problem with the following code. It compiles -- but seems to produce garbage at runtime. I'd...

C / C++

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++