Kelvin Moss wrote:
I am trying to search within wide strings (unicode characters) using
wcsstr (on Unix). My problem is that my src or dest strings may or may
not be wide strings. The code I have written seems to fail if I apply
mbstowcs to a wide string.
Don't Do That. The definition of mbstowcs specifies that the input is
a (possibly multibyte) character string. If you pass it an argument
that is a wide string, or an array of doubles, or a picture of an
orang-utan, it won't be able to cope.
Sometimes it may be able to tell that you have lied to it (because the
argument contains something that isn't a valid multibyte character
sequence) and it will return -1. Otherwise, if your wide character
type includes zero bytes for many wide characters then it is likely to
see one of these and think it is a terminating \0. Or worse things
may happen.
It works correctly if both strings are non
wide or if I don't apply mbstows on a wide string.
Good.
So my questions are
1) What's the behavior of applying mbstows on a wide sring. I was
expecting it would have left it unaffected.
See above
2) Is there am api to find if a given string is a unicode string and
doesn't require mbstows?
No.
First note that "unicode string" is not sufficient identification.
Unicode represents any character you are likely to encounter as a
number. In addition you need to specify an "encoding" that says how
those numbers are stored.
Even if you know (as mbstowcs assumes it does) the encoding that you
use, you can't reliably tell from the contents of a piece of memory
whether it contains a multibyte character string or a wide character
string [or an array of doubles, or a picture of an orang-utan]
For example, if your multibyte encoding is UTF-8 and your wchar_t is a
32-bit unsigned int then the sequence
0x48 0x49 0x00 0x00
could be a multibyte character string "AB" followed by a terminating
\0, followed coincidentally by another \0, or it could be a
single-character string in Chinese. The computer can't tell, so you
have to keep track.
String manipulation was certainly easier in the old days, at least for
English-speaking people with dollars as their currency unit.
-thomas