in comp.lang.c i read:
Does anyone have a reference to _how to actually use_ the multi-byte /
wide functions in a real program?
the main issue is that it is something of a portability nightmare, at least
without resorting to facilities beyond those in the c standard.
Specifically , I'm looking for a way to read from a text file that is
in one multibyte encoding, manipulate the contents as wide chars, then
write to a text file that is in a _different_ multibyte encoding.
the main issue is setting the locales properly. since there are few
standards for the meaning of the names, and what few exist don't tend to be
strict, this means much guessing and potential failures. sometimes this is
a non-issue, as a single known (and working) locale is involved for input
and output.
secondarily is library conformance; specifically whether it supports amd1
or c99, vs plain old c89. without amd1 or later you need to read a string
then use mbstowcs to convert to a wide string, at which point you can
manipulate the various wchar_t. character by character is not possible
using just c89 facilities (unless you want to go into the business of
decoding character encodings yourself).
a program that counts upper-case characters looks nearly the same when
insensitive to locale:
#include <stdio.h>
#include <ctype.h>
int main(void)
{
unsigned long upper = 0;
int c;
while (EOF != (c = getc(stdin)))
if (isupper(c))
upper++;
printf("There were %lu upper-case characters.\n", upper);
return 0;
}
as when sensitive (w/amd1 or c99 conformance):
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
unsigned long upper = 0;
wint_t c;
if (0 ==
setlocale(LC_CT YPE, "")) /* environment specified locale */
{
fputs("your locale is invalid, the world ends\n", stderr);
abort();
}
while (WEOF != (c = getwc(stdin)))
if (iswupper(c))
upper++;
wprintf(L"There were %lu upper-case characters.\n", upper);
return 0;
}
but your desire for a different locale on output makes it tricky. worse,
switching between locales can have issues, so best to get everything done
with one locale before moving to the next. you might let the user specify
each, and pray they supply valid names:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
unsigned long upper = 0;
wint_t c;
if (3 != argc)
{
fputs("incorrec t number of arguments\n", stderr);
fputs("supply input and output locale names\n", stderr);
abort();
}
if (0 ==
setlocale(LC_CT YPE, argv[1])) /* user specified input locale */
{
fputs("input locale is invalid, the world ends\n", stderr);
abort();
}
while (WEOF != (c = getwc(stdin)))
if (iswupper(c))
upper++;
if (0 ==
setlocale(LC_AL L, argv[2])) /* user specified output locale */
{
fputs("output locale is invalid, the world ends\n", stderr);
abort();
}
wprintf(L"There were %lu upper-case characters.\n", upper);
return 0;
}
though i've used wide string literals, and associated output functions, i
haven't actually shown anything that would make them useful, because the
form is implementation defined so anything outside the basic character set
may not be portable. wonderful, huh? now that isn't to say there is no
way to handle it, most people would use a localization (l10n) mechanism
like catgets or gettext so that the strings would be fetched from an
external resource which is aligned with the implementation requirements.
c99 provides a (somewhat clumsy) way to use iso-10646 characters in wide
string literals, which increases source portability -- i could have used
them here, though that would just make the "c99 isn't real" people come out
of the woodwork.
--
a signature