473,396 Members | 1,706 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Wide character to multi-byte

PEK
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.

I have tried with mbtowcs/wctombs but I'm not satisfied with the
result. If wctombs finds a character that can't be converted it return
-1, and stops. I would like to replace such of characters with some
special character and convert so much that is possible.

So I have written my own functions, based on mbtowc and wctomb. I have
successfully converted text from and to different codepages (I have
tried 437, 1252 and 949 [Korean, with some characters that takes two
bytes]). So I think the code is OK, but I would appreciate if someone
else look at it (so I have someone to blame ;-).

The code:

void ConvertCharToWstring(const char* from, wstring &to)
{
to = L"";

size_t pos=0;
wchar_t temp[1];

while(true)
{
size_t len = mbtowc(temp, from+pos, MB_CUR_MAX);

//Found end
if(len == 0)
return;
else if(len == (size_t)-1)
{
//Unknown character, this should never happen
pos++;
}
else
{
to += temp[0];
pos += len;
}
}
}

void ConvertWcharToString
(const wchar_t* from, string &to,
bool* datalost, char unknownchar)
{
to = "";

char* temp = new char[MB_CUR_MAX];

while(*from != L'\0')
{
size_t len = wctomb(temp, *from);

//Found end
if(len == 0)
break;
else if(len == (size_t)-1)
{
//Replace with unknown character
to += unknownchar;

if(datalost != NULL)
*datalost=true;
}
else
{
//Copy all characters
for(size_t i=0; i<len; i++)
to += temp[i];
}

from++;
}

delete [] temp;
}

/PEK

Jul 22 '05 #1
5 8473
"PEK" <pe*****@home.se> wrote in message
news:41***************@news.individual.net...
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.
/PEK


// wide-char to multibyte:
wstring source = "something";
typedef ctype<wchar_t> CT;
size_t length = source.length();
char *result = new char[length];
CT const& ct = use_facet<CT>(locale());
ct.narrow(source.data(), source.data() + source.size(), 'X', result);
string dest(result, length);
delete[] result;
return dest;

For the reverse, use ct.widen instead (and make source a string and dest a
wstring of course).
This uses the global C locale, which at program startup is ASCII, *not* the
system locale. To set a specific locale, use:
locale::global(locale("Dutch_Netherlands"));
At least on Windows with VC, this sets the global locale to the system
locale:
locale::global(locale(""));

Note that this won't handle actual multi-byte character sets, i.e. character
sets with characters > 256 (e.g. JIS), those characters will not get
converted properly. I know of no standard way to handle those, just the
WideCharToMultiByte windows method.

--
Unforgiven

Jul 22 '05 #2
PEK wrote:
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.


The standard C++ solution is to use codecvt facets. Currently these are a bit
hard to use, but there is a proposal to add several components which would make
it easier. See

http://www.open-std.org/jtc1/sc22/wg...004/n1683.html.

In the meantime, both the Boost Serialization library and the soon-to-be-relased
Boost Iostreams

http://home.comcast.net/~jturkanis/i.../doc/?path=5.6

library contain code conversion components. (The documentation for the iostreams
code conversion component is temporarily out-of-sync with the source.)

You can also use the Dinkumware CoreX library, which is reasonably priced and is
the basis for n1683.

Jonathan

Jul 22 '05 #3
Unforgiven wrote:
"PEK" <pe*****@home.se> wrote in message
news:41***************@news.individual.net...
I need some code that convert a multi-byte string to a Unicode
string, and Unicode to multi-byte. I work mostly in Windows and know
how to solve it there, but I would like to have some platform
independent code too.
/PEK
Note that this won't handle actual multi-byte character sets, i.e.
character sets with characters > 256 (e.g. JIS), those characters
will not get converted properly. I know of no standard way to handle
those, just the WideCharToMultiByte windows method.


Using mbtowcs/wctombs *is* a standard way to handle multibyte characters. The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.

Jonathan
Jul 22 '05 #4
"Jonathan Turkanis" <te******@kangaroologic.com> wrote in message
news:34*************@individual.net...
Unforgiven wrote:
"PEK" <pe*****@home.se> wrote in message
news:41***************@news.individual.net...
I need some code that convert a multi-byte string to a Unicode
string, and Unicode to multi-byte. I work mostly in Windows and know
how to solve it there, but I would like to have some platform
independent code too.
/PEK
Note that this won't handle actual multi-byte character sets, i.e.
character sets with characters > 256 (e.g. JIS), those characters
will not get converted properly. I know of no standard way to handle
those, just the WideCharToMultiByte windows method.


Using mbtowcs/wctombs *is* a standard way to handle multibyte characters.


That I knew, but it has the drawback of bolting on unrecognized characters
instead of replacing them with some predetermined character (like '?'), as
the OP mentioned.
The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.


That I didn't know.

--
Unforgiven

Jul 22 '05 #5
PEK
On Wed, 5 Jan 2005 23:38:00 +0100, "Unforgiven"
<ja*******@hotmail.com> wrote:
"Jonathan Turkanis" <te******@kangaroologic.com> wrote in message
news:34*************@individual.net...
Unforgiven wrote:
"PEK" <pe*****@home.se> wrote in message
news:41***************@news.individual.net...
I need some code that convert a multi-byte string to a Unicode
string, and Unicode to multi-byte. I work mostly in Windows and know
how to solve it there, but I would like to have some platform
independent code too.
/PEK

Note that this won't handle actual multi-byte character sets, i.e.
character sets with characters > 256 (e.g. JIS), those characters
will not get converted properly. I know of no standard way to handle
those, just the WideCharToMultiByte windows method.


Using mbtowcs/wctombs *is* a standard way to handle multibyte characters.


That I knew, but it has the drawback of bolting on unrecognized characters
instead of replacing them with some predetermined character (like '?'), as
the OP mentioned.


A workaround for this is to use mbtowc/wctomb instead and convert the
characters in a loop. This was my solution and it seems to work, or is
there some problems with it?
The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.


That I didn't know.


The code Unforgiven it's a bit obscure, but I think I understand most
of it. But I also want to detect if an unrecognized character was
replaced (I guess I didn't mention that in my earlier post). Another
problem with the code is that I suppose it's hard to calculate the
length of the result when multibyte characters will be used.
/PEK

Jul 22 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Jonathan Mcdougall | last post by:
I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist....
4
by: mimmo | last post by:
Hi! I should convert the accented letters of a string in the correspondent letters not accented. But when I compile with -Wall it give me: warning: multi-character character constant Do the...
1
by: Anitha Adusumilli | last post by:
Hi Can someone pls explain the usage of wide characters and tchar? Also, what should I be careful about, while coding in C, to make my code portable and suitable for internationalization? ( I...
1
by: jjf | last post by:
Do Standard C's wide characters and wide strings require absolutely that each character be stored in a single wchar_t, or can characters be "multi-wchar_t" in the same way that they can be...
4
by: uday.sen | last post by:
Hi, I need to convert a string from UTF8 to wide character (wchar_t *). I perform the same in windows using: MultiByteToWideChar(CP_UTF8, 0, pInput, -1, pOutput, nLen); However, in linux...
2
by: Elie Roux | last post by:
Hello, I would like to write a wide chars string with printf, but I do not really understand the behaviour I have with this basic test program for example : #include <stdlib.h> #include...
4
by: thinktwice | last post by:
i'm using VC++6 IDE i know i could use macros like A2T, T2A, but is there any way more decent way to do this?
9
by: toton | last post by:
Hi, I have my program using wstring everywhere instead of string. Similarly I need to process some file, which contains unicode or ascii character. I need to stream them. Thus I use wifstream etc....
2
by: George2 | last post by:
Hello everyone, I need to know the wide character (unicode) and multibyte (UTF-8) values of a character string of czech. I personally know nothing about czech. Is the following approach...
9
by: Bill Cunningham | last post by:
I want to print out the Chinese character meaning water which is decimal 27750 I believe. Do I use wprintf to do this and just include wchar.h ? So far I haven't gotten anything to work. Bill
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.