472,145 Members | 2,006 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,145 software developers and data experts.

Converting from UTF-16 to UTF-32

Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?

If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

Thanks!

Jul 31 '06 #1
7 11573
Jimmy Shaw wrote:
Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32?
Not in standard C++.

Jul 31 '06 #2
On 2006-07-31 10:06:35 -0400, Rolf Magnus <ra******@t-online.desaid:
Jimmy Shaw wrote:
>Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32?

Not in standard C++.
That's certainly not true.

Jul 31 '06 #3
On 2006-07-31 08:44:53 -0400, "Jimmy Shaw" <si*******@gmail.comsaid:
Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?
First, your question is off-topic here, as it isn't really a C++ question.

[offtopic]But there is indeed a conversion that is needed (otherwise,
UTF-32 would be a pointless waste of space or UTF-16 would be
incomplete)[/offtopic]
If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

Thanks!
[offtopic]The conversion isn't really intricate at all. See
http://www.zvon.org/tmRFC/RFC2781/Ou...ter2.html#sub2 for a
description of the algorithms used to convert to/from UTF-16.[/offtopic]

Jul 31 '06 #4
Jimmy Shaw wrote:
Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?
If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?
http://www.unicode.org/

--
Salu2
Jul 31 '06 #5

Jimmy Shaw wrote:
Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?

If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

Thanks!
These are the important bits from the functions that I use. utf32 is a
typedef for a signed 32 bit integer (__int32 on MSVC). utf16 is
normally the same as wchar_t on most platforms, but just in case it
isn't it needs to be sixteen bit. The UTF16 sequence is assumed to be
in Intel endian mode - the same as Windows uses.

You do need to use this sort of belt and braces approach though as this
is a prime vector for security cracks. The checks are even more
important for UTF8 sequences. I think that there's a lot of
improvements that could be made, but the code does work.
std::size_t FSLib::utf::utf16length( const utf32 ch ) {
if ( ch < 0x10000 ) return 1;
else return 2;
}

utf32 FSLib::utf::assertValid( const utf32 ch ) {
try {
if ( ch >= 0xD800 && ch <= 0xDBFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is in the UTF-16
leading surrogate pair range." );
if ( ch >= 0xDC00 && ch <= 0xDFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is in the UTF-16
trailing surrogate pair range." );
if ( ch == 0xFFFE || ch == 0xFFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is disallowed
(0xFFFE/0xFFFF)" );
if ( ch 0x10FFFF ) throw FSLib::Exceptions::UnicodeEncoding(
L"UTF-32 character is beyond the allowable range." );
return ch;
} catch ( FSLib::Exceptions::UnicodeEncoding &e ) {
e.info() << L"Character value is: " << ch << std::endl;
throw;
}
}

utf32 FSLib::utf::decode( const utf16 *seq ) {
try {
utf32 ch = *seq;
if ( ch >= 0xD800 && ch <= 0xDBFF ) {
if ( seq[ 1 ] == 0 ) throw FSLib::Exceptions::UnicodeEncoding(
L"Trailing surrogate missing from UTF-16 sequence (it is ZERO)" );
if ( seq[ 1 ] < 0xDC00 || seq[ 1 ] 0xDFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"Trailing character in a UTF-16
surrogate pair is missing (outside correct range)" );
return assertValid( ( ch << 10 ) + seq[ 1 ] + 0x10000 - ( 0xD800 <<
10 ) - 0xDC00 );
}
return assertValid( ch );
} catch ( FSLib::Exceptions::Exception &e ) {
e.info() << L"Decoding UTF-16 number: " << toString( unsigned int(
seq[ 0 ] ) ) << std::endl;
e.info() << L"Preceeding UTF-16 number: " << toString( unsigned int(
seq[ -1 ] ) ) << std::endl;
e.info() << L"Following UTF-16 number: " << toString( unsigned int(
seq[ 1 ] ) ) << std::endl;
throw;
}
}

Aug 1 '06 #6
Clark S. Cox III wrote:
On 2006-07-31 10:06:35 -0400, Rolf Magnus <ra******@t-online.desaid:
>Jimmy Shaw wrote:
>>Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32?

Not in standard C++.

That's certainly not true.
Concur. Most non-Windows platforms use a full integer for
wchar_t. Using locales and <codecvtyour iostreams probably
already provide the capability. Check your platform's wchar_t
to see if it already is in UTF32.

Dinkumware (http://www.dinkumware.com/) sells an extension
library that includes <codecvtconverters for UTF16.
Aug 1 '06 #7
"dayton" <mv************@yahoo.comwrote in message
news:3Y**************@newssvr29.news.prodigy.net.. .
Clark S. Cox III wrote:
>On 2006-07-31 10:06:35 -0400, Rolf Magnus <ra******@t-online.desaid:
>>Jimmy Shaw wrote:

Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32?

Not in standard C++.

That's certainly not true.

Concur. Most non-Windows platforms use a full integer for wchar_t. Using
locales and <codecvtyour iostreams probably already provide the
capability. Check your platform's wchar_t to see if it already is in
UTF32.

Dinkumware (http://www.dinkumware.com/) sells an extension library that
includes <codecvtconverters for UTF16.
Yep, except that they're now included as part of our standard
(Compleat) library product.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Aug 1 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by mustafa | last post: by
8 posts views Thread by prabha | last post: by
2 posts views Thread by Map Reader | last post: by
12 posts views Thread by Frederik Vanderhaeghe | last post: by
7 posts views Thread by Tor Aadnevik | last post: by
4 posts views Thread by gg9h0st | last post: by
reply views Thread by Saiars | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.