Jimmy Shaw wrote:
Hi everybody,
Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?
If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?
Thanks!
These are the important bits from the functions that I use. utf32 is a
typedef for a signed 32 bit integer (__int32 on MSVC). utf16 is
normally the same as wchar_t on most platforms, but just in case it
isn't it needs to be sixteen bit. The UTF16 sequence is assumed to be
in Intel endian mode - the same as Windows uses.
You do need to use this sort of belt and braces approach though as this
is a prime vector for security cracks. The checks are even more
important for UTF8 sequences. I think that there's a lot of
improvements that could be made, but the code does work.
std::size_t FSLib::utf::utf16length( const utf32 ch ) {
if ( ch < 0x10000 ) return 1;
else return 2;
}
utf32 FSLib::utf::assertValid( const utf32 ch ) {
try {
if ( ch >= 0xD800 && ch <= 0xDBFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is in the UTF-16
leading surrogate pair range." );
if ( ch >= 0xDC00 && ch <= 0xDFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is in the UTF-16
trailing surrogate pair range." );
if ( ch == 0xFFFE || ch == 0xFFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is disallowed
(0xFFFE/0xFFFF)" );
if ( ch 0x10FFFF ) throw FSLib::Exceptions::UnicodeEncoding(
L"UTF-32 character is beyond the allowable range." );
return ch;
} catch ( FSLib::Exceptions::UnicodeEncoding &e ) {
e.info() << L"Character value is: " << ch << std::endl;
throw;
}
}
utf32 FSLib::utf::decode( const utf16 *seq ) {
try {
utf32 ch = *seq;
if ( ch >= 0xD800 && ch <= 0xDBFF ) {
if ( seq[ 1 ] == 0 ) throw FSLib::Exceptions::UnicodeEncoding(
L"Trailing surrogate missing from UTF-16 sequence (it is ZERO)" );
if ( seq[ 1 ] < 0xDC00 || seq[ 1 ] 0xDFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"Trailing character in a UTF-16
surrogate pair is missing (outside correct range)" );
return assertValid( ( ch << 10 ) + seq[ 1 ] + 0x10000 - ( 0xD800 <<
10 ) - 0xDC00 );
}
return assertValid( ch );
} catch ( FSLib::Exceptions::Exception &e ) {
e.info() << L"Decoding UTF-16 number: " << toString( unsigned int(
seq[ 0 ] ) ) << std::endl;
e.info() << L"Preceeding UTF-16 number: " << toString( unsigned int(
seq[ -1 ] ) ) << std::endl;
e.info() << L"Following UTF-16 number: " << toString( unsigned int(
seq[ 1 ] ) ) << std::endl;
throw;
}
}