By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,960 Members | 1,009 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,960 IT Pros & Developers. It's quick & easy.

Upgrade from Windows-1252 to UCS-2

P: n/a
I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.

As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?

Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Anything I might have missed?

Boris
Jun 20 '07 #1
Share this Question
Share on Google+
12 Replies


P: n/a
Boris wrote:
I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows
"ANSI" code page) to UCS-2. Currently the program reads and writes files
encoded in Windows-1252 but should be able to read files encoded in
UCS-2, too.

As I don't want to deal with two character representations in the
program I plan to use UCS-2 internally. I should be able to simply use
std::wstring then?
Yes.

When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends
now on the implementation of the C++ standard library if and what kind
of conversions are supported? I might need to use a third-party library
like the Dinkum Conversions Library which converts data on the fly or
something like UTF-8 CPP where I can call functions explicitly to
convert between character sets?
AFAIK a third party library (or writing your own code) is the only way
to go. For Windows-1252 to UCS-2 why not write your own? It can't be
that hard.
>
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct
that I'm safe to use member functions of std::wstring as long as the
character set used is not multibyte?
That's correct for UCS-2.
>
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?
Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

But yes, to convert UCS-2 to UTF-8 is another step for which you could
either get a third party library or write your own code.
>
Anything I might have missed?

Boris
john
Jun 20 '07 #2

P: n/a
>
Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.
I want to take that back, Windows 1252 is an encding too, but it's still
the case that it's not the same as UTF-8

john
Jun 20 '07 #3

P: n/a
On Wed, 20 Jun 2007 15:35:25 +0900, John Harrison
<jo*************@hotmail.comwrote:
> Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

I want to take that back, Windows 1252 is an encding too, but it's still
the case that it's not the same as UTF-8
Thanks, John! I should have clarified it better: The idea is that files
with an ASCII-compatible subset of UTF-8 look like normal ASCII files when
encoded in UTF-8 (so other programs can simply assume they are ASCII
files).

Boris
Jun 20 '07 #4

P: n/a
On Jun 20, 12:36 pm, Boris <b...@gtemail.netwrote:
I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.
I think you mean UCS-4 and UTF-16. Old documents talk about UCS-2, but
current Windows (and I assume Linux etc.) is UCS-4. This causes no end
of confusion especially as for most purposes there isn't much
difference. Check that your software manages to handle the treble
cleff character properly. Let's see how it works here :)
>
As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?
Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.
>
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?
If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.
>
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Anything I might have missed?
To convert from UTF-16 to UTF-8 is fairly simple, but don't forget you
HAVE to go through UTF-32.

It's not directly about your situation, but you may find this
interesting as it does discuss some of the issues about encodings and
Unicode.

http://www.kirit.com/Getting%20the%2...ISAPI%20filter
The way you're going about it is a good way to start this sort of
conversion. In the end for our systems we made our own
std::basic_string like class that knows it is UTF-16 and alters parts
of the interface accordingly.

Once you start working with Unicode you won't want to go back.
K

Jun 20 '07 #5

P: n/a
On Wed, 20 Jun 2007 17:05:19 +0900, Kirit Sælensminde
<ki****************@gmail.comwrote:
On Jun 20, 12:36 pm, Boris <b...@gtemail.netwrote:
[...]
>conversions are supported? I might need to use a third-party library
like
the Dinkum Conversions Library which converts data on the fly or
something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.
What's so special about the first twenty wchar_ts? It's the first time I
hear about it.
[...]If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.
My idea was to use UCS-2 internally as with UTF-16 you'll get all kind of
problems like the one you described. I understand that you even created
your own std::basic_string class for your products. However I'm trying to
go the easy way. :) I understand that UCS-2 might not be sufficient for
all Unicode characters but for now that's the price I'm ready to pay. Or
do I really miss anything important (if for example the Klingon characters
don't fit in UCS-2 anymore I really don't mind :)?

Boris
Jun 20 '07 #6

P: n/a
On Jun 20, 4:41 pm, Boris <b...@gtemail.netwrote:
On Wed, 20 Jun 2007 17:05:19 +0900, Kirit Sælensminde
<kirit.saelensmi...@gmail.comwrote:
On Jun 20, 12:36 pm, Boris <b...@gtemail.netwrote:
[...]
conversions are supported? I might need to use a third-party library
like
the Dinkum Conversions Library which converts data on the fly or
something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?
Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.

What's so special about the first twenty wchar_ts? It's the first time I
hear about it.
Nothing. It was just an example. You can't take any internal range
between positions n and m in a UTF-16 sequence without checking that
you don't cut surrogate pairs in half.

You will generally be OK so long as you use a string instead of a
wchar_t for single character operations - ie. every place you would
get user input as one character handle it internally as a string. You
also need to make sure that you never use functions like substr at any
boundary that has not been found by searching within the string.
>
[...]If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.

My idea was to use UCS-2 internally as with UTF-16 you'll get all kind of
problems like the one you described. I understand that you even created
your own std::basic_string class for your products. However I'm trying to
go the easy way. :) I understand that UCS-2 might not be sufficient for
all Unicode characters but for now that's the price I'm ready to pay. Or
do I really miss anything important (if for example the Klingon characters
don't fit in UCS-2 anymore I really don't mind :)?
Then you need to strip the surrogate pairs from your code, but I'm not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is
pretty simple. You should be able to write a simple iterator based
algorithm that does it in a short amount of code.

If you are interested in licensing the implementations that we have
then you can contact me via email or my web site.
K

Jun 20 '07 #7

P: n/a
On Wed, 20 Jun 2007 23:31:01 +0900, Kirit Sælensminde
<ki****************@gmail.comwrote:
On Jun 20, 4:41 pm, Boris <b...@gtemail.netwrote:
[...]Nothing. It was just an example. You can't take any internal range
Ah, okay, then I understand. :)
[...]Then you need to strip the surrogate pairs from your code, but I'm
not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is
The problem is that the code base is pretty big. There are strings used
everywhere and of course the string member functions we all know from the
C++ standard. I expect that it's rather simple to replace std::string with
std::wstring. But I try to avoid having to make a complete code review to
figure out if the strings are used everywhere correctly. If I simply use
UCS-2 in std::wstring I should be more or less done? Or is there any trick
to make a std::wstring aware of UTF-16 - can't possibly work?

Boris
Jun 20 '07 #8

P: n/a
Boris wrote:
On Wed, 20 Jun 2007 23:31:01 +0900, Kirit Sælensminde
<ki****************@gmail.comwrote:
>On Jun 20, 4:41 pm, Boris <b...@gtemail.netwrote:
[...]Nothing. It was just an example. You can't take any internal range

Ah, okay, then I understand. :)
>[...]Then you need to strip the surrogate pairs from your code, but
I'm not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is

The problem is that the code base is pretty big. There are strings used
everywhere and of course the string member functions we all know from
the C++ standard. I expect that it's rather simple to replace
std::string with std::wstring. But I try to avoid having to make a
complete code review to figure out if the strings are used everywhere
correctly. If I simply use UCS-2 in std::wstring I should be more or
less done? Or is there any trick to make a std::wstring aware of UTF-16
- can't possibly work?

Boris
You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

On the other hand if you really mean the more modern UTF-16 then
surrogate pairs is an issue. Frankly though I'd stick with UCS-2.

john
Jun 20 '07 #9

P: n/a
Boris wrote:
On Wed, 20 Jun 2007 15:35:25 +0900, John Harrison
<jo*************@hotmail.comwrote:
>> Some confusion here I think, UTF-8 and Windows-1252 are not the
same. The first is an character encoding, the second is a character set.

I want to take that back, Windows 1252 is an encding too, but it's
still the case that it's not the same as UTF-8

Thanks, John! I should have clarified it better: The idea is that files
with an ASCII-compatible subset of UTF-8 look like normal ASCII files
when encoded in UTF-8 (so other programs can simply assume they are
ASCII files).

Boris
That is true, but again ASCII is not the same as Windows-1252. You need
to be precise about your terminology.

john
Jun 20 '07 #10

P: n/a
On Thu, 21 Jun 2007 03:04:41 +0900, John Harrison
<jo*************@hotmail.comwrote:
[...]You need to be careful about terminolgy. If you really mean UCS-2
then the surrogate pairs problem that Kirit mentioned is not a problem
for the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

On the other hand if you really mean the more modern UTF-16 then
surrogate pairs is an issue. Frankly though I'd stick with UCS-2.
Yes, I really do mean UCS-2. Kirit started to talk about UTF-16, not me. :)

Boris
Jun 21 '07 #11

P: n/a
On Jun 21, 1:04 am, John Harrison <john_androni...@hotmail.comwrote:
You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.
Windows now uses UTF-16 not UCS-2. Early versions of Windows used
UCS-2 and there is still a lot of documentation from that era on the
web. I don't remember which versions changed from UCS-2 to UCS-4.

If all of the strings are generated internally then you can probably
get away with assuming UCS-2 so long as you reject the surrogate
pairs. At every location that strings enter the program they will need
to be checked.
K
Jun 21 '07 #12

P: n/a
On Jun 20, 8:04 pm, John Harrison <john_androni...@hotmail.comwrote:
You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.
If you want to be really careful: it may use both, since they
refer to different things. UCS-2 specifies a character set and
its abstract encoding: the mapping between a specific character
and a numeric value. UTF-16 specifies an encoding format: the
way the numeric value is represented in a particular context
(memory, media, etc.). UTF-16 (like UTF-8) can be used to
represent both UCS-2 and UCS-4.

I think modern Windows uses UCS-4 in UTF-16 format, at least on
disk and at the API level.
On the other hand if you really mean the more modern UTF-16
then surrogate pairs is an issue. Frankly though I'd stick
with UCS-2.
The choice of the code set depends on the characters you need.
If I were writing a compiler for K&R C, I'd stick with US ASCII;
it's a lot simpler than either, and has all the necessary
characters. If I have to handle text in a far eastern
languange, on the other hand, I probably need UCS-4, regardless
of what I want, because it is the only encoding which has all of
the characters I need.

Depending on what I'm doing, internally, I'll use UTF-32 or
UTF-8. Probably... I've never worked in an environment which
had any native support for UTF-16, and perhaps in some cases,
the presence of native support would win out.

--
James Kanze (GABI Software, from CAI) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 21 '07 #13

This discussion thread is closed

Replies have been disabled for this discussion.