Upgrade from Windows-1252 to UCS-2

Boris

I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.

As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?

Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Anything I might have missed?

Boris

Jun 20 '07 #1

Subscribe Post Reply

2768

John Harrison

Boris wrote:

I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows
"ANSI" code page) to UCS-2. Currently the program reads and writes files
encoded in Windows-1252 but should be able to read files encoded in
UCS-2, too.

As I don't want to deal with two character representations in the
program I plan to use UCS-2 internally. I should be able to simply use
std::wstring then?

Yes.

When Windows-1252 encoded files are read I have to

convert the data to UCS-2 though. My understanding is that it depends
now on the implementation of the C++ standard library if and what kind
of conversions are supported? I might need to use a third-party library
like the Dinkum Conversions Library which converts data on the fly or
something like UTF-8 CPP where I can call functions explicitly to
convert between character sets?

AFAIK a third party library (or writing your own code) is the only way
to go. For Windows-1252 to UCS-2 why not write your own? It can't be
that hard.

>
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct
that I'm safe to use member functions of std::wstring as long as the
character set used is not multibyte?

That's correct for UCS-2.

>
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

But yes, to convert UCS-2 to UTF-8 is another step for which you could
either get a third party library or write your own code.

>
Anything I might have missed?

Boris

john

Jun 20 '07 #2

John Harrison

Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

I want to take that back, Windows 1252 is an encding too, but it's still
the case that it's not the same as UTF-8

john

Jun 20 '07 #3

Boris

On Wed, 20 Jun 2007 15:35:25 +0900, John Harrison
<jo*************@hotmail.comwrote:

> Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

I want to take that back, Windows 1252 is an encding too, but it's still
the case that it's not the same as UTF-8

Thanks, John! I should have clarified it better: The idea is that files
with an ASCII-compatible subset of UTF-8 look like normal ASCII files when
encoded in UTF-8 (so other programs can simply assume they are ASCII
files).

Boris

Jun 20 '07 #4

=?iso-8859-1?q?Kirit_S=E6lensminde?=

On Jun 20, 12:36 pm, Boris <b...@gtemail.netwrote:

I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.

I think you mean UCS-4 and UTF-16. Old documents talk about UCS-2, but
current Windows (and I assume Linux etc.) is UCS-4. This causes no end
of confusion especially as for most purposes there isn't much
difference. Check that your software manages to handle the treble
cleff character properly. Let's see how it works here :)

>
As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.

>
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?

If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.

>
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Anything I might have missed?

To convert from UTF-16 to UTF-8 is fairly simple, but don't forget you
HAVE to go through UTF-32.

It's not directly about your situation, but you may find this
interesting as it does discuss some of the issues about encodings and
Unicode.

http://www.kirit.com/Getting%20the%2...ISAPI%20filter
The way you're going about it is a good way to start this sort of
conversion. In the end for our systems we made our own
std::basic_string like class that knows it is UTF-16 and alters parts
of the interface accordingly.

Once you start working with Unicode you won't want to go back.
K

Jun 20 '07 #5

Boris

On Wed, 20 Jun 2007 17:05:19 +0900, Kirit Sælensminde
<ki****************@gmail.comwrote:

On Jun 20, 12:36 pm, Boris <b...@gtemail.netwrote:
[...]
>conversions are supported? I might need to use a third-party library
like
the Dinkum Conversions Library which converts data on the fly or
something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.

What's so special about the first twenty wchar_ts? It's the first time I
hear about it.

[...]If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.

My idea was to use UCS-2 internally as with UTF-16 you'll get all kind of
problems like the one you described. I understand that you even created
your own std::basic_string class for your products. However I'm trying to
go the easy way. :) I understand that UCS-2 might not be sufficient for
all Unicode characters but for now that's the price I'm ready to pay. Or
do I really miss anything important (if for example the Klingon characters
don't fit in UCS-2 anymore I really don't mind :)?

Boris

Jun 20 '07 #6

=?iso-8859-1?q?Kirit_S=E6lensminde?=

On Jun 20, 4:41 pm, Boris <b...@gtemail.netwrote:

On Wed, 20 Jun 2007 17:05:19 +0900, Kirit Sælensminde
<kirit.saelensmi...@gmail.comwrote:
On Jun 20, 12:36 pm, Boris <b...@gtemail.netwrote:
[...]
conversions are supported? I might need to use a third-party library
like
the Dinkum Conversions Library which converts data on the fly or
something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.

What's so special about the first twenty wchar_ts? It's the first time I
hear about it.

Nothing. It was just an example. You can't take any internal range
between positions n and m in a UTF-16 sequence without checking that
you don't cut surrogate pairs in half.

You will generally be OK so long as you use a string instead of a
wchar_t for single character operations - ie. every place you would
get user input as one character handle it internally as a string. You
also need to make sure that you never use functions like substr at any
boundary that has not been found by searching within the string.

>
[...]If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.

My idea was to use UCS-2 internally as with UTF-16 you'll get all kind of
problems like the one you described. I understand that you even created
your own std::basic_string class for your products. However I'm trying to
go the easy way. :) I understand that UCS-2 might not be sufficient for
all Unicode characters but for now that's the price I'm ready to pay. Or
do I really miss anything important (if for example the Klingon characters
don't fit in UCS-2 anymore I really don't mind :)?

Then you need to strip the surrogate pairs from your code, but I'm not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is
pretty simple. You should be able to write a simple iterator based
algorithm that does it in a short amount of code.

If you are interested in licensing the implementations that we have
then you can contact me via email or my web site.
K

Jun 20 '07 #7

Boris

On Wed, 20 Jun 2007 23:31:01 +0900, Kirit Sælensminde
<ki****************@gmail.comwrote:

On Jun 20, 4:41 pm, Boris <b...@gtemail.netwrote:
[...]Nothing. It was just an example. You can't take any internal range

Ah, okay, then I understand. :)

[...]Then you need to strip the surrogate pairs from your code, but I'm
not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is

The problem is that the code base is pretty big. There are strings used
everywhere and of course the string member functions we all know from the
C++ standard. I expect that it's rather simple to replace std::string with
std::wstring. But I try to avoid having to make a complete code review to
figure out if the strings are used everywhere correctly. If I simply use
UCS-2 in std::wstring I should be more or less done? Or is there any trick
to make a std::wstring aware of UTF-16 - can't possibly work?

Boris

Jun 20 '07 #8

John Harrison

Boris wrote:

On Wed, 20 Jun 2007 23:31:01 +0900, Kirit Sælensminde
<ki****************@gmail.comwrote:

>On Jun 20, 4:41 pm, Boris <b...@gtemail.netwrote:
[...]Nothing. It was just an example. You can't take any internal range

Ah, okay, then I understand. :)

>[...]Then you need to strip the surrogate pairs from your code, but
I'm not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is

The problem is that the code base is pretty big. There are strings used
everywhere and of course the string member functions we all know from
the C++ standard. I expect that it's rather simple to replace
std::string with std::wstring. But I try to avoid having to make a
complete code review to figure out if the strings are used everywhere
correctly. If I simply use UCS-2 in std::wstring I should be more or
less done? Or is there any trick to make a std::wstring aware of UTF-16
- can't possibly work?

Boris

You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

On the other hand if you really mean the more modern UTF-16 then
surrogate pairs is an issue. Frankly though I'd stick with UCS-2.

john

Jun 20 '07 #9

John Harrison

Boris wrote:

On Wed, 20 Jun 2007 15:35:25 +0900, John Harrison
<jo*************@hotmail.comwrote:

>> Some confusion here I think, UTF-8 and Windows-1252 are not the
same. The first is an character encoding, the second is a character set.

I want to take that back, Windows 1252 is an encding too, but it's
still the case that it's not the same as UTF-8

Thanks, John! I should have clarified it better: The idea is that files
with an ASCII-compatible subset of UTF-8 look like normal ASCII files
when encoded in UTF-8 (so other programs can simply assume they are
ASCII files).

Boris

That is true, but again ASCII is not the same as Windows-1252. You need
to be precise about your terminology.

john

Jun 20 '07 #10

Boris

On Thu, 21 Jun 2007 03:04:41 +0900, John Harrison
<jo*************@hotmail.comwrote:

[...]You need to be careful about terminolgy. If you really mean UCS-2
then the surrogate pairs problem that Kirit mentioned is not a problem
for the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

On the other hand if you really mean the more modern UTF-16 then
surrogate pairs is an issue. Frankly though I'd stick with UCS-2.

Yes, I really do mean UCS-2. Kirit started to talk about UTF-16, not me. :)

Boris

Jun 21 '07 #11

=?iso-8859-1?q?Kirit_S=E6lensminde?=

On Jun 21, 1:04 am, John Harrison <john_androni...@hotmail.comwrote:

You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

Windows now uses UTF-16 not UCS-2. Early versions of Windows used
UCS-2 and there is still a lot of documentation from that era on the
web. I don't remember which versions changed from UCS-2 to UCS-4.

If all of the strings are generated internally then you can probably
get away with assuming UCS-2 so long as you reject the surrogate
pairs. At every location that strings enter the program they will need
to be checked.
K

Jun 21 '07 #12

James Kanze

On Jun 20, 8:04 pm, John Harrison <john_androni...@hotmail.comwrote:

You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

If you want to be really careful: it may use both, since they
refer to different things. UCS-2 specifies a character set and
its abstract encoding: the mapping between a specific character
and a numeric value. UTF-16 specifies an encoding format: the
way the numeric value is represented in a particular context
(memory, media, etc.). UTF-16 (like UTF-8) can be used to
represent both UCS-2 and UCS-4.

I think modern Windows uses UCS-4 in UTF-16 format, at least on
disk and at the API level.

On the other hand if you really mean the more modern UTF-16
then surrogate pairs is an issue. Frankly though I'd stick
with UCS-2.

The choice of the code set depends on the characters you need.
If I were writing a compiler for K&R C, I'd stick with US ASCII;
it's a lot simpler than either, and has all the necessary
characters. If I have to handle text in a far eastern
languange, on the other hand, I probably need UCS-4, regardless
of what I want, because it is the only encoding which has all of
the characters I need.

Depending on what I'm doing, internally, I'll use UTF-32 or
UTF-8. Probably... I've never worked in an environment which
had any native support for UTF-16, and perhaps in some cases,
the presence of native support would win out.

--
James Kanze (GABI Software, from CAI) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 21 '07 #13

by: Jive | last post by:

Here's my sitch: I use gnuplot.py at work, platform Win32. I want to upgrade to Python 2.4. Gnuplot.py uses extension module Numeric. Numeric is now "unsupported." The documentation says "If...

Python

Upgrade Advice

by: Dave Harney | last post by:

Hi Newsgroup, I'm currently using VS Ver 7.0.9466 with an OS of Server 2000 (domain controller) Ver 5.0.2195 (Build 2195) SP4. I have MSDN Universal and would like to upgrade my development...

.NET Framework

steps to upgrade to 8.1

by: sea | last post by:

Hi, I have db2 workgroup 7.2 and am planning to upgrade to 8.1 workgroup server edition. Could someone please tell me if the order in which I do this as listed below is correct? ON THE SERVER...

DB2 Database

When should you upgrade access 97/vba applications?

by: Terry Bell | last post by:

We've had a very large A97 app running fine for the last seven years. I've just converted to SQL Server backend, which is being tested, but meanwhile the JET based version, running under terminal...

Microsoft Access / VBA

VC++ Net 2002 - 2003 upgrade

by: Tim | last post by:

Is there a way to upgrade from Visual C++ Net 2002 to Visual C++ Net 2003? The 2002 version does not provide a Windows Forms Designer. I can't find any upgrade package on Microsoft's website. ...

.NET Framework

Ugly groupbox caption font after .NET 1.1 SP1 upgrade

by: Dennis Sjogren | last post by:

Hi! I have this medium sized solution with a couple of projects and stuff. The generated application has an <appname>.exe.manifest file to enable XP themes. In the main window of the application...

Visual Basic .NET

DB2 Client Upgrade Question

by: pshindle | last post by:

We have several machines currently running the DB2 V7 Run-time Client that we would like to actually be running the App Dev Client. To 'upgrade' (within the same version) this client software can...

DB2 Database

SQL 6.5 upgrade to 2000

by: rdraider | last post by:

Hi, I am trying to use the Upgrade Wizard in SQL 2000 to upgrade a SQL 6.5 db to 2000. I am only choosing to update a single database, not system objects. The Upgrade wizard connects to SQL 6.5,...

Microsoft SQL Server

Should I upgrade my Windows 95 version of Access?

by: JakeD | last post by:

SInce about 1996, I have been happily using MS Access for Windows 95, but would now like to upgrade to a recent version. If I get a Windows XP version, will I be able to import all my Access 95...

Microsoft Access / VBA

Postgresql Upgrade from 8.1 to 8.3 on windows

by: nilart | last post by:

Hi , Our current windows application uses postgre 8.1 while next release will be using postgre 8.3 Naturally when application is upgraded ..postgre is expected to upgrade To support this...

PostgreSQL Database

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Upgrade from Windows-1252 to UCS-2

Similar topics