Any convenient and elegant way to do encoding conversion in C++?

Licheng Fang

I want to store Chinese in Unicode internally in my program, and give
output in UTF-8 or GBK format. After two days of searching and reading,
I still cannot find a simple and straightforward way to do the code
conversions. In particular, I want portability of the code across
platfroms (Windows and Linux), and I don't like having to refer the
user of my code to some third party libraries for compiling.

Some STL references point to the class "codecvt<>" for this task, but
it seems that I must rely on non-standard, third-party specializations
of this class. The STL itself doesn't implement the code conversions.
Another option I've read about is using GNU's "iconv", which is
implemented in C, and Glib provides a C++ wrapper of "iconv". Again,
re-compiling my source code can be a trouble if I relied heavily on
these libraries. Boost also seems to have some tools for code
conversion. Considering the huge size of the boost libraries, I would
have to pass that as an option.

These are the only possible ways I know of so far. I have to say that
my idea of how this task should be done is somewhat influenced by the
Python way, which is simple and elegant:

if 's' is a string in GBK.

unicode_s = s.decode('gbk')

and when I need to output in GBK I simply convert it back by

output = unicode_s.encode('gbk')

or, I can let the file object know what's the external coding:

import codecs
f = open('somefile', 'r', 'gbk')

I know it's not fair to expect the same things from two different
languages. I wonder, however, how can such a seemingly trivial task be
so infuriatingly complicated in C++.

Sep 23 '06 #1

Subscribe Post Reply

6472

Julián Albo

Licheng Fang wrote:

I want to store Chinese in Unicode internally in my program, and give
output in UTF-8 or GBK format. After two days of searching and reading,
I still cannot find a simple and straightforward way to do the code
conversions. In particular, I want portability of the code across
platfroms (Windows and Linux), and I don't like having to refer the
user of my code to some third party libraries for compiling.

Then you are in trouble. The windows apis, and the usual windows compilers,
uses a wchar_t type of 16 bits with utf16 encoding, and gcc and his
libraries in Linux use a wchar_t of 32 bits. So if you want the internals
of the program be the same on both platforms, and don't want to use third
party libraries, you must define your own wchar type and conversions to and
from utf8. And some platform dependent code to see what characters are
available in the fonts used.

The utf8 conversions are not hard, in http://www.unicode.org you have a
bunch of information.

--
Salu2

Sep 23 '06 #2

loufoque

Licheng Fang wrote :

Some STL references point to the class "codecvt<>" for this task, but
it seems that I must rely on non-standard, third-party specializations
of this class. The STL itself doesn't implement the code conversions.

Indeed, it's not in the standard library.
That's why you need to use a third party library, like libiconv, unless
you want to write it yourself of course.

Basically, you just have to define mappings between one encoding and
Unicode. This is a very boring task.

Sep 23 '06 #3

Licheng Fang

Julián Albo wrote:

Licheng Fang wrote:

I want to store Chinese in Unicode internally in my program, and give
output in UTF-8 or GBK format. After two days of searching and reading,
I still cannot find a simple and straightforward way to do the code
conversions. In particular, I want portability of the code across
platfroms (Windows and Linux), and I don't like having to refer the
user of my code to some third party libraries for compiling.

Then you are in trouble. The windows apis, and the usual windows compilers,
uses a wchar_t type of 16 bits with utf16 encoding, and gcc and his
libraries in Linux use a wchar_t of 32 bits. So if you want the internals
of the program be the same on both platforms, and don't want to use third
party libraries, you must define your own wchar type and conversions to and
from utf8. And some platform dependent code to see what characters are
available in the fonts used.

The utf8 conversions are not hard, in http://www.unicode.org you have a
bunch of information.

Thanks very much.

I know it's simple to convert Unicode to UTF-8, but the input of my
code is mostly in GBK, which is a popular Chinese encoding. I have to
deal with that.

It seems I have to accept that there's no standard way to convert
encodings in C++. Let me re-state my goals:

1) use Unicode internally in my program, to facilitate my coding task
2) make it as convenient as possible for the users of my code to
compile it

And let me forget about Windows for now, and think about how I can make
it simple to re-compile my code on Linux. Given that there's no
standard way to do encoding convesions, my question is:

What is the most widely used encoding conversion approach on Linux? Is
that the "iconv" library? Is this library included by default on most
Linux platforms? How about the Glib wrappings of this library? Should I
use it?

Sep 23 '06 #4

AnonMail2005

Licheng Fang wrote:

It seems I have to accept that there's no standard way to convert
encodings in C++. Let me re-state my goals:

1) use Unicode internally in my program, to facilitate my coding task
2) make it as convenient as possible for the users of my code to
compile it

And let me forget about Windows for now, and think about how I can make
it simple to re-compile my code on Linux. Given that there's no
standard way to do encoding convesions, my question is:

What is the most widely used encoding conversion approach on Linux? Is
that the "iconv" library? Is this library included by default on most
Linux platforms? How about the Glib wrappings of this library? Should I
use it?

iconv is a very standard way to do this. It's a single C function
which, given proper inputs, will do everything you need. Forget
about C++ wrappers. It's just a C function. Learn how to
declare it correctly for a C++ program and you're done (check
the FAQ). Hopefully, the Glib wrapper you are speaking of
just does that.

I use iconv for encoding internal strings when creating XML
messages which are sent externally. The iconv call is
contained in a wrapper we created to interface with an XML
library. It is cross platform - Linux and Windows.

Notice the word "call" above is not plural. The key point is
that the system is designed so that it doesn't need to keep
track of how strings are encoded. The process of creating
something for external consumption encapsulates the
conversion.

You should carefully design you system to do the same,
otherwise your code will be riddled with conversion calls
and anything that contains data will need to keep track
of how that data is encoded.

Good luck.

Sep 23 '06 #5

Julián Albo

Licheng Fang wrote:

It seems I have to accept that there's no standard way to convert
encodings in C++. Let me re-state my goals:
1) use Unicode internally in my program, to facilitate my coding task
2) make it as convenient as possible for the users of my code to
compile it

You have for example the mbtowc (multibyte to wide char) function and his
family in the C library, that I suppose will support your encoding if you
have a locale that uses it. You can handle the locale with the C style
functions in <locale.hor the C++ <localeones.

The availability of locales an libraries in linux is off-topic in this
group, you can ask in some linux related newsgroup.

--
Salu2

Sep 23 '06 #6

Similar topics

Encoding problems

by: Gandalf | last post by:

Hi All! I have a program that looks like this: # -*- coding: iso-8859-2 -*- s1 = 'néz' s2 = raw_input('Please type in "néz":') print repr(s1) print repr(s2)

Python

System.Text.Encoding oddities

by: Mark | last post by:

Sorry about the last... Anyway, here's the question: I've been working on some C# routines to process strings in and out of various encodings. The hope is that I can just let the user type in...

.NET Framework

Encoding/Codepage: Can't Get There From Here

by: Christopher H. Laco | last post by:

Long story longer. I need to get web user input into a backend system that a) only grocks single byte encoding, b) expectes the data transer to be 1 bytes = 1 character, and c) uses the HP Roman-6...

.NET Framework

Encoding.GetEncode Problem

by: Tamir Khason | last post by:

I have Windows Form application recieved data from clipboard and convert its encoding based on some ruls. So doing following: //from source to multiple targets System.Text.Encoding targ1 =...

C# / C Sharp

change db encoding

by: Alexander Cohen | last post by:

How would i go about changing a databases encoding? Is this at all possible? There does not seem to be much i can with ALTER DATABASE except change its name! -- Alexander Cohen...

PostgreSQL Database

BIG encoding and UTF8?

by: EmeraldShield | last post by:

We have an application that uses UTF8 everywhere to load / save / process documents. One of our clients is having a problem with BIG Encoded files being trashed after running through our app. ...

C# / C Sharp

Crazy with character encoding

by: Zhiv Kurilka | last post by:

Hi, I have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend()

C# / C Sharp

More elegant UTF-8 encoder

by: Bjoern Hoehrmann | last post by:

Hi, For a free software project, I had to write a routine that, given a Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds the UTF-8 encoded form of it, for example, U+00F6...

C / C++

Encoding: how to convert ISO-8559 to Unicode

by: deloford | last post by:

Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to...

.NET Framework

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server