Lost in encoding stuff

Alexander Adam

Hi,

I am a bit list in encoding related stuff. Let me explain what I am
doing (yes it's C++ :)):
I am getting some input content due Expat Xml Parser. I've setup Expat
to use wchar_t.
First question is this -- what is the difference of unsigned short,
wchar_t and char?
Okay, wchar_t is an built-in type of C++ and its two bytes of size
whereas char is always one byte.
But what's the real difference when storing Text into those types i.e.
ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
Afaik, UTF-8 is 2 bytes, UTF-16 is 2 bytes and UTF-32 is up to four
bytes? Well anyway, my issue is how to correctly work with those
types. Internally I am using wchar_t for all my representations but
depending on the encoding I need to shift a current char value
bitwise, right?
Okay next one -- I am storing everything of my wchar_t array into a
stream of type char, doing so by a simple memcpy. Now how could I read
it back in? Say I have char* buffer where my wchar_t string is saved
in. I could surely do a simply memcpy(myWcharVar, buffer,
sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
efficient as I'd like to read it char by char (like wchar_t nx =
buffer.next(), know what I mean?).
And then after having read such a char, I must be able to correctly
encode it. I know the encoding whether its ASCII, UTF-8, 16 or
anything but how would I go about it *without* using any big
libraries?

Thanks for *any* clarifications you could help out with on this topic,
Alex

Jan 16 '08 #1

Subscribe Post Reply

2712

Phil Endecott

Alexander Adam wrote:

Hi,

I am a bit list in encoding related stuff. Let me explain what I am
doing (yes it's C++ :)):
I am getting some input content due Expat Xml Parser. I've setup Expat
to use wchar_t.
First question is this -- what is the difference of unsigned short,
wchar_t and char?

On my compiler, they all have different sizes....

Okay, wchar_t is an built-in type of C++ and its two bytes of size

I believe that that's the case on Windows. On Linux, wchar_t is 4
bytes. You should not rely on it having any particular size.

whereas char is always one byte.
But what's the real difference when storing Text into those types i.e.
ASCII, UTF-8, UTF-16 or UTF-32 encoded text?
Afaik, UTF-8 is 2 bytes,

No. It's a variable length encoding. Look it up, e.g. on the Unicode
web site. I bet Wikipedia has a good description too.

UTF-16 is 2 bytes

No. It's a variable length encoding. For the vast majority of cases,
it will use two bytes per character, but you shouldn't rely on that.
Look it up.

and UTF-32 is up to four
bytes?

It's always exactly four bytes per character. Look it up.

Well anyway, my issue is how to correctly work with those
types. Internally I am using wchar_t for all my representations but
depending on the encoding I need to shift a current char value
bitwise, right?

Err, I'm not sure what you mean, but no I don't think that's the right
thing to do. What do you mean by "work with" these types? What are you
actually trying to do?

Okay next one -- I am storing everything of my wchar_t array into a
stream of type char,

Why?

doing so by a simple memcpy. Now how could I read
it back in? Say I have char* buffer where my wchar_t string is saved
in. I could surely do a simply memcpy(myWcharVar, buffer,
sizeof(wchar_t)) to get two bytes but this doesn't seem to be very
efficient as I'd like to read it char by char (like wchar_t nx =
buffer.next(), know what I mean?).

Beware that there are endianness issues to worry about here.

Your compiler will possibly optimise a memcpy() into efficient inline code.

But if you kept it in a whcat_t buffer, you wouldn't need to worry about
this.

And then after having read such a char, I must be able to correctly
encode it. I know the encoding whether its ASCII, UTF-8, 16 or
anything but how would I go about it *without* using any big
libraries?

Why the prohibition of libraries? POSIX systems have iconv(), which
will do it all for you. I think Windows has something similar.

If you want to write the code yourself, you should find enough
description in the definitions of the encodings.
I have done some work on strings tagged with their character sets which
I may propose for Boost at some point in the future. You'll find my
first attempt if you look for my name in the Boost list archives from
last September and October. I'm currently revising it, and my first
step has been to define char8_t, char16_t and char32_t. These types are
guaranteed to have exactly the indicated number of bits, and to be char
or wchar_t when that type is the right size. Here's the code:

template <int bits>
struct char_t {
typedef typename boost::uint_t<bits>::least type;
};

template <>
struct char_t<8*sizeof(char){
typedef char type;
};

template <>
struct char_t<8*sizeof(wchar_t){
typedef wchar_t type;
};

typedef char_t<8>::type char8_t;
typedef char_t<16>::type char16_t;
typedef char_t<32>::type char32_t;
I suggest using something like char16_t, rather than wchar_t, as the
basis for a UTF-16 string, for portability. I'm currently not sure how
this can work with string literals, though.

Regards, Phil.

Jan 16 '08 #2

by: lawrence | last post by:

Validator chokes on my pages now because I started sending an character encoding header of UTF-8 but the page is full of non UTF-8 characters. Anyway quick way to convert them? ...

PHP

Change encoding of Java Vm to ISO-8859-1

by: Ann | last post by:

Hi, Is there any way to Change encoding of Java Vm to ISO-8859-1? i am using Java vm along with an application called opencms. I get the following error message.. Error: the encoding of your...

Java

PEP-0263 and default encoding

by: Klaus Alexander Seistrup | last post by:

Hi, After upgrading my Python interpreter to 2.3.1 I constantly get warnings like this: DeprecationWarning: Non-ASCII character '\xe6' in file mumble.py on line 2, but no encoding declared;...

Python

Penetration of ASP.NET - Developers continue to use VB6 & ASP

by: dotnetforfood | last post by:

Joel Spolsky's new article "How Microsoft Lost the API War" at http://www.joelonsoftware.com/articles/APIWar.html describes how .NET has failed, how classic VB6 and ASP continue to be preferred by...

ASP / Active Server Pages

String encoding Converting and Save File Problem in IE

by: gnv | last post by:

Hi all, I am writing a cross-browser(i.e. 6 and netscape 7.1) javascript program to save an XML file to local file system. I have an xml string like below: var xmlStr = "<?xml version="1.0"...

Javascript

query string encoding/decoding

by: Mark | last post by:

I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...

ASP.NET

Character lost in POST submit

by: Pavils Jurjans | last post by:

Hello, I am experiencing a weird behaviour on my ASP.NET project. The project consists from client-side, which can be whatever environment - web page, EXE application, etc. The client sends HTTP...

ASP.NET

Lost plpgsql function

by: lnd | last post by:

After copied pg database from one PC to another -I could not find plpgsql function(s) in the copied database. -had to instal plpgsql language handler again -whilst tables and data moved...

PostgreSQL Database

Lost UTF-8 encoding on all files while converting ASP.NET web from 1.1 to 2.0

by: Christian Mairoll | last post by:

Hello, I'm maintaining a multi language website and have tried to convert it from ASP.NET 1.1 to 2.0 using Visual Studio 2005. When it had finished, I noticed, that it converted all my aspx...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Lost in encoding stuff

Similar topics