By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,198 Members | 1,597 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,198 IT Pros & Developers. It's quick & easy.

Displaying Non-ASCII Characters in C++

P: n/a
This post is a follow up to the post at :
http://groups.google.com/group/comp....aa6fab5622424e
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source : http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Thx!

Tushar
Dec 5 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
On Dec 5, 6:23 am, tushar.sax...@gmail.com wrote:
This post is a follow up to the post at :http://groups.google.com/group/comp....ead/thread/83a...
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Thx!

Tushar
You could refer to unicode.org FAQs, that might help.
By the way, are you trying to display non-ASCII characters on the
prompt?
AFAIK, DOS atleast doesn't support non-ASCII characters...
which platform are you working on?
Dec 5 '07 #2

P: n/a
On Dec 5, 2:23 am, tushar.sax...@gmail.com wrote:
This post is a follow up to the post at
:http://groups.google.com/group/comp....ead/thread/83a...
as my original question was answered there, but I have some additional
problems now.
Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")
The UTF-8 representation of "&" is a single byte, with the
value 0x26. Formally, that might be a '&', or it might not.
(In practice, it usually is:-). Even the IBM mainframe version
of C that I've seen mapped the native EBCDIC to ASCII, so that
within C programs, '&' was 0x26. I'm not sure how this would
have been written to a text file; the more common variants of
EBCDIC don't have a & character.)
What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :http://www.ascii.cl/htmlcodes.htm)
My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"
You're not allowed to use universal character names for
characters in the basic character set. A simple "&" will work
in this case, giving you the encoding of an ampersign in
whatever the compiler uses as its default narrow character
encoding (which will be compatible with ASCII/UTF-8/ISO 8859-n
99.9% of the time).

You really have two separate problems. One is converting the
sequence "&" to whatever internal encoding you are using
(e.g. UTF-8). The second is converting this internal encoding
to whatever the display device (or file) expects. If the
display device can handle UTF-8, you're home free. If it can't
you'll have to convert the UTF-8 encodings into something it can
handle. In the case of "&", there's a 99.9% chance that the
display device will handle the UTF-8 encoding correctly, since
in this particular case, it is also the ASCII encoding. (And
thus, the encoding in all of the ISO 8859-n character sets as
well. Of course, if you fall into the 0.1% chance, and your
display device uses EBCDIC, then you might not be able to
display it at all.) For other characters, it's far from
obvious, however; something like "—" maps to Unicode
'\u2014' -- the sequence 0xE1, 0x80, 0x94 in UTF-8. Depending
on the encoding used by the display device, you may be able to
map this directly; otherwise (in this case---there isn't always
a good general solution for this), you might map it to a 0x2D
(hyphen-minus in ASCII), or maybe a sequence of two of them. In
some cases, there really isn't any good solution---the input
specifies some Chinese ideograph, and the display device doesn't
have an Chinese ideographs in its fonts. A lot depends on just
what characters you want to support, and how much effort you
want to invest.

Note that it's not always simple to know what the display device
actually supports, either. Under X, it is the font which
determines the encoding. If you're managing the windows
yourself, you select the font, and you can probably know what
the encoding is. (The X font specification string has fields
for the encoding.) (I'm not too familiar with Windows; I think
the Window manager will always handle UTF-16, mapping it itself
if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea
that the user will later do a cat, you have the problem that
different windows can use different fonts with different
encodings; the problem is unsolvable. You just have to
establish a convention, tell the user about it, and leave it up
to him. (In the Unix world, or anything networked, I'd use
UTF-8, unless there were some constraints involving legacy
files; in a purely Windows environment, I'd probably use
UTF-16LE.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Dec 5 '07 #3

P: n/a
On Dec 5, 7:37 pm, Thomas Dickey <dic...@saltmine.radix.netwrote:
James Kanze <james.ka...@gmail.comwrote:
[...]
if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea
...most people would rely on the locale settings to give a
hint here.
It depends. At least under Unix with X, locale and the font
encoding are completely independent. And neither can really
solve the most basic problem: if I write to a file, what should
I write if the file will later be copied to two different
devices, using two different encodings?

The problems are far from simple. On the whole, I'd say when in
doubt, use UTF-8, and I'd certainly opt for UTF-8 for most new
uses. But legacy code and legacy environments won't go away
like that: where I work, for some reason, there are no UTF-8
fonts installed (for X); at home, I still have an old printer
which only understands ISO 8859-1, etc.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Dec 6 '07 #4

P: n/a
On Dec 5, 6:31 pm, tushar.sax...@gmail.com wrote:
Thanks for the replies everyone. The input file where I am reading the
data from is encoded in UTF-8, and so is the output file where I have
to write the modified data. The OS used is Linux. The terminal I use
is UTF-8 enabled, as I can correctly see characters beyond the normal
ASCII range. In any case, I am not so much worried about the actual
display of the characters as much as writing the correct data into the
file.
So what is the problem? It seems obvious in that case that you
should use UTF-8. If you're under Linux, too, you can be sure
that the basic execution character set is something ASCII based,
so that all characters in the basic execution set will have the
same encodings as in ASCII (and thus, as in UTF-8). I wouldn't
take the risk of using character and string constants for
anything else, however; I'd not use character constants for
anything else, and I'd use something like "\xC3\xA9" for
"&eaigu;".

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Dec 6 '07 #5

P: n/a
Well that was part of the problem that I faced James. I wasn't quite
sure how to write the unicode sequences to file. It has been resolved
now though, I wrote a small function to encode the Unicode characters
to UTF-8 and I'm writing that to file.

Thanks again everyone for all your help.
Dec 6 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.