472,794 Members | 2,087 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,794 software developers and data experts.

Displaying Non-ASCII Characters in C++

This post is a follow up to the post at :
http://groups.google.com/group/comp....aa6fab5622424e
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source : http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Thx!

Tushar
Dec 5 '07 #1
5 5282
On Dec 5, 6:23 am, tushar.sax...@gmail.com wrote:
This post is a follow up to the post at :http://groups.google.com/group/comp....ead/thread/83a...
as my original question was answered there, but I have some additional
problems now.

Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")

What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :http://www.ascii.cl/htmlcodes.htm)

My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"

Thx!

Tushar
You could refer to unicode.org FAQs, that might help.
By the way, are you trying to display non-ASCII characters on the
prompt?
AFAIK, DOS atleast doesn't support non-ASCII characters...
which platform are you working on?
Dec 5 '07 #2
On Dec 5, 2:23 am, tushar.sax...@gmail.com wrote:
This post is a follow up to the post at
:http://groups.google.com/group/comp....ead/thread/83a...
as my original question was answered there, but I have some additional
problems now.
Basically what I want to do is : Given an input UTF-8 encoded file
containing HTML sequences such as "&", I want to be able to
replace these sequences with their UTF-8 representations (i.e. "&")
The UTF-8 representation of "&" is a single byte, with the
value 0x26. Formally, that might be a '&', or it might not.
(In practice, it usually is:-). Even the IBM mainframe version
of C that I've seen mapped the native EBCDIC to ASCII, so that
within C programs, '&' was 0x26. I'm not sure how this would
have been written to a text file; the more common variants of
EBCDIC don't have a & character.)
What I have so far: Looking at some of the source code of the Mozilla
Firefox project, I have a small class that can convert the HTML
sequences into a number representing the Unicode value of that
character. i.e. "&" is represented by a Unicode value of 38
(source :http://www.ascii.cl/htmlcodes.htm)
My question: How can I use this unicode value to convert it into the
character "&" and write it to a file/display on the terminal? I tried
using something along the lines of printf("\u0012"), but that returns
the following compilation error : "\u0012 is not a valid universal
character"
You're not allowed to use universal character names for
characters in the basic character set. A simple "&" will work
in this case, giving you the encoding of an ampersign in
whatever the compiler uses as its default narrow character
encoding (which will be compatible with ASCII/UTF-8/ISO 8859-n
99.9% of the time).

You really have two separate problems. One is converting the
sequence "&" to whatever internal encoding you are using
(e.g. UTF-8). The second is converting this internal encoding
to whatever the display device (or file) expects. If the
display device can handle UTF-8, you're home free. If it can't
you'll have to convert the UTF-8 encodings into something it can
handle. In the case of "&", there's a 99.9% chance that the
display device will handle the UTF-8 encoding correctly, since
in this particular case, it is also the ASCII encoding. (And
thus, the encoding in all of the ISO 8859-n character sets as
well. Of course, if you fall into the 0.1% chance, and your
display device uses EBCDIC, then you might not be able to
display it at all.) For other characters, it's far from
obvious, however; something like "—" maps to Unicode
'\u2014' -- the sequence 0xE1, 0x80, 0x94 in UTF-8. Depending
on the encoding used by the display device, you may be able to
map this directly; otherwise (in this case---there isn't always
a good general solution for this), you might map it to a 0x2D
(hyphen-minus in ASCII), or maybe a sequence of two of them. In
some cases, there really isn't any good solution---the input
specifies some Chinese ideograph, and the display device doesn't
have an Chinese ideographs in its fonts. A lot depends on just
what characters you want to support, and how much effort you
want to invest.

Note that it's not always simple to know what the display device
actually supports, either. Under X, it is the font which
determines the encoding. If you're managing the windows
yourself, you select the font, and you can probably know what
the encoding is. (The X font specification string has fields
for the encoding.) (I'm not too familiar with Windows; I think
the Window manager will always handle UTF-16, mapping it itself
if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea
that the user will later do a cat, you have the problem that
different windows can use different fonts with different
encodings; the problem is unsolvable. You just have to
establish a convention, tell the user about it, and leave it up
to him. (In the Unix world, or anything networked, I'd use
UTF-8, unless there were some constraints involving legacy
files; in a purely Windows environment, I'd probably use
UTF-16LE.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Dec 5 '07 #3
On Dec 5, 7:37 pm, Thomas Dickey <dic...@saltmine.radix.netwrote:
James Kanze <james.ka...@gmail.comwrote:
[...]
if necssary for the font. But you still have the problem that
not all fonts have all Unicode characters.) If your outputting
to std::cout, in an xterm, however, you have absolutely no means
of knowing. And if you're outputting to a file, with the idea
...most people would rely on the locale settings to give a
hint here.
It depends. At least under Unix with X, locale and the font
encoding are completely independent. And neither can really
solve the most basic problem: if I write to a file, what should
I write if the file will later be copied to two different
devices, using two different encodings?

The problems are far from simple. On the whole, I'd say when in
doubt, use UTF-8, and I'd certainly opt for UTF-8 for most new
uses. But legacy code and legacy environments won't go away
like that: where I work, for some reason, there are no UTF-8
fonts installed (for X); at home, I still have an old printer
which only understands ISO 8859-1, etc.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Dec 6 '07 #4
On Dec 5, 6:31 pm, tushar.sax...@gmail.com wrote:
Thanks for the replies everyone. The input file where I am reading the
data from is encoded in UTF-8, and so is the output file where I have
to write the modified data. The OS used is Linux. The terminal I use
is UTF-8 enabled, as I can correctly see characters beyond the normal
ASCII range. In any case, I am not so much worried about the actual
display of the characters as much as writing the correct data into the
file.
So what is the problem? It seems obvious in that case that you
should use UTF-8. If you're under Linux, too, you can be sure
that the basic execution character set is something ASCII based,
so that all characters in the basic execution set will have the
same encodings as in ASCII (and thus, as in UTF-8). I wouldn't
take the risk of using character and string constants for
anything else, however; I'd not use character constants for
anything else, and I'd use something like "\xC3\xA9" for
"&eaigu;".

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Dec 6 '07 #5
Well that was part of the problem that I faced James. I wasn't quite
sure how to write the unicode sequences to file. It has been resolved
now though, I wrote a small function to encode the Unicode characters
to UTF-8 and I'm writing that to file.

Thanks again everyone for all your help.
Dec 6 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Gregory | last post by:
Hello, I've managed to build two web pages, one that can display images with associated text data in a table, and one that can resize and display images without the text. I'd like to resize the...
2
by: John B. | last post by:
I'm using <h1>MY HEADING</h1> to display a title, however I would like to show one short line of text beside it<p>The short line of text</p>, and then go back to normal. Here's an example of how...
5
by: Jelks Cabaniss | last post by:
I asked this in two of the Opera newsgroups, but except for one person who advised me to just use namespaced HTML elements in my XML, there were no replies. Has anybody here successfully used...
3
by: Dalan | last post by:
At first I was not certain what could cause Access 97 from displaying most jpeg images, but not all. After further testing, it seemed that all original images of less than 275 pixels per inch or...
3
by: active | last post by:
I draw text in a bitmap and then draw the bitmap on a picturebox and get text that is not all displayed the same. Note the different text style in the (40, 100) area. Got any insight into...
13
by: hornedw | last post by:
I have been working on a ecommerce website for myself. What I needed some assistance on was when i was trying to display the categories/subcategories for the different products. I decided to use...
7
by: Jonathan Wood | last post by:
Okay, I have a site that displays information based on user input, a couple of the items are plain strings that the user entered. I understand the risk here is that they could insert javascript...
1
by: ajos | last post by:
This may be very dumb question, but still...I am displaying a value in a input tag with the value shown as readonly... <input type=text size=18 maxlength=50 name="company" value=<%=companyCode %>...
7
by: RichB | last post by:
I am trying to get to grips with the asp.net ajaxcontrol toolkit, and am trying to add a tabbed control to the page. I have no problems within the aspx file, and can dynamically manipulate a...
1
by: payork | last post by:
I am trying to load a VB6 program into VB .Net project via COM. I have exactly the same problem as this thread from 2005. I would appreciate if anyone has an answer to this. Thanks ...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: lllomh | last post by:
How does React native implement an English player?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.