473,739 Members | 6,655 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Extended Characters in XML

Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.

So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?

We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?

Thanks ahead of time for any help.

Bart
ba*******@comca st.net

Jul 20 '05 #1
6 2387
On 18 Mar 2005 ba*******@comca st.net wrote:
The data includes some of the extended characters. We get strange
accent marks, italics
Italics??
and the like. These characters have decimal
value in the 200+ range.
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?
One possibility is to write all of them in the form &#number;
where number is the decimal code position in Unicode.
Is there a standard escape sequences for the extended characters?


&#number; , which is the same as in SGML/HTML. See
http://www.unics.uni-hannover.de/nht...ilingual2.html
for examples in various scripts.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 20 '05 #2


ba*******@comca st.net wrote:

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.


Any XML parser is supposed to support the UTF-8 encoding thus you could
encode your XML documents as UTF-8 and then you are able to use all
characters Unicode supports directly in your document. You only need to
make sure you use an editor that allows creation of UTF-8 encoded
documents. Or you could, as already suggested, escape characters with
the Unicode code point e.g. € for the Euro sign €.
<http://www.unicode.org/>
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #3
Andreas Prilop <nh******@rrz n-user.uni-hannover.de> wrote:
On 18 Mar 2005 ba*******@comca st.net wrote:
The data includes some of the extended characters. We get strange
accent marks, italics


Italics??


That sounds somewhat strange indeed, since normally the font style is
expressed at a level other than character level, e.g. in markup.
(Contrary to populistic propaganda, XML markup is not inherently
"logical"; nothing prevents you from using XML markup for purely
presentational purposes. If you need to store information in a manner
that preserves formatting information, that might be a good idea.
Using <i> for italics as in HTML would be natural then.)

But there _are_ characters in Unicode that are italicized variants of
other characters. Many of them are compatibility characters that have
been included just because they exist as characters in other standards.
There are other cases as well. If this topic is relevant, then the
document "Unicode in XML and other Markup Languages"
http://www.w3.org/TR/unicode-xml/ should be studied.
and the like. These characters have decimal
value in the 200+ range.
So how do you handle these in XML with the assurance that you
won't lose content and the off-the-shelf XML technologies will
interpret them correctly and not simply reject the document as
flawed?


One possibility is to write all of them in the form &#number;
where number is the decimal code position in Unicode.


That's certainly a way represent them in XML, and this might be useful
to protect against problems with encodings (and transcoding). However
it normally wins nothing and loses a lot in readability of the text in
XML source. (In XML it might be better to use &#xhhhh; where hhhh is
the code in hexadecimal, since character code standards and references
generally use hex.)

If the data needs to be processed using old software too, then all
kinds of problems may arise. If you need to prepared to _anything_,
then only the invariant subset of ASCII is safe, or mostly safe. But it
would be a mistake to convert data to ASCII using some simplifications
and transmogrificat ions, unless you _know_ there will be serious and
unsolvable problems otherwise.

Anything that you can use XML technology even in the feeblest sense
_must_ be able to accept data in UTF-8 encoding and at least store and
forward it unmodified, even if it is incapable of rendering all the
characters or recognizing them in a useful way. So the first step
should be to convert the arriving data into UTF-8 in a safe way.
Normally you should get information about the encoding of the data and
do the conversions automatically, but at early phases you might wish to
do some occasional checks to verify the sensibility of the data. It is
not uncommon to send text data as incorrectly labelled (as regards to
its encoding), or unlabelled (so that the recipient must guess or
deduce what encoding has been used).

Quite apart from this, we cannot realistically expect that all Unicode
characters will be adequately processed and rendered. So it's very
relevant what characters there will be in the input data and how it
should be processed. For example, we can probably expect that if some
software is advertized as reading XML data and storing it into a
database and supporting some searching and retrieval, then it will
accept and store any Unicode data in UTF-8 format. But it might fail to
display the data when retrieved, its sorting routines might not work by
Unicode rules, its case-insensitive search might be something rather
trivial that really works for basic Latin letters only, and it might
even fail to display characters properly right to left according to
their directionality.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Jul 20 '05 #4
In <11************ **********@l41g 2000cwc.googleg roups.com>, on
03/18/2005
at 08:17 AM, ba*******@comca st.net said:
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret
them correctly and not simply reject the document as flawed?


You can't really guaranty anything, but your best bet is probably to
use UTF-8, which is a transform of Unicode into 8-bit bytes. Note that
there are standard entity names for many Unicode characters.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@librar y.lspace.org

Jul 20 '05 #5
"Shmuel (Seymour J.) Metz" <sp******@libra ry.lspace.org.i nvalid>
wrote:
You can't really guaranty anything, but your best bet is probably
to use UTF-8, which is a transform of Unicode into 8-bit bytes.
Indeed.
Note that there are standard entity names for many Unicode
characters.


No, there aren't - in XML. In XML, the only predefined entity names
are &lt;, &gt;, &amp;, &quot;, and &apos;.

There are "standard entity names" in the sense that the SGML standard
contains a large number of entity declarations as samples, and some of
them have been copied to HTML. But from the XML viewpoint, there is
nothing standard about them; XML is logically independent of the SGML
standard. One might argue that if you declare entities that denote
Unicode characters, it would be advisable to use the same names as in
the SGML standard if possible. But even this is far from clear; the
SGML names are partly ridiculously and obscurely truncated (quickly,
guess what the "mnemonic" &lang; means!). Besides, you don't _need_ the
entities (except &lt; and &amp;) when you use UTF-8.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Jul 20 '05 #6
ba*******@comca st.net wrote:
Hello,

My company collects data from non-US sources. We are starting projects
where this data will be output in an XML document and passed around to
our applications and third party tools.

The data includes some of the extended characters. We get strange
accent marks, italics and the like. These characters have decimal
value in the 200+ range.
Accents are normal in many non-English languages, so they probably
aren't "strange" to the originators. As Jukka has pointed out, what
look like italics are probably variant characters which happen to
be sloping.
So how do you handle these in XML with the assurance that you won't
lose content and the off-the-shelf XML technologies will interpret them
correctly and not simply reject the document as flawed?
If you use XML software which conforms to the standards then it will handle
all the characters correctly (provided you also conform to the same
standards). If you need to be able to accept pretty much any character
from any source, use the UTF-8 encoding.
We know about the special escape sequences for the reserved XML
characters like '>' and '<'. Is there a standard escape sequences for
the extended characters?


">" is not a reserved character, it's just a character. It only has a
special meaning when it's used to close a start-tag or end tag. The
only two reserved characters are "<" and "&". The latter is the one you
want for the named or numeric codes for non-ASCII characters, but if you
use UTF-8 then you won't need it at all except for espacing "<" and "&",
as has already been pointed out.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"

Jul 20 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2048
by: Walt | last post by:
We are using ASP.net to develop a new website. The old website uses legacy ASP connecting to an Oracle database (9.2, W2k3, charecter set WE8ISO8859P1). The new site connects to the same database but using ASP.NET. Our problem is that extended characters are printing out as question marks. In particular, the "smart quote" characters that MS Word likes to use are showing up as question marks. We ran into a similar problem with the...
1
3786
by: | last post by:
Hey all, Quick question...been bugging me for some time, really. I have a console app, it does some things, and I want to save an array of text to a text file. The text consists of ASCII and extended ascii (codes 0 to 255). I am using ofstream to output, and this is the line doing the actual output: outfile << screenbuffer.Char.AsciiChar; Note that screenbuffer is an array of type 'CHAR_INFO', quick info of it here: ...
4
3850
by: wob | last post by:
Many thanks for those who responded to my question of "putting greek char into C string". In searching for an solution, I noticed that there are more than one version of "Extended ASCII characters"(No. 128 to 255) . e.g., in one version No. 224 is the symbol alpha, in another, it's a "a" with a ` on it... How come? You can see it here: http://www.kturby.com/cables/ascii2.htm
1
1454
by: Wayne M J | last post by:
One of the text files that I am do some processing of has extended characters (ö etc), but so far every time one of those characters are encountered, it is removed. How do I keep then from being munged when reading/writing?
3
24526
by: JSM | last post by:
Hi, I am just trying to port an existing simple encryption routine to C#. this routine simply adds/substracts 10 ascii characters to each character in a text file (except quotes). The routine for decrypting the file works fine however when I encrypt the file, several characters are corrupted. when I looked into it they are always extended ascii characters (eg "x" which is ascii character 120 gets translated to ascii character 130 which...
4
7235
by: =?Utf-8?B?Um9zaGFuIFIuRA==?= | last post by:
Hi All, I am new to C# programming; I am developing an application for recording audio data using voice modem. I am using HyperTerminal to manually process audio data. The modem when configured in voice record mode sends the audio (PCM) data on the serial port, few of the characters from these are in Extended ASCII range i.e. more than 127 decimal. In HyperTerminal we can reset the property to force the incoming data to 7 bit ASCII,...
1
2675
by: s123 | last post by:
Hi, while invoking a web service, if in xml request message i wrap the extended ASCII characters with CDATA it is not returning the desired result, while this is not the case if i do not wrap them with CDATA. I can not ignore the use of CDATA as for characters "<" and "&" i need to use it. I am not getting why it is not recognizing the Extended ASCII characters correctly when they are wrapped with CDATA, the problem is there only with DB2...
13
48083
by: ramif | last post by:
Is there a way to print extended ASCII in C?? I tried to code something, but it only displays strange symbols. here is my code: main() { char chr = 177; //stores the extended ASCII of a symbol printf("Character with an ascii code of 177: %c \n", chr); //tries to print an ASCII symbol...
0
3416
by: jumperbl | last post by:
I am converting hex to ASCII and one of the words has a dot in the middle, which is in the extended ascii characters. when i do this i get the ascii chars. binascii.unhexlify(''.join(value4.split())) Is there a way to print the extended characters as well. I currently get garbage where the dot is.
0
8969
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8792
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9479
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9337
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8215
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6754
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6054
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4570
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3280
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.