Umlaut characters in Unicode

Jürgen Kahrs

Hello,

do you think that this file is a proper Unicode file?

http://belnet.dl.sourceforge.net/sou...t-example3.xml

<?xml version="1.0" encoding="UTF-8"?>
...
<resource id="1" name="Andreas Plüschke" function="10" contacts=""/>

I am asking because of the ü Umlaut character.
I am guessing that the author used an ISO-8859-1
environment but forgot to change the encoding
declaration from UTF-8 to ISO-8859-1.

Jul 20 '05 #1

Subscribe Post Reply

21679

Martin Honnen

Jürgen Kahrs wrote:

do you think that this file is a proper Unicode file?

http://belnet.dl.sourceforge.net/sou...t-example3.xml
<?xml version="1.0" encoding="UTF-8"?>
...
<resource id="1" name="Andreas Plüschke" function="10" contacts=""/>

I am asking because of the ü Umlaut character.

Why is an umlaut a problem? Unicode certainly contains/allows umlaut
characters.
--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 20 '05 #2

Jürgen Kahrs

Martin Honnen wrote:

Why is an umlaut a problem? Unicode certainly contains/allows umlaut
characters.

Umlaut is not a problem for Unicode.
Umlaut is a problem if you write a text
with an editor in ISO-8859-1 mode and
watch the text with an editor in UTF-8
mode.

For example, while writing this posting,
I use ISO-8859-1 mode and this is an u-Umlaut: ü
Now, switch your news reader to UTF-8 and you
will find that the character does not look like
an u-umlaut anymore.

Jul 20 '05 #3

Steve W. Jackson

In article <2v*************@uni-berlin.de>,
Jürgen Kahrs <Ju*********************@vr-web.de> wrote:

:Martin Honnen wrote:
:
:> Why is an umlaut a problem? Unicode certainly contains/allows umlaut
:> characters.
:
:Umlaut is not a problem for Unicode.
:Umlaut is a problem if you write a text
:with an editor in ISO-8859-1 mode and
:watch the text with an editor in UTF-8
:mode.
:
:For example, while writing this posting,
:I use ISO-8859-1 mode and this is an u-Umlaut: ü
:Now, switch your news reader to UTF-8 and you
:will find that the character does not look like
:an u-umlaut anymore.

That's precisely the problem we've encountered with our application,
which stores its data in UTF-8 encoded XML documents.

We maintain everything internally in our Java application as part of a
DOM, and it's saved to an external file on request. But we failed to
force the byte stream written to the file to be encoded to UTF-8, so it
used the default ISO-8859-1 on our American systems. When the next
attempt was made to read the file (only if such characters appeared),
errors occurred because there were non-UTF-8 characters present.

The solution we found was to serialize the DOM with UTF-8 encoding
specified (which we were already doing) and then also specify UTF-8
encoding on the output file stream when writing. When this was done,
opening such an XML file in an editor clearly showed something that did
not resemble the letter with umlaut, or accent, or other special feature.

= Steve =
--
Steve W. Jackson
Montgomery, Alabama

Jul 20 '05 #4

Jürgen Kahrs

Steve W. Jackson wrote:

We maintain everything internally in our Java application as part of a
DOM, and it's saved to an external file on request. But we failed to
force the byte stream written to the file to be encoded to UTF-8, so it
used the default ISO-8859-1 on our American systems. When the next
attempt was made to read the file (only if such characters appeared),
errors occurred because there were non-UTF-8 characters present.

Yes, this is the situation I was thinking of.
Now, with your unpleasant experience in mind,
would you say that the following document was
also encoded in an inadequate way ?

http://belnet.dl.sourceforge.net/sou...t-example3.xml

As I said in my original posting, I am guessing
that the author used an ISO-8859-1 environment
(just like you) but forgot to change the encoding
declaration from UTF-8 to ISO-8859-1.

Thanks for answering !

Jul 20 '05 #5

Steve W. Jackson

In article <2v*************@uni-berlin.de>,
Jürgen Kahrs <Ju*********************@vr-web.de> wrote:

:Steve W. Jackson wrote:
:
:> We maintain everything internally in our Java application as part of a
:> DOM, and it's saved to an external file on request. But we failed to
:> force the byte stream written to the file to be encoded to UTF-8, so it
:> used the default ISO-8859-1 on our American systems. When the next
:> attempt was made to read the file (only if such characters appeared),
:> errors occurred because there were non-UTF-8 characters present.
:
:Yes, this is the situation I was thinking of.
:Now, with your unpleasant experience in mind,
:would you say that the following document was
:also encoded in an inadequate way ?
:
: http://belnet.dl.sourceforge.net/sou...tproject-examp
: le3.xml
:
:As I said in my original posting, I am guessing
:that the author used an ISO-8859-1 environment
:(just like you) but forgot to change the encoding
:declaration from UTF-8 to ISO-8859-1.
:
:Thanks for answering !

It looks to me as if it's not encoded properly, based on the visual
appearance of the <resource> element near the end.

Just to make clear what I said earlier, the problem we encountered did
not stem from using an ISO-8859-1 encoding in the XML itself. All of
our files already included <?xml version="1.0" encoding="UTF-8"?> at the
top when serialized, since we told the XML serializer to use UTF-8.

Instead, we also write the file using Java's OutputStreamWriter, in
which we specify the stream being written (in this case, Java's
FileOutputStream class designating the file) and the encoding to use
when writing the stream. Only if *both* of these things were done would
non-ASCII characters get correctly written and then parse without error
next time around. We got a separate report of this same problem from a
German user who used a directory name containing an umlaut-o (as in ö)
and from a French user with an accented e (as in é).

= Steve =
--
Steve W. Jackson
Montgomery, Alabama

Jul 20 '05 #6

Richard Tobin

In article <2v*************@uni-berlin.de>,
Jürgen Kahrs <Ju*********************@vr-web.de> wrote:

do you think that this file is a proper Unicode file?

http://belnet.dl.sourceforge.net/sou...t-example3.xml

The file at that URL appears to be well-formed, and contains a
correctly encoded UTF-8 u-with-umlaut. I don't see any problem with it.

Putting a UTF-8 declaration on a file that is really Latin-1 (and which
contains non-ascii characters) will almost always result in a detectable
error because the result will almost always be an illegal UTF-8 byte
sequence. An XML parser should detect the error.

-- Richard

Jul 20 '05 #7

Alan J. Flavell

On Fri, 12 Nov 2004, Richard Tobin wrote:

Putting a UTF-8 declaration on a file that is really Latin-1 (and which
contains non-ascii characters) will almost always result in a detectable
error
Indeed...
because the result will almost always be an illegal UTF-8 byte
sequence. An XML parser should detect the error.

In fact, anything which is supposed to handle utf-8 should give up at
that point, if only for security reasons. XML is a higher layer in
the protocol layer-cake: I'm not sure that it really should be allowed
to have any say in these lower-level problems. That way lie dragons,
from a security analysis point of view.

Jul 20 '05 #8

Martin Honnen

Jürgen Kahrs wrote:

Now, with your unpleasant experience in mind,
would you say that the following document was
also encoded in an inadequate way ?

http://belnet.dl.sourceforge.net/sou...t-example3.xml
As I said in my original posting, I am guessing
that the author used an ISO-8859-1 environment
(just like you) but forgot to change the encoding
declaration from UTF-8 to ISO-8859-1.

I have no problems viewing that file with Netscape 7 or IE 6, I don't
see anything displayed incorrectly that suggests the encoding has not
been declared correctly.
--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 20 '05 #9

Jürgen Kahrs

Richard Tobin wrote:

Putting a UTF-8 declaration on a file that is really Latin-1 (and which
contains non-ascii characters) will almost always result in a detectable
error because the result will almost always be an illegal UTF-8 byte
I should have looked into the hexdump immediately:

00002250 20 6e 61 6d 65 3d 22 41 6e 64 72 65 61 73 20 50 | name="Andreas P|
00002260 6c c3 bc 73 63 68 6b 65 22 20 66 75 6e 63 74 69 |l..schke" functi|

C3BC in UTF-8 converts to position 0FC as described here:

http://www.pemberley.com/janeinfo/latin1.html#utf8

And 0FC is really the position of the ü as described
on page 2 of this one:

http://www.unicode.org/charts/PDF/U0080.pdf

This mixture of bitwise encoding and character sets
is a pain if you work with it rarely.
sequence. An XML parser should detect the error.

The problem was that I did not trust my parser.
I think I should put the Unicode 4.0 book onto my book shelf.

Thanks to all who answered.

Jul 20 '05 #10

Richard Tobin

In article <2v*************@uni-berlin.de>,
Jürgen Kahrs <Ju*********************@vr-web.de> wrote:

I think I should put the Unicode 4.0 book onto my book shelf.

You might find this page useful:

http://www.cogsci.ed.ac.uk/~richard/utf-8.html

-- Richard

Jul 20 '05 #11

Jürgen Kahrs

Richard Tobin wrote:

You might find this page useful:

http://www.cogsci.ed.ac.uk/~richard/utf-8.html

Yes, this helps a lot.
I have appended this link to my bookmarks.

Jul 20 '05 #12

Similar topics

umlaut in mail sent by php?

by: Oliver Spiesshofer | last post by:

Hi, I am trying to send emails with the php Mail function, but umlaut and other special characters are not displayed correctly. Actually they are replaced by large X's. I checked if there is an...

PHP

Byte size of characters when encoding

by: Vladimir | last post by:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...

.NET Framework

Umlaut letters in C++

by: Pekka Jarvela | last post by:

I am using Visual Studio C++ .NET and when I try to print words with umlaut letters, for instance printf("Pässinpää-ääliö"); letters with dots over them, äö, will not be printed correctly on...

C / C++

Why is StringBuilder changing pipes to "o" umlaut when loading a pipe-delimited string?

by: Ray Stevens | last post by:

I am loading a pipe-delimited string from a DataSet into StringBuilder, such as 00P|23423||98723 (etc.). For some reason the pipe character is displaying in the debugger as "o" with two small dots...

C# / C Sharp

Encoding Umlaut becomes ? when using System.Text.Encoding.ASCII

by: Chris Auer | last post by:

I am trying to take in ASCII documents and convert them into ANSI for a customer in Germany. But every file I process turns umlauts and other german characters into something other then what it...

C# / C Sharp

UNICODE-encoded database does not accept umlaut-characters.

by: Erwin Brandstetter | last post by:

Created a new 7.4 database. # create database foo with encoding = UNICODE; Then tried to restore my dump from pg 7.2 which was SQL-ASCII or Latin1 encoded (cant tell which of the two, only got the...

PostgreSQL Database

Umlaut in the Params of an ASP.NET page

by: Reinier | last post by:

Hi all, When I try to get the parameter from an ASP.NET page, all characters with a umlaut disappear. So when I request the following URL: http://www.MyWebsite.com/MyPage.aspx?Name="Müller" ...

ASP.NET

Encoding and norwegian (non ASCII) characters.

by: joakim.hove | last post by:

Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his...

Python

How to remove accents (A-Umlaut to A)

by: cody | last post by:

Is there a method to replace special characters like Ä (A-Umlaut) with A, Ö (O-Umlaut) with O, and so on? Sure, I could look for each character separately and replace it with its...

.NET Framework

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware