473,659 Members | 2,722 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Umlaut characters in Unicode

Hello,

do you think that this file is a proper Unicode file?

http://belnet.dl.sourceforge.net/sou...t-example3.xml

<?xml version="1.0" encoding="UTF-8"?>
...
<resource id="1" name="Andreas Plüschke" function="10" contacts=""/>

I am asking because of the ü Umlaut character.
I am guessing that the author used an ISO-8859-1
environment but forgot to change the encoding
declaration from UTF-8 to ISO-8859-1.
Jul 20 '05 #1
11 21715


Jürgen Kahrs wrote:
do you think that this file is a proper Unicode file?

http://belnet.dl.sourceforge.net/sou...t-example3.xml
<?xml version="1.0" encoding="UTF-8"?>
...
<resource id="1" name="Andreas Plüschke" function="10" contacts=""/>

I am asking because of the ü Umlaut character.


Why is an umlaut a problem? Unicode certainly contains/allows umlaut
characters.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #2
Martin Honnen wrote:
Why is an umlaut a problem? Unicode certainly contains/allows umlaut
characters.


Umlaut is not a problem for Unicode.
Umlaut is a problem if you write a text
with an editor in ISO-8859-1 mode and
watch the text with an editor in UTF-8
mode.

For example, while writing this posting,
I use ISO-8859-1 mode and this is an u-Umlaut: ü
Now, switch your news reader to UTF-8 and you
will find that the character does not look like
an u-umlaut anymore.
Jul 20 '05 #3
In article <2v************ *@uni-berlin.de>,
Jürgen Kahrs <Ju************ *********@vr-web.de> wrote:
:Martin Honnen wrote:
:
:> Why is an umlaut a problem? Unicode certainly contains/allows umlaut
:> characters.
:
:Umlaut is not a problem for Unicode.
:Umlaut is a problem if you write a text
:with an editor in ISO-8859-1 mode and
:watch the text with an editor in UTF-8
:mode.
:
:For example, while writing this posting,
:I use ISO-8859-1 mode and this is an u-Umlaut: ü
:Now, switch your news reader to UTF-8 and you
:will find that the character does not look like
:an u-umlaut anymore.


That's precisely the problem we've encountered with our application,
which stores its data in UTF-8 encoded XML documents.

We maintain everything internally in our Java application as part of a
DOM, and it's saved to an external file on request. But we failed to
force the byte stream written to the file to be encoded to UTF-8, so it
used the default ISO-8859-1 on our American systems. When the next
attempt was made to read the file (only if such characters appeared),
errors occurred because there were non-UTF-8 characters present.

The solution we found was to serialize the DOM with UTF-8 encoding
specified (which we were already doing) and then also specify UTF-8
encoding on the output file stream when writing. When this was done,
opening such an XML file in an editor clearly showed something that did
not resemble the letter with umlaut, or accent, or other special feature.

= Steve =
--
Steve W. Jackson
Montgomery, Alabama
Jul 20 '05 #4
Steve W. Jackson wrote:
We maintain everything internally in our Java application as part of a
DOM, and it's saved to an external file on request. But we failed to
force the byte stream written to the file to be encoded to UTF-8, so it
used the default ISO-8859-1 on our American systems. When the next
attempt was made to read the file (only if such characters appeared),
errors occurred because there were non-UTF-8 characters present.


Yes, this is the situation I was thinking of.
Now, with your unpleasant experience in mind,
would you say that the following document was
also encoded in an inadequate way ?

http://belnet.dl.sourceforge.net/sou...t-example3.xml

As I said in my original posting, I am guessing
that the author used an ISO-8859-1 environment
(just like you) but forgot to change the encoding
declaration from UTF-8 to ISO-8859-1.

Thanks for answering !
Jul 20 '05 #5
In article <2v************ *@uni-berlin.de>,
Jürgen Kahrs <Ju************ *********@vr-web.de> wrote:
:Steve W. Jackson wrote:
:
:> We maintain everything internally in our Java application as part of a
:> DOM, and it's saved to an external file on request. But we failed to
:> force the byte stream written to the file to be encoded to UTF-8, so it
:> used the default ISO-8859-1 on our American systems. When the next
:> attempt was made to read the file (only if such characters appeared),
:> errors occurred because there were non-UTF-8 characters present.
:
:Yes, this is the situation I was thinking of.
:Now, with your unpleasant experience in mind,
:would you say that the following document was
:also encoded in an inadequate way ?
:
: http://belnet.dl.sourceforge.net/sou...tproject-examp
: le3.xml
:
:As I said in my original posting, I am guessing
:that the author used an ISO-8859-1 environment
:(just like you) but forgot to change the encoding
:declaration from UTF-8 to ISO-8859-1.
:
:Thanks for answering !


It looks to me as if it's not encoded properly, based on the visual
appearance of the <resource> element near the end.

Just to make clear what I said earlier, the problem we encountered did
not stem from using an ISO-8859-1 encoding in the XML itself. All of
our files already included <?xml version="1.0" encoding="UTF-8"?> at the
top when serialized, since we told the XML serializer to use UTF-8.

Instead, we also write the file using Java's OutputStreamWri ter, in
which we specify the stream being written (in this case, Java's
FileOutputStrea m class designating the file) and the encoding to use
when writing the stream. Only if *both* of these things were done would
non-ASCII characters get correctly written and then parse without error
next time around. We got a separate report of this same problem from a
German user who used a directory name containing an umlaut-o (as in ö)
and from a French user with an accented e (as in é).

= Steve =
--
Steve W. Jackson
Montgomery, Alabama
Jul 20 '05 #6
In article <2v************ *@uni-berlin.de>,
Jürgen Kahrs <Ju************ *********@vr-web.de> wrote:
do you think that this file is a proper Unicode file?

http://belnet.dl.sourceforge.net/sou...t-example3.xml


The file at that URL appears to be well-formed, and contains a
correctly encoded UTF-8 u-with-umlaut. I don't see any problem with it.

Putting a UTF-8 declaration on a file that is really Latin-1 (and which
contains non-ascii characters) will almost always result in a detectable
error because the result will almost always be an illegal UTF-8 byte
sequence. An XML parser should detect the error.

-- Richard
Jul 20 '05 #7
On Fri, 12 Nov 2004, Richard Tobin wrote:
Putting a UTF-8 declaration on a file that is really Latin-1 (and which
contains non-ascii characters) will almost always result in a detectable
error
Indeed...
because the result will almost always be an illegal UTF-8 byte
sequence. An XML parser should detect the error.


In fact, anything which is supposed to handle utf-8 should give up at
that point, if only for security reasons. XML is a higher layer in
the protocol layer-cake: I'm not sure that it really should be allowed
to have any say in these lower-level problems. That way lie dragons,
from a security analysis point of view.
Jul 20 '05 #8


Jürgen Kahrs wrote:

Now, with your unpleasant experience in mind,
would you say that the following document was
also encoded in an inadequate way ?

http://belnet.dl.sourceforge.net/sou...t-example3.xml
As I said in my original posting, I am guessing
that the author used an ISO-8859-1 environment
(just like you) but forgot to change the encoding
declaration from UTF-8 to ISO-8859-1.


I have no problems viewing that file with Netscape 7 or IE 6, I don't
see anything displayed incorrectly that suggests the encoding has not
been declared correctly.
--

Martin Honnen
http://JavaScript.FAQTs.com/
Jul 20 '05 #9
Richard Tobin wrote:
Putting a UTF-8 declaration on a file that is really Latin-1 (and which
contains non-ascii characters) will almost always result in a detectable
error because the result will almost always be an illegal UTF-8 byte
I should have looked into the hexdump immediately:

00002250 20 6e 61 6d 65 3d 22 41 6e 64 72 65 61 73 20 50 | name="Andreas P|
00002260 6c c3 bc 73 63 68 6b 65 22 20 66 75 6e 63 74 69 |l..schke" functi|

C3BC in UTF-8 converts to position 0FC as described here:

http://www.pemberley.com/janeinfo/latin1.html#utf8

And 0FC is really the position of the ü as described
on page 2 of this one:

http://www.unicode.org/charts/PDF/U0080.pdf

This mixture of bitwise encoding and character sets
is a pain if you work with it rarely.
sequence. An XML parser should detect the error.


The problem was that I did not trust my parser.
I think I should put the Unicode 4.0 book onto my book shelf.

Thanks to all who answered.
Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
4091
by: Oliver Spiesshofer | last post by:
Hi, I am trying to send emails with the php Mail function, but umlaut and other special characters are not displayed correctly. Actually they are replaced by large X's. I checked if there is an error on the receiving side, but both thunderbird and squirrelmail do the same! any Idea what is wrong? do I have to make a mb_string_recode? but to what charset?
43
3733
by: Vladimir | last post by:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character in a string is defined by a Unicode scalar value, also called ...
3
6577
by: Pekka Jarvela | last post by:
I am using Visual Studio C++ .NET and when I try to print words with umlaut letters, for instance printf("Pässinpää-ääliö"); letters with dots over them, äö, will not be printed correctly on the screen. I tried the trick #ifdef _UNICODE int wmain(void)
2
2317
by: Ray Stevens | last post by:
I am loading a pipe-delimited string from a DataSet into StringBuilder, such as 00P|23423||98723 (etc.). For some reason the pipe character is displaying in the debugger as "o" with two small dots on top (i.e., with an umlaut). Is StringBuilder modifying my pipe character and, if so, how do I stop it?
3
9682
by: Chris Auer | last post by:
I am trying to take in ASCII documents and convert them into ANSI for a customer in Germany. But every file I process turns umlauts and other german characters into something other then what it was. Here is some real simple console code. It seems like umlauts are not in ASCII, but they are. StreamWriter swExport = new StreamWriter("C:\\MyDocuments\\In Progress\\IIR\\OUT.txt",false,System.Text.Encoding.ASCII); swExport.WriteLine("Ö");
3
4412
by: Erwin Brandstetter | last post by:
Created a new 7.4 database. # create database foo with encoding = UNICODE; Then tried to restore my dump from pg 7.2 which was SQL-ASCII or Latin1 encoded (cant tell which of the two, only got the dump of the old database left after upgrading postgresql.) Succeeded creating the objects, but no data was restored, instead postgresql complained about illegal UNICODE characters. Also export of an MS-Access Database with pgAdmin 1.6 failed...
1
1938
by: Reinier | last post by:
Hi all, When I try to get the parameter from an ASP.NET page, all characters with a umlaut disappear. So when I request the following URL: http://www.MyWebsite.com/MyPage.aspx?Name="Müller" and I use the following C# code: string Name = Request.Params.Get("Name");
2
6696
by: joakim.hove | last post by:
Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his name. 2. I use the cgi module module to get to the input from the user: .... name = form.value
11
14873
by: cody | last post by:
Is there a method to replace special characters like Ä (A-Umlaut) with A, Ö (O-Umlaut) with O, and so on? Sure, I could look for each character separately and replace it with its ascii-counterpart, but there are also such special characters in French and Swedish and many other languages which I also want to catch. Is there a generic way to do it?
0
8427
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8332
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8746
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8525
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
5649
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4175
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4335
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2750
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1737
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.