By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,681 Members | 1,806 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,681 IT Pros & Developers. It's quick & easy.

UTF-8 & Unicode

P: n/a
Do web pages have to be created in unicode in order to use UTF-8 encoding?
If so, can anyone name a free application which I can use under Windows 98
to create web pages?
Jul 20 '05 #1
Share this Question
Share on Google+
27 Replies


P: n/a
EU citizen wrote:
Do web pages have to be created in unicode in order to use UTF-8 encoding?


Yes, but that doesn't mean you need a special text editor: any plain
US-ASCII (but not ISO 8859-1) file is automatically correct in UTF-8.
Jul 20 '05 #2

P: n/a
[Follow-up to comp.infosystems.www.authoring.html]

EU citizen wrote:
Do web pages have to be created in unicode in order to use UTF-8 encoding?


UTF-8 is one of the encoding scheme used for Unicode.
You should read carefully :
http://www.unicode.org/faq/
http://www.cs.tut.fi/~jkorpela/unicode/guide.html
http://ppewww.ph.gla.ac.uk/~flavell/.../internat.html
Jul 20 '05 #3

P: n/a
EU citizen wrote:
Do web pages have to be created in unicode in order to use UTF-8 encoding?


That's kind of a silly question because UTF-8 is a unicode encoding.
See my 3 part guide to unicode for an in-depth tutorial on creating
unicode files.

http://lachy.id.au/blogs/log/2004/12...unicode-part-1
http://lachy.id.au/blogs/log/2004/12...unicode-part-2
http://lachy.id.au/blogs/log/2005/01...unicode-part-3

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
Jul 20 '05 #4

P: n/a
"Lachlan Hunt" <sp***********@gmail.com> wrote in message
news:42**********************@per-qv1-newsreader-01.iinet.net.au...
EU citizen wrote:
Do web pages have to be created in unicode in order to use UTF-8
encoding?
That's kind of a silly question because UTF-8 is a unicode encoding.
See my 3 part guide to unicode for an in-depth tutorial on creating
unicode files.

http://lachy.id.au/blogs/log/2004/12...unicode-part-1
http://lachy.id.au/blogs/log/2004/12...unicode-part-2
http://lachy.id.au/blogs/log/2005/01...unicode-part-3


I wish people would give simple answers to simple questions.
This is not a silly question; See
http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding. Slightly
edited, this says:

XML documents can contain foreign characters like Norwegian , or French
.
To let your XML parser understand these characters, you should save your XML
documents as Unicode.
Windows 95/98 Notepad cannot save files in Unicode format.
You can use Notepad to edit and save XML documents that contain foreign
characters (like Norwegian or French and ),
But if you save the file and open it with IE 5.0, you will get an ERROR
MESSAGE.

Windows 95/98 Notepad files must be saved with an encoding attribute.
To avoid this error you can add an encoding attribute to your XML
declaration, but you cannot use Unicode.
The encoding below (open it with IE 5.0), will NOT give an error message:
<?xml version="1.0" encoding="UTF-8"?>
Jul 20 '05 #5

P: n/a
In article <4Z*************@newsfe5-win.ntli.net>,
EU citizen <no*******@forme.com> wrote:
> Do web pages have to be created in unicode in order to use UTF-8encoding? That's kind of a silly question because UTF-8 is a unicode encoding.

I wish people would give simple answers to simple questions.
It may be a simple question for you, because you know what you mean,
but for the rest of us it's a hard-to-understand question, because
if you use UTF-8, you are inevitably using Unicode, since it's a
way of writing Unicode.

But from what you say now, it looks as if your question is really
about some Windows software.
To let your XML parser understand these characters, you should save your XML
documents as Unicode.
Windows 95/98 Notepad cannot save files in Unicode format.
You can use Notepad to edit and save XML documents that contain foreign
characters (like Norwegian or French and ),
But if you save the file and open it with IE 5.0, you will get an ERROR
MESSAGE.
Presumably this means that Notepad saves documents containing those
characters in some non-Unicode encoding, in which case you must put
an appropriate encoding declaration at the top of the document. But
you will need to know the name of the encoding that Notepad uses.

<?xml version="1.0" encoding="whatever-the-notepad-encoding-is"?>
Windows 95/98 Notepad files must be saved with an encoding attribute.
This is mysterious. What does it mean? That Notepad won't save
them without one? Or that you have to add one to make it work
in the web browser?
To avoid this error you can add an encoding attribute to your XML
declaration, but you cannot use Unicode.
The encoding below (open it with IE 5.0), will NOT give an error message:
<?xml version="1.0" encoding="UTF-8"?>


It only makes sense to say that you're using UTF-8 if you are. If Notepad
really doesn't know about Unicode, this will only be true if you
restrict yourself to ASCII characters, because they're the same
in UTF-8 as they are in ASCII and most other common encodings.

-- Richard
Jul 20 '05 #6

P: n/a
On Wed, 2 Feb 2005, EU citizen wrote:
Do web pages have to be created in unicode in order to use UTF-8
encoding?


[...]
I wish people would give simple answers to simple questions.
I don't think you've understood the problem. If the questioner was in
a position to understand the "simple answer" which you say you want, I
can't imagine how they would have asked the question in that form in
the first place.
This is not a silly question;
The original questioner should not feel offended or dispirited by what
I'm going to say: but, in the form in which is was asked, the question
is incoherent.

This is not unusual: many people are confused both by the theory and
by the terminology of character representation, especially if they
gained an initial understanding in a simpler situation (typically,
character repertoires of 256 characters or less, represented by an
8-bit character encoding such as iso-8859-anything; and fonts that
were laid out accordingly).
See
http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding.
How very strange. This claims to be XHTML, but, as far as I can see,
it has no character encoding specified on its HTTP Content-type header
*nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...>
thingy).

In the absence of a BOM, XML is entitled to deduce that it's utf-8:
but since it's invalid utf-8, it *ought* to refuse to process it.
Unless someone can show me what I'm missing.

By looking at it, it is evidently encoded in iso-8859-1.
It purports to declare that via a "meta http-equiv", but for XML this
is meaningless - and anyway comes far too late.

I don't know why the W3C validator doesn't reject it out of hand?

(Of course the popular browsers will be slurping it as slightly
xhtml-flavoured tag soup, so we can't expect to deduce very much from
the fact that they calmly display what the author intended.)
Slightly
edited, this says:

XML documents can contain foreign characters like Norwegian , or French
.
And those characters are presented encoded in iso-8859-1 ...
To let your XML parser understand these characters, you should save
your XML documents as Unicode.
Two things wrong here. What do they suppose they mean by "save ... as
Unicode"? The XML Document Character Set is *by definition* Unicode,
there's nothing that an author can do to change that (unlike SGML).

Characters can be represented in at least two different ways in XML:
by /numerical character references/ (&#number;), or as /encoded
characters/ using some /character encoding scheme/. (In some contexts
there may also be named character entities, but they introduce no new
principles for the present purpose so we won't need to discuss them
here).

The only coherent interpretation I can put on their "should save as
Unicode" statement is "should save in one of the character encoding
schemes of Unicode". But /should/ we? Do they? No, they don't: they
are using iso-8859-1 (they *could* even do it correctly); and they
also discuss the use of windows-1252, although without giving much
detail about the implications of deploying a proprietary character
encoding on the WWW.

The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.

But the reader still hasn't really learned anything about the
underlying principles yet. And the page hasn't told them anything
useful about *which* encoding to choose for deploying their documents
on the WWW.
Windows 95/98 Notepad cannot save files in Unicode format.


Then it's unfit for composing the kind of document that we are
discussing here. No matter - there are plenty of competent editors
which can work on that platform.

My own tutorial pages weren't really aimed at XML, so I won't suggest
them as an appropriate answer here. Actually, the relevant chapter of
the Unicode specification is not unreasonable as an introduction to
the principles of character representation and encoding, even if they
might be a bit indigestible at a first reading.
Jul 20 '05 #7

P: n/a
/EU citizen/:
XML documents can contain foreign characters like Norwegian , or French
. [...] You can use Notepad to edit and save XML documents that contain foreign
characters (like Norwegian or French and ),


Hm, I don't see any Norwegian or French characters but some Cyrillic
instead... could it be you forgot to label the encoding of your
message? ;-)

--
Stanimir
Jul 20 '05 #8

P: n/a
"Richard Tobin" <ri*****@cogsci.ed.ac.uk> wrote in message
news:ct**********@pc-news.cogsci.ed.ac.uk...
In article <4Z*************@newsfe5-win.ntli.net>,
EU citizen <no*******@forme.com> wrote:
> Do web pages have to be created in unicode in order to use UTF-8encoding?

That's kind of a silly question because UTF-8 is a unicode encoding.
I wish people would give simple answers to simple questions.


It may be a simple question for you, because you know what you mean,
but for the rest of us it's a hard-to-understand question, because
if you use UTF-8, you are inevitably using Unicode, since it's a
way of writing Unicode.

But from what you say now, it looks as if your question is really
about some Windows software.


No. I am using a version of Windows (like most computer users on this
planet). However, my question isn't specific to Windows. For all I knew,
declaring uft-8 encoding might've caused the file to be transformed into
utf-8 regardless of the original file format.

To let your XML parser understand these characters, you should save your
XMLdocuments as Unicode.
Windows 95/98 Notepad cannot save files in Unicode format.
You can use Notepad to edit and save XML documents that contain foreign
characters (like Norwegian or French and ),
But if you save the file and open it with IE 5.0, you will get an ERROR
MESSAGE.


Presumably this means that Notepad saves documents containing those
characters in some non-Unicode encoding, in which case you must put
an appropriate encoding declaration at the top of the document. But
you will need to know the name of the encoding that Notepad uses.

<?xml version="1.0" encoding="whatever-the-notepad-encoding-is"?>


Based on what I know now, I agree. I always assumed that Notepad, being a
simple text editor, saved files in Ascii format. Nothing in Notepad's Help,
Windows' Help or Microsoft's website says anything about the formt used by
Notepad. Through experimentation with the W3C HTML vakidator, I've worked
out that iso-8859-1will work for Notepad files with standard english text
plus acute accented vowels.
Windows 95/98 Notepad files must be saved with an encoding attribute.
This is mysterious. What does it mean? That Notepad won't save
them without one? Or that you have to add one to make it work
in the web browser?


I can't make head or tail of it.
To avoid this error you can add an encoding attribute to your XML
declaration, but you cannot use Unicode.
The encoding below (open it with IE 5.0), will NOT give an error message:
<?xml version="1.0" encoding="UTF-8"?>


It only makes sense to say that you're using UTF-8 if you are. If Notepad
really doesn't know about Unicode, this will only be true if you
restrict yourself to ASCII characters, because they're the same
in UTF-8 as they are in ASCII and most other common encodings.


The need for the XML encoding statement to match the original file format
was not mentioned in any of the (many) articles I've read on XM:/XHTML over
the last *four* years.
Jul 20 '05 #9

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message
news:Pi******************************@ppepc56.ph.g la.ac.uk...
On Wed, 2 Feb 2005, EU citizen wrote:
> Do web pages have to be created in unicode in order to use UTF-8

encoding?


[...]
I wish people would give simple answers to simple questions.


I don't think you've understood the problem. If the questioner was in
a position to understand the "simple answer" which you say you want, I
can't imagine how they would have asked the question in that form in
the first place.
This is not a silly question;


The original questioner should not feel offended or dispirited by what
I'm going to say: but, in the form in which is was asked, the question
is incoherent.


I think there's a lot of miscommunication going on, I don't entirely
understand what your posting.
See
http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding.


How very strange. This claims to be XHTML, but, as far as I can see,
it has no character encoding specified on its HTTP Content-type header
*nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...>
thingy).


<snip>

You makee a number of valid criticisms about the w3schools article, but they
turned up near the top of my Google search for information on this subject.
It just shows how difficult it is to get reliable information.
Windows 95/98 Notepad cannot save files in Unicode format.


Then it's unfit for composing the kind of document that we are
discussing here. No matter - there are plenty of competent editors
which can work on that platform.


My original question asked for suggestions about suitable applications, and
yet no one has named one.
Jul 20 '05 #10

P: n/a
In article <Pt***************@newsfe1-win.ntli.net>,
EU citizen <no*******@forme.com> wrote:
Through experimentation with the W3C HTML vakidator, I've worked
out that iso-8859-1will work for Notepad files with standard english text
plus acute accented vowels.
Beware that Microsoft uses some proprietary encodings that are ISO-8859-1
for characters A0-FF, but use the C1 controls (81-9F) for other purposes.
If you don't use any of those (and the Euro symbol is quite likely one
of them) you should be OK.
The need for the XML encoding statement to match the original file format
was not mentioned in any of the (many) articles I've read on XM:/XHTML over
the last *four* years.


In most circumstances UTF-8 is the default encoding for XML if there
is no encoding declaration. In theory for text/* served by HTTP,
8859-1 is (or was - they may have changed it) the default. But if you
stick to ascii, it won't matter. And remember that you *can* stick to
ASCII and use character references (such as &#xa3;) or entity
references (if you declare them in your DTD) for all non-ascii
characters.

-- Richard
Jul 20 '05 #11

P: n/a
On Wed, 2 Feb 2005, EU citizen wrote:
X-Newsreader: Microsoft Outlook Express 6.00.2800.1437

XML documents can contain foreign characters like Norwegian ???, or French
???.


You need to set up your newsreader^W Outlook Express correctly
in order to transmit special, non-ASCII characters:

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #12

P: n/a
On Wed, 2 Feb 2005, Richard Tobin wrote:
In most circumstances UTF-8 is the default encoding for XML if there
is no encoding declaration.
It's a bit more complicated than that.

http://www.w3.org/TR/2000/REC-xml-20001006#charencoding

The /default/ is to look for a BOM - failing which, utf-8 is assumed.

On the other hand, it seems you've caught me out with the next bit:
In theory for text/* served by HTTP,
8859-1 is (or was - they may have changed it) the default.
HTTP hasn't changed. RFC2616 section 3.7.1, last paragraph. Thanks!

So I suppose /that/ was the explanation for the W3C validator not
failing the cited page from w3schools. Thanks.
But if you stick to ascii, it won't matter.


True - although that's hardly a very efficient way to write, say,
Cyrillic, or Arabic, or Japanese.
Jul 20 '05 #13

P: n/a
EU citizen wrote:
My original question asked for suggestions about suitable applications, and
yet no one has named one.


If you cared to take the time to read the guide to unicode I linked to
earlier, you would have found editors mentioned in part 2. Within it, I
mentioned two windows editors that support Unicode: SuperEdi [1] and
Macromedia Dreamweaver. A simple search for "Unicode Editor" also
reveals many other editors that may be capable of doing the job.

[1] http://www.wolosoft.com/en/superedi/

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
Jul 20 '05 #14

P: n/a
EU citizen wrote:
"Richard Tobin" <ri*****@cogsci.ed.ac.uk> wrote in message
news:ct**********@pc-news.cogsci.ed.ac.uk...
<?xml version="1.0" encoding="whatever-the-notepad-encoding-is"?>
Based on what I know now, I agree. I always assumed that Notepad, being a
simple text editor, saved files in Ascii format.


By default, Notepad saves files as Windows-1252. The characters from 0
to 127 (0x7F) are identical to US-ASCII, ISO-8859-1, UTF-8 and many
other character sets that make use of the same subset. Thus, any file
saved using Windows-1252 that only makes use of those characters is
compatible with all those other encodings.

The characters from 160 (0xA0) to 255 (0xFF) match those contained in
ISO-8859-1. Thus, any file saved using Windows-1252 that only makes use
of the aforementioned US-ASCII subset and that range of characters is
compatible with ISO-8859-1.

The characters from 128 (0x80) to 159 (0x9F), however, do not match
those in any other encoding, making any Windows-1252 file using these
characters incompatible with any other encoding. For XML, this must be
declared appropriately in the XML declaration. The characters in this
range contain the infamous "smart quotes" (Left and Right, single and
double quotation marks: ‘ ’ “ ”) that cause so many problems for the
uneducated. Use of this range while declaring ISO-8859-1, UTF-8 or any
other encoding, will cause errors because they are control characters in
the character repertoires used by those encodings.
Nothing in Notepad's Help, Windows' Help or Microsoft's website says anything
about the formt used by Notepad.
It is actually mentioned in a few places on the web, though it's not
easy to find. Microsoft tend to incorrectly refer to it as ANSI, even
though it is not.

Through experimentation with the W3C HTML vakidator, I've worked out that
iso-8859-1will work for Notepad files with standard english text plus acute
accented vowels.


That's because Windows-1252 is compatible with ISO-8859-1 when that
subset is used.
Windows 95/98 Notepad files must be saved with an encoding attribute.


This is mysterious. What does it mean? That Notepad won't save
them without one? Or that you have to add one to make it work
in the web browser?


I can't make head or tail of it.


It actually means that version of Notepad will only save as
Windows-1252, so it needs to be declared in the XML declaration. That
is because an XML parser will assume UTF-8 without it and that
assumption is acceptable only when the US-ASCII subset is used.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
Jul 20 '05 #15

P: n/a
On Wed, 2 Feb 2005, EU citizen wrote:
The need for the XML encoding statement to match the original file
format was not mentioned in any of the (many) articles I've read on
XM:/XHTML over the last *four* years.


The XML coding has to comply with the relevant bit of the XML
specification. Whether you read it "over the last four years" or not.
http://www.w3.org/TR/REC-xml/#charencoding

Talking about the "original file format" could be misleading, bearing
in mind that some HTTP servers are set up to transcode the
internally-stored file format into one that's more appropriate for use
on the web. For XML-based markups, that may call for appropriate
rewriting of the document's XML encoding specification. And if you're
using XHTML/1.0 Appendix C then the transcoded document would need to
confirm to its constraints too.

Jul 20 '05 #16

P: n/a
Alan J. Flavell wrote:
On Wed, 2 Feb 2005, EU citizen wrote:

The need for the XML encoding statement to match the original file
format was not mentioned in any of the (many) articles I've read on
XM:/XHTML over the last *four* years.

The XML coding has to comply with the relevant bit of the XML
specification. Whether you read it "over the last four years" or not.
http://www.w3.org/TR/REC-xml/#charencoding

Talking about the "original file format" could be misleading, bearing
in mind that some HTTP servers are set up to transcode the
internally-stored file format into one that's more appropriate for use
on the web. For XML-based markups, that may call for appropriate
rewriting of the document's XML encoding specification. And if you're
using XHTML/1.0 Appendix C then the transcoded document would need to
confirm to its constraints too.


RFC3023 talk about XML media types

i retain that text/xml (and text/and-others-related-to-xml) should be
avoid on behalf of application/xml (and
application/and-others-related-to-xml)

Here we get utf-8:
Content-type: text/xml; charset="utf-8"
<?xml version="1.0" encoding="utf-8"?>

!?!?! Here we get US-ACII, despite the encoding specified:
Content-type: text/xml
<?xml version="1.0" encoding="utf-8"?>

Here we get utf-16:
Content-type: application/xml; charset="utf-16"
{BOM}<?xml version="1.0" encoding="utf-16"?>

Here we get the right encoding-known-by-your-parser:
Content-type: application/xml
<?xml version="1.0" encoding="encoding-known-by-your-parser"?>

--
Cordialement,

///
(. .)
-----ooO--(_)--Ooo-----
| Philippe Poulard |
-----------------------
Jul 20 '05 #17

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.


That is not a safe conclusion. XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature. It follows that using any encoding
other than UTF-8 or UTF-16 is unsafe. If communication fails, because
someone sent an XML document in an encoding other than UTF-8 or UTF-16,
the sender is to blame.

This simplifies to a rule of thumb:
When producing XML, always use UTF-8 (and Unicode Normalization Form C).
Those who absolutely insist on using UTF-16 can use UTF-16 instead of
UTF-8.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #18

P: n/a
Henri Sivonen wrote:
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.

That is not a safe conclusion. XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature. It follows that using any encoding
other than UTF-8 or UTF-16 is unsafe. If communication fails, because
someone sent an XML document in an encoding other than UTF-8 or UTF-16,
the sender is to blame.

This simplifies to a rule of thumb:
When producing XML, always use UTF-8 (and Unicode Normalization Form C).
Those who absolutely insist on using UTF-16 can use UTF-16 instead of
UTF-8.


this is theory

is there anybody who knows a parser that doesn't handle iso-8859-1
corresctly ? i don't think so; otherwise, you should change, and
communication became safe :)

--
Cordialement,

///
(. .)
-----ooO--(_)--Ooo-----
| Philippe Poulard |
-----------------------
Jul 20 '05 #19

P: n/a
On Fri, 4 Feb 2005, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
The /conclusions/ are fine, in their way:

* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.
That is not a safe conclusion.


I guess that was one of the penalties of responding to a cross-posted
article.
XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature.
But that's OK, since any plausible encoding produced by the editor can
be transformed by rote into utf-8 prior to subsequent XML processing
(that's the XML relevance). And pretty much any plausible encoding
produced by an editor that's meant for WWW use, is going to be
supported by the available web browsers (that's the c.i.w.a.h
relevance).
It follows that using any encoding other than UTF-8 or UTF-16 is
unsafe.


I take your point, but again: as long as the document is correctly
labelled, it can be transformed by rote into utf-8, it needs no
special heuristics, nor does it run risks of being damaged in the
process.

all the best

Jul 20 '05 #20

P: n/a
In article <cu**********@news-sop.inria.fr>,
Philippe Poulard <Ph****************@SPAMsophia.inria.fr> wrote:
this is theory

is there anybody who knows a parser that doesn't handle iso-8859-1
corresctly ?


I don't. I do, however, know a parser that does not support (by default
without extra work) ISO-8859-15, Windows-1252 or Shift-JIS: expat.

AFAIK, in *practice* the set of safe encodings is US-ASCII, ISO-8859-1
and UTF-8. In theory, it is UTF-8 and UTF-16. The intersection of
reality and theory is UTF-8.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #21

P: n/a
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> wrote:
That is not a safe conclusion. XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature. It follows that using any encoding
other than UTF-8 or UTF-16 is unsafe.
This is an exaggeration. You might as well say: XML processors are
not required to support any particular URI scheme, so referring to
a DTD at an HTTP URI is unsafe.
If communication fails, because
someone sent an XML document in an encoding other than UTF-8 or UTF-16,
the sender is to blame.


So phone them up and ask them to change it. Not every XML document has
to be instantly useful to everyone.

-- Richard
Jul 20 '05 #22

P: n/a
In article <cu***********@pc-news.cogsci.ed.ac.uk>,
ri*****@cogsci.ed.ac.uk (Richard Tobin) wrote:
You might as well say: XML processors are
not required to support any particular URI scheme, so referring to
a DTD at an HTTP URI is unsafe.


I consider external subsets on the Web harmful. Not because of HTTP URIs
but because non-validating processors are not required to process the
DTD and the usefulness of DTDs relative to their usual size is low.
Mozilla, for one, never dereferences an HTTP URI to retrieve an external
entity.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #23

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Fri, 4 Feb 2005, Henri Sivonen wrote:

XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature.


But that's OK, since any plausible encoding produced by the editor can
be transformed by rote into utf-8 prior to subsequent XML processing
(that's the XML relevance).


Such conversion leads to bugs like this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=174351

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #24

P: n/a
On Fri, 4 Feb 2005, Henri Sivonen wrote:
Newsgroups: comp.infosystems.www.authoring.html,comp.text.xml
You should set a F'up-To. I've done this and remark only
what's relevant to c.i.w.a.html.
AFAIK, in *practice* the set of safe encodings is US-ASCII, ISO-8859-1
and UTF-8. In theory, it is UTF-8 and UTF-16. The intersection of
reality and theory is UTF-8.


Google still doesn't support UTF-16 as can be seen from
http://www.google.com/search?q=U.T.F.1-6
Hence the recommendation in
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist
to use only UTF-8 as Unicode encoding on the WWW.

Jul 20 '05 #25

P: n/a
On Fri, 4 Feb 2005, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
But that's OK, since any plausible encoding produced by the editor can
be transformed by rote into utf-8 prior to subsequent XML processing
(that's the XML relevance).


Such conversion leads to bugs like this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=174351


Does it? I'll have to ask you to explain that in more detail, please.
As far as I can see, the bug relates to a byte stream which is not
valid utf-8 - which by definition is therefore not utf-8 at all.

What I'm talking about is taking a properly-labelled and
properly-formed character stream in some known encoding, and
transcoding that into properly-formed utf-8 (with appropriate
re-labelling, of course).
Jul 20 '05 #26

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Fri, 4 Feb 2005, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
But that's OK, since any plausible encoding produced by the editor can
be transformed by rote into utf-8 prior to subsequent XML processing
(that's the XML relevance).


Such conversion leads to bugs like this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=174351


Does it? I'll have to ask you to explain that in more detail, please.
As far as I can see, the bug relates to a byte stream which is not
valid utf-8 - which by definition is therefore not utf-8 at all.

What I'm talking about is taking a properly-labelled and
properly-formed character stream in some known encoding, and
transcoding that into properly-formed utf-8 (with appropriate
re-labelling, of course).


The problem is that the XML spec is not only concerned with proper UTF-8
streams but also says what to do in improper cases. If the character
encoding conversion is decoupled from the XML processor, but this is
viewed as an implementation detail so that the combination of the
converter and actual XML processor is subjected to the conformance
requirements placed on XML processors, non-conformance ensues if the
converter is lenient, which they usually are.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #27

P: n/a
On Sat, 5 Feb 2005, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

> But that's OK, since any plausible encoding produced by the
> editor can be transformed by rote into utf-8 prior to
> subsequent XML processing (that's the XML relevance).

[...]
The problem is that the XML spec is not only concerned with proper
UTF-8 streams but also says what to do in improper cases. If the
character encoding conversion is decoupled from the XML processor,
but this is viewed as an implementation detail so that the
combination of the converter and actual XML processor is subjected
to the conformance requirements placed on XML processors,
non-conformance ensues if the converter is lenient, which they
usually are.


Thanks. I understand your point now.

I have this feeling that there's a lot of scope for practical utility
without running the risk of falling foul of this particular problem;
but I won't drag the argument out.

all the best
Jul 20 '05 #28

This discussion thread is closed

Replies have been disabled for this discussion.