browsers don't follow the charset specification

Xiaotian Sun

I added the following line to the header of my html file

<meta http-equiv="content-type" content="text/html; charset=utf-8">

hoping browsers will use UTF-8 encoding. But all browsers I tried still
use ISO-8859-1.

What did I do wrong?

Thanks,

Xiaotian

Jul 23 '05 #1

Subscribe Post Reply

3195

Andreas Prilop

On Thu, 10 Mar 2005, Xiaotian Sun wrote:

I added the following line to the header of my html file
<meta http-equiv="content-type" content="text/html; charset=utf-8">
hoping browsers will use UTF-8 encoding. But all browsers I tried still
use ISO-8859-1.

You need to set up your *web server* correctly to send the right
charset parameter:
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html
Please refer to a comp.infosystems.www.servers... group for details.

The <meta charset> thingy is superfluous; get rid of it.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #2

Martin Honnen

Xiaotian Sun wrote:

I added the following line to the header of my html file

<meta http-equiv="content-type" content="text/html; charset=utf-8">

hoping browsers will use UTF-8 encoding. But all browsers I tried still
use ISO-8859-1.

Post a URL we can check, perhaps the server sends a HTTP response header
that overrides the meta.

--

Martin Honnen
http://JavaScript.FAQTs.com/

Jul 23 '05 #3

Steve Pugh

Xiaotian Sun <su**@berkeley.edu> wrote:

I added the following line to the header of my html file

<meta http-equiv="content-type" content="text/html; charset=utf-8">

hoping browsers will use UTF-8 encoding. But all browsers I tried still
use ISO-8859-1.
But if your server is sending out a real content-type header the meta
tag will be ignored.
What did I do wrong?

You tried to fake something that should have been done properly.

Steve

--
"My theories appal you, my heresies outrage you,
I never answer letters and you don't like my tie." - The Doctor

Steve Pugh <st***@pugh.net> <http://steve.pugh.net/>

Jul 23 '05 #4

Xiaotian Sun

Andreas Prilop wrote:

On Thu, 10 Mar 2005, Xiaotian Sun wrote:

I added the following line to the header of my html file
<meta http-equiv="content-type" content="text/html; charset=utf-8">
hoping browsers will use UTF-8 encoding. But all browsers I tried still
use ISO-8859-1.

You need to set up your *web server* correctly to send the right
charset parameter:
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html
Please refer to a comp.infosystems.www.servers... group for details.

The <meta charset> thingy is superfluous; get rid of it.

I'm not an expert on html or web servers, but that was actually what I
suspected. The thing is that nowhere on the w3 web pages taling about
this <meta> thing mentioned about web server.

anyways, the problem is that i don't have the control over the web
server. it's our department server. maybe I should talk to our IT staff.

Jul 23 '05 #5

Andreas Prilop

On Thu, 10 Mar 2005, Xiaotian Sun wrote:

anyways, the problem is that i don't have the control over the web
server. it's our department server. maybe I should talk to our IT staff.

You do not need to control the whole web server but only your own
directory. For example with Apache, you can define the encoding
via the .htaccess file. Please post your question to
<news:comp.infosystems.www.servers.unix>
or whatever type your server is.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #6

Xiaotian Sun

Andreas Prilop wrote:

On Thu, 10 Mar 2005, Xiaotian Sun wrote:

anyways, the problem is that i don't have the control over the web
server. it's our department server. maybe I should talk to our IT staff.

You do not need to control the whole web server but only your own
directory. For example with Apache, you can define the encoding
via the .htaccess file. Please post your question to
<news:comp.infosystems.www.servers.unix>
or whatever type your server is.

Thanks. Problem solved.

Jul 23 '05 #7

Big Bill

On Thu, 10 Mar 2005 18:24:06 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:

On Thu, 10 Mar 2005, Xiaotian Sun wrote:
I added the following line to the header of my html file
<meta http-equiv="content-type" content="text/html; charset=utf-8">
hoping browsers will use UTF-8 encoding. But all browsers I tried still
use ISO-8859-1.

You need to set up your *web server* correctly to send the right
charset parameter:
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html
Please refer to a comp.infosystems.www.servers... group for details.

The <meta charset> thingy is superfluous; get rid of it.

Why do you regard it as superfluous?

BB
--
www.kruse.co.uk/ SE*@kruse.demon.co.uk
Affordable SEO!
--

Jul 23 '05 #8

Pierre Goiffon

Xiaotian Sun wrote:

I'm not an expert on html or web servers, but that was actually what I
suspected. The thing is that nowhere on the w3 web pages taling about
this <meta> thing mentioned about web server.

You should have missed it
Have a look here :
http://www.w3.org/TR/html401/charset.html#h-5.2.2
"To sum up, conforming user agents must observe the following priorities
when determining a document's character encoding (from highest priority
to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a
value set for "charset".
3. The charset attribute set on an element that designates an
external resource."

And you can also go to the excellent "international" section of the W3C
website :
http://www.w3.org/International/reso...x.html#charset

Jul 23 '05 #9

Jukka K. Korpela

Pierre Goiffon <pg******@invalid.fr> wrote:

Have a look here :
http://www.w3.org/TR/html401/charset.html#h-5.2.2
"To sum up, conforming user agents must observe the following
priorities when determining a document's character encoding (from
highest priority to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type"
and a
value set for "charset".
3. The charset attribute set on an element that designates an
external resource."

I think it needs to be reminded that clause 3 there is pure theory: no
browser has been observed to pay attention to such information. I would
be happy to be corrected here.

Actually there's a fourth way too, in pure theory: the type attribute,
which takes an Internet (MIME) media type as an attribute, and the
media type designation can have a charset parameter, so that
<a href="..." type="text/html;charset=utf-8">
would specify the encoding if no other information about encoding is
available. But browsers ignore the type attribute.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #10

Dave Anderson

Big Bill wrote:

On Thu, 10 Mar 2005 18:24:06 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
The <meta charset> thingy is superfluous; get rid of it.

Why do you regard it as superfluous?

[All IIRC, so I may have messed something up.]

The specs require that a browser use the information from a real
'content-type' header in preference to that from a <META
HTTP-EQUIV="content-type" ...> element.

The specs require that every server always send a real 'content-type'
header with a 'charset' parameter.

Therefore, in the context of the WWW, the info from the <META
HTTP-EQUIV="content-type" ...> element should never be used.

Dave

Jul 23 '05 #11

Alan J. Flavell

On Fri, 11 Mar 2005, Dave Anderson wrote:

Big Bill wrote:
On Thu, 10 Mar 2005 18:24:06 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
The <meta charset> thingy is superfluous; get rid of it.
Why do you regard it as superfluous?

It has to be said that opinions differ on this topic. Ideally,
Andreas would be right: specifying the character coding should be a
matter for the web server, on the HTTP header, external to the
document itself. There are a number of disadvantages in stashing it
inside the document.

On the other hand, these disadvantages only become apparent nowadays
in rather unusual situations (transcoding proxies etc.) and there are
some practical advantages (e.g offline copies) in maintaining
character coding information with the document itself. So it's a
close call IMHO.

This might be helpful in reviewing some of the issues
http://www.w3.org/International/tuto...enc/#declaring
, even if some readers will reach a different conclusion after having
followed the arguments.

CERT CA-2000-02 also has a word or two to say on the topic: the issues
are complex, but its conclusion on the HTTP charset issue is clear.
[All IIRC, so I may have messed something up.]
There's no extra charge for reviewing the authoritative
specifications...
The specs require that a browser use the information from a real
'content-type' header in preference to that from a <META
HTTP-EQUIV="content-type" ...> element.
Correct, and current browser versions do follow that specification, at
least when initially browsing a page (some of them permit a subsequent
user override, to compensate for mistakes).
The specs require that every server always send a real
'content-type' header with a 'charset' parameter.
There are recommendations, but nothing makes it a mandatory
requirement of the protocol.
Therefore, in the context of the WWW, the info from the <META
HTTP-EQUIV="content-type" ...> element should never be used.

Even if the HTTP charset is present, there *can* be practical benefits
in storing page copies locally with their charset indication. How
that is actually achieved, would be a topic for discussion. But,
given that file systems rarely keep this kind of metadata, there's an
argument for using meta http-equiv for it. Then it's a question of
whether that's supposed to be coming from the server, or whether the
HTTP charset should be transferred into a meta http-equiv at the point
where the document is saved by an HTTP client agent. It's a complex
topic. And XHTML adds further complications.

What is absolutely clear is that an HTTP header carrying a charset
which is incompatible with the charset specified in meta http-equiv is
utterly useless. One of the benefits of not having one in meta
http-equiv is that, by definition, it cannot be wrong! Then the only
problem that the /author/ has to solve is how to get the correct /HTTP
charset/ sent.

But there's still the issue of what a recipient should do when they
save the document locally.

On the whole, I concur with Andreas's conclusion; but I'm not sure
that the issues are quite as definite as he tends to imply.

Jul 23 '05 #12

Dave Anderson

Alan J. Flavell wrote:

On Fri, 11 Mar 2005, Dave Anderson wrote:
[All IIRC, so I may have messed something up.]

There's no extra charge for reviewing the authoritative
specifications...

....except the time needed to find the appropriate section of the
appropriate spec and verify that it says what one thought it did (and
that no other spec says something contradictory), and to verify the
extent to which actual browsers (and IE) conform to the spec...

That checking is essential before claiming to provide an authoritative
response, but (for me) would take a significant amount of effort -- I
only maintain a few simple pages, and don't have this all memorized /
organized such that I can instantly find what I need. So, since no-one
more qualified had responded to the OP, I opted for a disclaimer and a
reasonably-accurate quick answer rather than doing nothing.

I think that my response was close enough to correct that this was a
good decision on my part, but I'm glad to see your (much) higher-quality
response superseding mine.

Dave

Jul 23 '05 #13

Henri Sivonen

In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On the other hand, these disadvantages only become apparent nowadays
in rather unusual situations (transcoding proxies etc.)

How real is the threat of transcoding proxies? I occasionally see
experts based in the traditional ISO-8850-1 region invoke the
transcoding proxy bogey. However, I don't see Russians themselves
mention that their HTTP transfers were subject to man-in-the-middle
tampering. Is there really a substantial installed base of transcoding
proxies that would interfere with end-to-end UTF-8 transfers in Russia
or Japan? When both the server and the user agents support UTF-8, it
would be evil and wrong for a proxy to transcode.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #14

Tim

On Tue, 15 Mar 2005 14:00:46 +0200,
Henri Sivonen <hs******@iki.fi> posted:

When both the server and the user agents support UTF-8, it
would be evil and wrong for a proxy to transcode.

I wouldn't be that surprised at it happening. Various mail systems will
automatically transcode incoming data simply presuming that it's in the
best interest of the recipient. I wouldn't put it past web proxy coders to
make the same presumption (it's simpler to transcode everything going
through, rather than check if it's necessary, and assume that it'll cause
no problems for anybody).

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.

Jul 23 '05 #15

Nick Kew

Henri Sivonen wrote:

In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On the other hand, these disadvantages only become apparent nowadays
in rather unusual situations (transcoding proxies etc.)

How real is the threat of transcoding proxies?

Well, FWIW my most popular piece of software is mod_proxy_html, which is
an essential component of a nontrivial reverse proxy in Apache. It's
one of several markup-parsing modules I've written that can be deployed
in a proxy.

mod_proxy_html parses documents using libxml2. libxml2 uses utf-8
internally, and generates all output as utf-8 by default. Since
doing otherwise would be an overhead, mod_proxy_html also generates
utf-8 output.

Now, mod_proxy_html is very careful about transcoding. It detects the
charset on input using HTTP headers if available, followed by XML or
HTML rules (BOM, xmldecl, <meta ...>). It is also careful about
output, setting the charset in the Content-Type header, and removing
or fixing any conflicting xmldecl or <meta ...> in the input.

The experience of mod_proxy_html (and some of my other work) shows
there's certainly a demand for transcoding proxies. I don't know
how many products there are out there, but it's entirely possible
there are some that take less care over it.
Is there really a substantial installed base of transcoding
proxies that would interfere with end-to-end UTF-8 transfers in Russia
or Japan? When both the server and the user agents support UTF-8, it
would be evil and wrong for a proxy to transcode.

That seems much less likely. It's non-utf8 that's more likely to be
changed. Unless perhaps somewhere in 16-bit-land.

--
Nick Kew

Jul 23 '05 #16

Alan J. Flavell

On Wed, 16 Mar 2005, Nick Kew wrote:

The experience of mod_proxy_html (and some of my other work) shows
there's certainly a demand for transcoding proxies. I don't know
how many products there are out there, but it's entirely possible
there are some that take less care over it.
The transcoding function of Russian Apache was an active topic of
discussion for quite a number of years, because (I'm speaking just as
a bystander here - it's not my field!) Russian traditionally used a
bunch of incompatible 8-bit codings, and a lot of browsers supported
only a subset of them. See http://apache.lexa.ru/english/

Early versions of that transcoder worked only at the character stream
level: they made -no- changes to the document content. Basically if
and when they spotted a meta http-equiv which fixed the document's
encoding, they simply did not transcode it.

Later versions parsed the source (at least to some extent - again I
don't know the internals) and could transcode anything, and they
stripped out (or maybe modified, I'm not certain) any meta http-equiv
which they found.

Now, looking at the documentation for that nowadays, it gives the
impression that nothing much changed since around 2001-2002. We need
someone who reads Russian to tell us if there's any explanation for
that, since the English doesn't seem to clarify the question. Maybe
it's now working so well that no more updates are needed; maybe in the
meantime cp-866 has died out, and browser support for all of koi8-r,
Windows-125x, iso-8859-5 and (of course) utf-8 are now so universal
that there's no further need to transcode. I simply don't know.

Is there really a substantial installed base of transcoding
proxies that would interfere with end-to-end UTF-8 transfers in Russia
or Japan? When both the server and the user agents support UTF-8, it
would be evil and wrong for a proxy to transcode.

Oh, if the source was utf-8 and the client offered utf-8 capability
then there would be no need to transcode.
That seems much less likely. It's non-utf8 that's more likely to be
changed.

Agreed.

But what do we conclude from this?

In an ideal world, the knowledge of a document's character encoding
would be kept with it (in some sense) and adjusted whenever the
document itself was transcoded. Any kind of text/* document (and some
subset of application/* documents) need to have their character
encoding specified - just as much as does text/html itself. Once
you've worked out how to handle that for text/plain, then you need
nothing more for text/html.

MIME (to take an example) defines a way of handling this: you package
the *unmodified* data and you specify its character encoding in the
MIME wrapper. There's no call to go interfering with the content of
the document itself to try to stash its own encoding inside it.

But what happened? Someone short-sightedly devised a method (meta
http-equiv) that works *only* for text/html. Then they devised an
incompatible one that works only for text/css. Then they devised yet
another one for XML. Then came Appendix C, and here we are...
confused...

Jul 23 '05 #17

Nick Kew

Alan J. Flavell wrote:

The transcoding function of Russian Apache was an active topic of
discussion for quite a number of years,
The Russian Apache project is of course where the full mod_charset
comes from, FWIW.
Early versions of that transcoder worked only at the character stream
level: they made -no- changes to the document content. Basically if
and when they spotted a meta http-equiv which fixed the document's
encoding, they simply did not transcode it.

Later versions parsed the source (at least to some extent - again I
don't know the internals) and could transcode anything, and they
stripped out (or maybe modified, I'm not certain) any meta http-equiv
which they found.
That sounds familiar. Early versions of my modules were more
simpleminded and prone to failure on unfamiliar inputs. It's only
really since last June that the full regime I described has been
the norm for those of my modules where it's relevant.
But what happened? Someone short-sightedly devised a method (meta
http-equiv) that works *only* for text/html. Then they devised an
incompatible one that works only for text/css. Then they devised yet
another one for XML. Then came Appendix C, and here we are...
confused...

Heh, heh.

And then the wowser vendors don't bother to implement any of them.
See for example the plethora of "it doesn't work" reports on the
W3 validator mailinglist since WinXP-SP2 brought new brokenness
to MSIE.

--
Nick Kew

Jul 23 '05 #18

Henri Sivonen

In article <ms************@asgard.webthing.com>,
Nick Kew <ni**@asgard.webthing.com> wrote:

Now, mod_proxy_html is very careful about transcoding. It detects the
charset on input using HTTP headers if available, followed by XML or
HTML rules (BOM, xmldecl, <meta ...>). It is also careful about
output, setting the charset in the Content-Type header, and removing
or fixing any conflicting xmldecl or <meta ...> in the input.

What about form submissions?

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #19

Henri Sivonen

In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On Wed, 16 Mar 2005, Nick Kew wrote:
The experience of mod_proxy_html (and some of my other work) shows
there's certainly a demand for transcoding proxies. I don't know
how many products there are out there, but it's entirely possible
there are some that take less care over it.
The transcoding function of Russian Apache

Conneg based on Accept-Charset on the origin server is substantially
different than man-in-the-middle transcoding.
Oh, if the source was utf-8 and the client offered utf-8 capability
then there would be no need to transcode.
Right. I'm interested in the possible installed base of software that
does unnecessary things. That is: Is it accurate to state that there is
no substantial installed base of proxies that transcode when the origin
server sends UTF-8?
In an ideal world, the knowledge of a document's character encoding
would be kept with it (in some sense) and adjusted whenever the
document itself was transcoded.
In the ideal world, there'd be no transcoding at all and all text on top
HTTP would be UTF-8 without a need to say so. :-)
But what happened? Someone short-sightedly devised a method (meta
http-equiv) that works *only* for text/html. Then they devised an
incompatible one that works only for text/css. Then they devised yet
another one for XML.

Ruby's postulate explains why.

http://intertwingly.net/slides/2005/etcon/74.html

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #20

Alan J. Flavell

On Thu, 17 Mar 2005, Henri Sivonen wrote:

The transcoding function of Russian Apache
Conneg based on Accept-Charset on the origin server is substantially
different than man-in-the-middle transcoding.

To some extent, but I'd say that fundamentally (if we assume that the
origin server itself does not control the documents which its authors
supply to it) the same fundamental principles apply: you get presented
with a document, with some information on its character encoding, and
the document may or may not contain some internal declaration of what
that character encoding is.
In the ideal world, there'd be no transcoding at all and all text on
top HTTP would be UTF-8 without a need to say so. :-)

You have to understand that my early upbringing was on computers which
dealt with a number of different external character encodings - some
of which, indeed, stored their data in encodings which were certain to
differ from any external one[1]. So I reckon that I have a
well-deveopled sense of this kind of architecture.

It's been my experience that whenever developers leave out one of the
transcoding layers in the interests of similicity, things always go
wrong somewhere.

If you were to dismantle the relevant networking layer wherever it
occurs, sooner or later you're going to want something different than
utf-8 (whatever it may turn out to be), and then you'll find yourself
painted into a corner.

Whereas if you retain this layer, even though - for the time being -
it's an identity transformation, you'll be set. But you need to
retain the knowledge and understanding of why that layer is there,
despite it seeming (for the time being) to be redundant.

But what happened? Someone short-sightedly devised a method (meta
http-equiv) that works *only* for text/html. Then they devised an
incompatible one that works only for text/css. Then they devised
yet another one for XML.

Ruby's postulate explains why.

I have some sympathy with it. But this is the fundamentally wrong
implementation of "nearness". Suppose you found a historical artifact
which was neatly labelled "historical artifact". Would that guarantee
that it was a genuine historical artifact? No. To my mind it's
similar with internal claims of character encoding.

For theoretical consistency, the character encoding needs to be on a
wrapper, not inside the layer-cake. The MIME specifications
understood this already. HTTP tried to get away without that, and you
see what the consequences are.

all the best

[1] Titan internal code, for those who care. OK, some would say that
the same statement was true of EBCDIC ;-)

Jul 23 '05 #21

Nick Kew

Henri Sivonen wrote:

What about form submissions?

Incoming POST data is not transcribed by any proxy software I'm
aware of.

The reply, if (X)HTML, is treated the same as any other (X)HTML.

--
Nick Kew

Jul 23 '05 #22

Henri Sivonen

In article <d0************@asgard.webthing.com>,
Nick Kew <ni**@asgard.webthing.com> wrote:

Henri Sivonen wrote:
What about form submissions?

Incoming POST data is not transcribed by any proxy software I'm
aware of.

Since browsers implement a "same as incoming" policy for form submission
encoding, the origin server receives the submissions in an unexpected
encoding in that case if the page containing the form has been
transcoded.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #23

Henri Sivonen

In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

For theoretical consistency, the character encoding needs to be on a
wrapper, not inside the layer-cake.

In practice, however, sniffing the encoding from an XML document works
great and RFC 3023 is only trouble. The main problem is that file
systems do not work as the theory would require.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #24

Alan J. Flavell

On Sat, 19 Mar 2005, Henri Sivonen wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
For theoretical consistency, the character encoding needs to be on a
wrapper, not inside the layer-cake.
In practice, however, sniffing the encoding from an XML document works
great

You mean inside the <?xml...> thingy? Yes, for XML and only for
XML(-based) content (and even then, it upsets some HTML-slurping
application which many folks reckon to be a web browser).

Meantime we *still* need something that could work for text/plain, and
comma-separated-values, and text/css (OK, that one already has its
own solution, that /only/ works for CSS); and, and, and...

As for the procedure described in XHTML 1.0 Appendix C, it's
enough to make strong men wake in the night gibbering.
The main problem is that file systems do not work as the theory
would require.

I understand you to mean that there's no place in the filesystem
directory (or equivalent) where such metadata can be associated with
the file? That is indeed true on most OSes. But given an appropriate
determination to solve the problem, I'm sure a solution could be
agreed upon. (And I don't mean tossing out every document that isn't
encoded in utf-8, sorry).

all the best

Jul 23 '05 #25

Henri Sivonen

In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On Sat, 19 Mar 2005, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
For theoretical consistency, the character encoding needs to be on a
wrapper, not inside the layer-cake.

In practice, however, sniffing the encoding from an XML document works
great

You mean inside the <?xml...> thingy?

Yes.

The main problem is that file systems do not work as the theory
would require.

I understand you to mean that there's no place in the filesystem
directory (or equivalent) where such metadata can be associated with
the file? That is indeed true on most OSes. But given an appropriate
determination to solve the problem, I'm sure a solution could be
agreed upon. (And I don't mean tossing out every document that isn't
encoded in utf-8, sorry).

I don't believe that. With OS X, even Apple caved in to file name
extensions where file system metadata had been previously used. Sniffing
for the BOM or heuristically recognizing UTF-8 has a better chance of
working across systems and networks than file system metadata.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #26

Alan J. Flavell

On Mon, 21 Mar 2005, Henri Sivonen wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
I understand you to mean that there's no place in the filesystem
directory (or equivalent) where such metadata can be associated with
the file? That is indeed true on most OSes. But given an appropriate
determination to solve the problem, I'm sure a solution could be
agreed upon.
[...]
I don't believe that.
I don't believe that the "appropriate determination to solve the
problem" is given - so we reach the same conclusion, for different
reasons.
Sniffing for the BOM or heuristically recognizing UTF-8 has a better
chance of working across systems and networks than file system
metadata.

That's still not a solution for maintaining character encoding
information for the existing 8-bit encodings. It works in a few
special cases - and, yes, some of those special cases are now of
sufficient importance to be interesting - and that then means that
they distract from a need to solve the underlying problem, as your
argument is demonstrating. So I think it's even less likely in
practice that a theoretically satisfying solution will come into use.
(Meantime it's working well enough for mail, thanks to the MIME
standards.)

cheers

Jul 23 '05 #27

Henri Sivonen

In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On Mon, 21 Mar 2005, Henri Sivonen wrote:

Sniffing for the BOM or heuristically recognizing UTF-8 has a better
chance of working across systems and networks than file system
metadata.

That's still not a solution for maintaining character encoding
information for the existing 8-bit encodings.

The legacy data can be converted to UTF-8.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #28

Alan J. Flavell

On Mon, 21 Mar 2005, Henri Sivonen wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
That's still not a solution for maintaining character encoding
information for the existing 8-bit encodings.

The legacy data can be converted to UTF-8.

I had an ominous feeling that was going to be your answer.

:-}

Jul 23 '05 #29

browsers don't follow the charset specification

Similar topics