Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old July 20th, 2005, 05:25 PM
Nicolai Pedersen
Guest
 
Posts: n/a
Default Validation of XHTML with danish characters

I have a problem validating a simple piece of XHTML containing danish
characters. Trying to validate the following piece of XHTML gives the error
mentioned beneath. If I remove the first line (the XML part) the document
validates fine. Does anyone have an idea how to solve this problem without
changing the characters to #xxx; or ø I've triede with UTF-8.
**************************************************
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="da" lang="da">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Document title</title>
</head>

<body>
<p>This is a danish document with the Danish letters æ ø and å</p>
</body>

</html>

The error:
"Sorry, I am unable to validate this document because on line 11 it
contained one or more bytes that I cannot interpret as us-ascii (in other
words, the bytes found are not valid values in the specified Character
Encoding). Please check both the content of the file and the character
encoding indication."


  #2  
Old July 20th, 2005, 05:25 PM
Jukka K. Korpela
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

"Nicolai Pedersen" <np@dynamicsystems.dk> wrote:
[color=blue]
> I have a problem validating a simple piece of XHTML containing danish
> characters.[/color]

This is long and sad story, and you would be confused after the
explanation. The short advice is simple: stop playing with XHTML;
upgrade to HTML 4.01. After all, it's just a matter of syntactic
trivialities, but playing by XHTML rules gives you a headache
if you don't know them well (and maybe even if you do).

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

  #3  
Old July 20th, 2005, 05:25 PM
Nicolai Pedersen
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

I'm currently upgrading from 4.01 to XHTML. Not for the fun of it - but
because I really need to do it.

"Jukka K. Korpela" <jkorpela@cs.tut.fi> wrote in message
news:Xns9421846B0C562jkorpelacstutfi@193.229.0.31. ..[color=blue]
> "Nicolai Pedersen" <np@dynamicsystems.dk> wrote:
>[color=green]
> > I have a problem validating a simple piece of XHTML containing danish
> > characters.[/color]
>
> This is long and sad story, and you would be confused after the
> explanation. The short advice is simple: stop playing with XHTML;
> upgrade to HTML 4.01. After all, it's just a matter of syntactic
> trivialities, but playing by XHTML rules gives you a headache
> if you don't know them well (and maybe even if you do).
>
> --
> Yucca, http://www.cs.tut.fi/~jkorpela/
> Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
>[/color]


  #4  
Old July 20th, 2005, 05:25 PM
Jim Ley
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Mon, 27 Oct 2003 12:13:33 +0100, "Nicolai Pedersen"
<np@dynamicsystems.dk> wrote:
[color=blue]
>I'm currently upgrading from 4.01 to XHTML. Not for the fun of it - but
>because I really need to do it.[/color]

Why?

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

  #5  
Old July 20th, 2005, 05:25 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Mon, 27 Oct 2003, Nicolai Pedersen wrote:

[in a usenet posting which appeared to be lacking its MIME header and
content-type, and therefore *ought* to have contained only us-ascii
characters...]
[color=blue]
> I have a problem validating a simple piece of XHTML containing danish
> characters. Trying to validate the following piece of XHTML gives the error
> mentioned beneath.[/color]

Then the important detail is likely to be something not included in
your report. Better quote a URL where we can investigate this for
ourselves.
[color=blue]
> If I remove the first line (the XML part) the document
> validates fine.[/color]

That has me puzzled, but I'm confident that if you gave a URL then
you'd get a prompt explanation, if not from me then from one of the
other contributors.
[color=blue]
> Does anyone have an idea how to solve this problem[/color]

We don't really know what the "problem" is yet - you've reported some
of the symptoms, but IMHO some important detail is missing.
[color=blue]
> without changing the characters to #xxx; or &oslash;[/color]

There should be no necessity for that, even if it does bring some
benefits in terms of document (mis)handling.
[color=blue]
> I've triede with UTF-8.[/color]

You don't make iso-8859-1-encoded characters magically change to utf-8
merely by declaring them so. If they genuinely were utf-8-encoded,
then that would be different.
[color=blue]
> The error:
> "Sorry, I am unable to validate this document because on line 11 it
> contained one or more bytes that I cannot interpret as us-ascii (in other
> words, the bytes found are not valid values in the specified Character
> Encoding).[/color]

Something seems to have convinced the processor that your document is
us-ascii-encoded? Maybe the web server?

good luck
  #6  
Old July 20th, 2005, 05:25 PM
Nicolai Pedersen
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

Thank you for your answer - it gave me the hint for the source of the error:

I was uploading the script as a file to the validator service - when
uploading it to my webserver and revalidating using the URL, everything
works fine.

"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote in message
news:Pine.LNX.4.53.0310271059150.27527@ppepc56.ph. gla.ac.uk...[color=blue]
> On Mon, 27 Oct 2003, Nicolai Pedersen wrote:
>
> Something seems to have convinced the processor that your document is
> us-ascii-encoded? Maybe the web server?
>
> good luck[/color]


  #7  
Old July 20th, 2005, 05:25 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Mon, 27 Oct 2003, Jukka K. Korpela wrote:
[color=blue]
> This is long and sad story, and you would be confused after the
> explanation.[/color]

If you've understood the problem based on what the hon Usenaut posted,
then I'm interested to know what it is. Maybe it's already been
posted or FAQed in some form, but if it was, I confess to not being
aware of it.
  #8  
Old July 20th, 2005, 05:25 PM
Andreas Prilop
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Mon, 27 Oct 2003, Nicolai Pedersen wrote:
[color=blue]
> X-Newsreader: Microsoft Outlook Express 6.00.2800.1158
>
> I have a problem validating a simple piece of XHTML containing danish
> characters.
>
> <p>This is a danish document with the Danish letters ? ? and ?</p>[/color]

Start with repairing your simulation of a newsreader. Here you go:

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

  #9  
Old July 20th, 2005, 05:25 PM
Jukka K. Korpela
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote:
[color=blue]
> If you've understood the problem based on what the hon Usenaut
> posted, then I'm interested to know what it is.[/color]

I have tried to actively forget all the confusion that XHTML causes
since, as I wrote, the simple answer is to stay away from it. (Some
day, there might be some actual use for XHTML, but I hope that then the
worst oddities have been fixed.)

If you compose a document containing the OP's sample HTML, in
ISO-8859-1 encoding, and submit it to validation via the file upload
facility at http://validator.w3.org/ , then the problem reported
will appear. It's strange that the validator refuses to look at
the document content, which twice specifies ISO-8859-1, but what can we
do? Yes, we _could_ use the extended interface, which lets us specify
the encoding the third time, and then we get

Note: The HTTP Content-Type header sent by your web browser (unknown)
did not contain a "charset" parameter, but the Content-Type was one of
the XML text/* sub-types (text/xml). The relevant specification (RFC
3023) specifies a strong default of "us-ascii" for such documents so we
will use this value regardless of any encoding you may have indicated
elsewhere. If you would like to use a different encoding, you should
arrange to have your browser send this new encoding information.

which looks pretty strange after a _file upload_ submission.

But it's less strange than experiences that I have seen when
the "CSS Validator" has been used on an XHTML document containing 8-bit
characters in ISO-8859-1 encoding and the "CSS Validator" called
the W3C "markup validator", which choked on it. As I wrote, I'm
actively trying to forget the mess.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

  #10  
Old July 20th, 2005, 05:25 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Mon, 27 Oct 2003, Jukka K. Korpela wrote:
[color=blue]
> I have tried to actively forget all the confusion that XHTML causes
> since, as I wrote, the simple answer is to stay away from it.[/color]

;-}

But there's an interesting point here, nevertheless, which has nothing
directly to do with XML/XHTML and a lot to do with i18n form
submission, and, I now realise, relating tangentially to my form-i18n
page. Oh, and by chance Google just presented me with a discussion
thread which is also relevant to the underlying principle, in a way...
http://lists.w3.org/Archives/Public/...3Sep/0025.html etc.
[color=blue]
> Yes, we _could_ use the extended interface, which lets us specify
> the encoding the third time, and then we get
>
> Note: The HTTP Content-Type header sent by your web browser (unknown)
> did not contain a "charset" parameter, but the Content-Type was one of
> the XML text/* sub-types (text/xml).[/color]

This assertion presumably relates to the file upload "control" of the
multipart/form-data submission, yes? I'm not in the least surprised
by the absence of a "charset" specification, but I'm puzzled by the
fact that it's saying it was content-type "text/xml". Would this have
been sent by your client agent, or are they spoofing it in order to
make their validator accept it?

[quotation continues...][color=blue]
> The relevant specification (RFC
> 3023) specifies a strong default of "us-ascii" for such documents so we
> will use this value regardless of any encoding you may have indicated
> elsewhere. If you would like to use a different encoding, you should
> arrange to have your browser send this new encoding information.[/color]

Hmmm, yes, they have a point, despite its unfriendliness.
[color=blue]
> which looks pretty strange after a _file upload_ submission.[/color]

Oh, I don't know: the client agent is in a far better position to know
what encoding to assign to this portion of the multipart/form-data
submission, than is any other participant in the proceedings.

What it basically means is: because implementers have been avoiding
implementing the necessary features of the i18n specifications (in
some cases alleging that they couldn't do it because it would upset
other incomplete implementations), this kind of file upload can't do
the job that is needed at this point.

If the validator folk were to start applying heuristics at this point
then they'd defeat their own purpose, presumably. It's a shame about
the users who are caught out by this, though.

As you may recall, my thesis has always been that no text file is
complete without external information about its character encoding,
and that it's an architectural error to smuggle that information into
content of the file itself. But I've long since lost that battle,
what with the http meta thingy, the <?xml...encoding thingy. I could
almost live with the BOM, but of course the BOM doesn't solve anything
for non-Unicode encodings.

And Mark C made dire threats about the dangers of going anywhere near
ISO-2022 (which I hadn't even mentioned!) when I got involved in a
discussion about character code support in PINE recently.

I think the bottom line here is that the file upload feature of the
validator is of very limited usefulness, given the shortcomings which
have been raised here, and needs Some Big Text to warn users of the
pitfalls, relative to putting the content onto a server and pointing
the validator/checker at its URL.

thanks for the explanation!

all the best
  #11  
Old July 20th, 2005, 05:26 PM
Jukka K. Korpela
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote:
[color=blue]
> This assertion presumably relates to the file upload "control" of
> the multipart/form-data submission, yes?[/color]

That would be the natural explanation. It could also relate to whatever
the validator was doing along its normal path in making guesses - maybe
there's not much special code for dealing with data that arrives via
file submission, apart from the necessary evil of recognizing it and
extracting the file.
[color=blue]
> I'm not in the least
> surprised by the absence of a "charset" specification, but I'm
> puzzled by the fact that it's saying it was content-type
> "text/xml". Would this have been sent by your client agent, or are
> they spoofing it in order to make their validator accept it?[/color]

I was puzzled too. The client was a normal IE 6 on normal Windows 98
with rather normal settings, and the filename extension was .html, so I
would say that anything else than text/html would be unnatural.
I first put the blame on the validator, which apparently wants to see
the world as XML, but it seems that they are in an alliance with
Microsoft: as far as my simple "echo back" script works OK, IE 6 really
sends (in the form data set) is like this

Content-Disposition: form-data; name="..."; filename="..."
Content-Type: text/xml

And if I do the same with an HTML 4.01 document, in a file with
the .html extension, then IE 6 sends
Content-Type: text/html
so it apparently looks at the actual data before deciding on the
content type!
[color=blue]
> [quotation continues...][color=green]
>> The relevant specification (RFC
>> 3023) specifies a strong default of "us-ascii" for such documents
>> so we will use this value regardless of any encoding you may have
>> indicated elsewhere. If you would like to use a different
>> encoding, you should arrange to have your browser send this new
>> encoding information.[/color]
>
> Hmmm, yes, they have a point, despite its unfriendliness.[/color]

Admitted. Actually RFC 3023 is stronger than what the above might
suggest:
if a text/xml entity is received with
the charset parameter omitted, MIME processors and XML processors
MUST use the default charset value of "us-ascii"[ASCII].
[color=blue]
> this kind of file upload can't
> do the job that is needed at this point.[/color]

So it seems, if RFC 3023 is taken seriously (and they can hardly avoid
that if they play XML game) and there's virtually nothing we can do
about browsers. Well, they have a menu for setting the encoding
manually, but isn't that a violation of RFC 3023 too? :-) After all, it
does not change the Content-Type header sent by the browser.
[color=blue]
> If the validator folk were to start applying heuristics at this
> point then they'd defeat their own purpose, presumably.[/color]

So the user is forced (if he plays the XHTML game), when using file
submission to the validator, to select manually the encoding. Then we
just close our eyes and pretend that this was more OK.
[color=blue]
> I think the bottom line here is that the file upload feature of the
> validator is of very limited usefulness,[/color]

So it seems. Except if we upgrade from XHTML to HTML 4.01, which _does_
allow (and prescribes) the recipient to deduce the encoding from
a meta tag if no charset parameter is present in actual headers.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

  #12  
Old July 20th, 2005, 05:26 PM
Nick Kew
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

In article <Pine.LNX.4.53.0310271909340.27527@ppepc56.ph.gla. ac.uk>, one of infinite monkeys
at the keyboard of "Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote:
[color=blue][color=green]
>> Note: The HTTP Content-Type header sent by your web browser (unknown)
>> did not contain a "charset" parameter, but the Content-Type was one of
>> the XML text/* sub-types (text/xml).[/color][/color]

That'll be why someone noted that the <?xml ...?> line made a difference -
a browser was using that to set a content-type header.
[color=blue]
> but I'm puzzled by the
> fact that it's saying it was content-type "text/xml". Would this have
> been sent by your client agent,[/color]

Yes.
[color=blue]
> or are they spoofing it in order to
> make their validator accept it?[/color]

Are you thinking of the way validators used to "spoof" a DOCTYPE? There's
no analagous situation at all here.
[color=blue]
> Hmmm, yes, they have a point, despite its unfriendliness.
>[color=green]
>> which looks pretty strange after a _file upload_ submission.[/color][/color]

That's a view I tend to favour. It's been discussed on #validator/etc.
[color=blue]
> Oh, I don't know: the client agent is in a far better position to know
> what encoding to assign to this portion of the multipart/form-data
> submission, than is any other participant in the proceedings.[/color]

Well, sort-of. But if all the agent sees is data on a non-i18n-aware
filesystem, it's not well-placed either.
[color=blue]
> If the validator folk were to start applying heuristics at this point
> then they'd defeat their own purpose, presumably. It's a shame about
> the users who are caught out by this, though.[/color]

For Page Valet I've taken a different approach which I think is more
helpful. It'll issue a stern warning, but then it will go ahead and
apply XML or HTML rules as appropriate to the content-type.
[color=blue]
> As you may recall, my thesis has always been that no text file is
> complete without external information about its character encoding,
> and that it's an architectural error to smuggle that information into
> content of the file itself.[/color]

That's fair comment re file exchange, but a bit tough when it comes
to file *storage*. After all, we don't have to require every desktop
box to support i18n at all.

Corollary: HTTP file upload is basically a dumb storage operation.
[color=blue]
> But I've long since lost that battle,
> what with the http meta thingy, the <?xml...encoding thingy. I could
> almost live with the BOM, but of course the BOM doesn't solve anything
> for non-Unicode encodings.[/color]

No BOM => <?xml encoding can be parsed as ASCII. I think that makes sense,
as it means the information is always preserved, even when an XML file
lives on some dumb storage medium.

The villain of the piece seems as so often to be in the specs, with
the conflict between RFC3023 and the XML parsing rules.

--
Nick Kew
  #13  
Old July 20th, 2005, 05:26 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Tue, 28 Oct 2003, Nick Kew wrote:
[color=blue][color=green][color=darkred]
> >> Note: The HTTP Content-Type header sent by your web browser (unknown)
> >> did not contain a "charset" parameter, but the Content-Type was one of
> >> the XML text/* sub-types (text/xml).[/color][/color]
>
> That'll be why someone noted that the <?xml ...?> line made a difference -
> a browser was using that to set a content-type header.[/color]

OK, I've grasped that point now - thanks.

If I get a calm moment, I might try to summarise the key points of
this in that form-i18n page of mine.
[color=blue][color=green]
> > Oh, I don't know: the client agent is in a far better position to know
> > what encoding to assign to this portion of the multipart/form-data
> > submission, than is any other participant in the proceedings.[/color]
>
> Well, sort-of. But if all the agent sees is data on a non-i18n-aware
> filesystem, it's not well-placed either.[/color]

No disagreement, but I don't know any other participant in the
exchange who is better placed to determine this, than the client agent
with, possibly, a bit of co-operation with its user.
[color=blue][color=green]
> > If the validator folk were to start applying heuristics at this point
> > then they'd defeat their own purpose, presumably. It's a shame about
> > the users who are caught out by this, though.[/color]
>
> For Page Valet I've taken a different approach which I think is more
> helpful. It'll issue a stern warning, but then it will go ahead and
> apply XML or HTML rules as appropriate to the content-type.[/color]

Sounds good to me.
[color=blue]
> Corollary: HTTP file upload is basically a dumb storage operation.[/color]

In a practical sense that's probably fair comment - but the parts of
the multipart submission have provision for including such meta-data,
so it's a bit of a twilight zone.
[color=blue]
> No BOM => <?xml encoding can be parsed as ASCII. I think that makes sense,
> as it means the information is always preserved, even when an XML file
> lives on some dumb storage medium.[/color]

Just don't try FTP-ing the file at text to, for instance, an
EBCDIC-based platform, or you'll have an EBCDIC-coded document whose
contents claim it to be in an ASCII-based coding.
[color=blue]
> The villain of the piece seems as so often to be in the specs, with
> the conflict between RFC3023 and the XML parsing rules.[/color]

Well, IMHO the *true* villain of this particular piece is the idea
that in order to correctly handle a text file - even for a dumb
transfer of the file contents betweeen platforms - it's necessary to
parse and modify its contents. That's architecturally unsound, since
it would mean that a transfer process (e.g FTP) intended to transfer
content of type text/* would need to be able to parse and to correctly
modify the contents of every different kind of text/* content it was
trying to support.

If not that, then you need to be able to handle platform-foreign text
file formats in your applications. It's a mess.
  #14  
Old July 20th, 2005, 05:26 PM
Henri Sivonen
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

In article <Xns9422AED84822jkorpelacstutfi@193.229.0.31>,
"Jukka K. Korpela" <jkorpela@cs.tut.fi> wrote:
[color=blue]
> So it seems, if RFC 3023 is taken seriously (and they can hardly avoid
> that if they play XML game)[/color]

In my opinion, playing the XML game with text/* content types is a bad
idea considering that US-ASCII default that overrides the XML
declaration. The sensible way to play the XML game is to use
application/* content types without the charset parameter, use UTF-8 or
UTF-16 and leave detecting UTF-8 vs. UTF-16 to the XML processor.

(BTW, why is the delivery protocol referred to as a "higher-level"
protocol in the XML spec? If TCP is higher up in the protocol stack than
IP and HTTP is higher up than TCP, shouldn't XML be higher up than HTTP?)

--
Henri Sivonen
hsivonen@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
  #15  
Old July 20th, 2005, 05:26 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

On Tue, 28 Oct 2003, Henri Sivonen wrote:
[color=blue]
> (BTW, why is the delivery protocol referred to as a "higher-level"
> protocol in the XML spec?[/color]

HTTP is at a higher protocol level than the character encoding layer,
is how I interpret its meaning.

Recognising a BOM is at a lower protocol layer than HTTP protocol.
Picking out a few characters that happen to read "<?xml ...>" is also
at a rather low level (there is no requirement at that stage of the
processing - we're talking about the processing which determines the
operative character encoding, right? - to actually parse the XML).
  #16  
Old July 20th, 2005, 05:29 PM
Henri Sivonen
Guest
 
Posts: n/a
Default Re: Validation of XHTML with danish characters

In article <Pine.LNX.4.53.0310281851130.28979@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote:
[color=blue]
> On Tue, 28 Oct 2003, Henri Sivonen wrote:
>[color=green]
> > (BTW, why is the delivery protocol referred to as a "higher-level"
> > protocol in the XML spec?[/color]
>
> HTTP is at a higher protocol level than the character encoding layer,
> is how I interpret its meaning.[/color]

But HTTP deals with bytes and XML deals with Unicode characters. Doesn't
that put byte stream to character mapping layer on top of HTTP?

XHTML
Namespaces in XML
XML 1.0
byte stream to Unicode character stream mapping
HTTP
TCP
IP
link layer

--
Henri Sivonen
hsivonen@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles