By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,315 Members | 1,672 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,315 IT Pros & Developers. It's quick & easy.

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

P: n/a
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice
features such as:
* rewrite source code
* check syntax
* global search & replace (through several files at once)
* regular expression search & replace.

Normally my documents are encoded with the ISO setting.

Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
GoLive 'rewrite source code' feature. Big mistake. It changed all my funny characters to
non-SGML compliant characters (e.g. é was converted to ) and I didn't notice until
after I'd saved the document. Nasty. It doesn't do that with ISO encoded documents.

Jul 20 '05 #1
Share this Question
Share on Google+
48 Replies


P: n/a
Zenobia <5.**********@spamgourmet.com> wrote:
Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
GoLive 'rewrite source code' feature. Big mistake. It changed all my funny characters to
non-SGML compliant characters (e.g. &eacute; was converted to )


There's nothing non-compliant about . It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it. This
would also be the case for any encoding that contained , e.g.
ISO-8859-1 etc. This is true whether your document is HTML or XHTML.

However, an XHTML document that is "SGML compliant"? Surely that's an
oxymoron. ;-)

Steve

--
"My theories appal you, my heresies outrage you,
I never answer letters and you don't like my tie." - The Doctor

Steve Pugh <st***@pugh.net> <http://steve.pugh.net/>
Jul 20 '05 #2

P: n/a
Zenobia <5.**********@spamgourmet.com> wrote:
Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
GoLive 'rewrite source code' feature. Big mistake. It changed all my funny characters to
non-SGML compliant characters (e.g. &eacute; was converted to )


There's nothing non-compliant about . It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it. This
would also be the case for any encoding that contained , e.g.
ISO-8859-1 etc. This is true whether your document is HTML or XHTML.

However, an XHTML document that is "SGML compliant"? Surely that's an
oxymoron. ;-)

Steve

--
"My theories appal you, my heresies outrage you,
I never answer letters and you don't like my tie." - The Doctor

Steve Pugh <st***@pugh.net> <http://steve.pugh.net/>
Jul 20 '05 #3

P: n/a
On Sat, 10 Apr 2004 12:11:54 +0100, Steve Pugh <st***@pugh.net> wrote:
Zenobia <5.**********@spamgourmet.com> wrote:
Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
GoLive 'rewrite source code' feature. Big mistake. It changed all my funny characters to
non-SGML compliant characters (e.g. &eacute; was converted to )


There's nothing non-compliant about . It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it. This
would also be the case for any encoding that contained , e.g.
ISO-8859-1 etc. This is true whether your document is HTML or XHTML.


I entered these character into an XHTML document:

&plusmn; &deg; &auml;

It validated OK by the W3c XHTML validator.

This is what GoLive 6 displays after 'rewrite source code'

± ° ä

These are the characters, as rendered, by IE6:



This is the error message I get when I validate the (modified) document with the W3c XHTML
validator:

Sorry, I am unable to validate this document because on line 11 it contained
one or more bytes that I cannot interpret as us-ascii (in other words, the bytes
found are not valid values in the specified Character Encoding). Please check
both the content of the file and the character encoding indication.

Well, the corresponding numeric codes are:

± ° ä

(I suppose the w3c validator would also accept these).

How can these characters ( , or the 2-char versions) be valid UTF-8 characters if the
W3C validator doesn't accept them? Or does the W3C validator not work correctly?

I'm also lost as to why is shown in the GoLive editor as ±, etc. That character is
± - surely this is just a one byte character. GoLive should display it as

How would you go about writing XTHML valid code using GoLive - (a) with the document set
to use UTF-8 encoding (but without the benefit of the 'rewrite source code' feature. OR
(b) using ISO-8859-1, so that you are able to use the 'rewrite source code' feature?

Jul 20 '05 #4

P: n/a
On Sat, 10 Apr 2004 12:11:54 +0100, Steve Pugh <st***@pugh.net> wrote:
Zenobia <5.**********@spamgourmet.com> wrote:
Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
GoLive 'rewrite source code' feature. Big mistake. It changed all my funny characters to
non-SGML compliant characters (e.g. &eacute; was converted to )


There's nothing non-compliant about . It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it. This
would also be the case for any encoding that contained , e.g.
ISO-8859-1 etc. This is true whether your document is HTML or XHTML.


I entered these character into an XHTML document:

&plusmn; &deg; &auml;

It validated OK by the W3c XHTML validator.

This is what GoLive 6 displays after 'rewrite source code'

± ° ä

These are the characters, as rendered, by IE6:



This is the error message I get when I validate the (modified) document with the W3c XHTML
validator:

Sorry, I am unable to validate this document because on line 11 it contained
one or more bytes that I cannot interpret as us-ascii (in other words, the bytes
found are not valid values in the specified Character Encoding). Please check
both the content of the file and the character encoding indication.

Well, the corresponding numeric codes are:

± ° ä

(I suppose the w3c validator would also accept these).

How can these characters ( , or the 2-char versions) be valid UTF-8 characters if the
W3C validator doesn't accept them? Or does the W3C validator not work correctly?

I'm also lost as to why is shown in the GoLive editor as ±, etc. That character is
± - surely this is just a one byte character. GoLive should display it as

How would you go about writing XTHML valid code using GoLive - (a) with the document set
to use UTF-8 encoding (but without the benefit of the 'rewrite source code' feature. OR
(b) using ISO-8859-1, so that you are able to use the 'rewrite source code' feature?

Jul 20 '05 #5

P: n/a
Zenobia wrote:
On Sat, 10 Apr 2004 12:11:54 +0100, Steve Pugh <st***@pugh.net> wrote:
Zenobia <5.**********@spamgourmet.com> wrote:
Recently I was writing an XHTML document. After changing the encoding to
UTF-8 I used the GoLive 'rewrite source code' feature. Big mistake. It
changed all my funny characters to non-SGML compliant characters (e.g.
&eacute; was converted to é)
There's nothing non-compliant about é. It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it. This
would also be the case for any encoding that contained é, e.g.
ISO-8859-1 etc. This is true whether your document is HTML or XHTML.


I entered these character into an XHTML document:

&plusmn; &deg; &auml;

It validated OK by the W3c XHTML validator.


Okay, but there's nothing depending on UTF-8 there, those characters are all
present in US-ASCII (the characters '&', 'p', 'l', etc). Those are the
characters that are actually present in the file.

This is what GoLive 6 displays after 'rewrite source code'

± ° ä
What a (possibly flawed) program displays isn't very relevent when trying to
determine where a bug lies. Could you provide a URL to a representative
example or two?

These are the characters, as rendered, by IE6:

± ° ä
This is of little value, as Internet Explorer violates multiple
specifications to try and guess at the behaviour that the author intended.

This is the error message I get when I validate the (modified) document
with the W3c XHTML validator:

Sorry, I am unable to validate this document because on line 11 it
contained one or more bytes that I cannot interpret as us-ascii (in other
words, the bytes found are not valid values in the specified Character
Encoding). Please check both the content of the file and the character
encoding indication.
From that error message, I would *guess* that there was an incorrect or
missing HTTP header and/or <meta> element in your document. The characters
you are talking about are not present in US-ASCII, if the validator thinks
the document is encoded in US-ASCII, it's probably because you have told it
so (which you shouldn't).

Well, the corresponding numeric codes are:

± ° ä

(I suppose the w3c validator would also accept these).
Yes, on account of those actual characters being present in the US-ASCII
character encoding ('&', '#', '1', etc).

How can these characters (± ° ä, or the 2-char versions) be valid UTF-8
characters if the W3C validator doesn't accept them?
If, when a user-agent requests a document, you are telling it that it is
encoded in US-ASCII, most user-agents will believe you, including
validators. Try supplying an appropriate HTTP header:

Content-Type: text/html; charset=UTF-8

Or does the W3C validator not work correctly?
If I had to guess between Internet Explorer working correctly, and something
else working correctly, I'd put money on the something else.

I'm also lost as to why ± is shown in the GoLive editor as ±, etc.
character is ± - surely this is just a one byte character.
No it isn't. How many bytes depends on the character encoding, and UTF-8
sometimes splits single characters up into multiple bytes. I'm pretty sure
that the byte sequence for ± in UTF-8 is the same as the byte sequence for
± in US-ASCII.

GoLive should display it as ±
Only if your document is correctly advertised as being UTF-8, which is
probably isn't.

How would you go about writing XTHML valid code using GoLive - (a) with
the document set to use UTF-8 encoding (but without the benefit of the
'rewrite source code' feature. OR (b) using ISO-8859-1, so that you are
able to use the 'rewrite source code' feature?


Ensure the server is sending the correct HTTP headers, and place a matching
<meta> element in each document.
--
Jim Dabell

Jul 20 '05 #6

P: n/a
Zenobia wrote:
On Sat, 10 Apr 2004 12:11:54 +0100, Steve Pugh <st***@pugh.net> wrote:
Zenobia <5.**********@spamgourmet.com> wrote:
Recently I was writing an XHTML document. After changing the encoding to
UTF-8 I used the GoLive 'rewrite source code' feature. Big mistake. It
changed all my funny characters to non-SGML compliant characters (e.g.
&eacute; was converted to é)
There's nothing non-compliant about é. It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it. This
would also be the case for any encoding that contained é, e.g.
ISO-8859-1 etc. This is true whether your document is HTML or XHTML.


I entered these character into an XHTML document:

&plusmn; &deg; &auml;

It validated OK by the W3c XHTML validator.


Okay, but there's nothing depending on UTF-8 there, those characters are all
present in US-ASCII (the characters '&', 'p', 'l', etc). Those are the
characters that are actually present in the file.

This is what GoLive 6 displays after 'rewrite source code'

± ° ä
What a (possibly flawed) program displays isn't very relevent when trying to
determine where a bug lies. Could you provide a URL to a representative
example or two?

These are the characters, as rendered, by IE6:

± ° ä
This is of little value, as Internet Explorer violates multiple
specifications to try and guess at the behaviour that the author intended.

This is the error message I get when I validate the (modified) document
with the W3c XHTML validator:

Sorry, I am unable to validate this document because on line 11 it
contained one or more bytes that I cannot interpret as us-ascii (in other
words, the bytes found are not valid values in the specified Character
Encoding). Please check both the content of the file and the character
encoding indication.
From that error message, I would *guess* that there was an incorrect or
missing HTTP header and/or <meta> element in your document. The characters
you are talking about are not present in US-ASCII, if the validator thinks
the document is encoded in US-ASCII, it's probably because you have told it
so (which you shouldn't).

Well, the corresponding numeric codes are:

± ° ä

(I suppose the w3c validator would also accept these).
Yes, on account of those actual characters being present in the US-ASCII
character encoding ('&', '#', '1', etc).

How can these characters (± ° ä, or the 2-char versions) be valid UTF-8
characters if the W3C validator doesn't accept them?
If, when a user-agent requests a document, you are telling it that it is
encoded in US-ASCII, most user-agents will believe you, including
validators. Try supplying an appropriate HTTP header:

Content-Type: text/html; charset=UTF-8

Or does the W3C validator not work correctly?
If I had to guess between Internet Explorer working correctly, and something
else working correctly, I'd put money on the something else.

I'm also lost as to why ± is shown in the GoLive editor as ±, etc.
character is ± - surely this is just a one byte character.
No it isn't. How many bytes depends on the character encoding, and UTF-8
sometimes splits single characters up into multiple bytes. I'm pretty sure
that the byte sequence for ± in UTF-8 is the same as the byte sequence for
± in US-ASCII.

GoLive should display it as ±
Only if your document is correctly advertised as being UTF-8, which is
probably isn't.

How would you go about writing XTHML valid code using GoLive - (a) with
the document set to use UTF-8 encoding (but without the benefit of the
'rewrite source code' feature. OR (b) using ISO-8859-1, so that you are
able to use the 'rewrite source code' feature?


Ensure the server is sending the correct HTTP headers, and place a matching
<meta> element in each document.
--
Jim Dabell

Jul 20 '05 #7

P: n/a
Zenobia <5.**********@spamgourmet.com> wrote in message news:<0f********************************@4ax.com>. ..
Sorry, I am unable to validate this document because on line 11 it contained
one or more bytes that I cannot interpret as us-ascii (in other words, the bytes
found are not valid values in the specified Character Encoding). Please check
both the content of the file and the character encoding indication.


Sounds like your server is sending the document with the character
encoding specified as us-ascii rather than utf-8, even though the
actual encoding is utf-8.

--
Dan
Jul 20 '05 #8

P: n/a
Zenobia <5.**********@spamgourmet.com> wrote in message news:<0f********************************@4ax.com>. ..
Sorry, I am unable to validate this document because on line 11 it contained
one or more bytes that I cannot interpret as us-ascii (in other words, the bytes
found are not valid values in the specified Character Encoding). Please check
both the content of the file and the character encoding indication.


Sounds like your server is sending the document with the character
encoding specified as us-ascii rather than utf-8, even though the
actual encoding is utf-8.

--
Dan
Jul 20 '05 #9

P: n/a
"Steve Pugh" <st***@pugh.net> wrote in
comp.infosystems.www.authoring.html:
There's nothing non-compliant about . It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it.


But Alan Flavell advises against using any of 128-255 directly in
UTF-8, if I understand his "Checklist" page correctly.[1] Instead he
says characters above 127 should be expressed in &-notation.

There's another problem with UTF-8: when I "Save As" a UTF-8 page,
Mozilla 1.4 scrogs up the high-order characters so that the local
copy contains garbage sequences instead of e.g. –. I reported
this months ago; anybody know if it's been fixed in later versions?

[1] http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html
See scenario 6.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #10

P: n/a
"Steve Pugh" <st***@pugh.net> wrote in
comp.infosystems.www.authoring.html:
There's nothing non-compliant about . It exists in UTF-8 therefore if
UTF-8 is the declared encoding it is perfectly okay to use it.


But Alan Flavell advises against using any of 128-255 directly in
UTF-8, if I understand his "Checklist" page correctly.[1] Instead he
says characters above 127 should be expressed in &-notation.

There's another problem with UTF-8: when I "Save As" a UTF-8 page,
Mozilla 1.4 scrogs up the high-order characters so that the local
copy contains garbage sequences instead of e.g. –. I reported
this months ago; anybody know if it's been fixed in later versions?

[1] http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html
See scenario 6.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #11

P: n/a
On Sat, 10 Apr 2004, Stan Brown wrote:
UTF-8 is the declared encoding it is perfectly okay to use it.

Right - as long as the operator in question knows how to handle it.
But Alan Flavell advises against using any of 128-255 directly in
UTF-8, if I understand his "Checklist" page correctly.[1]
That checklist presents a number of options, for different situations.
Scenario 7 is the use of utf-8 encoding -
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s7
- in which I comment:

this is an entirely valid form to send out documents, and is
acceptable to any RFC2070-conforming browser as well as to Netscape
4.* versions. Browser coverage for the two forms seems rather
similar. The expected difficulties are not in the browsers, but in
authors (mis)handling this unfamiliar data format.

which I think is fair comment. Of course, on the other hand, if a
document contains large numbers of non-ASCII characters, then the use
of &-notations, as described under scenario 6, will produce an
unnecessarily bulky source document. But - given a requirement for a
wide character repertoire, and an author whose competence in handling
coded characters is as yet unproven, I would tend to err on the
conservative side and recommend scenario 6, as you say, on purely
practical grounds. Technically, the use of utf-8 encoding is
impeccable, and will become increasingly the norm in future, I
presume.

Hope that clarifies the issue..
Instead he says characters above 127 should be expressed in
&-notation.


I recommend that the choice depends on circumstances ;-)
There's no single ideal representation which fits all requirements.

But don't confuse this with the issue of trying to represent the
"Windows characters" by &#number; values in the range 128-159 decimal:
those characters have their proper places in the Unicode character
set. As being discussed on a parallel thread: &#number; notations in
the range 128-159 decimal in HTML are technically "undefined", and in
XHTML illegal, and should not be used in either case (no matter that
they are emitted in droves by software from the Empire).

cheers
Jul 20 '05 #12

P: n/a
On Sat, 10 Apr 2004, Stan Brown wrote:
UTF-8 is the declared encoding it is perfectly okay to use it.

Right - as long as the operator in question knows how to handle it.
But Alan Flavell advises against using any of 128-255 directly in
UTF-8, if I understand his "Checklist" page correctly.[1]
That checklist presents a number of options, for different situations.
Scenario 7 is the use of utf-8 encoding -
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s7
- in which I comment:

this is an entirely valid form to send out documents, and is
acceptable to any RFC2070-conforming browser as well as to Netscape
4.* versions. Browser coverage for the two forms seems rather
similar. The expected difficulties are not in the browsers, but in
authors (mis)handling this unfamiliar data format.

which I think is fair comment. Of course, on the other hand, if a
document contains large numbers of non-ASCII characters, then the use
of &-notations, as described under scenario 6, will produce an
unnecessarily bulky source document. But - given a requirement for a
wide character repertoire, and an author whose competence in handling
coded characters is as yet unproven, I would tend to err on the
conservative side and recommend scenario 6, as you say, on purely
practical grounds. Technically, the use of utf-8 encoding is
impeccable, and will become increasingly the norm in future, I
presume.

Hope that clarifies the issue..
Instead he says characters above 127 should be expressed in
&-notation.


I recommend that the choice depends on circumstances ;-)
There's no single ideal representation which fits all requirements.

But don't confuse this with the issue of trying to represent the
"Windows characters" by &#number; values in the range 128-159 decimal:
those characters have their proper places in the Unicode character
set. As being discussed on a parallel thread: &#number; notations in
the range 128-159 decimal in HTML are technically "undefined", and in
XHTML illegal, and should not be used in either case (no matter that
they are emitted in droves by software from the Empire).

cheers
Jul 20 '05 #13

P: n/a
Preface: While I'm asking some detailed questions below, I realized
that I have a fundamental gap in my knowledge: What is a UTF-8
document? What does it look like physically? Is each character 16
bits, or is there some variable-length encoding? Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?

I don't expect anyone to write me a lengthy custom answer to that
question. A short summary would be greatly appreciated, but a
reference to something on line would be helpful too. I'm not sure
even how to frame a search query that's specific enough to be
useful.

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in
comp.infosystems.www.authoring.html:
On Sat, 10 Apr 2004, Stan Brown wrote:
But Alan Flavell advises against using any of 128-255 directly in
UTF-8, if I understand his "Checklist" page correctly.[1]
That checklist presents a number of options, for different situations.
Scenario 7 is the use of utf-8 encoding -
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s7
- in which I comment:
this is an entirely valid form to send out documents, and is
acceptable to any RFC2070-conforming browser as well as to Netscape
4.* versions. Browser coverage for the two forms seems rather
similar. The expected difficulties are not in the browsers, but in
authors (mis)handling this unfamiliar data format.

Technically, the use of utf-8 encoding is
impeccable, and will become increasingly the norm in future, I
presume.
As I understand Scenario 7, it would have me create documents in a
Unicode editor. Does that mean characters are no longer physically 8
bits of storage? Can I create a Unicode document that on a Windows
98 US system, without installing and learning a whole lot of new
software?

Right now I edit source code in Vim (or GVim), then run GNU MAKE and
GNU AWK to do macro substitutions, add standard prologues and
epilogues, and the like. I don't suppose MAKE much cares, but the
AWK documentation doesn't mention Unicode. Am I missing something
here?
which I think is fair comment. Of course, on the other hand, if a
document contains large numbers of non-ASCII characters, then the use
of &-notations, as described under scenario 6, will produce an
unnecessarily bulky source document.


Tell me about it! But still the &-notations are a lot less bulky
than image tags for symbols like pi and a right-pointing arrow.

I reviewed your checklist, but if this question was answered I
missed the answer:

If I always use &#nnn; or &#nnnn; for characters 160 and up,(*) I
know that Netscape 4 is a bit more likely to render the characters
if the document advertises itself as UTF-8 than if it advertises
itself as ISO-8859-1. But (as I mentioned) Mozilla seems to do a bad
job with saving UTF-8 documents.

Is there any other basis for choosing between ISO-8859-1 and UTF-8,
when the document could legitimately advertise itself as either?
(*) You mentioned _not_ using &-notation for 128-159. I agree
wholeheartedly, as my contribution yesterday to the "long hyphen
151" thread attests.
--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #14

P: n/a
Preface: While I'm asking some detailed questions below, I realized
that I have a fundamental gap in my knowledge: What is a UTF-8
document? What does it look like physically? Is each character 16
bits, or is there some variable-length encoding? Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?

I don't expect anyone to write me a lengthy custom answer to that
question. A short summary would be greatly appreciated, but a
reference to something on line would be helpful too. I'm not sure
even how to frame a search query that's specific enough to be
useful.

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in
comp.infosystems.www.authoring.html:
On Sat, 10 Apr 2004, Stan Brown wrote:
But Alan Flavell advises against using any of 128-255 directly in
UTF-8, if I understand his "Checklist" page correctly.[1]
That checklist presents a number of options, for different situations.
Scenario 7 is the use of utf-8 encoding -
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s7
- in which I comment:
this is an entirely valid form to send out documents, and is
acceptable to any RFC2070-conforming browser as well as to Netscape
4.* versions. Browser coverage for the two forms seems rather
similar. The expected difficulties are not in the browsers, but in
authors (mis)handling this unfamiliar data format.

Technically, the use of utf-8 encoding is
impeccable, and will become increasingly the norm in future, I
presume.
As I understand Scenario 7, it would have me create documents in a
Unicode editor. Does that mean characters are no longer physically 8
bits of storage? Can I create a Unicode document that on a Windows
98 US system, without installing and learning a whole lot of new
software?

Right now I edit source code in Vim (or GVim), then run GNU MAKE and
GNU AWK to do macro substitutions, add standard prologues and
epilogues, and the like. I don't suppose MAKE much cares, but the
AWK documentation doesn't mention Unicode. Am I missing something
here?
which I think is fair comment. Of course, on the other hand, if a
document contains large numbers of non-ASCII characters, then the use
of &-notations, as described under scenario 6, will produce an
unnecessarily bulky source document.


Tell me about it! But still the &-notations are a lot less bulky
than image tags for symbols like pi and a right-pointing arrow.

I reviewed your checklist, but if this question was answered I
missed the answer:

If I always use &#nnn; or &#nnnn; for characters 160 and up,(*) I
know that Netscape 4 is a bit more likely to render the characters
if the document advertises itself as UTF-8 than if it advertises
itself as ISO-8859-1. But (as I mentioned) Mozilla seems to do a bad
job with saving UTF-8 documents.

Is there any other basis for choosing between ISO-8859-1 and UTF-8,
when the document could legitimately advertise itself as either?
(*) You mentioned _not_ using &-notation for 128-159. I agree
wholeheartedly, as my contribution yesterday to the "long hyphen
151" thread attests.
--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #15

P: n/a
Stan Brown <th************@fastmail.fm> wrote:
What is a UTF-8 document?
A document that contains text in the UTF-8 encoding, which is a method of
representing Unicode characters as strings of octets (8-bit bytes).
What does it look like physically?
As such, it is a sequence of octets, and any program that presents it in
visible form may choose how to do that. But the natural method is to
display the characters using glyphs from some font(s), as lines (obeying
the various characters that indicate line breaks in Unicode).
Is each character 16
bits, or is there some variable-length encoding?
UTF-8 is a variable-length encoding. All Ascii characters are represented
"as such", one character as one octet, whereas for other characters, two
or more octets are used. A little more about this:
http://www.cs.tut.fi/~jkorpela/chars.html#utf
Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?
It was designed to be a transmission format (that's where the name comes
from, with U standing for Unicode), but naturally it can be used a
storage format too. Normally if you wish to create an HTML document for
the WWW in UTF-8 encoding, you use a Unicode-capable text editor and
select UTF-8 as the format in "Save As" (could be the default of course),
then upload it (using binary mode in FTP!). And you make sure that the
server sends the header Content-Type: text/html; charset=utf-8
(if possible). Normally you don't need to know anything about the
technicalities of UTF-8 or see anything in any encoded format.
But naturally if you open a UTF-8 encoded document in a simple text
editor that works with 8-bit characters only (e.g., assuming the whole
world is ISO Latin 1 or Windows Latin 1), you may see some mess. The
Ascii characters are OK, all the rest is confusing.
As I understand Scenario 7, it would have me create documents in a
Unicode editor.
Basically yes. In principle, you could also use some other editor and a
separate converter program.
Does that mean characters are no longer physically 8
bits of storage?
What an editor uses internally depends on its implementation, but
certainly it cannot represent all characters as 8-bit quantities.
Using UTF-8 internally creates some problems, since there is no simple
way to e.g. pick up the 42nd character from a string (when the length of
a character in octets is varying). A Unicode editor could internally work
with a different encoding (maybe even an encoding where each character is
represented as a sequence of four octets - this is structurally simplest,
though surely wastes space) and convert between it and UTF-8 on output
(and on input), when desired.
Can I create a Unicode document that on a Windows
98 US system, without installing and learning a whole lot of new
software?
I'm afraid not. But the whole lot of new software might be something
fairly simple, if you just need a text editor. If you need utilities like
macro expansion, file inclusion, etc., then it gets more complicated.
Right now I edit source code in Vim (or GVim), then run GNU MAKE and
GNU AWK to do macro substitutions, add standard prologues and
epilogues, and the like. I don't suppose MAKE much cares, but the
AWK documentation doesn't mention Unicode. Am I missing something
here?


I'm afraid you would need to check each program's documentation
separately. If it does not mention Unicode, the odds are that it does not
support Unicode. Some operations might still work, to the extent that
they operate on strings of octets irrespectively of their interpretation
as characters or "parts of character".

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #16

P: n/a
Stan Brown <th************@fastmail.fm> wrote:
What is a UTF-8 document?
A document that contains text in the UTF-8 encoding, which is a method of
representing Unicode characters as strings of octets (8-bit bytes).
What does it look like physically?
As such, it is a sequence of octets, and any program that presents it in
visible form may choose how to do that. But the natural method is to
display the characters using glyphs from some font(s), as lines (obeying
the various characters that indicate line breaks in Unicode).
Is each character 16
bits, or is there some variable-length encoding?
UTF-8 is a variable-length encoding. All Ascii characters are represented
"as such", one character as one octet, whereas for other characters, two
or more octets are used. A little more about this:
http://www.cs.tut.fi/~jkorpela/chars.html#utf
Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?
It was designed to be a transmission format (that's where the name comes
from, with U standing for Unicode), but naturally it can be used a
storage format too. Normally if you wish to create an HTML document for
the WWW in UTF-8 encoding, you use a Unicode-capable text editor and
select UTF-8 as the format in "Save As" (could be the default of course),
then upload it (using binary mode in FTP!). And you make sure that the
server sends the header Content-Type: text/html; charset=utf-8
(if possible). Normally you don't need to know anything about the
technicalities of UTF-8 or see anything in any encoded format.
But naturally if you open a UTF-8 encoded document in a simple text
editor that works with 8-bit characters only (e.g., assuming the whole
world is ISO Latin 1 or Windows Latin 1), you may see some mess. The
Ascii characters are OK, all the rest is confusing.
As I understand Scenario 7, it would have me create documents in a
Unicode editor.
Basically yes. In principle, you could also use some other editor and a
separate converter program.
Does that mean characters are no longer physically 8
bits of storage?
What an editor uses internally depends on its implementation, but
certainly it cannot represent all characters as 8-bit quantities.
Using UTF-8 internally creates some problems, since there is no simple
way to e.g. pick up the 42nd character from a string (when the length of
a character in octets is varying). A Unicode editor could internally work
with a different encoding (maybe even an encoding where each character is
represented as a sequence of four octets - this is structurally simplest,
though surely wastes space) and convert between it and UTF-8 on output
(and on input), when desired.
Can I create a Unicode document that on a Windows
98 US system, without installing and learning a whole lot of new
software?
I'm afraid not. But the whole lot of new software might be something
fairly simple, if you just need a text editor. If you need utilities like
macro expansion, file inclusion, etc., then it gets more complicated.
Right now I edit source code in Vim (or GVim), then run GNU MAKE and
GNU AWK to do macro substitutions, add standard prologues and
epilogues, and the like. I don't suppose MAKE much cares, but the
AWK documentation doesn't mention Unicode. Am I missing something
here?


I'm afraid you would need to check each program's documentation
separately. If it does not mention Unicode, the odds are that it does not
support Unicode. Some operations might still work, to the extent that
they operate on strings of octets irrespectively of their interpretation
as characters or "parts of character".

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #17

P: n/a
Stan Brown wrote:
There's another problem with UTF-8: when I "Save As" a UTF-8 page,
Mozilla 1.4 scrogs up the high-order characters so that the local
copy contains garbage sequences instead of e.g. –. I reported
this months ago; anybody know if it's been fixed in later versions?


They're not "garbage sequences"; they're properly-coded UTF-8 byte
sequences representing the characters in question. You're just
apparently viewing the resulting file in a viewer or editor that expects
us-ascii, iso-8859-1, windows-1252, or some other 7- or 8-bit encoding,
so they look like garbage. When loaded into a UTF-8-based viewer,
they'd look fine.

--
== Dan ==
Dan's Mail Format Site: http://mailformat.dan.info/
Dan's Web Tips: http://webtips.dan.info/
Dan's Domain Site: http://domains.dan.info/
Jul 20 '05 #18

P: n/a
Stan Brown wrote:
There's another problem with UTF-8: when I "Save As" a UTF-8 page,
Mozilla 1.4 scrogs up the high-order characters so that the local
copy contains garbage sequences instead of e.g. –. I reported
this months ago; anybody know if it's been fixed in later versions?


They're not "garbage sequences"; they're properly-coded UTF-8 byte
sequences representing the characters in question. You're just
apparently viewing the resulting file in a viewer or editor that expects
us-ascii, iso-8859-1, windows-1252, or some other 7- or 8-bit encoding,
so they look like garbage. When loaded into a UTF-8-based viewer,
they'd look fine.

--
== Dan ==
Dan's Mail Format Site: http://mailformat.dan.info/
Dan's Web Tips: http://webtips.dan.info/
Dan's Domain Site: http://domains.dan.info/
Jul 20 '05 #19

P: n/a
On Sun, 11 Apr 2004, Jukka K. Korpela wrote:
Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?
It was designed to be a transmission format (that's where the name
comes from, with U standing for Unicode),


....and the T standing for 'transformation' (not for 'transmission' as
you might seem to be implying here)...
As I understand Scenario 7, it would have me create documents in a
Unicode editor.


Basically yes. In principle, you could also use some other editor
and a separate converter program.


Agreed on both counts. But the 'some other editor' will need to store
the the Unicode characters in some format - so, either it's likely to
be (X)HTML-aware, and convert to/from &#number; formats, or it's
likely work internally in some fixed-width (perhaps ucs-4) format of
Unicode character representation. Or do what Windows seems to do
(which seems to be based on utf-16).
Using UTF-8 internally creates some problems, since there is no
simple way to e.g. pick up the 42nd character from a string (when
the length of a character in octets is varying).
Nevertheless, this is how recent Perl versions work, when their
Unicode support is triggered. I'm not sure if this is more likely to
enlighten or to confuse, for someone who wants to work with HTML
rather than with Perl, but I found
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html and
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html to be useful
when working with Perl ;-)
A Unicode editor could internally work with a different encoding
(maybe even an encoding where each character is represented as a
sequence of four octets - this is structurally simplest, though
surely wastes space) and convert between it and UTF-8 on output (and
on input), when desired.
Indeed. As I understand it, some versions of Windows are capable of
working internally with utf-16. There's also a *partial* list of
unicode-capable software at

http://www.unicode.org/onlinedat/products.html

which incidentally includes the NT-based Windows series (NT, 2000, XP)
Can I create a Unicode document that on a Windows
98 US system, without installing and learning a whole lot of new
software?


I'm afraid not. But the whole lot of new software might be something
fairly simple, if you just need a text editor. If you need utilities
like macro expansion, file inclusion, etc., then it gets more
complicated.


It is of course possible for applications to support Unicode even if
the OS doesn't.

And keep in mind that applications can be written to process
utf-8-encoded data even though the program code itself contains
nothing more than us-ascii, so you don't necessarily have to get a
Unicode-enabled editor in order to program Unicode-capable
applications ;-)
I'm afraid you would need to check each program's documentation
separately. If it does not mention Unicode, the odds are that it does not
support Unicode.
Well, I haven't seen Windows (NT) bragging explicitly about its
Unicode support, but it's definitely there - as the Unicode web site
indeed says. And Perl can definitely exploit it (by setting the
WIDE_SYSTEM_CALLS flag).
Some operations might still work, to the extent that they operate on
strings of octets irrespectively of their interpretation as
characters or "parts of character".


Well, yes, but that kind of behaviour can easily produce invalid utf-8
sequences. And, for security reasons, the relevant specifications
mandate that invalid utf-8 sequences MUST produce an error - not some
kind of fixup.

Properly-behaved utf-8-aware software would never produce invalid
utf-8 sequences.

cheers
Jul 20 '05 #20

P: n/a
On Sun, 11 Apr 2004, Jukka K. Korpela wrote:
Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?
It was designed to be a transmission format (that's where the name
comes from, with U standing for Unicode),


....and the T standing for 'transformation' (not for 'transmission' as
you might seem to be implying here)...
As I understand Scenario 7, it would have me create documents in a
Unicode editor.


Basically yes. In principle, you could also use some other editor
and a separate converter program.


Agreed on both counts. But the 'some other editor' will need to store
the the Unicode characters in some format - so, either it's likely to
be (X)HTML-aware, and convert to/from &#number; formats, or it's
likely work internally in some fixed-width (perhaps ucs-4) format of
Unicode character representation. Or do what Windows seems to do
(which seems to be based on utf-16).
Using UTF-8 internally creates some problems, since there is no
simple way to e.g. pick up the 42nd character from a string (when
the length of a character in octets is varying).
Nevertheless, this is how recent Perl versions work, when their
Unicode support is triggered. I'm not sure if this is more likely to
enlighten or to confuse, for someone who wants to work with HTML
rather than with Perl, but I found
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html and
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html to be useful
when working with Perl ;-)
A Unicode editor could internally work with a different encoding
(maybe even an encoding where each character is represented as a
sequence of four octets - this is structurally simplest, though
surely wastes space) and convert between it and UTF-8 on output (and
on input), when desired.
Indeed. As I understand it, some versions of Windows are capable of
working internally with utf-16. There's also a *partial* list of
unicode-capable software at

http://www.unicode.org/onlinedat/products.html

which incidentally includes the NT-based Windows series (NT, 2000, XP)
Can I create a Unicode document that on a Windows
98 US system, without installing and learning a whole lot of new
software?


I'm afraid not. But the whole lot of new software might be something
fairly simple, if you just need a text editor. If you need utilities
like macro expansion, file inclusion, etc., then it gets more
complicated.


It is of course possible for applications to support Unicode even if
the OS doesn't.

And keep in mind that applications can be written to process
utf-8-encoded data even though the program code itself contains
nothing more than us-ascii, so you don't necessarily have to get a
Unicode-enabled editor in order to program Unicode-capable
applications ;-)
I'm afraid you would need to check each program's documentation
separately. If it does not mention Unicode, the odds are that it does not
support Unicode.
Well, I haven't seen Windows (NT) bragging explicitly about its
Unicode support, but it's definitely there - as the Unicode web site
indeed says. And Perl can definitely exploit it (by setting the
WIDE_SYSTEM_CALLS flag).
Some operations might still work, to the extent that they operate on
strings of octets irrespectively of their interpretation as
characters or "parts of character".


Well, yes, but that kind of behaviour can easily produce invalid utf-8
sequences. And, for security reasons, the relevant specifications
mandate that invalid utf-8 sequences MUST produce an error - not some
kind of fixup.

Properly-behaved utf-8-aware software would never produce invalid
utf-8 sequences.

cheers
Jul 20 '05 #21

P: n/a
On Sun, 11 Apr 2004, Stan Brown wrote:
But (as I mentioned) Mozilla seems to do a bad
job with saving UTF-8 documents.


I'd be interested to know more about this problem. I was under the
impression that Mozilla Composer behaved correctly. Could the problem
be Win/9x-specific?

Jul 20 '05 #22

P: n/a
On Sun, 11 Apr 2004, Stan Brown wrote:
But (as I mentioned) Mozilla seems to do a bad
job with saving UTF-8 documents.


I'd be interested to know more about this problem. I was under the
impression that Mozilla Composer behaved correctly. Could the problem
be Win/9x-specific?

Jul 20 '05 #23

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
It was designed to be a transmission format (that's where the name
comes from, with U standing for Unicode),
...and the T standing for 'transformation' (not for 'transmission' as
you might seem to be implying here)...


Thanks for the correction; I should rely less on my memory.
As I understand it, some versions of Windows are capable of
working internally with utf-16. There's also a *partial* list of
unicode-capable software at - -
which incidentally includes the NT-based Windows series (NT, 2000,
XP)


I know NT very little, but I have got the impression from other people's
descriptions that its Unicode support is rather flaky. This might be one
of the reasons why the support wasn't advertized much. On XP, it seems
that Unicode support is fairly good, though there's still a long way to
go before we can _conveniently_ use Unicode.
Some operations might still work, to the extent that they operate on
strings of octets irrespectively of their interpretation as
characters or "parts of character".


Well, yes, but that kind of behaviour can easily produce invalid
utf-8 sequences.


I was mainly thinking about operations like string replacements that
operate on Ascii characters only, e.g. modifying HTML markup in documents
that are UTF-8 encoded. Unless I'm missing something, you could replace
any string of Ascii characters by another string of Ascii characters,
operating on 8-bit bytes only, without disturbing the integrity of
a UTF-8 encoded document.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #24

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
It was designed to be a transmission format (that's where the name
comes from, with U standing for Unicode),
...and the T standing for 'transformation' (not for 'transmission' as
you might seem to be implying here)...


Thanks for the correction; I should rely less on my memory.
As I understand it, some versions of Windows are capable of
working internally with utf-16. There's also a *partial* list of
unicode-capable software at - -
which incidentally includes the NT-based Windows series (NT, 2000,
XP)


I know NT very little, but I have got the impression from other people's
descriptions that its Unicode support is rather flaky. This might be one
of the reasons why the support wasn't advertized much. On XP, it seems
that Unicode support is fairly good, though there's still a long way to
go before we can _conveniently_ use Unicode.
Some operations might still work, to the extent that they operate on
strings of octets irrespectively of their interpretation as
characters or "parts of character".


Well, yes, but that kind of behaviour can easily produce invalid
utf-8 sequences.


I was mainly thinking about operations like string replacements that
operate on Ascii characters only, e.g. modifying HTML markup in documents
that are UTF-8 encoded. Unless I'm missing something, you could replace
any string of Ascii characters by another string of Ascii characters,
operating on 8-bit bytes only, without disturbing the integrity of
a UTF-8 encoded document.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #25

P: n/a
On Mon, 12 Apr 2004, Jukka K. Korpela wrote:
I know NT very little, but I have got the impression from other people's
descriptions that its Unicode support is rather flaky.
For NT4 that might well be true, but Win/2000 seems OK. I've done
Unicode-based work in Activestate Perl on Win/2K and not had any
problems attributable to the OS and display support.

When it comes to the browser-like component (YKWIM), it seems to be a
good idea to install a few of the various "language" options,
irrespective of whether one has any intention of using those languages
"as such", e.g USA users should add Pan-European for starters, and
then we all could add at least Japanese and possibly Arabic, in my
limited experience - I really can't tell you all the detailed
differences which this brought, but none of them were harmful, even
though I can't read Japanese nor Arabic. A whole sackfull of useful
symbols, which bore no obvious relationship to Japanese, suddenly
started to work in IE (just as they had already been working in
Mozilla) when I added Japanese support - very odd.

We had a bit of a standoff when someone wanted to process Microsoft
Train Simulator datasets with Perl. There were some horrible
interactions between the utf-16 format of the data, and Perl's support
for DOS-format newlines (CR+LF). But my conclusion was that the blame
lay primarily with the Perl (5.8.0) implementation, and not with the
Windows OS.
Well, yes, but that kind of behaviour can easily produce invalid
utf-8 sequences.


I was mainly thinking about operations like string replacements that
operate on Ascii characters only, e.g. modifying HTML markup in documents
that are UTF-8 encoded.


Well, OK, as long as the software passes "high-bit octets" completely
transparently, you'd be OK if you only made changes to ASCII
characters. (Some time back, I made quite a mess of a utf-8 file by
trying to edit its ASCII characters with an old version of "vi").
Unless I'm missing something, you could replace any string of Ascii
characters by another string of Ascii characters, operating on 8-bit
bytes only, without disturbing the integrity of a UTF-8 encoded
document.


Right, I don't see any reason to disagree with that statement -
subject to the caveat that I just made, that is.
Jul 20 '05 #26

P: n/a
On Mon, 12 Apr 2004, Jukka K. Korpela wrote:
I know NT very little, but I have got the impression from other people's
descriptions that its Unicode support is rather flaky.
For NT4 that might well be true, but Win/2000 seems OK. I've done
Unicode-based work in Activestate Perl on Win/2K and not had any
problems attributable to the OS and display support.

When it comes to the browser-like component (YKWIM), it seems to be a
good idea to install a few of the various "language" options,
irrespective of whether one has any intention of using those languages
"as such", e.g USA users should add Pan-European for starters, and
then we all could add at least Japanese and possibly Arabic, in my
limited experience - I really can't tell you all the detailed
differences which this brought, but none of them were harmful, even
though I can't read Japanese nor Arabic. A whole sackfull of useful
symbols, which bore no obvious relationship to Japanese, suddenly
started to work in IE (just as they had already been working in
Mozilla) when I added Japanese support - very odd.

We had a bit of a standoff when someone wanted to process Microsoft
Train Simulator datasets with Perl. There were some horrible
interactions between the utf-16 format of the data, and Perl's support
for DOS-format newlines (CR+LF). But my conclusion was that the blame
lay primarily with the Perl (5.8.0) implementation, and not with the
Windows OS.
Well, yes, but that kind of behaviour can easily produce invalid
utf-8 sequences.


I was mainly thinking about operations like string replacements that
operate on Ascii characters only, e.g. modifying HTML markup in documents
that are UTF-8 encoded.


Well, OK, as long as the software passes "high-bit octets" completely
transparently, you'd be OK if you only made changes to ASCII
characters. (Some time back, I made quite a mess of a utf-8 file by
trying to edit its ASCII characters with an old version of "vi").
Unless I'm missing something, you could replace any string of Ascii
characters by another string of Ascii characters, operating on 8-bit
bytes only, without disturbing the integrity of a UTF-8 encoded
document.


Right, I don't see any reason to disagree with that statement -
subject to the caveat that I just made, that is.
Jul 20 '05 #27

P: n/a
"Daniel R. Tobias" <da*@tobias.name> wrote in
comp.infosystems.www.authoring.html:
Stan Brown wrote:
There's another problem with UTF-8: when I "Save As" a UTF-8 page,
Mozilla 1.4 scrogs up the high-order characters so that the local
copy contains garbage sequences instead of e.g. –. I reported
this months ago; anybody know if it's been fixed in later versions?


They're not "garbage sequences"; they're properly-coded UTF-8 byte
sequences representing the characters in question. You're just
apparently viewing the resulting file in a viewer or editor that expects
us-ascii, iso-8859-1, windows-1252, or some other 7- or 8-bit encoding,
so they look like garbage. When loaded into a UTF-8-based viewer,
they'd look fine.


That makes sense; thanks. It doesn't actually help me since I don't
have any Unicode capable tools, but at least I understand it's
probably not a bug. (I could wish that Mozilla gave the option to
store in the machine's own character set, however.)

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #28

P: n/a
"Daniel R. Tobias" <da*@tobias.name> wrote in
comp.infosystems.www.authoring.html:
Stan Brown wrote:
There's another problem with UTF-8: when I "Save As" a UTF-8 page,
Mozilla 1.4 scrogs up the high-order characters so that the local
copy contains garbage sequences instead of e.g. –. I reported
this months ago; anybody know if it's been fixed in later versions?


They're not "garbage sequences"; they're properly-coded UTF-8 byte
sequences representing the characters in question. You're just
apparently viewing the resulting file in a viewer or editor that expects
us-ascii, iso-8859-1, windows-1252, or some other 7- or 8-bit encoding,
so they look like garbage. When loaded into a UTF-8-based viewer,
they'd look fine.


That makes sense; thanks. It doesn't actually help me since I don't
have any Unicode capable tools, but at least I understand it's
probably not a bug. (I could wish that Mozilla gave the option to
store in the machine's own character set, however.)

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #29

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in
comp.infosystems.www.authoring.html:
On Sun, 11 Apr 2004, Stan Brown wrote:
But (as I mentioned) Mozilla seems to do a bad
job with saving UTF-8 documents.


I'd be interested to know more about this problem. I was under the
impression that Mozilla Composer behaved correctly. Could the problem
be Win/9x-specific?


It could very well be. (Mozilla 1.4, Win 98. I'm using GVim to view
the files, and I get the same thing in ASCII or binary mode.)

Here's a sample:
http://www.acad.sunytccc.edu/instruct/sbrown/stats/
Note the right-pointing arrows in the breadcrumbs at the upper left
of the page. They are → in my source code.

"Save As" and "Web Page Complete", they are saved as a circumflex a
(character 226 decimal E2 hex), followed by a t 134 or 0x86),
followed by a curly right apostrophe (146 or 0x92).

"Save As" and "Web Page HTML Only", they are saved as → as I
would expect.

The above page doesn't have a server-supplied charset (IIS server,
and me with no access to administrative functions), but supplies
"UTF-8" in a META tag.

If you want a page that the _server_ labels as UTF-8, try
<http://oakroadsystems.com/twt/special.htm>. Don't look at the
breadcrumb arrows on this page, because they're images. Look instead
at the Summary paragraph near the beginning. The pi characters
(π in source) are π in "Web page HTML only" and 207+128
(0xCF+0x80) in "Web page Complete". Interestingly, "Web page
Complete" changes &deg; from my actual source code to 194+176
(0xC2+0xB0). The 176 (0xB0) is a degree mark.

I'd be interested whether others verify these differences.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #30

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in
comp.infosystems.www.authoring.html:
On Sun, 11 Apr 2004, Stan Brown wrote:
But (as I mentioned) Mozilla seems to do a bad
job with saving UTF-8 documents.


I'd be interested to know more about this problem. I was under the
impression that Mozilla Composer behaved correctly. Could the problem
be Win/9x-specific?


It could very well be. (Mozilla 1.4, Win 98. I'm using GVim to view
the files, and I get the same thing in ASCII or binary mode.)

Here's a sample:
http://www.acad.sunytccc.edu/instruct/sbrown/stats/
Note the right-pointing arrows in the breadcrumbs at the upper left
of the page. They are → in my source code.

"Save As" and "Web Page Complete", they are saved as a circumflex a
(character 226 decimal E2 hex), followed by a t 134 or 0x86),
followed by a curly right apostrophe (146 or 0x92).

"Save As" and "Web Page HTML Only", they are saved as → as I
would expect.

The above page doesn't have a server-supplied charset (IIS server,
and me with no access to administrative functions), but supplies
"UTF-8" in a META tag.

If you want a page that the _server_ labels as UTF-8, try
<http://oakroadsystems.com/twt/special.htm>. Don't look at the
breadcrumb arrows on this page, because they're images. Look instead
at the Summary paragraph near the beginning. The pi characters
(π in source) are π in "Web page HTML only" and 207+128
(0xCF+0x80) in "Web page Complete". Interestingly, "Web page
Complete" changes &deg; from my actual source code to 194+176
(0xC2+0xB0). The 176 (0xB0) is a degree mark.

I'd be interested whether others verify these differences.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #31

P: n/a
On Mon, 12 Apr 2004, Stan Brown wrote:
That makes sense; thanks. It doesn't actually help me since I don't
have any Unicode capable tools,
What kind of capability are you looking for, over and above Mozilla
Composer and your available range of browsers? Activestate Perl
(5.8.0 or later) can be very handy, if you're into that sort of thing.
I don't see why it wouldn't work under Win9x, although I haven't used
Win9x myself for some years now so YMMV.
(I could wish that Mozilla gave the option to
store in the machine's own character set, however.)


What's this Composer menu item: File -> "Save as Charset", hmmm?

(Win32 Mozilla 1.6, if that's relevant) (well, I don't think so, as I
get the same menu with Moz 1.4.1 under Fedora Linux)

Uh-uh, were you perhaps just looking at the _browser's_ own save-As
menu? To get what you want, it seems you'd need to first take the
Mozilla browser's File -> "Edit page", to open a Composer window, and
then "save as charset" from there.

I think that also addresses your other posting. If you browse the
resulting file with Mozilla and then view source, it should be
displayed correctly. And that _should_ be the case with other
browsers' view source, if their source viewer is properly capable.

I'm not sure whether I have the nerve to try firing up this antique
Win95 system behind me, it's been gathering dust for several years
now.
Jul 20 '05 #32

P: n/a
On Mon, 12 Apr 2004, Stan Brown wrote:
That makes sense; thanks. It doesn't actually help me since I don't
have any Unicode capable tools,
What kind of capability are you looking for, over and above Mozilla
Composer and your available range of browsers? Activestate Perl
(5.8.0 or later) can be very handy, if you're into that sort of thing.
I don't see why it wouldn't work under Win9x, although I haven't used
Win9x myself for some years now so YMMV.
(I could wish that Mozilla gave the option to
store in the machine's own character set, however.)


What's this Composer menu item: File -> "Save as Charset", hmmm?

(Win32 Mozilla 1.6, if that's relevant) (well, I don't think so, as I
get the same menu with Moz 1.4.1 under Fedora Linux)

Uh-uh, were you perhaps just looking at the _browser's_ own save-As
menu? To get what you want, it seems you'd need to first take the
Mozilla browser's File -> "Edit page", to open a Composer window, and
then "save as charset" from there.

I think that also addresses your other posting. If you browse the
resulting file with Mozilla and then view source, it should be
displayed correctly. And that _should_ be the case with other
browsers' view source, if their source viewer is properly capable.

I'm not sure whether I have the nerve to try firing up this antique
Win95 system behind me, it's been gathering dust for several years
now.
Jul 20 '05 #33

P: n/a
Stan Brown wrote:
http://www.acad.sunytccc.edu/instruct/sbrown/stats/
404 not found
<http://oakroadsystems.com/twt/special.htm>. the Summary paragraph near the beginning. The pi characters
(π in source) are π in "Web page HTML only" and 207+128
(0xCF+0x80) in "Web page Complete". Interestingly, "Web page
Complete" changes &deg; from my actual source code to 194+176
(0xC2+0xB0). The 176 (0xB0) is a degree mark.


Confirmed. I tried saving web page complete, saw the same results. I
also tried going to web editor window, then saving. Same results.
Finally, I followed Alan Flavell's advice: web page edit, save as
charset, where UTF-8 was selected, and saved. No difference, characters
were changed as Stan described above. Win 2k. I can't find a version
number anywhere in Mozilla (it's no longer in "about" page). I think
it's Moz 1.6.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #34

P: n/a
Stan Brown wrote:
http://www.acad.sunytccc.edu/instruct/sbrown/stats/
404 not found
<http://oakroadsystems.com/twt/special.htm>. the Summary paragraph near the beginning. The pi characters
(π in source) are π in "Web page HTML only" and 207+128
(0xCF+0x80) in "Web page Complete". Interestingly, "Web page
Complete" changes &deg; from my actual source code to 194+176
(0xC2+0xB0). The 176 (0xB0) is a degree mark.


Confirmed. I tried saving web page complete, saw the same results. I
also tried going to web editor window, then saving. Same results.
Finally, I followed Alan Flavell's advice: web page edit, save as
charset, where UTF-8 was selected, and saved. No difference, characters
were changed as Stan described above. Win 2k. I can't find a version
number anywhere in Mozilla (it's no longer in "about" page). I think
it's Moz 1.6.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #35

P: n/a
On Mon, 12 Apr 2004, Brian wrote:
Confirmed. I tried saving web page complete, saw the same results. I
also tried going to web editor window, then saving. Same results.
Finally, I followed Alan Flavell's advice: web page edit, save as
charset, where UTF-8 was selected, and saved. No difference, characters
were changed as Stan described above. Win 2k.


I'm having to guess a bit about the circumstances of what you're
describing, so bear with me (or "bare with me" as the naturists might
say ;-)

The HTML source is evidently saved by Moz Composer without an explicit
charset indication within it (which would have to be a meta http-equiv
if there was one at all).

So if you view this via file://... locally, you'll need to set the
browser's default (in Mozilla: View->Character Coding->utf-8), whereas
if you put the document on a web server, you want to configure the web
server to send the right HTTP charset (doing that is always best
practice IMHO, even if it seems like extra work compared to plonking a
meta http-equiv into the file itself).

If you want to take the meta http-equiv route despite my advice, on
the other hand, then you'll only be doing what plenty of others do.
And once you've put that meta in place, it appears that Composer will
keep it updated if you save with different charsets. What it doesn't
do is to create one in the first place.

Does that help at all?

By the way, take care if you ever get a yearning to include non-ascii
characters into your CSS stylesheet. Some browsers assume that the
stylesheet is in the same character encoding as the HTML, unless the
stylesheet specifies its own @charset. Examples have been seen where
even a CSS /*comment*/ containing a single iso-8859-1 octet caused the
entire stylesheet to be ruled invalid, because the browser was
treating the stylesheet as utf-8.
Jul 20 '05 #36

P: n/a
On Mon, 12 Apr 2004, Brian wrote:
Confirmed. I tried saving web page complete, saw the same results. I
also tried going to web editor window, then saving. Same results.
Finally, I followed Alan Flavell's advice: web page edit, save as
charset, where UTF-8 was selected, and saved. No difference, characters
were changed as Stan described above. Win 2k.


I'm having to guess a bit about the circumstances of what you're
describing, so bear with me (or "bare with me" as the naturists might
say ;-)

The HTML source is evidently saved by Moz Composer without an explicit
charset indication within it (which would have to be a meta http-equiv
if there was one at all).

So if you view this via file://... locally, you'll need to set the
browser's default (in Mozilla: View->Character Coding->utf-8), whereas
if you put the document on a web server, you want to configure the web
server to send the right HTTP charset (doing that is always best
practice IMHO, even if it seems like extra work compared to plonking a
meta http-equiv into the file itself).

If you want to take the meta http-equiv route despite my advice, on
the other hand, then you'll only be doing what plenty of others do.
And once you've put that meta in place, it appears that Composer will
keep it updated if you save with different charsets. What it doesn't
do is to create one in the first place.

Does that help at all?

By the way, take care if you ever get a yearning to include non-ascii
characters into your CSS stylesheet. Some browsers assume that the
stylesheet is in the same character encoding as the HTML, unless the
stylesheet specifies its own @charset. Examples have been seen where
even a CSS /*comment*/ containing a single iso-8859-1 octet caused the
entire stylesheet to be ruled invalid, because the browser was
treating the stylesheet as utf-8.
Jul 20 '05 #37

P: n/a
"Brian" <us*****@julietremblay.com.invalid> wrote in
comp.infosystems.www.authoring.html:
Stan Brown wrote:
http://www.acad.sunytccc.edu/instruct/sbrown/stats/
404 not found


I'm sorry -- it should have been
http://www.acad.sunytccc.edu/instruct/sbrown/stat/
I did check before posting; honest. Maybe in the spell-check phase I
clicked "Change" instead of "Ignore".

But it's a blessing in disguise, since the other page
<http://oakroadsystems.com/twt/special.htm>.

shows the same problems (as you confirmed) without raising the side
issue of charset in <meta>.
Win 2k. I can't find a version
number anywhere in Mozilla (it's no longer in "about" page). I think
it's Moz 1.6.


Annoying, isn't it? I complained about this a couple of months ago
on the Mozilla group. After many back-and-forths of "It's on the
page" "No it isn't" someone finally admitted that the version number
is (a) NOT compiled in but (b) read from the preferences file but in
any event (c) won't display in the About page unless Javascript is
enabled.

The developers who responded saw nothing wrong in this. I was
stunned.

I use Mozilla because it's the best of the alternatives. But its
developers seem every bit as arrogant as Microsoft: they know what's
best and any user input that disagrees with their preconceptions is
simply dismissed.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #38

P: n/a
"Brian" <us*****@julietremblay.com.invalid> wrote in
comp.infosystems.www.authoring.html:
Stan Brown wrote:
http://www.acad.sunytccc.edu/instruct/sbrown/stats/
404 not found


I'm sorry -- it should have been
http://www.acad.sunytccc.edu/instruct/sbrown/stat/
I did check before posting; honest. Maybe in the spell-check phase I
clicked "Change" instead of "Ignore".

But it's a blessing in disguise, since the other page
<http://oakroadsystems.com/twt/special.htm>.

shows the same problems (as you confirmed) without raising the side
issue of charset in <meta>.
Win 2k. I can't find a version
number anywhere in Mozilla (it's no longer in "about" page). I think
it's Moz 1.6.


Annoying, isn't it? I complained about this a couple of months ago
on the Mozilla group. After many back-and-forths of "It's on the
page" "No it isn't" someone finally admitted that the version number
is (a) NOT compiled in but (b) read from the preferences file but in
any event (c) won't display in the About page unless Javascript is
enabled.

The developers who responded saw nothing wrong in this. I was
stunned.

I use Mozilla because it's the best of the alternatives. But its
developers seem every bit as arrogant as Microsoft: they know what's
best and any user input that disagrees with their preconceptions is
simply dismissed.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #39

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in
comp.infosystems.www.authoring.html:
I'm having to guess a bit about the circumstances of what you're
describing, so bear with me (or "bare with me" as the naturists might
say ;-)


Did you miss my article and see only Brian's follow-up?

I was pretty explicit, or at least I tried to be.

Message-ID: <MP************************@news.odyssey.net>

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #40

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in
comp.infosystems.www.authoring.html:
I'm having to guess a bit about the circumstances of what you're
describing, so bear with me (or "bare with me" as the naturists might
say ;-)


Did you miss my article and see only Brian's follow-up?

I was pretty explicit, or at least I tried to be.

Message-ID: <MP************************@news.odyssey.net>

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #41

P: n/a
Stan Brown wrote:
Brian wrote:
I can't find a version number anywhere in Mozilla (it's no longer
in "about" page).
Annoying, isn't it?


In fact, yes, it is very annoying.
the version number is (a) NOT compiled in but (b) read from the
preferences file but in any event (c) won't display in the About page
unless Javascript is enabled.
Well that's just plain silly.
The developers who responded saw nothing wrong in this. I was
stunned.
As am I.
I use Mozilla because it's the best of the alternatives. But its
developers seem every bit as arrogant as Microsoft: they know what's
best and any user input that disagrees with their preconceptions is
simply dismissed.


I don't know if I'd be that hard on them, but the version number thing
*is* troubling.
I think it's Moz 1.6.


Enabled js; confirmed, I'm using 1.6.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #42

P: n/a
Stan Brown wrote:
Brian wrote:
I can't find a version number anywhere in Mozilla (it's no longer
in "about" page).
Annoying, isn't it?


In fact, yes, it is very annoying.
the version number is (a) NOT compiled in but (b) read from the
preferences file but in any event (c) won't display in the About page
unless Javascript is enabled.
Well that's just plain silly.
The developers who responded saw nothing wrong in this. I was
stunned.
As am I.
I use Mozilla because it's the best of the alternatives. But its
developers seem every bit as arrogant as Microsoft: they know what's
best and any user input that disagrees with their preconceptions is
simply dismissed.


I don't know if I'd be that hard on them, but the version number thing
*is* troubling.
I think it's Moz 1.6.


Enabled js; confirmed, I'm using 1.6.

--
Brian (remove "invalid" from my address to email me)
http://www.tsmchughs.com/
Jul 20 '05 #43

P: n/a
On Sun, 11 Apr 2004, Stan Brown wrote:
What is a UTF-8
document? What does it look like physically? Is each character 16
bits, or is there some variable-length encoding? Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?
Typically, MS Windows uses UTF-16, whereas Unix/Linux use UTF-8
for file encoding. When a Windows program offers you the option
"Save in Unicode" this is likely to be UTF-16.
A short summary would be greatly appreciated, but a
reference to something on line would be helpful too.


<http://www.unicode.org/unicode/faq/utf_bom.html>

Jul 20 '05 #44

P: n/a
On Sun, 11 Apr 2004, Stan Brown wrote:
What is a UTF-8
document? What does it look like physically? Is each character 16
bits, or is there some variable-length encoding? Or is UTF-8 not a
storage scheme at all but simply a scheme for transmission from
server to browser?
Typically, MS Windows uses UTF-16, whereas Unix/Linux use UTF-8
for file encoding. When a Windows program offers you the option
"Save in Unicode" this is likely to be UTF-16.
A short summary would be greatly appreciated, but a
reference to something on line would be helpful too.


<http://www.unicode.org/unicode/faq/utf_bom.html>

Jul 20 '05 #45

P: n/a
Many thanks to all who responded. From the link supplied by Alan, it
seems Win98 is not really Unicode ready in any meaningful way. While
changing operating systems (probably _not_ to another Windoze) is
probably the right thing to do in the long term for many reasons,
I've got too much on my plate even to think about it for months
ahead.

Looks as though I should stick with Scenario 6.

Thanks also to Jukka and Andreas for the links to UTF format
answers. My curiosity is satisfied about what's going on in the
bits.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #46

P: n/a
Many thanks to all who responded. From the link supplied by Alan, it
seems Win98 is not really Unicode ready in any meaningful way. While
changing operating systems (probably _not_ to another Windoze) is
probably the right thing to do in the long term for many reasons,
I've got too much on my plate even to think about it for months
ahead.

Looks as though I should stick with Scenario 6.

Thanks also to Jukka and Andreas for the links to UTF format
answers. My curiosity is satisfied about what's going on in the
bits.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #47

P: n/a
On Thu, 15 Apr 2004, Stan Brown wrote:
Thanks also to Jukka and Andreas for the links to UTF format
answers. My curiosity is satisfied about what's going on in the
bits.


Nevertheless, I have a postscript:
This test page
<http://www.unics.uni-hannover.de/nhtcapri/multilingual1.htm>
is sent without encoding (charset) specified. You can now
switch to any encoding in your browser and view the results.
But only "charset=UTF-8" gives you the correct result.

Jul 20 '05 #48

P: n/a
On Thu, 15 Apr 2004, Stan Brown wrote:
Thanks also to Jukka and Andreas for the links to UTF format
answers. My curiosity is satisfied about what's going on in the
bits.


Nevertheless, I have a postscript:
This test page
<http://www.unics.uni-hannover.de/nhtcapri/multilingual1.htm>
is sent without encoding (charset) specified. You can now
switch to any encoding in your browser and view the results.
But only "charset=UTF-8" gives you the correct result.

Jul 20 '05 #49

This discussion thread is closed

Replies have been disabled for this discussion.