How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Zenobia <5.**********@spamgourmet.com> wrote:

How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Is there another character that will substitute? The W3C validation parser,
http://validator.w3.org, tells me that this character and the ones around it are illegal
- then, after resubmission it flags no errors.

So, are there any illegal characters between 0 and 255 in the UTF-8 character set or is it
just my imagination that the W3C validation parser thinks there are - say between 129-151,
or thereabouts; then later it changes its mind?

The characters between 128 and 159 are not valid in HTML--they are
Windows extensions to the character set. The "long hyphen" (em dash)
should be coded as — .

See Jukka Korpela's page at

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

for information on the proper code to use for most of these
characters. For character 128, the Windows euro symbol, see

http://www.cs.tut.fi/~jkorpela/html/euro.html

Windows doesn't have characters for 129, 141, 143, 144, or 157. Jukka
left out the lower- and upper-case z-hacek at positions 158 and 142--I
don't know why!

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #4

Zenobia <5.**********@spamgourmet.com> wrote:

How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Is there another character that will substitute? The W3C validation parser,
http://validator.w3.org, tells me that this character and the ones around it are illegal
- then, after resubmission it flags no errors.

So, are there any illegal characters between 0 and 255 in the UTF-8 character set or is it
just my imagination that the W3C validation parser thinks there are - say between 129-151,
or thereabouts; then later it changes its mind?

Jul 20 '05 #5

On Sat, 10 Apr 2004, Zenobia wrote:

How do I display character 151 (long hyphen) in XHTML (utf-8) ?
The characters between 128 and 159 decimal in the XHTML Document
Character Set (Unicode) are control characters (see e.g
http://www.unicode.org/charts/PDF/U0080.pdf ) and are excluded from
use in XHTML.

Don't confuse them with the displayable characters in some other 8-bit
character encodings.
Is there another character that will substitute?
You might find
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT
to be useful, but basically you need to understand the (X)HTML
character representation model first. http://www.w3.org/TR/charmod/
The W3C validation parser, http://validator.w3.org, tells me that
this character and the ones around it are illegal - then, after
resubmission it flags no errors.
There's something of significance that you're not telling us.
So, are there any illegal characters between 0 and 255 in the UTF-8
character set

There is no "UTF-8 character set". UTF-8 is an encoding scheme of the
Unicode "character set".

Certainly the control characters x80-x9F (128-150 decimal), as well as
most of the control characters x00-x1F (0-31 decimal) , of the
Document Character Set (Unicode), are excluded from use in XHTML.

In the case of other encodings, you need to refer to the cross-mapping
tables (below http://www.unicode.org/Public/MAPPINGS/ ) to find the
equivalences.

Jul 20 '05 #6

On Sat, 10 Apr 2004, Zenobia wrote:

How do I display character 151 (long hyphen) in XHTML (utf-8) ?
The characters between 128 and 159 decimal in the XHTML Document
Character Set (Unicode) are control characters (see e.g
http://www.unicode.org/charts/PDF/U0080.pdf ) and are excluded from
use in XHTML.

Don't confuse them with the displayable characters in some other 8-bit
character encodings.
Is there another character that will substitute?
You might find
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT
to be useful, but basically you need to understand the (X)HTML
character representation model first. http://www.w3.org/TR/charmod/
The W3C validation parser, http://validator.w3.org, tells me that
this character and the ones around it are illegal - then, after
resubmission it flags no errors.
There's something of significance that you're not telling us.
So, are there any illegal characters between 0 and 255 in the UTF-8
character set

Jul 20 '05 #7

Zenobia

On Sat, 10 Apr 2004 10:43:17 +0100, Steve Pugh <st***@pugh.net> wrote:

Zenobia <5.**********@spamgourmet.com> wrote:
How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Position 151 is a control character in UTF-8, it is not a long hyphen.
There is no character called long hyphen in Unicode, maybe you're
thinking of the em dash, which is decimal 8212 in Unicode and 151 in
Windows-1252?

How are you currently trying to display the character? Are you
entering it directly or are you using —? The former can be okay
under some circumstances (i.e. if you're advertising your character
encoding as being Windows-1252 or similar and if all your audience can
cope with that encoding). The latter is just dangerous and will only
'work' because some browsers break (or at least severely bend) the
specification.

Thanks. Steve. I'm using — but the encoding has been changed to UTF-8 to make it
XHTML compliant. I see I'll have to change that. It seems to me that XHTML (UTF-8)
compliant code is too restrictive for my needs. I've just found out that the same
character is called — as well.

I prefer to used the named entity convention. I can't stand the idea of having to write a
acute as a number rather than á Numbers are meaningless, especially when your
editor WYSIWYG feature doesn't display characters correctly. It becomes impossible to
understand what you've written. I expect the browsers I'm writing for to understand things
like: á and β - I can. I've been using the Mathematical, Greek and Symbolic
characters for HTML shown here:
www.intuitive.com/coolweb/entities.html and here
http://www.htmlhelp.com/reference/ht...s/symbols.html

Which particular encoding is that. From what spec. did it come from? Are all these named
entities specified in ISO-10646?

These Math, Greek and Symbolic characters have been around for years. I find it
astonishing that some modern browsers still can't support them. But I have to admit I
don't care. There's no way I'm going to memorize a bunch of numbers just so that I can
read my source code. However I would like to specify the correct encoding in my documents
in future.

Are the named entities ISO 10646 or ISO 8859-1?

Jul 20 '05 #8

Zenobia

On Sat, 10 Apr 2004 10:43:17 +0100, Steve Pugh <st***@pugh.net> wrote:

Zenobia <5.**********@spamgourmet.com> wrote:
How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Position 151 is a control character in UTF-8, it is not a long hyphen.
There is no character called long hyphen in Unicode, maybe you're
thinking of the em dash, which is decimal 8212 in Unicode and 151 in
Windows-1252?

How are you currently trying to display the character? Are you
entering it directly or are you using —? The former can be okay
under some circumstances (i.e. if you're advertising your character
encoding as being Windows-1252 or similar and if all your audience can
cope with that encoding). The latter is just dangerous and will only
'work' because some browsers break (or at least severely bend) the
specification.

Jul 20 '05 #9

Zenobia

On Sat, 10 Apr 2004 10:43:56 +0100, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On Sat, 10 Apr 2004, Zenobia wrote:
The W3C validation parser, http://validator.w3.org, tells me that
this character and the ones around it are illegal - then, after
resubmission it flags no errors.

There's something of significance that you're not telling us.

Thanks for your answer too - I shall be looking at all the links you've given me here.
I've snipped your reply so that my answer stands out.

Yes I think the validator flagged them as illegal when I had 5 errors in my document. I
removed one of these errors, before the </body> tag, and suddenly there were no
fatal errors (just warnings). That, at least, is how I remember it.

I've given up with XHTML compliance now for my own pages because I have no intention of
using UTF-8 as I want to use the named entity convention for funny characters. [I'm
writing scientific articles]

Jul 20 '05 #10

Zenobia

On Sat, 10 Apr 2004 10:43:56 +0100, "Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On Sat, 10 Apr 2004, Zenobia wrote:
The W3C validation parser, http://validator.w3.org, tells me that
this character and the ones around it are illegal - then, after
resubmission it flags no errors.

There's something of significance that you're not telling us.

Jul 20 '05 #11

On Sat, 10 Apr 2004, Zenobia wrote:

I've given up with XHTML compliance now for my own pages because I
have no intention of using UTF-8 as I want to use the named entity
convention for funny characters. [I'm writing scientific articles]

This will not solve your problem. It rates to store up more problems
in the future. I can only recommend that you take the relatively
small step towards understanding this aspect of (X)HTML, as it's quite
fundamental to anything that involves unusual characters - as you
evidently aim to achieve.

Of course, in the end the choice is yours. I'm only offering what
seems to me to be appropriate advice.

Jul 20 '05 #12

On Sat, 10 Apr 2004, Zenobia wrote:

I've given up with XHTML compliance now for my own pages because I
have no intention of using UTF-8 as I want to use the named entity
convention for funny characters. [I'm writing scientific articles]

Jul 20 '05 #13

Zenobia <5.**********@spamgourmet.com> wrote:

I've given up with XHTML compliance now for my own pages because I
have no intention of using UTF-8
XHTML does not require the use of UTF-8. On the other hand, XHTML
compliance is not relevant in practical HTML authoring.
as I want to use the named entity
convention for funny characters.

You can use the funny entity names if you like, whether you use HTML or
XHTML. On the other hand, character references work more reliably.

To include an em dash into your document, you can simply write
—
irrespectively of HTML 4 / XHTML distinctions, character encoding,
language, phase of the moon, etc. Some browsers may still fail to render
it properly, and using hyphens is more robust, but if you are willing to
take this risk, use —.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #14

Zenobia <5.**********@spamgourmet.com> wrote:

I've given up with XHTML compliance now for my own pages because I
have no intention of using UTF-8
XHTML does not require the use of UTF-8. On the other hand, XHTML
compliance is not relevant in practical HTML authoring.
as I want to use the named entity
convention for funny characters.

Jul 20 '05 #15

Harlan Messinger <hm*******************@comcast.net> writes:

The characters between 128 and 159 are not valid in HTML

It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context (specialised HTML
validation services might be modified to conform to room temperature IQs
instead of ISO8879, because nobody who understands the implications
would use a remote service for validation purposes, but I digress).

The inherent problem still applies, of course, but it is passed to the
application; in terms of validation there's no error to be reported.
--
| ) 111010111011 | http://bednarz.nl/
-(
| ) Distribute me: http://binaries.bednarz.nl/mp3/aisha

Jul 20 '05 #16

Harlan Messinger <hm*******************@comcast.net> writes:

The characters between 128 and 159 are not valid in HTML

Jul 20 '05 #17

Zenobia <5.**********@spamgourmet.com> writes:

I've given up with XHTML
Well, that's fine, but you still have some serious misunderstandings:
compliance now for my own pages because I have no intention of
using UTF-8 as I want to use the named entity convention for funny
characters.

You don't *need* to use utf-8 encoding for XHTML, as well you *could*
use utf-8 for HTML. Using utf-8, you could just insert every character
literally (given you use an utf-8 capable editor in the first place);
the problem is compatibility (regarding limitations of user agents,
operating systems or even third party services who process your pages
one way or the other).
--
| ) 111010111011 | http://bednarz.nl/
-(
| ) Distribute me: http://binaries.bednarz.nl/mp3/aisha

Jul 20 '05 #18

Zenobia <5.**********@spamgourmet.com> writes:

I've given up with XHTML
Well, that's fine, but you still have some serious misunderstandings:
compliance now for my own pages because I have no intention of
using UTF-8 as I want to use the named entity convention for funny
characters.

Jul 20 '05 #19

On Sat, 10 Apr 2004, Alan J. Flavell wrote:

The characters between 128 and 159 decimal in the XHTML Document
Character Set (Unicode) are control characters [...] Certainly the control characters x80-x9F (128-150 decimal), as well as

^^^^^^^^^^^^^^^

Apologies for the typo. That was meant to be "128-159 decimal", of
course.

Jul 20 '05 #20

On Sat, 10 Apr 2004, Alan J. Flavell wrote:

The characters between 128 and 159 decimal in the XHTML Document
Character Set (Unicode) are control characters [...] Certainly the control characters x80-x9F (128-150 decimal), as well as

^^^^^^^^^^^^^^^

Apologies for the typo. That was meant to be "128-159 decimal", of
course.

Jul 20 '05 #21

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:

It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context

A statement which is at least as misleading as it is technically
accurate, unfortunately. In HTML (as opposed to XHTML) the validity
rules are those of SGML, and, as far as SGML is concerned, this
construct is not actually "invalid" - merely "undefined".

In XHTML (which is the subject line of the posting), my understanding
is that &#number; references between 128 and 159 inclusive are
illegal, and will be correctly rejected by an XML validator. Do you
know different? Then you'd better complain to the authors of some XML
validators.

Jul 20 '05 #22

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:

It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context

Jul 20 '05 #23

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:
It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context
A statement which is at least as misleading as it is technically
accurate, unfortunately. In HTML (as opposed to XHTML) the validity
rules are those of SGML, and, as far as SGML is concerned, this
construct is not actually "invalid" - merely "undefined".

The *construct* actually is defined; you have the possibility to enter
non-SGML characters by numeric reference. They certainly *are invalid*
if they appear as markup or data characters.
In XHTML (which is the subject line of the posting)

No argument here at all, I was directly replying to an inline statement
of "in HTML". What would be misleading is to say that something is
invalid and a validator (correctly) doesn't report it as an error.
--
| ) 111010111011 | http://bednarz.nl/
-(
| ) Distribute me: http://binaries.bednarz.nl/mp3/aisha

Jul 20 '05 #24

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:
It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context
A statement which is at least as misleading as it is technically
accurate, unfortunately. In HTML (as opposed to XHTML) the validity
rules are those of SGML, and, as far as SGML is concerned, this
construct is not actually "invalid" - merely "undefined".

The *construct* actually is defined; you have the possibility to enter
non-SGML characters by numeric reference. They certainly *are invalid*
if they appear as markup or data characters.
In XHTML (which is the subject line of the posting)

Jul 20 '05 #25

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
A statement which is at least as misleading as it is technically
accurate, unfortunately. In HTML (as opposed to XHTML) the validity
rules are those of SGML, and, as far as SGML is concerned, this
construct is not actually "invalid" - merely "undefined".
The *construct* actually is defined; you have the possibility to enter
non-SGML characters by numeric reference.

Yes, but -what- character is this? The interworking specifications do
not define what it is, and in that sense I continue to assert that it
is "undefined".

Sure - the syntax of the construct itself (ampersand, hash, decimal
number, semicolon) is the same regardless of the value of the number,
but some numbers give the construct a defined meaning (they reference
a character in the HTML Document Character Set), whereas some do not.
This is one of the latter kind.

But which character does it then represent? The Document Character
Set of HTML is iso-10646/Unicode, in which character 151 decimal
(U+0097) - if it were used - would be the control character END OF
GUARDED AREA. But the SGML declaration for HTML excludes this
character range, does it not?

As I understand it, the meaning of such undefined references is open
to mutual agreement between the parties. But as the WWW readership is
in no position to enter into any mutual agreement on this matter, I
don't think it's unfair to rate these references as "undefined". At
least, that is how it was put to me by SGML specialists, and I've seen
no reason to doubt it since.

Others might suppose that unilateral imposition by Bill G represents
"mutual agreement", and that therefore all these characters "mean"
whatever they mean in Windows 8-bit encodings. But the HTML document
character set -has- a proper place for all of those characters, so
there is no necessity to adopt such proprietary conventions over and
above those established in HTML4.
They certainly *are invalid* if they appear as markup or data
characters.

I'm not sure what you're saying here. You had already stated
(in reference specifically to HTML as opposed to XHTML):

| in terms of validation there's no error to be reported.

I thought your whole point was that a reference such as — in HTML
(as opposed to XHTML) is *not* invalid, in the technical SGML sense
(even though, as I'm saying, its meaning is undefined by any of the
interworking specifications).

In XHTML (which is the subject line of the posting)

No argument here at all, I was directly replying to an inline statement
of "in HTML".

OK, yes, I see that Harlan M. did indeed (mistakenly) say that.

At least we don't need these complex prevarications in XHTML, so I
suppose XHTML does have *some* benefits, after all ;-)

Jul 20 '05 #26

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
A statement which is at least as misleading as it is technically
accurate, unfortunately. In HTML (as opposed to XHTML) the validity
rules are those of SGML, and, as far as SGML is concerned, this
construct is not actually "invalid" - merely "undefined".
The *construct* actually is defined; you have the possibility to enter
non-SGML characters by numeric reference.

Yes, but -what- character is this? The interworking specifications do
not define what it is, and in that sense I continue to assert that it
is "undefined".

Sure - the syntax of the construct itself (ampersand, hash, decimal
number, semicolon) is the same regardless of the value of the number,
but some numbers give the construct a defined meaning (they reference
a character in the HTML Document Character Set), whereas some do not.
This is one of the latter kind.

But which character does it then represent? The Document Character
Set of HTML is iso-10646/Unicode, in which character 151 decimal
(U+0097) - if it were used - would be the control character END OF
GUARDED AREA. But the SGML declaration for HTML excludes this
character range, does it not?

As I understand it, the meaning of such undefined references is open
to mutual agreement between the parties. But as the WWW readership is
in no position to enter into any mutual agreement on this matter, I
don't think it's unfair to rate these references as "undefined". At
least, that is how it was put to me by SGML specialists, and I've seen
no reason to doubt it since.

Others might suppose that unilateral imposition by Bill G represents
"mutual agreement", and that therefore all these characters "mean"
whatever they mean in Windows 8-bit encodings. But the HTML document
character set -has- a proper place for all of those characters, so
there is no necessity to adopt such proprietary conventions over and
above those established in HTML4.
They certainly *are invalid* if they appear as markup or data
characters.

In XHTML (which is the subject line of the posting)

No argument here at all, I was directly replying to an inline statement
of "in HTML".

OK, yes, I see that Harlan M. did indeed (mistakenly) say that.

At least we don't need these complex prevarications in XHTML, so I
suppose XHTML does have *some* benefits, after all ;-)

Jul 20 '05 #27

(I seem to be quite unclear, sorry; I'll snip a lot to avoid even more
misunderstandings ;-)

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:
[non-SGML characters]

They certainly *are invalid* if they appear as markup or data
characters.

I thought your whole point was that a reference such as — in HTML
(as opposed to XHTML) is *not* invalid, in the technical SGML sense
Right; and that's not what I wrote above.

'—' is neither a *markup character* (it contains some, though) nor
a *data character*, it's markup that can be parsed, and *the result* is
data. At this point, it doesn't matter to the parser that the resulting
data is a non-SGML character that itself cannot be parsed.

To illustrate:

<!DOCTYPE em [
<!ELEMENT EM O O ANY>
<!ENTITY dash "—" -- Markup parsed to data; no problem -->
]>&dash; 
At least we don't need these complex prevarications in XHTML, so I
suppose XHTML does have *some* benefits, after all ;-)

Sure; futile improvements of environments that are essentially
malfunctional conveniently discard the uncomfortable need to track back
to the point where it went wrong and work on a real solution from
scratch (I do the same for pretty much all aspects of real life, so I
should be more tolerant, I suppose).
--
| ) 111010111011 | http://bednarz.nl/
-(
| ) Distribute me: http://binaries.bednarz.nl/mp3/aisha

Jul 20 '05 #28

(I seem to be quite unclear, sorry; I'll snip a lot to avoid even more
misunderstandings ;-)

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:

On Sat, 10 Apr 2004, Eric B. Bednarz wrote:
[non-SGML characters]

They certainly *are invalid* if they appear as markup or data
characters.

I thought your whole point was that a reference such as — in HTML
(as opposed to XHTML) is *not* invalid, in the technical SGML sense
Right; and that's not what I wrote above.

'—' is neither a *markup character* (it contains some, though) nor
a *data character*, it's markup that can be parsed, and *the result* is
data. At this point, it doesn't matter to the parser that the resulting
data is a non-SGML character that itself cannot be parsed.

To illustrate:

<!DOCTYPE em [
<!ELEMENT EM O O ANY>
<!ENTITY dash "—" -- Markup parsed to data; no problem -->
]>&dash; 
At least we don't need these complex prevarications in XHTML, so I
suppose XHTML does have *some* benefits, after all ;-)

Jul 20 '05 #29

Harlan Messinger <hm*******************@comcast.net> wrote:

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html - - Windows doesn't have characters for 129, 141, 143, 144, or 157. Jukka
left out the lower- and upper-case z-hacek at positions 158 and
142--I don't know why!

My page explains near the start, after giving the conservative
recommendation to avoid certain characters: "The same applies to euro
sign, as well as to Z and z with caron, with the additional note that
since they are additions to the original MS Windows character set, they
cause even more problems than the others."

The recommendation may sound _very_ conservative these days, but as I
have mentioned elsewhere, it seems that Google translator cannot even
cope with the right single quotation mark, i.e. treats a word like
"don't" as untranslateable if a typographically correct character is
used instead of the Ascii apostrophe. I'm afraid we _still_ have
software lurking around that doesn't understand even the "Windows extra
characters" right.

I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)

Using the typographically inferior surrogate notation with "--" I get
Je -- honnêtement -- suis déçu chez Google.
(which makes much more sense)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #30

Harlan Messinger <hm*******************@comcast.net> wrote:

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html - - Windows doesn't have characters for 129, 141, 143, 144, or 157. Jukka
left out the lower- and upper-case z-hacek at positions 158 and
142--I don't know why!

Jul 20 '05 #31

Eric B. Bednarz <be*****@fahr-zur-hoelle.org> wrote:

Harlan Messinger <hm*******************@comcast.net> writes:
The characters between 128 and 159 are not valid in HTML
It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context (specialised HTML
validation services might be modified to conform to room temperature IQs
instead of ISO8879, because nobody who understands the implications
would use a remote service for validation purposes, but I digress).

From the SGML specification for HTML:

CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344

The document character set is copied from ISO 10646-1 without
renumbering for ranges 9-10, 13-13, 32-126, 160-55295, and 57344 on
up. The ranges 0-8, 11-12, 14-31, 127-159, and 55296-57343 are
unused--they are not part of the DCS for (X)HTML. So why is it wrong
for them to be flagged as invalid when someone includes them in the
document?
The inherent problem still applies, of course, but it is passed to the
application; in terms of validation there's no error to be reported.

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #32

Eric B. Bednarz <be*****@fahr-zur-hoelle.org> wrote:

Harlan Messinger <hm*******************@comcast.net> writes:
The characters between 128 and 159 are not valid in HTML
It depends; the OP has already stated that he uses —, and using
numeric references *is valid* in this context (specialised HTML
validation services might be modified to conform to room temperature IQs
instead of ISO8879, because nobody who understands the implications
would use a remote service for validation purposes, but I digress).

From the SGML specification for HTML:

CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344

The document character set is copied from ISO 10646-1 without
renumbering for ranges 9-10, 13-13, 32-126, 160-55295, and 57344 on
up. The ranges 0-8, 11-12, 14-31, 127-159, and 55296-57343 are
unused--they are not part of the DCS for (X)HTML. So why is it wrong
for them to be flagged as invalid when someone includes them in the
document?
The inherent problem still applies, of course, but it is passed to the
application; in terms of validation there's no error to be reported.

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #33

"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:

Harlan Messinger <hm*******************@comcast.net> wrote:
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html- -
Windows doesn't have characters for 129, 141, 143, 144, or 157. Jukka
left out the lower- and upper-case z-hacek at positions 158 and
142--I don't know why!

My page explains near the start, after giving the conservative
recommendation to avoid certain characters: "The same applies to euro
sign, as well as to Z and z with caron,

There's the problem. I was searching for "hacek". Sorry.
with the additional note that
since they are additions to the original MS Windows character set, they
cause even more problems than the others."

The recommendation may sound _very_ conservative these days, but as I
have mentioned elsewhere, it seems that Google translator cannot even
cope with the right single quotation mark, i.e. treats a word like
"don't" as untranslateable if a typographically correct character is
used instead of the Ascii apostrophe. I'm afraid we _still_ have
software lurking around that doesn't understand even the "Windows extra
characters" right.

I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)

Using the typographically inferior surrogate notation with "--" I get
Je -- honnêtement -- suis déçu chez Google.
(which makes much more sense)

Everything's relative.

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #34

"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:

Harlan Messinger <hm*******************@comcast.net> wrote:
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html- -
Windows doesn't have characters for 129, 141, 143, 144, or 157. Jukka
left out the lower- and upper-case z-hacek at positions 158 and
142--I don't know why!

My page explains near the start, after giving the conservative
recommendation to avoid certain characters: "The same applies to euro
sign, as well as to Z and z with caron,

There's the problem. I was searching for "hacek". Sorry.
with the additional note that
since they are additions to the original MS Windows character set, they
cause even more problems than the others."

The recommendation may sound _very_ conservative these days, but as I
have mentioned elsewhere, it seems that Google translator cannot even
cope with the right single quotation mark, i.e. treats a word like
"don't" as untranslateable if a typographically correct character is
used instead of the Ascii apostrophe. I'm afraid we _still_ have
software lurking around that doesn't understand even the "Windows extra
characters" right.

I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)

Using the typographically inferior surrogate notation with "--" I get
Je -- honnêtement -- suis déçu chez Google.
(which makes much more sense)

Everything's relative.

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #35

Neal

On Sat, 10 Apr 2004 21:55:43 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:

I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)
I think what's fundamentally wrong is that no one uses an em dash when
highlighting a word in English. There's typographic ways to express the
stress you intend, but none of those are going to be able to be parsed by
a translator at this still early time.

Honestly.
Using the typographically inferior surrogate notation with "--" I get
Je -- honnêtement -- suis déçu chez Google.
(which makes much more sense)

And, coincidentally, doesn't.

I am, honestly, disappointed in Google. But I don't blame it on anything.

Jul 20 '05 #36

Neal

On Sat, 10 Apr 2004 21:55:43 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:

I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)
I think what's fundamentally wrong is that no one uses an em dash when
highlighting a word in English. There's typographic ways to express the
stress you intend, but none of those are going to be able to be parsed by
a translator at this still early time.

Honestly.
Using the typographically inferior surrogate notation with "--" I get
Je -- honnêtement -- suis déçu chez Google.
(which makes much more sense)

And, coincidentally, doesn't.

I am, honestly, disappointed in Google. But I don't blame it on anything.

Jul 20 '05 #37

Stan Brown

"Zenobia" <5.**********@spamgourmet.com> wrote in
comp.infosystems.www.authoring.html:

How do I display character 151 (long hyphen) in XHTML (utf-8) ?
Character 151 may be an em dash (not "long hyphen") in your Windows
character set, but it's undefined in HTML./ You want —.
So, are there any illegal characters between 0 and 255 in the UTF-8 character set

Yes, don't use 128-159. Windows has put bullets, dashes, and curly
quotes in that range, but for HTML you need to code them correctly.
A good reference is Alan Wood's
<http://www.hclrss.demon.co.uk/demos/wgl4.html>.

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/

Jul 20 '05 #38

Stan Brown

"Zenobia" <5.**********@spamgourmet.com> wrote in
comp.infosystems.www.authoring.html:

How do I display character 151 (long hyphen) in XHTML (utf-8) ?
Character 151 may be an em dash (not "long hyphen") in your Windows
character set, but it's undefined in HTML./ You want —.
So, are there any illegal characters between 0 and 255 in the UTF-8 character set

Jul 20 '05 #39

Neal <ne*****@spamrcn.com> wrote:

I think what's fundamentally wrong is that no one uses an em dash
when highlighting a word in English.
What's wrong with that?
There's typographic ways to
express the stress you intend, but none of those are going to be able
to be parsed by a translator at this still early time.

What I meant with the use of em dashes is relatively irrelevant here
(but I used them in the normal sense of separating an
extra remark--like this--from the normal flow of text).

The point is that Google translator does not understand the em dash
(or the right single quote) as a punctuation character at all.

For further demonstration, consider the sentence:
He said: "Yes, I know."
When asked to translate into Spanish, Google translator produces
Er sagte: "ja, weiß ich."
which has an odd word order (not that surprising) but is otherwise
correct. But when I change the quotation marks in the original, replacing
the Ascii characters by orthographically and typographically correct
marks ("smart quotes"), the translator does not translate the quoted
string at all, effectively treating it as a string literal.

Regarding _emphasis_, in HTML we should naturally express it using markup
such as or . And Google translator _tries_ to transfer it to
the translation but often fails miserably. If I write
I am honestly disappointed at Google.
then Google produces
Je suis honnêtement déçu chez Google.
And in many cases, it manages to put the markup around some word in
the translation, but not the right one. Sometimes it puts the markup
there correctly, e.g.
I will do this my</my> way =>
Je ferai de cette ma façon.

And, in fact, if I use clumsy style like
I will do this _my_ way
then Google translator gets even this right in a sense:
Je ferai de cette _ ma _ façon.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #40

Neal <ne*****@spamrcn.com> wrote:

I think what's fundamentally wrong is that no one uses an em dash
when highlighting a word in English.
What's wrong with that?
There's typographic ways to
express the stress you intend, but none of those are going to be able
to be parsed by a translator at this still early time.

Jul 20 '05 #41

On Sat, 10 Apr 2004 21:01:26 -0400, Neal <ne*****@spamrcn.com> wrote:

On Sat, 10 Apr 2004 21:55:43 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:
I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)
I think what's fundamentally wrong is that no one uses an em dash when
highlighting a word in English. There's typographic ways to express the
stress you intend, but none of those are going to be able to be parsed by
a translator at this still early time.

Why do you think that stress is intended? That is a parenthetical
expression, as are:

I am, honestly, disappointed at Google.
I am (honestly) disappointed at Google.

And in Europe the following form is also common, although I understand
it is considered incorrect in the US: "I am - honestly - disappointed
at Google", with the hyphen replaced by an en-dash when possible.

--
Stephen Poley

http://www.xs4all.nl/~sbpoley/webmatters/

Jul 20 '05 #42

On Sat, 10 Apr 2004 21:01:26 -0400, Neal <ne*****@spamrcn.com> wrote:

On Sat, 10 Apr 2004 21:55:43 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:
I just tested how
http://translate.google.com/translate_t
translates
I am--honestly--disappointed at Google.
into French when I use the em dash instead of "--":
J'am?honestly?disappointed chez Google.
(I guess everyone sees that something went fundamentally wrong here,
even if you don't know French.)
I think what's fundamentally wrong is that no one uses an em dash when
highlighting a word in English. There's typographic ways to express the
stress you intend, but none of those are going to be able to be parsed by
a translator at this still early time.

Jul 20 '05 #43

On Sat, 10 Apr 2004 15:43:32 +0200, Eric B. Bednarz
<be*****@fahr-zur-hoelle.org> wrote:

because nobody who understands the implications
would use a remote service for validation purposes, but I digress).

OK, you've got me here. What implications are you referring to?

--
Stephen Poley

http://www.xs4all.nl/~sbpoley/webmatters/

Jul 20 '05 #44