How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Zenobia

How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Is there another character that will substitute? The W3C validation parser,
http://validator.w3.org, tells me that this character and the ones around it are illegal
- then, after resubmission it flags no errors.

So, are there any illegal characters between 0 and 255 in the UTF-8 character set or is it
just my imagination that the W3C validation parser thinks there are - say between 129-151,
or thereabouts; then later it changes its mind?

Jul 20 '05

Subscribe Post Reply

15017

Zenobia

On Sat, 10 Apr 2004 05:53:37 -0400, Harlan Messinger
<hm*******************@comcast.net> wrote:

Zenobia <5.**********@spamgourmet.com> wrote:
How do I display character 151 (long hyphen) in XHTML (utf-8) ?

Is there another character that will substitute? The W3C validation parser,
http://validator.w3.org, tells me that this character and the ones around it are illegal
- then, after resubmission it flags no errors.

So, are there any illegal characters between 0 and 255 in the UTF-8 character set or is it
just my imagination that the W3C validation parser thinks there are - say between 129-151,
or thereabouts; then later it changes its mind?
The characters between 128 and 159 are not valid in HTML--they are
Windows extensions to the character set. The "long hyphen" (em dash)
should be coded as — .

Cheers. I was confused (even more than you may think I was) by a
reference at the W3C site pointing to a page listing extended
characters. That page only had a small percentage of the
characters I needed and all those characters were specified
using the numeric convention. I was, therefore, under the
impression that that UTF-8 encoding greatly restricted the
number of extended characters available. I shall tell the W3C
web-master about that.
See Jukka Korpela's page at

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

for information on the proper code to use for most of these
characters. For character 128, the Windows euro symbol, see

http://www.cs.tut.fi/~jkorpela/html/euro.html

Windows doesn't have characters for 129, 141, 143, 144, or 157. Jukka
left out the lower- and upper-case z-hacek at positions 158 and 142--I
don't know why!

I must disagree with you recommendation on those pages. I think
it must be preferable to use the named entity convention
wherever possible. One can't start by bringing oneself down to
the level of a machine just because so much software has so many
bugs.

Jul 20 '05 #51

Alan J. Flavell

On Sun, 11 Apr 2004, Zenobia wrote:

Cheers. I was confused (even more than you may think I was) by a
reference at the W3C site pointing to a page listing extended
characters.
WHAT page? If you're going to talk in riddles, then you're not going
to get best help here. You seem very keen to criticise others, but
it's clear from your postings that you're still not up to speed on
understanding the issues yourself.
That page only had a small percentage of the characters I needed and
all those characters were specified using the numeric convention.
It's an entirely legal way to represent them.
I was, therefore, under the impression that that UTF-8 encoding
greatly restricted the number of extended characters available.
You still seem to be "at sea" regarding the HTML character model.
Until you tackle that seriously, the time that you waste isn't only
your own.
I shall tell the W3C web-master about that.
You haven't even told -us- yet. If there really -is- anything wrong
at the W3C, then the other contributors here may be able to help you
formulate an appropriate report.

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

I must disagree with you recommendation on those pages.
You'll be in a position to do that when you understand the issues as
well as Jukka does.
I think it must be preferable to use the named entity convention
wherever possible. One can't start by bringing oneself down to the
level of a machine just because so much software has so many bugs.

Then use software which can produce an appropriate representation for
you, based on what you're prepared to type into it. HTML is defined
as an interworking format (i.e what's sent out from server software to
client software, normally over HTTP). You don't necessarily have to
code the HTML by hand.

Jul 20 '05 #52

Alan J. Flavell

On Sun, 11 Apr 2004, Zenobia wrote:

Cheers. I was confused (even more than you may think I was) by a
reference at the W3C site pointing to a page listing extended
characters.
WHAT page? If you're going to talk in riddles, then you're not going
to get best help here. You seem very keen to criticise others, but
it's clear from your postings that you're still not up to speed on
understanding the issues yourself.
That page only had a small percentage of the characters I needed and
all those characters were specified using the numeric convention.
It's an entirely legal way to represent them.
I was, therefore, under the impression that that UTF-8 encoding
greatly restricted the number of extended characters available.
You still seem to be "at sea" regarding the HTML character model.
Until you tackle that seriously, the time that you waste isn't only
your own.
I shall tell the W3C web-master about that.
You haven't even told -us- yet. If there really -is- anything wrong
at the W3C, then the other contributors here may be able to help you
formulate an appropriate report.

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

I must disagree with you recommendation on those pages.
You'll be in a position to do that when you understand the issues as
well as Jukka does.
I think it must be preferable to use the named entity convention
wherever possible. One can't start by bringing oneself down to the
level of a machine just because so much software has so many bugs.

Jul 20 '05 #53

Stephen Poley

On Sun, 11 Apr 2004 09:33:40 +0100, Zenobia
<5.**********@spamgourmet.com> wrote:

The corresponding named entity is — - is it not. I will
use that naming convention so that I will be able to read my
source code. I've not come across a HTML editor that correctly
displays all extended characters in view mode. The primary
purpose of a character encoding scheme should be to make the
meaning of the source document transparent to the author. It is
a form of language.

Unless I can read the language it's useless to me. Nearly every
number I read is as meaningful as any other (although, over
time, I've come to remember the meaning of 65, 32 and other
landmark characters - I have no intention of remembering what
200 or so extended characters code for. I can't see why anyone
should use the numeric convention when the named entity
convention is available too.

The one reason I know of is to support Netscape 4 - though that reason
is rapidly disappearing. Apparently there are still a few organisations
around where that is still considered important.

--
Stephen Poley

http://www.xs4all.nl/~sbpoley/webmatters/

Jul 20 '05 #54

Stephen Poley

On Sun, 11 Apr 2004 09:33:40 +0100, Zenobia
<5.**********@spamgourmet.com> wrote:

The corresponding named entity is — - is it not. I will
use that naming convention so that I will be able to read my
source code. I've not come across a HTML editor that correctly
displays all extended characters in view mode. The primary
purpose of a character encoding scheme should be to make the
meaning of the source document transparent to the author. It is
a form of language.

Unless I can read the language it's useless to me. Nearly every
number I read is as meaningful as any other (although, over
time, I've come to remember the meaning of 65, 32 and other
landmark characters - I have no intention of remembering what
200 or so extended characters code for. I can't see why anyone
should use the numeric convention when the named entity
convention is available too.

Jul 20 '05 #55

Jukka K. Korpela

Zenobia <5.**********@spamgourmet.com> wrote:

The corresponding named entity is — - is it not.
It is. And it is less widely supported, partly because some old browsers
(like Netscape 4) don't get it right, partly because XHTML rules do not
require predefined entities to be supported (to put it simply) and some
browsers don't support them in XHTML mode.
I will
use that naming convention so that I will be able to read my
source code.
As you wish, but I gave you some points you might consider.

Besides, if readability of code is important, why don't you use a Unicode
capable editor where you can type the real em dash and see it as an em
dash?
I've not come across a HTML editor that correctly
displays all extended characters in view mode.
You will naturally need an editor and a suitable font, or fonts. But em
dashes shouldn't be a problem. For example, Unipad most probably shows
all the characters you need (but it's a text editor with nothing special
for HTML).
The primary
purpose of a character encoding scheme should be to make the
meaning of the source document transparent to the author.
The notations — and — are _not_ about character encoding
scheme - on the contrary, they work independently of encoding.
It is a form of language.
Calling notations "languages" tends to confuse and obscure more than it
could possibly illuminate. Even calling HTML a language is misleading.
I can't see why anyone
should use the numeric convention when the named entity
convention is available too.

Because the character reference works more robustly. (And the entity
names aren't really that mnemonic. They have been procrustinated to be at
most six characters long, and not particularly well designed. You would
still need to remember it's — and not the more natural &emdash; for
example.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #56

Jukka K. Korpela

Zenobia <5.**********@spamgourmet.com> wrote:

The corresponding named entity is — - is it not.
It is. And it is less widely supported, partly because some old browsers
(like Netscape 4) don't get it right, partly because XHTML rules do not
require predefined entities to be supported (to put it simply) and some
browsers don't support them in XHTML mode.
I will
use that naming convention so that I will be able to read my
source code.
As you wish, but I gave you some points you might consider.

Besides, if readability of code is important, why don't you use a Unicode
capable editor where you can type the real em dash and see it as an em
dash?
I've not come across a HTML editor that correctly
displays all extended characters in view mode.
You will naturally need an editor and a suitable font, or fonts. But em
dashes shouldn't be a problem. For example, Unipad most probably shows
all the characters you need (but it's a text editor with nothing special
for HTML).
The primary
purpose of a character encoding scheme should be to make the
meaning of the source document transparent to the author.
The notations — and — are _not_ about character encoding
scheme - on the contrary, they work independently of encoding.
It is a form of language.
Calling notations "languages" tends to confuse and obscure more than it
could possibly illuminate. Even calling HTML a language is misleading.
I can't see why anyone
should use the numeric convention when the named entity
convention is available too.

Jul 20 '05 #57

Owen Jacobson

On Sun, 11 Apr 2004 07:17:52 +0000, Jukka K. Korpela wrote:

For further demonstration, consider the sentence:
He said: "Yes, I know."
When asked to translate into Spanish, Google translator produces ^^^^^^^ Er sagte: "ja, weiß ich."

Really?

--
Some say the Wired doesn't have political borders like the real world,
but there are far too many nonsense-spouting anarchists or idiots who
think that pranks are a revolution.

Jul 20 '05 #58

Owen Jacobson

On Sun, 11 Apr 2004 07:17:52 +0000, Jukka K. Korpela wrote:

For further demonstration, consider the sentence:
He said: "Yes, I know."
When asked to translate into Spanish, Google translator produces ^^^^^^^ Er sagte: "ja, weiß ich."

Really?

--
Some say the Wired doesn't have political borders like the real world,
but there are far too many nonsense-spouting anarchists or idiots who
think that pranks are a revolution.

Jul 20 '05 #59

Daniel R. Tobias

Jukka K. Korpela wrote:

The recommendation may sound _very_ conservative these days, but as I
have mentioned elsewhere, it seems that Google translator cannot even
cope with the right single quotation mark, i.e. treats a word like
"don't" as untranslateable if a typographically correct character is
used instead of the Ascii apostrophe. I'm afraid we _still_ have
software lurking around that doesn't understand even the "Windows extra
characters" right.

How, exactly, do you transmit "typographically correct characters" in
Web forms? If you just paste them in from Microsoft bogusware, you'll
wind up with code points that are really control characters, if
ISO-8859-1 is used as the encoding (as it seems to be on the translation
page you cited).

--
== Dan ==
Dan's Mail Format Site: http://mailformat.dan.info/
Dan's Web Tips: http://webtips.dan.info/
Dan's Domain Site: http://domains.dan.info/

Jul 20 '05 #60

Daniel R. Tobias

Jukka K. Korpela wrote:

The recommendation may sound _very_ conservative these days, but as I
have mentioned elsewhere, it seems that Google translator cannot even
cope with the right single quotation mark, i.e. treats a word like
"don't" as untranslateable if a typographically correct character is
used instead of the Ascii apostrophe. I'm afraid we _still_ have
software lurking around that doesn't understand even the "Windows extra
characters" right.

Jul 20 '05 #61

Tim

On Sun, 11 Apr 2004 07:17:52 +0000 (UTC),
"Jukka K. Korpela" <jk******@cs.tut.fi> posted:

The point is that Google translator does not understand the em dash
(or the right single quote) as a punctuation character at all.

I thought search engines generally ignore punctuation, completely? That
allows people to search for things like "don't" and it finds "don't" and
"dont", treating them as the same thing.

Some, allegedly, even go as far as translating words into another scheme
(rather like phonetics, though I've the proper name for the scheme), so
that spelling is almost irrelevant (it's used by some systems for name or
title searches, so a machine could correlate badly written terms supposedly
the same as a human could match a John Smith against a John Smythe, because
they haven't seen it written down).

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.

Jul 20 '05 #62

Tim

On Sun, 11 Apr 2004 07:17:52 +0000 (UTC),
"Jukka K. Korpela" <jk******@cs.tut.fi> posted:

The point is that Google translator does not understand the em dash
(or the right single quote) as a punctuation character at all.

Jul 20 '05 #63

Jukka K. Korpela

Owen Jacobson <an******@lionsanctuary.net> wrote:

He said: "Yes, I know."
When asked to translate into Spanish, Google translator produces
^^^^^^^
Er sagte: "ja, weiß ich."

Really?

Yeah, it got so confused with the tests with punctuation that it was
completely Babelized. :-) Seriously, I tried different tests with
different languages, and copied & pasted the wrong text here, or typed
Spanish instead of German.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #64

Jukka K. Korpela

Owen Jacobson <an******@lionsanctuary.net> wrote:

He said: "Yes, I know."
When asked to translate into Spanish, Google translator produces
^^^^^^^
Er sagte: "ja, weiß ich."

Really?

Jul 20 '05 #65

Jukka K. Korpela

Tim <ti*@mail.localhost.invalid> wrote:

I thought search engines generally ignore punctuation, completely?
They don't.
That allows people to search for things like "don't" and it finds
"don't" and "dont", treating them as the same thing.
If you try those searches on Google, you will notice that the results are
quite different. Google treats
don't
apparently the same way as
don+t
or
"don t"
treating it as a sequence of two words "don" and "t" in succession and
coupled together.

But _translation_ is different. It is crucial, of course, to recognize
"don't" as a contracted form of "do not" when translating from English.
Some, allegedly, even go as far as translating words into another
scheme (rather like phonetics, though I've the proper name for the
scheme), so that spelling is almost irrelevant (it's used by some
systems for name or title searches, so a machine could correlate
badly written terms supposedly the same as a human could match a John
Smith against a John Smythe, because they haven't seen it written
down).

That's a different issue, and language-dependent. Google seems to perform
some fuzzy searches, even suggesting "Did you mean ...?". I don't know
the specifics, but I guess the techniques are primitive and brute force
(which is often effective) rather than phonetic matching, for example,
since Google really hasn't got information about phonetics.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #66

Jukka K. Korpela

Tim <ti*@mail.localhost.invalid> wrote:

I thought search engines generally ignore punctuation, completely?
They don't.
That allows people to search for things like "don't" and it finds
"don't" and "dont", treating them as the same thing.
If you try those searches on Google, you will notice that the results are
quite different. Google treats
don't
apparently the same way as
don+t
or
"don t"
treating it as a sequence of two words "don" and "t" in succession and
coupled together.

But _translation_ is different. It is crucial, of course, to recognize
"don't" as a contracted form of "do not" when translating from English.
Some, allegedly, even go as far as translating words into another
scheme (rather like phonetics, though I've the proper name for the
scheme), so that spelling is almost irrelevant (it's used by some
systems for name or title searches, so a machine could correlate
badly written terms supposedly the same as a human could match a John
Smith against a John Smythe, because they haven't seen it written
down).

Jul 20 '05 #67

Jukka K. Korpela

"Daniel R. Tobias" <da*@tobias.name> wrote:

How, exactly, do you transmit "typographically correct characters" in
Web forms? If you just paste them in from Microsoft bogusware,
you'll wind up with code points that are really control characters,
if ISO-8859-1 is used as the encoding (as it seems to be on the
translation page you cited).

The page http://www.google.com/language_tools appears as UTF-8 encoded to
my IE 6, ISO-8859-1 encoded to my Lynx. Presumably the server applies
content negotiation.

What happens in cut & paste is more or less obscure, but there's the
option of composing a Web page and submitting it to the validator.

This made me carry out a very simple test, despite the fact that I
virtually knew the results would make me crazy:

I don't know.
I don’t know.
I don’t know.

translated from English to German gives

Ich weiß nicht.
Dont I wissen.
Ich weiß nicht.

That is, the undefined and hence incorrect reference ’ produces
better results than the correct reference ’.

And using the right single quotation mark in UTF-8 encoding fails too,
giving the same mistranslation as the use of ’.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #68

Jukka K. Korpela

"Daniel R. Tobias" <da*@tobias.name> wrote:

How, exactly, do you transmit "typographically correct characters" in
Web forms? If you just paste them in from Microsoft bogusware,
you'll wind up with code points that are really control characters,
if ISO-8859-1 is used as the encoding (as it seems to be on the
translation page you cited).

Jul 20 '05 #69

Andreas Prilop

On Mon, 12 Apr 2004, Jukka K. Korpela wrote:

The page http://www.google.com/language_tools appears as UTF-8 encoded to
my IE 6, ISO-8859-1 encoded to my Lynx. Presumably the server applies
content negotiation.

Use
<http://www.google.com/language_tools?oe=UTF-8>
<http://www.google.com/language_tools?oe=ISO-8859-1>
<http://www.google.com/language_tools?oe=Windows-1252>
etc.

Jul 20 '05 #70

Andreas Prilop

On Mon, 12 Apr 2004, Jukka K. Korpela wrote:

The page http://www.google.com/language_tools appears as UTF-8 encoded to
my IE 6, ISO-8859-1 encoded to my Lynx. Presumably the server applies
content negotiation.

Use
<http://www.google.com/language_tools?oe=UTF-8>
<http://www.google.com/language_tools?oe=ISO-8859-1>
<http://www.google.com/language_tools?oe=Windows-1252>
etc.

Jul 20 '05 #71

Pierre Goiffon

"Zenobia" <5.**********@spamgourmet.com> a écrit dans le message de
news:4v********************************@4ax.com

the encoding has been changed to
UTF-8 to make it XHTML compliant.

?!!
You could use all encoding you want in XHTML, you just have to encode your
document right and send correctly to the client the right encoding info.

Jul 20 '05 #72

Pierre Goiffon

"Zenobia" <5.**********@spamgourmet.com> a écrit dans le message de
news:4v********************************@4ax.com

the encoding has been changed to
UTF-8 to make it XHTML compliant.

?!!
You could use all encoding you want in XHTML, you just have to encode your
document right and send correctly to the client the right encoding info.

Jul 20 '05 #73

Zenobia

On Sun, 11 Apr 2004 10:19:02 +0100, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:

On Sun, 11 Apr 2004, Zenobia wrote:
Cheers. I was confused (even more than you may think I was) by a
reference at the W3C site pointing to a page listing extended
characters.

WHAT page? If you're going to talk in riddles, then you're not going
to get best help here. You seem very keen to criticise others, but
it's clear from your postings that you're still not up to speed on
understanding the issues yourself.

I think it must have been this page:

http://www.w3.org/MarkUp/html3/latin1.html

After submitting to the validator. Getting a character error
(http://validator.w3.org/check) and clicking on the explain link
(http://validator.w3.org/docs/errors.html#bad-char) there is a
link to that page. Note that the list of entities does not
include entities such as α

That page only had a small percentage of the characters I needed and
all those characters were specified using the numeric convention.

It's an entirely legal way to represent them.

Ha. But entirely useless because it's meaningless. Not for
nothing did computer programmers move from machine code to
assembler and then to 3rd generation languages.

I think it must be preferable to use the named entity convention
wherever possible. One can't start by bringing oneself down to the
level of a machine just because so much software has so many bugs.

Then use software which can produce an appropriate representation for
you, based on what you're prepared to type into it.

Not Adobe GoLive.

Jul 20 '05 #74

Zenobia

On Sun, 11 Apr 2004 10:19:02 +0100, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:

On Sun, 11 Apr 2004, Zenobia wrote:
Cheers. I was confused (even more than you may think I was) by a
reference at the W3C site pointing to a page listing extended
characters.

WHAT page? If you're going to talk in riddles, then you're not going
to get best help here. You seem very keen to criticise others, but
it's clear from your postings that you're still not up to speed on
understanding the issues yourself.

That page only had a small percentage of the characters I needed and
all those characters were specified using the numeric convention.

It's an entirely legal way to represent them.

Ha. But entirely useless because it's meaningless. Not for
nothing did computer programmers move from machine code to
assembler and then to 3rd generation languages.

I think it must be preferable to use the named entity convention
wherever possible. One can't start by bringing oneself down to the
level of a machine just because so much software has so many bugs.

Then use software which can produce an appropriate representation for
you, based on what you're prepared to type into it.

Not Adobe GoLive.

Jul 20 '05 #75

Alan J. Flavell

On Sat, 17 Apr 2004, Zenobia wrote:

On Sun, 11 Apr 2004 10:19:02 +0100, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
WHAT page? If you're going to talk in riddles, then you're not going
to get best help here.
I think it must have been this page:

http://www.w3.org/MarkUp/html3/latin1.html

You're looking here at a piece of history - a never-finished working
draft for a never-completed "HTML/3.0": the document was last-changed
in 1995, and - despite its considerable interest to historians of HTML
- has no direct relevance to current HTML versions.

See http://www.w3.org/MarkUp/html3/ for further details.
After submitting to the validator. Getting a character error
(http://validator.w3.org/check) and clicking on the explain link
(http://validator.w3.org/docs/errors.html#bad-char) there is a
link to that page. Note that the list of entities does not
include entities such as α

The page cited above fails validation for some quite different
reason(s). I suspect you tried to validate some other page than the
one which you cited above. Never mind.

That page only had a small percentage of the characters I needed and
all those characters were specified using the numeric convention.

It's an entirely legal way to represent them.

Ha. But entirely useless because it's meaningless.

You have a point, but markup isn't designed to be read by the end
user, but to be rendered (or otherwise processed) by the client agent
(browser, indexer etc.). If the browsers implement numeric character
references (which most did) and some omitted to implement character
entities, then it's more important to use what the browsers
implemented, than to fuss about whether they have mnemonic value.

Then use software which can produce an appropriate representation for
you, based on what you're prepared to type into it.

Not Adobe GoLive.

Why not? The results can be converted "by rote" into any valid
character representation form; it's not hard. The conversion could be
bundled with your other routine QA tests (validation etc.) that are
performed as part of your final publish-to-server procedure (these can
be defined in a makefile, for example).

In saying that, I'm not specifically recommending any particular
HTML-extruding software, neither GoLive nor anything else. I'm just
addressing your apparent belief that if you are using a particular
software already, then it somehow disqualifies you from doing other
kinds of routine processing to your documents.

have fun

Jul 20 '05 #76

Alan J. Flavell

On Sat, 17 Apr 2004, Zenobia wrote:

On Sun, 11 Apr 2004 10:19:02 +0100, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
WHAT page? If you're going to talk in riddles, then you're not going
to get best help here.
I think it must have been this page:

http://www.w3.org/MarkUp/html3/latin1.html

You're looking here at a piece of history - a never-finished working
draft for a never-completed "HTML/3.0": the document was last-changed
in 1995, and - despite its considerable interest to historians of HTML
- has no direct relevance to current HTML versions.

See http://www.w3.org/MarkUp/html3/ for further details.
After submitting to the validator. Getting a character error
(http://validator.w3.org/check) and clicking on the explain link
(http://validator.w3.org/docs/errors.html#bad-char) there is a
link to that page. Note that the list of entities does not
include entities such as α

The page cited above fails validation for some quite different
reason(s). I suspect you tried to validate some other page than the
one which you cited above. Never mind.

That page only had a small percentage of the characters I needed and
all those characters were specified using the numeric convention.

It's an entirely legal way to represent them.

Ha. But entirely useless because it's meaningless.

Then use software which can produce an appropriate representation for
you, based on what you're prepared to type into it.

Not Adobe GoLive.

Jul 20 '05 #77

Similar topics