Standard character attributes for Hebrew?

Shmuel (Seymour J.) Metz

I'd like to include some Hebrew names in a web page. HTML 4 doesn't
appear to include character attributes for ISO-8859-8. I'd prefer
avoiding numeric references, e.g.,
"שמואל". Is there currently a
standardized set of character attributes for Hebrew? If so, is there a
downloadable set of definitions for those attributes?

Thanks.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Mar 12 '06 #1

Subscribe Reply

3042

Alan J. Flavell

On Sun, 12 Mar 2006, Shmuel (Seymour J.) Metz wrote:

I'd like to include some Hebrew names in a web page. HTML 4 doesn't
appear to include character attributes for ISO-8859-8.
I'm sorry to say you seem to be extensively confuddled about character
representation in HTML. Really, the character *encoding* of
iso-8859-8 has nothing to do with this, unless you are providing
characters which are actually encoded in 8859-8.

In general you have three ways to represent characters in HTML:

1. the character itself, in the character encoding which you are using

2. numeric character reference (&#number;), in either decimal or hex,

3. if available (but in this case they are not available), named
character entities.

I don't know what you think you mean by "character attributes".
I'd prefer avoiding numeric references, e.g.,
"שמואל".

Choose one of the above options. There aren't any other.

good luck

Mar 12 '06 #2

Jukka K. Korpela

"Alan J. Flavell" <fl*****@physics.gla.ac.uk> wrote:

I don't know what you think you mean by "character attributes".

My guess is "character references", the W3C-endorsed misnomer for entities
with definitions that expand an entity reference to a character reference.
Things like α. They do not exist for Hebrew characters in HTML or in
SGML. (&alefsym; is not an exception: it denotes U+8501, which is a
letterlike symbol, not a letter.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Mar 12 '06 #3

Andreas Prilop

On Sun, 12 Mar 2006, Shmuel (Seymour J.) Metz wrote:

I'd like to include some Hebrew names in a web page. HTML 4 doesn't
appear to include character attributes for ISO-8859-8.
"Character attributes"? What's that?
I'd prefer avoiding numeric references, e.g.,
"שמואל".
Why?
Well, actually you should prefer decimal character references
http://www.unics.uni-hannover.de/nht...l2.html#hebrew
because of wider support in browsers and other programs (e.g.
StarOffice 7).
Is there currently a
standardized set of character attributes for Hebrew? If so, is there a
downloadable set of definitions for those attributes?

Sorry, I don't understand what you mean.
To encode Hebrew characters as in
http://www.unics.uni-hannover.de/nhtcapri/hebrew.html8
http://www.unics.uni-hannover.de/nhtcapri/hebrew.win
is only half the job. Read
http://ppewww.ph.gla.ac.uk/~flavell/...direction.html
http://www.unics.uni-hannover.de/nht...onal-text.html
how to mark-up your right-to-left text.

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 13 '06 #4

Shmuel (Seymour J.) Metz

In <Pi*******************************@ppepc62.ph.gla. ac.uk>, on
03/12/2006
at 08:18 PM, "Alan J. Flavell" <fl*****@physics.gla.ac.uk> said:

I'm sorry to say you seem to be extensively confuddled about
character representation in HTML.
No, just the nomenclature.
Really, the character *encoding* of iso-8859-8 has nothing to do
with this,
Nor did I say anything about the character encoding. Some character
set standards define standard names as well as standard encodings.
In general you have three ways to represent characters in HTML:
I already knew that; I was just confused about the nomenclature.
1. the character itself, in the character encoding which you are
using
The character encoding that I am using is ISO 8859-1; it doesn't have
the Hebrew characters. Unfortunately, my preferred editor doesn't
support UTF-8.
2. numeric character reference (&#number;), in either decimal or
hex,
That's what I am currently using and what I wish to avoid.
3. if available (but in this case they are not available), named
character entities.
*That* is what I was asking about.
I don't know what you think you mean by "character attributes".
Named character entities.
Choose one of the above options.

I already did, your option three. What I was hoping was that there had
been an additional standard since HTML 4 to add named character
entities for the Hebrew characters, old enough to be picked up by the
major browsers. If there is no such standard then I'll just have to
code entity declarations using my own names.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Mar 14 '06 #5

Shmuel (Seymour J.) Metz

In <Xn*****************************@193.229.4.246>, on 03/12/2006
at 08:26 PM, "Jukka K. Korpela" <jk******@cs.tut.fi> said:

They do not exist for Hebrew characters in HTML or in
SGML.

What about in XML and in XHTML?

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Mar 14 '06 #6

Shmuel (Seymour J.) Metz

In
<Pi**************************************@s5b004.r rzn.uni-hannover.de>,
on 03/13/2006
at 02:19 PM, Andreas Prilop <nh******@rrzn-user.uni-hannover.de>
said:

"Character attributes"? What's that?
Named character entities.
Why?
Readability of the HTML source.
Well, actually you should prefer decimal character references
De gustibus non disputandem est. I regard that "should" the same way
that I would regard "Well, actually you should prefer machine
language"; it forces me to do work that the computer is better suited
to do.
Sorry, I don't understand what you mean.
Is there a more recent standard than HTML or 8859-8 that specifies
named character entities for the Hebrew letters?
To encode Hebrew characters as in
http://www.unics.uni-hannover.de/nhtcapri/hebrew.html8
http://www.unics.uni-hannover.de/nhtcapri/hebrew.win
is only half the job.
Partially true. At least one browser renders them properly without any
bidi markup other than the usual <meta http-equiv="content-type"
content="text/html; charset=UTF-8">. Are you saying that some browsers
require additional bidi markup even when using Unicode?
Read
http://ppewww.ph.gla.ac.uk/~flavell/...direction.html
http://www.unics.uni-hannover.de/nht...onal-text.html
how to mark-up your right-to-left text.

Will do.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Mar 14 '06 #7

Alan J. Flavell

On Tue, 14 Mar 2006, Shmuel (Seymour J.) Metz wrote, quoting me:

Really, the character *encoding* of iso-8859-8 has nothing to do
with this,
Nor did I say anything about the character encoding.

With respect: you mentioned iso-8859-8, and, as far as HTML is
concerned, that *is* a character encoding.
The character encoding that I am using is ISO 8859-1; it doesn't have
the Hebrew characters. Unfortunately, my preferred editor doesn't
support UTF-8.
Editing mixed-direction content isn't a bundle of fun in any editor,
by the way.
What I was hoping was that there had been an additional standard
since HTML 4 to add named character entities for the Hebrew
characters,
Well, only in theory. It wouldn't be feasible on the web today.
old enough to be picked up by the major browsers.

No. Whatever you do with your internal editing procedures (e.g your
own private character entity names, maybe derived from some naming
convention that you got from a different context), would need to be
programmatically converted into one of the above-mentioned forms
(either encoded characters, or numerical character references) for
actually publishing to the web. While you're doing that, you might be
advised to convert the output encoding to utf-8, and especially if you
aim to write <form...>s that can be used for submitting i18n matter.

It's certainly do-able, possibly with XML-based tools or with whatever
programmatic process you're comfortable with. But some kind of
*process* in between your authored document, and what you publish to
the web, is inevitable, assuming you want the result to work in a
current web context, rather than only under specialised conditions.

best answer I can offer, sorry

Mar 14 '06 #8

Andreas Prilop

On Tue, 14 Mar 2006, Shmuel (Seymour J.) Metz wrote:

Well, actually you should prefer decimal character references
De gustibus non disputandem est.

I wrote about support in browsers and others programs -
not about your personal taste.
Is there a more recent standard than HTML or 8859-8 that specifies
named character entities for the Hebrew letters?
Suppose there is: which browser(s) would know them?
(No, there isn't.)
other than the usual <meta http-equiv="content-type"
content="text/html; charset=UTF-8">.
This <meta> voodoo is not usual but superfluous.
Specify the encoding (charset) in the HTTP header:
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html
Are you saying that some browsers
require additional bidi markup even when using Unicode?

Yes, examples given here:

http://ppewww.ph.gla.ac.uk/~flavell/...direction.html
http://www.unics.uni-hannover.de/nht...onal-text.html

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 14 '06 #9

Andreas Prilop

On Tue, 14 Mar 2006, Shmuel (Seymour J.) Metz wrote:

The character encoding that I am using is ISO 8859-1; it doesn't have
the Hebrew characters.
Then use ISO-8859-8
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s4
or numeric character references.
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6
Unfortunately, my preferred editor doesn't support UTF-8.

"My preferred tool is the hammer. How do I turn screws
with my hammer?"

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 14 '06 #10

Harlan Messinger

Shmuel (Seymour J.) Metz wrote:

Partially true. At least one browser renders them properly without any
bidi markup other than the usual <meta http-equiv="content-type"
content="text/html; charset=UTF-8">. Are you saying that some browsers
require additional bidi markup even when using Unicode?

I would hope so. Just because I've thrown in a few Hebrew characters
doesn't mean the browser should assume a change in direction. If I write,

The first three letters of the Hebrew alphabet are [alef] [bet] [gimel].

with the actual letters substituted for the bracketed words, I don't
want the browser placing the gimel to the left and the alef to the
right. And this is a pain, because I was going to produce this
illustration with the actual letters, but as soon as I copied in an alef
from the Windows Character Map application, Thunderbird went into RTL
mode, leaving me with no idea how to ensure the intended appearance on
your end.

Mar 14 '06 #11

Alan J. Flavell

On Tue, 14 Mar 2006, Harlan Messinger wrote:

I would hope so. Just because I've thrown in a few Hebrew characters
doesn't mean the browser should assume a change in direction.
You seem to have walked right into a problem. The behaviour of HTML
in relation to RTL writing systems is not only fairly well defined, it
is also fairly well implemented in browsers. If you don't want RTL
behaviour in HTML, when the characters themselves are RTL, then you
damned-well have to demand it - in HTML, that's with <bdo> markup.
[...] And this is a pain, because I was going to produce this illustration
with the actual letters, but as soon as I copied in an alef from the
Windows Character Map application, Thunderbird went into RTL mode,
leaving me with no idea how to ensure the intended appearance on
your end.

If you're composing plain-text, then the HTML rules will not help you.
But that would be off-topic for the current group.

regards

Mar 14 '06 #12

Jukka K. Korpela

"Shmuel (Seymour J.) Metz" <sp******@library.lspace.org.invalid> wrote:

They do not exist for Hebrew characters in HTML or in
SGML.

What about in XML and in XHTML?

Even less. XML is a strongly simplified version of SGML, and XHTML is an
XMLized variant of HTML. In fact, an XHTML processor is not even required to
process external entity declarations, which means that it need not even
interpret "predefined" entity references like α.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Mar 14 '06 #13

Andreas Prilop

On Tue, 14 Mar 2006, Harlan Messinger wrote:

The first three letters of the Hebrew alphabet are [alef] [bet] [gimel].

with the actual letters substituted for the bracketed words, I don't
want the browser placing the gimel to the left and the alef to the
right.

You need to write

The first three letters of the Hebrew alphabet are
<bdo dir=ltr>א ב ג</bdo>

Be sure to read
http://ppewww.ph.gla.ac.uk/~flavell/...direction.html

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 15 '06 #14

Harlan Messinger

Alan J. Flavell wrote:

On Tue, 14 Mar 2006, Harlan Messinger wrote:

I would hope so. Just because I've thrown in a few Hebrew characters
doesn't mean the browser should assume a change in direction.

You seem to have walked right into a problem. The behaviour of HTML
in relation to RTL writing systems is not only fairly well defined, it
is also fairly well implemented in browsers. If you don't want RTL
behaviour in HTML, when the characters themselves are RTL, then you
damned-well have to demand it - in HTML, that's with <bdo> markup.

Well, I wasn't aware of that, but I'm also surprised because I remember
an Israeli web site, in Hebrew, that was fine in IE and backwards in
Firefox. I wonder how they managed that--unless they did something
funny, given the discrepancy I would think Firefox would have been the
one to get it right.

Mar 15 '06 #15

Alan J. Flavell

On Wed, 15 Mar 2006, Harlan Messinger wrote:

Alan J. Flavell wrote:
fairly well implemented in browsers. If you don't want RTL
behaviour in HTML, when the characters themselves are RTL, then
you damned-well have to demand it - in HTML, that's with <bdo>
markup.
Well, I wasn't aware of that, but I'm also surprised because I
remember an Israeli web site, in Hebrew, that was fine in IE and
backwards in Firefox.

You don't need me to tell you that the prima facie assumption is
that Firefox/Mozilla do what the spec tells them to do, whereas
IE tends to do what it guesses the misguided author might have
intended.
I wonder how they managed that
Without a URL, I'm not going to start to make any serious guess; but
A.Prilop reminded me recently me that earlier versions of Opera
(7.something) treated charset=iso-8859-8 as meaning "visual Hebrew",
although the specs (and informed commentaries such as
http://www.nirdagan.com/hebrew/standards ) don't support that
assumption. Also with IE there'd be the question of whether RTL
support had been enabled in the OS.
--unless they did something funny, given the discrepancy I would
think Firefox would have been the one to get it right.

Oh, right, you said the same thing as I did, really.

Mar 15 '06 #16

Andreas Prilop

On Wed, 15 Mar 2006, Harlan Messinger wrote:

I remember an Israeli web site,
URL?
in Hebrew, that was fine in IE and backwards in Firefox.
You can safely assume, whenever such a thing happens, that Mozilla/
Firefox is right and Internet Exploder is wrong. Here it means
that the author/webmaster has actually produced backwards text -
maybe intentionally so that IE displays it as intended.
I wonder how they managed that

URL?

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 15 '06 #17

Harlan Messinger

Andreas Prilop wrote:

On Wed, 15 Mar 2006, Harlan Messinger wrote:
I remember an Israeli web site,

URL?

Their own site's Hebrew pages seem to have disappeared, but here's an
example of a page generated by the same content management tool:

http://www.gal-dogs.co.il/Dev2Go.web?Anchor=Page7

EVERYTHING's backwards in Firefox.

Mar 15 '06 #18

Andreas Prilop

On Wed, 15 Mar 2006, Harlan Messinger wrote:

http://www.gal-dogs.co.il/Dev2Go.web?Anchor=Page7

| <META NAME="GENERATOR" CONTENT="iDune Dev2Go">

Have you looked at the source?
I surrender! Sorry!

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 15 '06 #19

Alan J. Flavell

On Wed, 15 Mar 2006, Andreas Prilop wrote:

On Wed, 15 Mar 2006, Harlan Messinger wrote:
http://www.gal-dogs.co.il/Dev2Go.web?Anchor=Page7

| <META NAME="GENERATOR" CONTENT="iDune Dev2Go">

Have you looked at the source?

What a sorry mess!

In the time that I'm prepared to devote to looking at this, I honestly
cannot work out just how this misguided author has contrived to
confuddle MSIE to the extent that it displays in the way that he
evidently intended it - this despite the fact that displaying RTL text
"to specification" is one of the subset of web-like activities that
MSIE /does/ seem to be capable of, when given a reasonable
opportunity.

"il.co.link.www", indeed :-{

Mar 15 '06 #20

Henri Sivonen

In article <Xn*****************************@193.229.4.246>,
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:

In fact, an XHTML processor is not even required to
process external entity declarations, which means that it need not even
interpret "predefined" entity references like α.

The predefined entities are: <, >, &, " and '.

α is not predefined. It is just defined in the XHTML DTDs.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Mar 16 '06 #21

Andreas Prilop

On Wed, 15 Mar 2006, Alan J. Flavell wrote:

| <META NAME="GENERATOR" CONTENT="iDune Dev2Go">
Have you looked at the source?

What a sorry mess!

This mess comes from http://www.idune.com/ .

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 16 '06 #22

Alan J. Flavell

On Thu, 16 Mar 2006, Andreas Prilop wrote:

This mess comes from http://www.idune.com/ .

Hmmm, see that broken image icon[1]. If I ask the browser to view the
image (N.B *image*) then it attempts to retrieve the URL

http://www.idune.com/iDuneDownload.d...chor=&ext=.gif

The server HTTP response appears to be:

HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Thu, 16 Mar 2006 16:12:39 GMT
i.e it claims that the transaction was conducted to HTTP/1.1 protocol
and was successful - *but* there is no HTTP Content-type header.

The response body appears to be quasi-HTML - not an image at all,
which says (modulo linebreaks):

<HTML><BODY>The reason can be one of the following: * File not
found. * Access is denied. * Session was lost (in this case,
return to the web site and then click on the file again). *
Other... </BODY></HTML>

Just how many web standards did that break, in total?

And then there's their white text displayed on my default background
colour (which *could* well have been white).

And that's before we even *start* on the previously reported real
problem :-(

I'll bet they make *lots* of money.
[1] Oh, the reason for the broken images is doubtless that I routinely
reject cookies when I'm offered no explanation of what benefit I'm
going to get from them.

Mar 16 '06 #23

Alan J. Flavell

On Thu, 16 Mar 2006, Henri Sivonen wrote:

"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
In fact, an XHTML processor is not even required to process
external entity declarations, which means that it need not even
interpret "predefined" entity references like α.

The predefined entities are: <, >, &, " and '.

α is not predefined. It is just defined in the XHTML DTDs.

OK; but none of this addresses the plaintiff's wishes.

As I understood it (and I might even say "understandably"), he wanted
to edit the source using some kind of mnemonics for the Hebrew
letters, instead of the actual characters (which, as I remarked, are
hard to edit in a mixed-direction text) nor the numerical character
references (which aren't easy to remember).

We all agree that it wouldn't be feasible, on the web as we find it,
to send out documents with arbitrary character entities such as
&aleph; , &beth; etc. in them, *even* if we included the definitions
in the document.

I haven't done this myself, but I don't really see why one couldn't
*code* them that way in an editor, and then use some process which
could turn them optionally into the coded character /or/ the numerical
character reference, according to one's current preference. It could
be XML-based, couldn't it? And produce (X)HTML as its output for
publishing on the web?

Mar 16 '06 #24

Ian Rastall

Alan J. Flavell wrote:

I don't really see why one couldn't
*code* them that way in an editor, and then use some process which
could turn them optionally into the coded character /or/ the numerical
character reference

Hey Alan. I'm in over my head in this discussion, but I will say that
one way I worked with Chinese and Russian etext was to copy the text
into MS Word, then save as HTML, and then extract the text part of the
document, which was then in the correct numerical character references
(UTF-8).

Ian
--
http://sundry.ws

Mar 16 '06 #25

Henri Sivonen

In article <Pi*******************************@ppepc55.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@physics.gla.ac.uk> wrote:

I haven't done this myself, but I don't really see why one couldn't
*code* them that way in an editor, and then use some process which
could turn them optionally into the coded character /or/ the numerical
character reference, according to one's current preference. It could
be XML-based, couldn't it? And produce (X)HTML as its output for
publishing on the web?

Sure. All it takes is an XML serializer connected to an XML parser that
resolves external entities.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Mar 16 '06 #26

Andreas Prilop

On Thu, 16 Mar 2006, Ian Rastall wrote:

Hey Alan. I'm in over my head in this discussion, but I will say that
one way I worked with Chinese and Russian etext was to copy the text
into MS Word, then save as HTML,
Ouch! I would rather use Mozilla Composer for such a task. It will
give you UTF-8-encoded text if you chose UTF-8 or it will give you
numeric character references if you chose ISO-8859-1.
and then extract the text part of the
document, which was then in the correct numerical character references
(UTF-8).

You seem to confuse two things:
- numeric character references
http://www.unics.uni-hannover.de/nht...ilingual2.html

- UTF-8-encoded characters
http://www.unics.uni-hannover.de/nht...ilingual1.html

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 17 '06 #27

Ian Rastall

Andreas Prilop wrote:

Ouch! I would rather use Mozilla Composer for such a task. It will
give you UTF-8-encoded text if you chose UTF-8 or it will give you
numeric character references if you chose ISO-8859-1.
Not a bad idea. I'll look into it.
You seem to confuse two things:
- numeric character references
- UTF-8-encoded characters

I can never understand encoding issues. Here's what I meant. Say, like
me, you wanted to put some Tolstoy up in Russian. You find the public
domain etext with the Russian characters, highlight and copy the
section. Paste that into a text editor, and you get gobbledy-gook.
Paste it into Word, and you get the actual Russian characters. Now
just ask Word to save as HTML with UTF-8 encoding, and your HTML now
looks like:

〹〼㈯

etc.

I haven't tried it with an WYSIWYG editor, like Composer, or
Dreamweaver. I imagine it would work that way, too. Keep in mind, I
believe in hand-coding, so using Word is just a preliminary step in
order to get the right numerical character references.

BTW, I will definitely check out those links. I really wish I could
conquer the character code beast.

Ian
--
http://sundry.ws

Mar 17 '06 #28

Andreas Prilop

On Fri, 17 Mar 2006, Ian Rastall wrote:

Paste it into Word, and you get the actual Russian characters. Now
just ask Word to save as HTML with UTF-8 encoding, and your HTML now
looks like:
〹〼㈯

These are numeric character references. Although the numbers
refer to Unicode,
http://ppewww.ph.gla.ac.uk/~flavell/...unidata04.html
they have nothing to do with the encoding UTF-8.

I don't know what MS Word does (only that its HTML code is
a perversion of logical markup); but Mozilla Composer will give
these numeric references when you choose West European ISO-8859-1
as encoding. When you choose Unicode UTF-8 as encoding, it will
give you the actual UTF-8-encoded characters.

See the source texts of
http://www.unics.uni-hannover.de/nht...ilingual1.html
http://www.unics.uni-hannover.de/nht...ilingual2.html
for the difference.

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 17 '06 #29

Harlan Messinger

Ian Rastall wrote:

Andreas Prilop wrote:
Ouch! I would rather use Mozilla Composer for such a task. It will
give you UTF-8-encoded text if you chose UTF-8 or it will give you
numeric character references if you chose ISO-8859-1.

Not a bad idea. I'll look into it.
You seem to confuse two things:
- numeric character references
- UTF-8-encoded characters

I can never understand encoding issues. Here's what I meant. Say, like
me, you wanted to put some Tolstoy up in Russian. You find the public
domain etext with the Russian characters, highlight and copy the
section. Paste that into a text editor, and you get gobbledy-gook. Paste
it into Word, and you get the actual Russian characters. Now just ask
Word to save as HTML with UTF-8 encoding, and your HTML now looks like:

〹〼㈯

Then the characters that are in the source document--the &, #, 1, 2, 3,
etc.-- are all encoded using UTF-8. The document contains no Russian
characters at all, just their HTML numeric representations. The point of
the &#uuuu; representation is that it allows you to include Unicode
characters in the HTML parser's output without having to encode them,
without even having to be able to encode them, under the encoding used
for the source document. The digits and the symbols &, #, and ; are
available in every encoding transmissible via HTML.

If you *have* to specify UTF-8 to Word in order to get it to use the
numeric codes, then that means that Word is missing the point.
Specifying UTF-8 instead of ASCII or Western European (8859-1) is what
would *allow* the Russian characters to be encoded directly instead of
having to use the numeric codes.

etc.

I haven't tried it with an WYSIWYG editor, like Composer, or
Dreamweaver. I imagine it would work that way, too. Keep in mind, I
believe in hand-coding, so using Word is just a preliminary step in
order to get the right numerical character references.

BTW, I will definitely check out those links. I really wish I could
conquer the character code beast.

Ian

Mar 17 '06 #30

Andreas Prilop

On Fri, 17 Mar 2006, Harlan Messinger wrote:

〹〼㈯
Then the characters that are in the source document--the &, #, 1, 2, 3,
etc.-- are all encoded using UTF-8.

That is correct but somewhat misleading here. & # 1 2 3 are encoded
in US-ASCII.
The document contains no Russian
characters at all, just their HTML numeric representations.

Yes, the *surce text* contains only US-ASCII characters.

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 17 '06 #31

Harlan Messinger

Andreas Prilop wrote:

On Fri, 17 Mar 2006, Harlan Messinger wrote:

〹〼㈯

Then the characters that are in the source document--the &, #, 1, 2, 3,
etc.-- are all encoded using UTF-8.

That is correct but somewhat misleading here. & # 1 2 3 are encoded
in US-ASCII.

Eh? Aren't they encoded in whatever encoding is being used for the file
containing them?

Mar 17 '06 #32

Ian Rastall

Andreas Prilop wrote:

On Fri, 17 Mar 2006, Ian Rastall wrote:
〹〼㈯

These are numeric character references.

Okay, I knew we were talking at cross-purposes, and it's my own fault.
I have a habit of putting everything into numeric character references
.... although now that I think about it, I can't come up with the
reason. Just too many websites I've been to where the charcters come
up as boxes, until I can find the right encoding to switch to. I
figure if I put everything into NCRs, then I'm directly telling the
browser, "This is UTF-8." Maybe I'm wasting my time, since that test
site you have, with the actual UTF-8 characters, rendered just fine.

I will download Composer and see what I can do with it. Thanks for the
suggestion.

Ian
--
http://sundry.ws

Mar 17 '06 #33

Ian Rastall

Harlan Messinger wrote:

Specifying UTF-8 instead of ASCII or Western European (8859-1) is what would *allow* the Russian characters to be encoded directly instead of
having to use the numeric codes.

Thanks, Harlan. As I just mentioned, I'd forgotten about my habit of
putting everything into numeric codes, or rather forgotten that it
wasn't standard practice. I don't trust putting characters in
directly, but that may be a prejudice on my part, and I would love to
find out that it's unfounded. Would cut down on a lot of work!

Ian
--
http://sundry.ws

Mar 17 '06 #34

Andreas Prilop

On Fri, 17 Mar 2006, Harlan Messinger wrote:

That is correct but somewhat misleading here. & # 1 2 3 are encoded
in US-ASCII.

Eh? Aren't they encoded in whatever encoding is being used for the file
containing them?

Yes. But they are only ASCII characters; so they are already encoded
in US-ASCII. I wrote it would be /misleading/ to take any superset
such as UTF-8 or Cyrillic Windows-1251. You /could/ say they are
encoded in Cyrillic Windows-1251 - but there are no genuine
characters from Windows-1251 other than US-ASCII here.

--
All free men, wherever they may live, are citizens of Denmark.
And therefore, as a free man, I take pride in the words "Jeg er dansker!"

Mar 17 '06 #35

Alan J. Flavell

On Fri, 17 Mar 2006, Andreas Prilop wrote:

Yes. But they are only ASCII characters; so they are already encoded
in US-ASCII. I wrote it would be /misleading/ to take any superset
such as UTF-8 or Cyrillic Windows-1251. You /could/ say they are
encoded in Cyrillic Windows-1251

Indeed. You *could* say they are encoded in iso-8859-anything, or
windows-125x for various values of x, or in koi8-r... or indeed in
utf-8, but all of those would be misleading, seeing that the
characters in question (&, #, some digits, and a semi-colon) are
nothing more exciting than us-ascii.

Which was of course your point.

Mar 17 '06 #36

Harlan Messinger

Andreas Prilop wrote:

On Fri, 17 Mar 2006, Harlan Messinger wrote:

That is correct but somewhat misleading here. & # 1 2 3 are encoded
in US-ASCII.

Eh? Aren't they encoded in whatever encoding is being used for the file
containing them?

Yes. But they are only ASCII characters; so they are already encoded
in US-ASCII. I wrote it would be /misleading/ to take any superset
such as UTF-8 or Cyrillic Windows-1251. You /could/ say they are
encoded in Cyrillic Windows-1251 - but there are no genuine
characters from Windows-1251 other than US-ASCII here.

I think I understand what you're saying, but to me it's like describing
the paths of the planets in terms of revolving around the earth with
epicycles instead of in terms of revolving around the sun. Whatever
encoding is being used to stored character data in a file, that's the
encoding being used for all the characters. For a subset of those
characters the encoding might, by design, be the same as under US-ASCII,
but it doesn't serve any useful purpose to say that the characters are
being encoded using US-ASCII instead of the other encoding. (There's no
reason to assume that the source text as a whole contains only these
characters, just because these are the characters I was discussing.)

After all, the whole analysis would still hold if the source text were
encoded in EBCDIC.

Mar 17 '06 #37

Alan J. Flavell

On Fri, 17 Mar 2006, Ian Rastall wrote:

Thanks, Harlan. As I just mentioned, I'd forgotten about my habit of putting
everything into numeric codes, or rather forgotten that it wasn't standard
practice. I don't trust putting characters in directly,
If you're talking about Usenet postings, then that's good advice.

If you're talking about putting "real" characters into your HTML
source, then that really depends on the editor that you're using (as
well as on its operator's expertise in using it, of course ;-)
prejudice on my part, and I would love to find out that it's
unfounded. Would cut down on a lot of work!

If you put it onto the web properly, then it *will* work, nowadays.
Even NN4.* was (with limitations) capable of rendering utf-8 (its
forms submission under those circumstances was, however, hopeless).

As already noted, Mozilla Composer, and its sibling Nvu, is happy to
convert between numerical character references on the one hand, and
characters properly encoded in your choice of encoding, depending on
which encoding you choose when you save the composed document.

But to go back to the original topic of this thread: editing
mixed-direction HTML source is *painful* in pretty much any editor.

Mar 17 '06 #38

Ian Rastall

Alan J. Flavell wrote:

If you put it onto the web properly, then it *will* work, nowadays.
Thanks, Alan. That was what I was wondering.
to go back to the original topic of this thread: editing
mixed-direction HTML source is *painful* in pretty much any editor.

I guess my point was small ... that if you solved the problem by using
numerical character references, that it would be easy to plug the text
into something that could handle it, and then use it to generate the
code. The code would need to be cleaned up, worked on, etc., but I
thought it might be a helpful suggestion.

Ian
--
http://sundry.ws

Mar 17 '06 #39

Henri Sivonen

In article <xu******************@fe04.news.easynews.com>,
Ian Rastall <id*******@gmail.com> wrote:

You find the public
domain etext with the Russian characters, highlight and copy the
section. Paste that into a text editor, and you get gobbledy-gook.

That suggests your text editor is broken. Russian text can be pasted
into contemporary UTF-8-capable editors just fine.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Mar 17 '06 #40

Ian Rastall

Henri Sivonen wrote:

That suggests your text editor is broken. Russian text can be pasted
into contemporary UTF-8-capable editors just fine.

Ah well, I'll have to wait for NoteTab 5.0 ... as I've become
accustomed to it. Maybe if I work on the Russian text, as I plan on, I
can use UltraEdit or something like that.

Sorry to get off topic.

Ian
--
http://sundry.ws

Mar 18 '06 #41

Standard character attributes for Hebrew?

Similar topics