French "No" character entity

Haines Brown

I'm having trouble finding the character entity for the French
abbreviation for "number" (capital N followed by a small supercript
o, period).

My references are not listing it. Where would I find an answer to this
question (don't find it in the W3C_char_entities document).

--
Haines Brown
br****@hartford-hwp.com
kb****@arrl.net
www.hartford-hwp.com

Jul 20 '05 #1

Subscribe Post Reply

5684

Jacqui or (maybe) Pete

In article <m2************@hartford-hwp.com>, br****@hartford-hwp.com
says...

I'm having trouble finding the character entity for the French
abbreviation for "number" (capital N followed by a small supercript
o, period).

Would N° be any good?

Jul 20 '05 #2

Jim Dabell

Haines Brown wrote:

[snip]

My references are not listing it. Where would I find an answer to this
question (don't find it in the W3C_char_entities document).

I see Alan's already answered your question, but for future reference, you
can look for characters at <URL:http://www.unicode.org/charts/>.

--
Jim Dabell

Jul 20 '05 #3

Stan Brown

In article <m2************@hartford-hwp.com> in
comp.infosystems.www.authoring.html, Haines Brown <brownh@hartford-
hwp.com> wrote:

I'm having trouble finding the character entity for the French
abbreviation for "number" (capital N followed by a small supercript
o, period).

My references are not listing it. Where would I find an answer to this
question (don't find it in the W3C_char_entities document).

You want the "masculine ordinal indicator", º or º. I
don't know what "references" you tried, but I found it at the W3C's
list of entities,
<http://www.w3.org/TR/html401/sgml/entities.html#h-24.2>.

Note that this is an ISO-8859-1 character.

(Somebody suggested a degree mark. That's too small, at least in the
first font I checked; and I think it may be at the wrong height
above the baseline too.)

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
validator: http://jigsaw.w3.org/css-validator/

Jul 20 '05 #4

Stan Brown

In article <Xn****************************@193.229.0.31> in
comp.infosystems.www.authoring.html, Jukka K. Korpela
<jk******@cs.tut.fi> wrote:

Stan Brown <th************@fastmail.fm> wrote:
You want the "masculine ordinal indicator", º or º.
I would say no.

would be illogical to use the latter, since the symbol does not stand
for any ordinal number. I think the masculine ordinal indicator should
_only_ be used when writing ordinal numbers in Spanish.

Correction accepted. I thought it was logically right, but I see the
distinction you're making.

(As for degree mark, I knew it was logically wrong but for a one-
shot Usenaut I thought the presentational argument would carry more
weight.)

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
validator: http://jigsaw.w3.org/css-validator/

Jul 20 '05 #5

Andreas Prilop

"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:

Alan's suggestion of using numero sign U+2116 is certainly correct in
some sense. It's apparently the kind of symbol used in French. But
there's the practical consideration that many fonts don't contain that
character, and browsers may fail to render it at all.

U+2116 is essentially a Cyrillic character, which you can find on every
Russian mechanical (!) typewriter. It's included in ISO-8859-5, cp866,
Windows-1251, MacCyrillic, etc. etc.
<http://www.unics.uni-hannover.de/nhtcapri/cyrillic.html5>

This symbol is firmly established in Russian like "#" in US English.
However, in many fonts the glyph for this symbol doesn't look okay
to be used with French; it rather has a certain "Russian touch".
<http://www.adobe.com/type/browser/pdfs/BSCQ/BaskervilleCyrLTStd-Upright.pdf>

--
But thats what FP puts in to the page, so i asume thats correct
Harry H. Arends in microsoft.public.frontpage.client

Jul 20 '05 #6

Andreas Prilop

I wrote:

This symbol is firmly established in Russian like "#" in US English.
However, in many fonts the glyph for this symbol doesn't look okay
to be used with French; it rather has a certain "Russian touch".

See also
<http://www.microsoft.com/typography/developers/fdsspec/symbol.htm>

--
But thats what FP puts in to the page, so i asume thats correct
Harry H. Arends in microsoft.public.frontpage.client

Jul 20 '05 #7

Haines Brown

I appreciate the interesting comments on how to display the numero
character, but the other initial question remains: How do I display
any character referenced by its unicode index? For example, how would
I reference U+2116 in a HTML doc (besides the issue of whether a font
actually contains it)?

I assume I have to define:

@font-face {
unicode-range: U+2116;
}

but how do I then call that character in the document body?

--
Haines Brown
br****@hartford-hwp.com
kb****@arrl.net
www.hartford-hwp.com

Jul 20 '05 #8

Jukka K. Korpela

Haines Brown <br****@hartford-hwp.com> wrote:

I appreciate the interesting comments on how to display the numero
character, but the other initial question remains: How do I display
any character referenced by its unicode index?
That was _not_ what you asked. You specifically wrote:
"I'm having trouble finding the character entity for the French
abbreviation for "number""
The short answer is that there is no entity for it in HTML. What we
have tried to do is to guess that you actually wanted to know how to
render the numero character, instead of wanting to specifically use an
entity, which is impossible.

The question that you ask now is both very simple and very complicated.
It's very simple in the sense that by HTML specifications, any
character can be denoted by using the character reference &#n; where n
is its Unicode code position (index) in decimal. It's very complicated
in the sense that browsers fail to implement this properly, partly for
very understandable reasons, and there are many things to consider in
practice. I would suggest some tutorialish treatise like
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html
http://www.cs.tut.fi/~jkorpela/html/chars.var
For example, how would
I reference U+2116 in a HTML doc (besides the issue of whether a font
actually contains it)?
For example, as №. Or "as such", as character, in UTF-8 encoding,
in a document advertized to use that encoding; you would probably want
to use a Unicode-capable text editor for this.
I assume I have to define:

@font-face {
unicode-range: U+2116;
}

No, that would be CSS and poorly supported at that. If it worked, in a
proper context, it would just help a browser "to avoid checking or
downloading a font that does not have sufficient glyphs to render a
particular character" (to quote the CSS2 specification).

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #9

Haines Brown

"Jukka K. Korpela" <jk******@cs.tut.fi> writes:

What we have tried to do is to guess that you actually wanted to
know how to render the numero character, instead of wanting to
specifically use an entity, which is impossible.
That is correct. I misspoke. Unfortunately, I have in hand a doc that
speaks of "decimal entities," which I realize is incorrect, but it
contributed to my sloppiness.
The question that you ask now is both very simple and very
complicated. I would suggest some tutorialish treatise like
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html
http://www.cs.tut.fi/~jkorpela/html/chars.var

Yes, I'm currently ploughing through some of this interesting
material. Thanks.

I assume I have to define:

@font-face {
unicode-range: U+2116;
}

No, that would be CSS and poorly supported at that. If it worked, in a
proper context, it would just help a browser "to avoid checking or
downloading a font that does not have sufficient glyphs to render a
particular character" (to quote the CSS2 specification).

Thanks for the clarification. Is it correct to conclude, then, that
the unicode-range definition only serves to speed rendering by
preventing the downloading of uneeded glyphs?

--
Haines Brown
br****@hartford-hwp.com
kb****@arrl.net
www.hartford-hwp.com

Jul 20 '05 #10

Jukka K. Korpela

Haines Brown <br****@hartford-hwp.com> wrote:

Unfortunately, I have in hand a doc
that speaks of "decimal entities," which I realize is incorrect,
but it contributed to my sloppiness.
That's understandable, especially since even the W3C materials mess
things up. The terminological confusion is unfortunate, but perhaps the
most serious confusion in practice is the common idea that one cannot
use a character if there is no entity for it. Entities are like macros
or named constants, which denote some strings, namely (in the case of
HTML) character references.
Is it correct to conclude, then, that
the unicode-range definition only serves to speed rendering by
preventing the downloading of uneeded glyphs?

Yes, something like that is the idea behind it. And unicode-range
definitions are poorly supported by browsers, and have been removed
from the CSS 2.1 draft due to lack of implementations.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #11

Stan Brown

In article <m2************@hartford-hwp.com> in
comp.infosystems.www.authoring.html, Haines Brown <brownh@hartford-
hwp.com> wrote:

For example, how would
I reference U+2116 in a HTML doc (besides the issue of whether a font
actually contains it)?
This is answered in the HTML spec (URL below): № will display
character 2116hex if it's available to the browser.
I assume I have to define:

@font-face {
unicode-range: U+2116;
}

Good gracious no, how horrid!

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
validator: http://jigsaw.w3.org/css-validator/

Jul 20 '05 #12

Dean Tiegs

Haines Brown <br****@hartford-hwp.com> writes:

First, I might be inclined to be adventurous and set charset=UTF-16.
You almost certainly don't want to do this. You can't make your
documents UTF-16 just by changing the XML declaration (unless you have
an advanced, XML-aware editor that does this -- I don't know of such
an editor); each character in your document has to be expanded from
eight bits to sixteen. If your documents are in Latin script, UTF-16
is wasteful compared to UTF-8; UTF-16 is best reserved for Arabic,
east Asian, south Asian and other non-Latin scripts.
Second, I also have this in my documents' preface:

<?xml version ="1.0" standalone="yes" encoding="UTF-8" ?>

My impression is that the encoding= statement is optional (documents
validate OK without it).

It's optional if and only if the document is UTF-8 or UTF-16. Since
all US-ASCII documents are also UTF-8, you can also omit it if you
have no non-ASCII characters. Note this means literal characters; you
can always use entities and character references for non-ASCII
characters -- they do not affect the encoding declaration.

--
Dean Tiegs, NEÂ¼-20-52-25-W4
â€œConfortare et esto robustusâ€
http://telusplanet.net/public/dctiegs/

Jul 20 '05 #13

Andreas Prilop

Stan Brown <th************@fastmail.fm> wrote:

This is answered in the HTML spec (URL below): № will display ^^^^ character 2116hex if it's available to the browser.

It _should_ display but it will not in many versions of Netscape 4.x
and perhaps other browsers. You better write № .

--
But thats what FP puts in to the page, so i asume thats correct
Harry H. Arends in microsoft.public.frontpage.client

Jul 20 '05 #14

Andreas Prilop

Haines Brown <br****@hartford-hwp.com> wrote:

This I corrected, and the accented characters now display
without reference to their index:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Don't do that! Rather specify the encoding (charset) in the HTTP header.
http://www.w3.org/International/O-HTTP-charset
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html
First, I might be inclined to be adventurous and set charset=UTF-16. What
would be the disadvantages?

What would be the advantages?

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #15

Haines Brown

Andreas Prilop <nh******@rrzn-user.uni-hannover.de> writes:

What would be the advantage of using UTF-16 (instead of UTF-8) for
Arabic?

In pre-computer days, I was a student of Chinese philatelic history,
and am considering getting back to some re-study of the written
language. For this, the UTF-16 could be handy.

--
Haines Brown
br****@hartford-hwp.com
kb****@arrl.net
www.hartford-hwp.com

Jul 20 '05 #16

Andreas Prilop

ÈÓå Çääç ÇäÑÍåæ ÇäÑÍêå

Haines Brown <br****@hartford-hwp.com> wrote:

What would be the advantage of using UTF-16 (instead of UTF-8) for
Arabic?
In pre-computer days, I was a student of Chinese philatelic history,

Fine - but I don't see any connection with Arabic.
and am considering getting back to some re-study of the written
language. For this, the UTF-16 could be handy.

I still do not see the advantage of UTF-16 over UTF-8 in webpages,
especially not for Arabic. Please explain!

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #17

Andreas Prilop

In article <news:MP************************@news.odyssey.net> ,
Stan Brown <th************@fastmail.fm> wrote:

Once again, this is in the spec. U+2116 makes use of a hex number,
and № should display the character (depending on your browser
and the available font). The corresponding decimal number is 8470,
and therefore № refers to the same character.
I don't know if any browser can handle the decimal numbers but not
the hex numbers.

Netscape 4.0 and Netscape 4.5 wouldn't display hexadecimal character
references &#xnumber; but Netscape 4.8 does.
So, which Netscape 4 version was the first to recognize &#xnumber; ?
Does anybody know?

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #18

Dean Tiegs

Andreas Prilop <nh******@rrzn-user.uni-hannover.de> writes:

What would be the advantage of using UTF-16 (instead of UTF-8) for
Arabic?

I assumed that most Arabic letters would be two octets in UTF-16 but
three octets in UTF-8. Now that I have checked, I see I was
mistaken. Arabic letters are two octets in both. So no size
advantage either way. Sorry about that.

--
Dean Tiegs, NEÂ¼-20-52-25-W4
â€œConfortare et esto robustusâ€
http://telusplanet.net/public/dctiegs/

Jul 20 '05 #19

Ernest Cline

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote:

I still do not see the advantage of UTF-16 over UTF-8 in webpages,
especially not for Arabic. Please explain!

The advanatge to using UTF-16 for Arabic is in terms of file size.

Since the main Arabic Characters are in the range U+0600 to U+06FF, in UTF-8
these characters require three bytes to code while UTF-16 uses only two
bytes. Hence UTF-16 would save 1/3 the size. This holds true for any
document that consists mostly of characters in the range U+0400 to U+FFFF,
since all of these characters take three bytes in UTF-8 and two in UTF-16.
Basically, UTF-8 is the best choice only if the text is mostly Latin
characters (which will usually contain quite a few of the characters in the
range U+0000 to U+007F that UTF-8 can represent in one byte).

Jul 20 '05 #20

Ernest Cline

"Ernest Cline" <er*********@mindspring.communism> wrote in message
news:bf**********@slb6.atl.mindspring.net...

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote:

I still do not see the advantage of UTF-16 over UTF-8 in webpages,
especially not for Arabic. Please explain!
The advanatge to using UTF-16 for Arabic is in terms of file size.

Since the main Arabic Characters are in the range U+0600 to U+06FF, in

UTF-8 these characters require three bytes to code while UTF-16 uses only two
bytes. Hence UTF-16 would save 1/3 the size. This holds true for any
document that consists mostly of characters in the range U+0400 to U+FFFF,
since all of these characters take three bytes in UTF-8 and two in UTF-16.
Basically, UTF-8 is the best choice only if the text is mostly Latin
characters (which will usually contain quite a few of the characters in the range U+0000 to U+007F that UTF-8 can represent in one byte).

Please amend the above, the breakoff point for the three byte UTF-8 range is
U+0800 not U+0400. There is still a small advantage to using UTF-16 for
Arabic, but it has nothing to do with file size for Arabic characters. The
advantage is that since Javascript uses 16-bit characters for its routines,
Gecko-based browsers (and most likely IE and Opera as well, altho I can't
say for sure) use 16-bit characters to store character data. Using UTF-16
as opposed to UTF-8 will therefore save some time on the client-side
converting the UTF-8 whereas the UTF-16 does not need conversion.

Jul 20 '05 #21

Alan J. Flavell

On Mon, Jul 21, Haines Brown inscribed on the eternal scroll:

The header in my documents indicated they were ISO-8859-1 rather than
UTC-8.
(this'll be a typo for UTF-8, I'm assuming)
This I corrected, and the accented characters now display
without reference to their index:

<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
meta http-equiv is an HTML-only ersatz for a real HTTP header. Any
kind of text file can and should have an appropriate content-type
header (text/whatever) which also specifies its character coding
(there's a wide-ranging security alert CA-2000-02, in which this is
one of the recommendations), and some service providers have followed
this advice by configuring their server to supply a default charset
(often, iso-8859-1): if they do that, then you could specify meta
http-equiv till you go blue in the face, it'll do no good for
specification-conforming browsers since the real HTTP header takes
precedence.

But note also the situation in XML and XML-based markups such as
XHTML, where meta http-equiv is either (a) only there for
compatibility with HTML or (b) just extra noise serving no productive
purpose, depending on circumstances.
First, I might be inclined to be adventurous and set charset=UTF-16. What
would be the disadvantages? Would it be likely to slow down the user's
display of documents? Are there other dangers?
I defer to the other postings on this topic. Must admit I've never
attempted it myself, so don't know the practical implications.

Btw, utf-8 has some real benefits since us-ascii is a proper subset of
the character coding. If you're worried that utf-8 might be too
voluminous, have you looked at the typical HTML-like extrusion from MS
Office products? But, ridicule apart: most browsers recognise
gzip-compressed HTML nowadays, if properly advertised from the server,
which may be worth a look if you have voluminous HTML files.
Second, I also have this in my documents' preface:

<?xml version ="1.0" standalone="yes" encoding="UTF-8" ?>

My impression is that the encoding= statement is optional (documents
validate OK without it). Is the statement a good idea or not (I hear
that it can be troublesome)?

There are several ways of communicating character coding to an
XML-based processor and/or a tag-soup browser.

Since XML defaults to utf-8 - or to whatever coding of unicode is
implied by the BOM - there is no need to use the <?xml... thingy to
declare utf-8 to the XML-based processor. However, HTML does not
default to utf-8, so you certainly need to specify charset=utf-8
somewhere for the use of HTML-supporting user agents (browsers and
such). My recommendation, as I already noted, would be to specify it
in the real HTTP protocol Content-type header.

Supplying the <?xml thingy can upset some browser-like objects,
putting them into their quirks mode when you might have wanted them in
specification-conforming (so-called "standards") mode. MSIE does a
muddled line in overruling mandatory requirements of the IETF
specification, and trying (sometimes badly) to guess what the author
might have meant rather than doing what they actually asked for.

Jul 20 '05 #22

Ernest Cline

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote:

"Ernest Cline" <er*********@mindspring.communism> wrote:
The
advantage is that since Javascript uses 16-bit characters for its routines, Gecko-based browsers (and most likely IE and Opera as well, altho I can't say for sure) use 16-bit characters to store character data. Using UTF-16 as opposed to UTF-8 will therefore save some time on the client-side
converting the UTF-8 whereas the UTF-16 does not need conversion.
I do not think that justifies the use of UTF-16 instead of UTF-8.

- A plethora of browsers and search engines will recognize UTF-8 but
not UTF-16.

Search engines, I can believe, altho I think all the major ones can handle
UTF-16. After all, if they are going to handle East Asian texts, they have
to be able to handle UTF-16 as well as several other 16-bit character sets.
However, I can't think of any major modern browser that can't handle UTF-16
(provided that the server correctly identifies the file as UTF-16 that is).
Netscape 4 might be a problem, but since I write in Latin-1, I've never
really worried too much about charset problems. If handling an ancient
browser or search engine is important then that would be a reason to use
UTF-8 or ISO-8859-1 with numerical character references instead of UTF-16,
but for many uses supporting those antiquated programs is not a concern.
- UTF-8 is a superset of ASCII.
So are the various ISO-8859-X character sets. In general, in so far as file
size is concerned, often one of the ISO-8859-X character sets will be be
best choice for Latin, Greek, Cyrillic, or Hebrew texts. However, from what
I've heard, ISO-8859 Arabic is woefully deficient
- HTML-generating software usually blows up files by inserting spaces
for layout. Two thirds of such files may consist of spaces - making
the difference between UTF-8 and UTF-16 irrelevant.
Well, that's a defect of the generating software then. Actually in such a
case, then for Arabic, UTF-8 would have the advantage, because of using only
one byte for spaces as opposed to the two that UTF-16 does. Transmission
time, especially for dialup accounts, is far more important than the time it
takes to convert from UTF-8 into the browsers internal character set.
- It's not just the file size that counts: Nesting tables, for example,
is a heavier crime than big files.

True, but nesting tables isn't something that can be affected by one's
choice of character set. :)

Jul 20 '05 #23

Jukka K. Korpela

Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:

This symbol is firmly established in Russian like "#" in US
English. However, in many fonts the glyph for this symbol doesn't
look okay to be used with French; it rather has a certain "Russian
touch".

See also
<http://www.microsoft.com/typography/developers/fdsspec/symbol.htm>

It shows "common French usage" as a sans-serif symbol where "N" is
followed by an "o" which is raised but not considerably smaller than we
expect a normal lower case "o" to be, and there is no underlining.

It seems that No would be the best markup for it. But it
results in typographically inferior appearance, especially if the font
size is big - the "o" appears (on IE 6 for example) as if it were a
superscript of "N".

So we might use
No
(optionally with title="numéro" in the tag, but I think
that would be overboard)
and then consider ways to style it. I would like to use <abbr>, but
then IE would fail to recognize the markup, and the styling would fail.

The following style sheet seems to give roughly the "French usage"
(disclaimer: I don't really know French):

span.no sup { vertical-align: 1ex;
font-size: 80%; }
span.no { letter-spacing: 0.05em;}

Of course, if you later decide that you really want the "o" underlined,
this can easily be changed by tuning the style sheet. (And it works the
other way too: even if you use in the markup, you can remove the
underlining, in most browsing situations, using a style sheet, _if_ you
have put the stuff inside a container like [or if you have not
used for anything else on your page]).

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #24

Alan J. Flavell

On Wed, Jul 23, Ernest Cline inscribed on the eternal scroll:

- UTF-8 is a superset of ASCII.
So are the various ISO-8859-X character sets.

Neither of these are "character sets" in the meaning of HTML4, they
are external character codings.
In general, in so far as file
size is concerned, often one of the ISO-8859-X character sets will be be
best choice for Latin, Greek, Cyrillic, or Hebrew texts.
Not "in general", no: but at least in the specific cases where the
document repertoire matches the repertoire of the chosen character
coding. Occasional characters outside of that repertoire can then be
represented by &-notations (NN4.* doesn't understand this approach,
but it's entirely legal per HTML4/RFC2070 and supported by any
conforming browser).
However, from what
I've heard, ISO-8859 Arabic is woefully deficient

I've no idea what you mean by this.

Jul 20 '05 #25

Andreas Prilop

On Wed, 23 Jul 2003, Ernest Cline wrote:

Search engines, I can believe, altho I think all the major ones can handle
UTF-16.
Can you prove it?
After all, if they are going to handle East Asian texts, they have
to be able to handle UTF-16 as well as several other 16-bit character sets.
They don't "have to be able". Remember, even Netscape 2.0 for Macintosh
could display CJK text but was not able to decipher UTF-16.
Netscape 4 might be a problem, but since I write in Latin-1, I've never
really worried too much about charset problems.
Oh, you do not have any practical experience with non-Latin-1 text. Hmmm...
ancient browser
antiquated programs
I can assure you that Netscape 4.x versions are installed on many computers
around here (uni-hannover.de) and the user can do nothing about it.
In general, in so far as file
size is concerned, often one of the ISO-8859-X character sets will be be
best choice for Latin, Greek, Cyrillic, or Hebrew texts.
File size should be the _last_ argument in selecting a suitable encoding
for certain non-Roman text. And, I recall, you said you have no experience
with non-Latin-1 text.
However, from what
I've heard, ISO-8859 Arabic is woefully deficient
Please explain!
(I miss the mere four Persian letters, which might well fit into the many
empty spaces in ISO-8859-6. Windows-1256 and MacArabic even cover Urdu.)
Actually in such a
case, then for Arabic, UTF-8 would have the advantage, because of using only
one byte for spaces as opposed to the two that UTF-16 does.

This really doesn't matter at all. And a simple image may be bigger than your
whole text.

You should carefully select the best encoding (if any) for your purpose
http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html
where file size doesn't really count - as long as it doesn't increase
by a factor 4 or so.

Can you measure any difference in display time between
http://www.unics.uni-hannover.de/nht...ilingual1.html and
http://www.unics.uni-hannover.de/nht...ilingual2.html ?

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #26

Andreas Prilop

"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:

It seems that No would be the best markup for it.

That's an ASCII-HTML approximation of Nº.

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #27

Andreas Prilop

"Ernest Cline" <er*********@mindspring.communism> wrote:

Can
you provide an example of a web program released after 1 January 2000 that
does not support UTF-16 but does support the use of one or more of the other
16-bit East Asian character sets?

Netscape 4.80 was released in 2002. People using this browser mostly
cannot upgrade to Netscape 7 or Mozilla because of their hardware/OS
limitations.

But all this doesn't matter since Google cannot decode UTF-16:
<http://www.google.com/search?q=%22a+thai+charactors%22&oe=UTF-8&filter=0>

AllTheWeb and AltaVista probably also cannot decode UTF-16:
<http://www.alltheweb.com/search?cs=utf-8&q=%22a+thai+charactors%22&_sb_lang=any>
<http://altavista.com/web/results?enc=utf8&q=%22a+thai+charactors%22&kgs=0&k ls=0>
They won't find the UTF-16-encoded page. There might be a small chance
that they missed the URL altogether but I suspect they just cannot
decode UTF-16.

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #28

Alan J. Flavell

On Wed, Jul 23, Ernest Cline inscribed on the eternal scroll:

Besides, if IIRC, Netscape 4's characetr set problem
occur when trying to use characters outside the domain of an 8-bit character
set that the document is encoded in.
NN4 indeed has such a problem, but the topic here was support for
utf-16, so let's stay focused on that.
Where we appear to disagree is over how much effort should be made
to support antique programs. You appear to favor trying to support
a priori antiquated programs
Andreas shall, as ever, speak for himself, but my reading was that he
was trying to support _users_, rather than browsers: where there is a
choice of techniques, both of which are valid but one is supported
over a wider range of browsers, then it seems appropriate to choose
the one with wider range.
whereas I favor worrying about that only if you know that your
target audience must use them.

On this discussion group, the "target audience" is ipso facto the
world wide web. That's why the group has "www" in its name.

There _are_ some techniques where there simply is no choice in the
matter: there is for example no respectable way at all in HTML of
getting Latin, Greek and Russian displayed on the same web page with
non-supporting browsers. An author who needs to do that, therefore,
has no option in the matter - no matter how much they might want to
support users of, say, WebTV, they cannot do it.

But I repeat, where there _is_ a choice, it seems opportune to make
that choice in the interests of wider browser coverage.

If you're worried about download time, don't forget that most browsers
support gzip-compressed HTML. It may be worth noting that Andreas's
pair of sample pages, although very different in uncompressed size,
were quite close in size (about 4.5K) when compressed.

With server negotiation, one could bring this benefit to the majority
of readers, while not shutting-out those whose browser can't do it.
(And then it doesn't matter whether Google etc. support it or not:
they'd get the version which they could handle, depending on whatever
Accept-encoding header they might send with their request.)

(disclaimer: I don't actually do this in practice myself, except for a
small number of very large HTML pages that are for internal use only.)

Jul 20 '05 #29

Andreas Prilop

I wrote:

But all this doesn't matter since Google cannot decode UTF-16:

Here are some more examples of UTF-16-encoded pages, which Google cannot
index correctly: http://www.google.com/search?q=%22U+T+F+1+6%22

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #30

Jukka K. Korpela

Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:

IMHO, these Unicode designations ("ordinal indicator") are
designations only for identifying the characters.
I don't think so. But I wonder whether we should take this discussion
elsewhere, since it's about character usage rather than HTML (although
things like form a connection, and of course character usage is
important in HTML authoring too).

My understanding is that Unicode defines the meanings of characters
too, though often very broadly and implicitly. And quite often the
_name_ assigned to the character is the most important hint to its
meaning. The Unicode standard does not say these things very clearly,
but it does assign meaning to a name, apparently with the implicit idea
that we are to understand that meaning from common knowledge of what
words usually mean. It then says:
"A character may have a broader range of use than the most literal
interpretation of its name might indicate; coded representation, name,
and representative glyph need to be taken in context when establishing
the semantics of a character."
There's room for interpretation here, of course. But the message is
clear: the name means something.
But they do not
imply the specific use of these characters.
The Unicode standard specifically mentions, in the code chart
explanations, "Spanish", i.e. that the feminine and masculine ordinal
indicators are used in Spanish. Such descriptive notes need not be
exclusive, though often they apparently describe _the_ meaning and use
of character. In this case, it's really the names that matter. Why else
would they be called by such names? Note that there are e.g.
"superscript 2" and "superscript 3" in the ISO 8859-1 range, and
_their_ names indicate general superscript use. Why name the characters
in cryptic ways (I remember how puzzling they were to me when I made my
first acquaintance with ISO 8859-1, and I do know some Spanish),
instead of just "superscript small a" and "superscript small o" or
something, unless the characters were really meant to have specific use
in conjunction with numerals?
In fact, Spanish,
Portuguese, Catalan (and others?) use ª º for lots of abbreviations
such as Srª = Senhora.

Perhaps, but there's a lot of wrong use of characters on the Web
anyway. The sharp s character is often used to denote the letter beta;
would this make such practice acceptable?

Finally, and not just as an ObHTML, I cite the joint W3C and Unicode
document "Unicode in XML and other Markup Languages",
http://www.w3.org/TR/unicode-xml/
which recommends (if I read it correctly - this is a relatively new
version and I have not studied it in detail) the use of markup
for compatibility characters with a "compatibility tag" (not to be
confused with HTML tags) of <super>. And the ordinal indicators are
defined as compatibility characters <super> a and <super> b.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #31

Ernest Cline

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote in message
news:230720032129548898%nh******@rrzn-user.uni-hannover.de...

"Ernest Cline" <er*********@mindspring.communism> wrote:
Can
you provide an example of a web program released after 1 January 2000 that does not support UTF-16 but does support the use of one or more of the other 16-bit East Asian character sets?
Netscape 4.80 was released in 2002. People using this browser mostly
cannot upgrade to Netscape 7 or Mozilla because of their hardware/OS
limitations.

I was under the impression that Netscape 4.06+ could handle UTF-16 (At least
for characters that are only in Plane 0.) Indeed since Javascript 1.3
(first included in Netscape 4.06) uses 16 bit Unicode character codes for
its internals, I would be extremly surprised if it can't handle UTF-16. Can
you provide documentation of its inability?

But all this doesn't matter since Google cannot decode UTF-16:
<http://www.google.com/search?q=%22a+thai+charactors%22&oe=UTF-8&filter=0>
The ie and oe parameters refer to how to access the data that Google holds
(which indeed must be in UTF-8 and only UTF-8) but they do not limit the
character set of the documents that Google will index which is what is
important in this case. Indeed since one of the three links given by this
search link is http://movie.lemononline.com/webdemo...ctor/utf16.htm
which most certainly is in UTF-16, it can be shown that Google *CAN* handle
UTF-16.

AllTheWeb and AltaVista probably also cannot decode UTF-16:
<http://www.alltheweb.com/search?cs=u...rs%22&_sb_lang
=any> <http://altavista.com/web/results?enc...ors%22&kgs=0&k
ls=0> They won't find the UTF-16-encoded page. There might be a small chance
that they missed the URL altogether but I suspect they just cannot
decode UTF-16.

Now since these two sites don't list the above mentioned link to the UTF-16
page, I can believe that they are deficient in this regard, but both Google
and Yahoo! do show links to the UTF-16 page. Thus I would say that UTF-16
support for search engines is available but not sufficiently widespread in
all major search engines to warrant using UTF-16 for pages intended for
access by search engines. Hopefully, that support will soon be available,
since adding UTF-16 support is trivial for any search engine or browser that
handles UTF-8. (Worst case scenario involves having the program in question
translate the UTF-16 into UTF-8.) In my original reply I stated that I was
uncertain of search engine support of UTF-16, so while I am saddened by
these results, I am not surprised.

Jul 20 '05 #32

Alan J. Flavell

On Thu, Jul 24, Ernest Cline inscribed on the eternal scroll:

I was under the impression that Netscape 4.06+ could handle UTF-16
Hmmm, I find this rather curious (I'm trying this on Win NN4.79): in
the menus (e.g View->Character Set) there's no evidence of any support
for utf-16, which I must admit had led me to believe that NN4 didn't
support it; and yet I now discover that if I view the cited document
http://movie.lemononline.com/webdemo...ctor/utf16.htm then the
browser gives a remarkable impression of actually knowing what to do.

But all this doesn't matter since Google cannot decode UTF-16:
<http://www.google.com/search?q=%22a+thai+charactors%22&oe=UTF-8&filter=0>

The ie and oe parameters refer to how to access the data that Google holds

Indeed.
(which indeed must be in UTF-8 and only UTF-8) but they do not limit the
character set of the documents that Google will index which is what is
important in this case. Indeed since one of the three links given by this
search link is http://movie.lemononline.com/webdemo...ctor/utf16.htm
Is this the acid test? A page might be indexed on account that it's
linked from somewhere else, irrespective that Google failed to index
the contents of the linked page itself.

Here's one test that seemed reasonable to me. This page contains a
link http://movie.lemononline.com/webdemo/ , and therefore I reckon it
ought to be listed amongst the pages which Google returns in response
to the query link:http://movie.lemononline.com/webdemo/

But there is only one page in that query response, and this one
isn't it.
which most certainly is in UTF-16, it can be shown that Google *CAN* handle
UTF-16.
Here's another test. Go to
http://www.google.com/search?hl=en&i...=Google+Search
(i.e search for the URL of the page itself).

The response to this is already suspect: it's displayed to me as

ÿþ< ! DOCTYPE html PUBLIC " - / / W 3 C / / DTD XHTML Basic 1 . 0 /
/ EN " " http : / / www . w 3 . org / TR / xhtml - basic / xhtml

(by the way, that DOCTYPE seems to be utterly inappropriate to the
actual page, but that's almost certainly irrelevant to the present
discussion).

Then take the link called "Show Google's cache of ..." to see the full
horror of Google's interpretation. How do you explain that
observation, if Google was interpreting utf-16 correctly?
Thus I would say that UTF-16
support for search engines is available but not sufficiently widespread in
all major search engines to warrant using UTF-16 for pages intended for
access by search engines.

Looks to me as if your practical conclusions are in agreement with
what Andreas already said, even if there are differences in detail.

all the best

Jul 20 '05 #33

Andreas Prilop

"Alan J. Flavell" <fl*****@mail.cern.ch> wrote:

Hmmm, I find this rather curious (I'm trying this on Win NN4.79): in
the menus (e.g View->Character Set) there's no evidence of any support
for utf-16, which I must admit had led me to believe that NN4 didn't
support it;

Me too. And you cannot _generate_ UTF-16-encoded pages with Composer
even in version 7. (At least I couldn't find how.) But it seems, you
and Ernest Cline are right: Netscape 4.x _can_ display UTF-16.
I stand corrected.

Most probably the limitations in Netscape 4.x are the same as with
UTF-8. No need to discuss them again here.

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt

Jul 20 '05 #34

Andreas Prilop

"Ernest Cline" <er*********@mindspring.communism> wrote:

but both Google and Yahoo! do show links to the UTF-16 page.

Here is another one from Yahoo:
<http://search.yahoo.com/bin/search?p=%22U+T+F+1+6%22>
(Isn't Yahoo search the same as Google's?)

--
But thats what FP puts in to the page, so i asume thats correct
Harry H. Arends in microsoft.public.frontpage.client

Jul 20 '05 #35

Alan J. Flavell

On Thu, Jul 24, Andreas Prilop inscribed on the eternal scroll:

But it seems, you and Ernest Cline are right:
To be accurate: it seems that Ernest Cline was right, and I was wrong,
until I actually investigated the issue and discovered my mistake!
Netscape 4.x _can_ display UTF-16. I stand corrected.

Btw, "view source" seems to work, or not work, depending on which
encoding one has selected on the View->"Character Set" menu pulldown
prior to opening the view-source window. Bizarre.

cheers

Jul 20 '05 #36

Alan J. Flavell

On Thu, Jul 24, Alan J. Flavell inscribed on the eternal scroll:

But all this doesn't matter since Google cannot decode UTF-16:
<http://www.google.com/search?q=%22a+thai+charactors%22&oe=UTF-8&filter=0>

The ie and oe parameters refer to how to access the data that Google holds

Indeed.

I see that Andreas answered this point in the negative, so, in the
interests of "the record", I beg to correct myself. Of course, ie and
oe refer to input and output encoding (of a query).

(which indeed must be in UTF-8 and only UTF-8) but they do not limit the
character set of the documents that Google will index which is what is
important in this case.

I now realise that I was mistaken to quote that part without comment,
and I reckon Andreas was right to dispute it.

As you see, I hate to leave misleading remarks uncorrected on the
record. No offence meant.

Jul 20 '05 #37

Alan J. Flavell

On Wed, Jul 23, Ernest Cline inscribed on the eternal scroll:

I remember that in the early 1990's that several of my fellow graduate
students who came from the Middle East had complaints about the ISO-8859
Arabic set. Exactly what those complaints were and how serious they were, I
don't recall, but I do know that they were not satisfied with it.

Well, then as you don't seem to have the technical basis for the
problem, you might want to avoid being suspected of propagating an
"urban legend". But I'll chance a theory of what the problem might
be.

In Arabic script, letters can have up to four different forms:
initial, medial, final and isolated. The Unicode standard recommends
the use of the same character code for all forms, and puts the
responsibility on the software to choose the appropriate glyph
according to the context in which the character appears.

Putting the clock back a decade or two, it was more usual for a font
to be implemented to a specific character code (indeed, some people
who learned fonts back in those times find it difficult to understand
the distinction between a font arrangement and a character coding, as
the two concepts were indistinguishable in their mental model).

Consequently, back then you'd find simplistic fonts that did no more
than shovelled the isolated-form glyphs into the corresponding font
positions, and just let words be displayed from those, instead of
making any attempt to select the appropriate initial, medial or final
forms according to context. My hunch would be that they were
complaining about that.

Btw, http://www.unicode.org/Public/MAPPIN...859/8859-6.TXT

But there's no reason in principle that they couldn't treat text
encoded in iso-8859-6 in the same way that the Unicode standard wants
them to do anyway with Unicode-encoded text. If you check RFC2070
you'll see the black-box model of HTML whereby the "reference browser"
would map all external codings into Unicode internally anyway.

So much for that issue. There are other languages than Arabic which
get written with basically Arabic characters, for example Farsi, Urdu.
They need a few extra characters not present in Arabic itself, and so
there might be complaints about 8859-6 on that basis too, if your
informants were not actually using Arabic (language).

disclaimer: I don't actually read Arabic, my interest is basically in
the technology of character representation (coding etc.).

Jul 20 '05 #38

Andreas Prilop

"Alan J. Flavell" <fl*****@mail.cern.ch> wrote:

Putting the clock back a decade or two,
[ ... ]
My hunch would be that they were complaining about that.

Apparently they did not use a Macintosh :-/
System 7 was released in 1990 and it's still available:
<http://download.info.apple.com/Apple_Support_Area/Apple_Software_Updates/Arabic/Macintosh/System/System_7.0.1/>
The MacArabic character set is indeed a superset of ISO-8859-6.

--
Meanwhile, at the Google Ranch ...
"I can't read this bloody site; it's all Falsh and JavaScrap."
"Forget it and move on! Still 2*718*281*828 pages to crawl."

Jul 20 '05 #39

French "No" character entity

Similar topics