By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,628 Members | 1,401 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,628 IT Pros & Developers. It's quick & easy.

Lang attribute values

P: n/a

Been searching around, and found
http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
looking for a guide to what codes are acceptable.

I see stuff like lang="en-us" - that extension, where is that from? Is
there a codification somewhere?
Jul 20 '05 #1
Share this Question
Share on Google+
80 Replies


P: n/a
Neal <ne*****@spamrcn.com> wrote:

Been searching around, and found
http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
looking for a guide to what codes are acceptable.

I see stuff like lang="en-us" - that extension, where is that from? Is
there a codification somewhere?


RFC 1766: http://www.ietf.org/rfc/rfc1766.txt

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ter le premier point de mon adresse de courriel.
Jul 20 '05 #2

P: n/a
Harlan Messinger <hm*******************@comcast.net> wrote:
I see stuff like lang="en-us" - that extension, where is that from?
Is there a codification somewhere?


RFC 1766: http://www.ietf.org/rfc/rfc1766.txt


RFC 1766 has been superseded by RFC 3066 and RFC 3282.

For more info on language codes see
http://webtips.dan.info/language.html
http://xml.coverpages.org/languageIdentifiers.html

I'm afraid there's only one detailed survey of language codes in HTML,
and it's in Finnish ( http://www.cs.tut.fi/~jkorpela/kielimerkkaus/ )
and I have no plans for translating it. But don't worry. For the most
of it, language markup is mostly an exercise in writing theoretically
correct markup, and even the W3C doesn't take that job seriously on
their own pages (including the pages on language markup).

In particular, it's best to use lang="en". The country specifier hardly
helps anyone in the present world.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #3

P: n/a
On Wed, 21 Jan 2004 07:13:34 -0500, Harlan Messinger
<hm*******************@comcast.net> wrote:
Neal <ne*****@spamrcn.com> wrote:

Been searching around, and found
http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
looking for a guide to what codes are acceptable.

I see stuff like lang="en-us" - that extension, where is that from? Is
there a codification somewhere?


RFC 1766: http://www.ietf.org/rfc/rfc1766.txt


Thanks. The subtag - what I'm getting from all this is that we basically
make up something. I'm not confident that's accurate. Are there set limits
on what the subtag can actually be, aside from the broad types listed in
the document you linked? I'm imagining there must be a list of those
floating around... but I'm not finding them.
Jul 20 '05 #4

P: n/a
Neal <ne*****@spamrcn.com> wrote:
The subtag - what I'm getting from all this is that we basically
make up something.
Well, all the language codes have been made up by some people. The
correct way to define a new subcode for a language code is to register
it at IANA. But two-letter subcodes are reserved for use as country
codes.
I'm not confident that's accurate.
If you register a subcode, it's mostly up to you how accurate your
definition is.
Are there set
limits on what the subtag can actually be, aside from the broad
types listed in the document you linked?


See RFC 3066.

On the other hand, why would you use a subcode? Given the fact that
most software that _could_ make use of language markup (such as
browsers, search engines, and page editing tools) make almost no use of
it, and make _wrong_ use at times, even for the most basic and common
language codes like "en" or "de", is there any reason to play with
anything that isn't even registered yet? (I don't expect registration
do much good per se.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #5

P: n/a
Tim
Neal <ne*****@spamrcn.com> wrote:
I see stuff like lang="en-us" - that extension, where is that from? Is
there a codification somewhere?


Harlan Messinger <hm*******************@comcast.net> wrote:
RFC 1766: http://www.ietf.org/rfc/rfc1766.txt


Neal <ne*****@spamrcn.com> wrote:
Thanks. The subtag - what I'm getting from all this is that we basically
make up something. I'm not confident that's accurate. Are there set limits
on what the subtag can actually be, aside from the broad types listed in
the document you linked? I'm imagining there must be a list of those
floating around... but I'm not finding them.


Well, unless you're inventing something new, they're a country code
(e.g. en-us for U.S.A. English, en-au for Australian English, etc.).

--
My "from" address is totally fake. The reply-to address is real, but
may be only temporary. Reply to usenet postings in the same place as
you read the message you're replying to.

This message was sent without a virus, please delete some files yourself.
Jul 20 '05 #6

P: n/a
On Wed, 21 Jan 2004, Neal wrote:
Thanks. The subtag - what I'm getting from all this is that we basically
make up something.
Oh no we don't!!!
I'm not confident that's accurate. Are there set limits
on what the subtag can actually be,


They're country codes per the appropriate ISO specification.

But as usual, the major vendor's dirty tricks department have made
sure that the specified interworking protocol will fail. For example,
someone who has installed their operating system component in Austria
will be presenting "Accept-language: de-AT", as I've seen in our
server logs (no, we don't have any Austrian German pages on our
server, sorry), which is supposed to mean that they accept only
Austrian German. So, even generic German documents appear to be
unacceptable to them, unless they know enough about it to override the
installation defaults.

I'm sure that's a part of why Jukka advised you that the mechanism
isn't practical for use (no, I can't read Finnish and I don't trust
the babelfish, so I can only guess what's on his page). He's entitled
to his view, but with a bit of pragmatism (all multilingual web pages
should offer _some_[1] way to access alternative languages explicitly)
I'd say it's usable, with a bit of care.

As I've recently discovered: at least it isn't as hopelessly broken as
that same operating system component's implementation of content-type
negotiation. I'd say: aim at the users of any protocol-conforming WWW
browser, while making appropriate provision to pander tolerably to the
operating system component. In that sense, accept-language
negotiation is workable, given a bit of care and attention.

Would this page be of any use? the apache supporters were kind enough
to cite it, so I suppose it's not too bad, at least as a starting
point: http://ppewww.ph.gla.ac.uk/~flavell/www/lang-neg.html

have fun

[1] No "flags of nations" as markers of language, please. Only
recently I landed on a French web page that insisted on me clicking
the Stars and Stripes to get English. Wibble.
Jul 20 '05 #7

P: n/a
On Wed, 21 Jan 2004, Jukka K. Korpela wrote:
Given the fact that
most software that _could_ make use of language markup (such as
browsers, search engines, and page editing tools) make almost no use of
it, and make _wrong_ use at times, even for the most basic and common
language codes like "en" or "de", [ ... ]


Mozilla/Netscape uses the value of the LANG attribute to determine
the typeface in which the corresponding text is displayed.
<http://ppewww.ph.gla.ac.uk/~flavell/charset/browsers-fonts.html>

Jul 20 '05 #8

P: n/a
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla/Netscape uses the value of the LANG attribute to determine
the typeface in which the corresponding text is displayed.


That's an example of what I meant by _wrong_ use.

If I write about <span lang="ru">Dostoyevsky</span>, I don't want the
name appear in a fancy font just because a browser makes foolish
guesses. That's why I recommend that lang markup be not used for
transliterated texts. (This violates WAI requirements, since the
language of the text is surely not changed in transliteration. But WAI
pages themselve violate the rule of marking up _all_ language changes,
which they present as Priority 1 requirement.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #9

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

[ regarding poor defaults in browsers, making a particular dialect
the only alternative declared in Accept-Language: ]
I'm sure that's a part of why Jukka advised you that the mechanism
isn't practical for use (no, I can't read Finnish and I don't trust
the babelfish, so I can only guess what's on his page).
No, my Finnish-only page on language markup doesn't really discuss
content negotiation - which is discussed at
http://www.cs.tut.fi/~jkorpela/multi/
which is available in English too, via content negotiation.
He's
entitled to his view, but with a bit of pragmatism (all
multilingual web pages should offer _some_[1] way to access
alternative languages explicitly) I'd say it's usable, with a bit
of care.


That's my view too, actually. But content negotiation, based on
language preferences, is independent of language markup. Content
negotiation works for all media types, not just HTML, and if used for
HTML, it does not make any use of lang markup.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #10

P: n/a
In article <Xn*****************************@193.229.0.31>,
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla/Netscape uses the value of the LANG attribute to determine
the typeface in which the corresponding text is displayed.
That's an example of what I meant by _wrong_ use.


Tim Bray mentions "Things you can't do properly in a language-oblivious
way include: Render it on a screen or on paper [...]" as one reason for
including xml:lang in XML.
(http://www.xml.com/axml/notes/WhyLangs.html)
If I write about <span lang="ru">Dostoyevsky</span>, I don't want the
name appear in a fancy font just because a browser makes foolish
guesses.


In the absence of *script* identification, is Mozilla's behavior really
that foolish? How do you suggest the font heuristics should work with
UTF-8 (that is, when the dominant script can't be guessed from the
encoding)? O(N) character counting over the entire document is not a
good solution as it would interfere with incremental display.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #11

P: n/a
On Thu, 22 Jan 2004, Jukka K. Korpela wrote:
He's entitled to his view, but with a bit of pragmatism (all
multilingual web pages should offer _some_[1] way to access
alternative languages explicitly) I'd say it's usable, with a bit
of care.


That's my view too, actually. But content negotiation, based on
language preferences, is independent of language markup. Content
negotiation works for all media types, not just HTML, and if used for
HTML, it does not make any use of lang markup.


Fully agreed; and I can see now that I was getting the two issues
somewhat tangled. Apologies for any confusion caused.

Jul 20 '05 #12

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
Mozilla/Netscape uses the value of the LANG attribute to determine
the typeface in which the corresponding text is displayed.
That's an example of what I meant by _wrong_ use.


It may be unintuitive but I don't consider it wrong. If you have a
document with "charset=UTF-8", both Mozilla/Netscape and Internet
Explorer would display it in the typeface you chose for West European
Latin. (Silly idea BTW.) However, if you have some text with "LANG=ar"
this will be displayed in your preferred Arabic typeface in Mozilla,
which will probably give better results. If you have a document with
"charset=ISO-8859-6", text marked with "LANG=en" will nevertheless
be displayed in your preferred Latin typeface.

The difference is most notable on Mac OS 9, where Arabic and Hebrew
typefaces do *not* contain glyphs for ASCII characters. These are
taken from other [West European] typefaces.

You might inspect these two identical (!) documents
<http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html>
<http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html6>
I haven't used LANG markup for characters of the Arabic script
in order to see the difference between "charset=UTF-8" (*.html)
and "charset=ISO-8859-6" (*.html6). Text marked with "LANG=en" is
always displayed the same in Mozilla/Netscape.
If I write about <span lang="ru">Dostoyevsky</span>, I don't want the
name appear in a fancy font just because a browser makes foolish
guesses.
On the other hand, you might welcome that "" written in
Cyrillic letters is displayed in your preferred Cyrillic typeface.
Anyway, it doesn't make "foolish guesses" but uses *your* preferred
typeface for Cyrillic and ASCII Latin.
That's why I recommend that lang markup be not used for
transliterated texts.


Good idea! I second that.
Jul 20 '05 #13

P: n/a
On Thu, 22 Jan 2004, Henri Sivonen wrote:
If I write about <span lang="ru">Dostoyevsky</span>, I don't want the
name appear in a fancy font just because a browser makes foolish
guesses.
In the absence of *script* identification,


Well, the writing system is determined by which Unicode characters are
used (unless you're interested in disambiguating the Han unification
for CJK languages, about which I know rather little...). What did you
mean by "in the absence..."? That string "Dostoyevsky" consists
unambiguously of Latin characters! There's no ambiguity about the
"script".

If those characters were Arabic, then it would be useful to choose,
say, a Persian font if it were known that the language is Farsi.

I don't know whether there's a similar desire, if the characters were
Cyrillic, of choosing a Russian font as opposed to any other "Cyrillic
language" font. But as they aren't Cyrillic characters, that
consideration doesn't matter anyway.
is Mozilla's behavior really that foolish?
Yes, in this detail I would have to say it is. Those characters are
clearly Latin characters, per the HTML character model; it makes no
particular sense to display the Latin letters with a Russian flavour -
unless you thought that was cosmetically appropriate to do so, but
then you'd suggest font(s) via CSS if that's what you wanted, surely?
How do you suggest the font heuristics should work with UTF-8


What's wrong with displaying Latin characters using the selected Latin
font? And so on.

OK, if the browser had been configured in a perverse way, maybe the
Latin and the Cyrillic fonts would look so massively different that
mixed texts would look silly. But that's a configuration option IMHO.

Remember that in principle in HTML, language and writing system are
meant to be separate attributes. Japanese is still Japanese
(language) when transliterated into Roman characters; conversely
English is still English (language) when transliterated into Japanese
(characters). AFAICT the only exception to this comes indirectly via
Unicode and its Han unification (but I'll stop there).
Jul 20 '05 #14

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
is Mozilla's behavior really that foolish?
Yes, in this detail I would have to say it is.


I do not regard Mozilla's behaviour as foolish. And I think it's
a lot better than IE's behaviour.
Those characters are
clearly Latin characters, per the HTML character model; it makes no
particular sense to display the Latin letters with a Russian flavour -


What do you mean by "Russian flavour"? Is, e.g., Verdana "Russian
flavoured"? Even if some typeface has specific Russian-looking
Cyrillic characters, the ASCII characters can still look quite
ordinary.

<span lang="el">Andreas</span> oops :-)
Jul 20 '05 #15

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
If those characters were Arabic, then it would be useful to choose,
say, a Persian font if it were known that the language is Farsi.


It might be a good idea to extend the euro-centric list
serif, sans-serif, cursive, fantasy
by
naskhi, nastaliq, thuluth
etc.
<http://images.google.com/images?q=naskhi>
<http://images.google.com/images?q=nastaliq>
<http://images.google.com/images?q=thuluth>

Serif and sans-serif have not meaning with the Arabic script.
BTW: Have you ever noticed that the Arabic glyphs in Arial and
Times New Roman are identical?
Jul 20 '05 #16

P: n/a
On Thu, 22 Jan 2004 17:38:17 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla/Netscape uses the value of the LANG attribute to determine
the typeface in which the corresponding text is displayed.


That's an example of what I meant by _wrong_ use.

If I write about <span lang="ru">Dostoyevsky</span>, I don't want the
name appear in a fancy font just because a browser makes foolish
guesses.


Are you guys saying that if I set a transliterated name with a language
markup, it might change the characters? That'd be so so wrong. Simply
comparing the number of characters in the Latin-transliterated Tchaikovsky
to the Russian Cyrillic spelling - that would become gibberish!

Please tell me I have it wrong.
Jul 20 '05 #17

P: n/a
Neal <ne*****@spamrcn.com> wrote:
Are you guys saying that if I set a transliterated name with a
language markup, it might change the characters?


No, we are saying that it actually changes the _glyphs_ on some
browsers. That is, if you have a letter "D", it will appear in some
visual form, as some glyph, but it may be of a typeface/font different
from the surrounding text. For example, you might see an ordinary word,
written in Latin letters, in the midst of normal text written in Latin
letters, in e.g. Arial font while the text around it is Times New
Roman. Just because you marked it up as what it is, such as a Russian
word.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #18

P: n/a
Henri Sivonen <hs******@iki.fi> wrote:
Tim Bray mentions "Things you can't do properly in a
language-oblivious way include: Render it on a screen or on paper
Oh, what have Web browsers been doing then? They surely have problems
in presenting text _well_, but you seem to be saying that the selection
of a font is among the worst problems. I disagree.
If I write about <span lang="ru">Dostoyevsky</span>, I don't want
the name appear in a fancy font just because a browser makes
foolish guesses.


In the absence of *script* identification, is Mozilla's behavior
really that foolish?


Yes. It should see immediately that Latin script is used. But in
addition to this, what's the big idea in selecting fonts according to
language? It might make sense for some scripts, like CJK, but only in
cases where the language actually affects the generally preferred
choice of fonts.
How do you suggest the font heuristics should
work with UTF-8 (that is, when the dominant script can't be guessed
from the encoding)?


I don't suggest any font heuristics. There's enough confusion in the
current font settings in browsers, which hopelessly mix up languages,
countries, scripts, character repertoires, fonts and whatever into a
dessert for tag soup. _Documenting_ the behavior would be the best
move. Well, next to making things simple: specify some coherent
sequence of fonts to be tried in succession when trying to display a
character, and let the user change it. And naturally the author can
make his own suggestions. There's no need for a browser play in that
game with its guesswork (aka heuristics).

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #19

P: n/a
Neal <ne*****@spamrcn.com> wrote:
Are you guys saying that if I set a transliterated name with a language
markup, it might change the characters?


No.
Let's say you've defined Futura as your preferred typeface for West
European Latin and Verdana for Cyrillic. Then Mozilla will display
your document with "charset=ISO-8859-1" or "charset=UTF-8" in Futura
but will display <span lang="ru">Dostoevskij</span> in Verdana.

If you have "charset=ISO-8859-5", everything is displayed in
Verdana - except of course <span lang="en">Bront&euml;</span> ,
which is in Futura.
Jul 20 '05 #20

P: n/a
On Thu, 22 Jan 2004 09:41:18 +1030, Tim <Ti*@mail.localhost> wrote:
Well, unless you're inventing something new, they're a country code
(e.g. en-us for U.S.A. English, en-au for Australian English, etc.).

Apologies if this has been answered elsewhere, but is there a list of
these codes anywhere? And how necessary are they?

My specific application is a website for an orchestra using many foreign
titles and names. I'm imagining a speech reader will need the language
code to be able to pronounce the word correctly, but perhaps I am off here
as well. At any rate, a country subtag appears to be unimportant, as our
primary market is our US-based audience.

I guess the question distills down to this - what's the proper markup for
the French title "L'arlessienne" or the Czech name "Dvork" in an
otherwise English document?
Jul 20 '05 #21

P: n/a
Neal <ne*****@spamrcn.com> wrote:
I guess the question distills down to this - what's the proper markup for
the French title "L'arlessienne" or the Czech name "Dvork" in an
otherwise English document?


<span lang="cs">Dvořák</span>
<span lang="cs">Dvok</span>
Jul 20 '05 #22

P: n/a
On Thu, 22 Jan 2004 22:53:07 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:
Neal <ne*****@spamrcn.com> wrote:
Are you guys saying that if I set a transliterated name with a
language markup, it might change the characters?


No, we are saying that it actually changes the _glyphs_ on some
browsers. That is, if you have a letter "D", it will appear in some
visual form, as some glyph, but it may be of a typeface/font different
from the surrounding text. For example, you might see an ordinary word,
written in Latin letters, in the midst of normal text written in Latin
letters, in e.g. Arial font while the text around it is Times New
Roman. Just because you marked it up as what it is, such as a Russian
word.

Ok, just so long as it doesn't make it illegible, I can deal with that!
Jul 20 '05 #23

P: n/a
On Fri, 23 Jan 2004 00:35:25 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
Neal <ne*****@spamrcn.com> wrote:
I guess the question distills down to this - what's the proper markup
for
the French title "L'arlessienne" or the Czech name "Dvorák" in an
otherwise English document?


<span lang="cs">Dvořák</span>
<span lang="cs">Dvořák</span>

Holy crap, I have been looking for YEARS for &#345...

Where do I find a COMPLETE list of such characters that aren't just the
typical set?
Jul 20 '05 #24

P: n/a
Neal <ne*****@spamrcn.com> wrote:
Holy crap, I have been looking for YEARS for &#345...
You'll have fun in future - there are literally myriads of other
character references to be found.
Where do I find a COMPLETE list of such characters that aren't just
the typical set?


The Unicode standard, or the equivalent ISO 10646 standard. The tricky
part is to find the information you need, and make sure you have
understood it correctly, but see
http://www.cs.tut.fi/~jkorpela/html/unicode.html

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #25

P: n/a
Neal <ne*****@spamrcn.com> wrote in message news:<op**************@news.rcn.com>...
I guess the question distills down to this - what's the proper markup
for
the French title "L'arlessienne" or the Czech name "Dvorák" in an
otherwise English document?


<span lang="cs">Dvořák</span>
<span lang="cs">Dvořák</span>


Where do I find a COMPLETE list of such characters that aren't just the
typical set?


Try the Unicode site (http://www.unicode.org) or buy their book.

Or look for Unicode with a search engine, which will find lots of
useful sites, including mine.

--
Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)
Jul 20 '05 #26

P: n/a
In article <Xn****************************@193.229.0.31>,
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
Henri Sivonen <hs******@iki.fi> wrote:
Tim Bray mentions "Things you can't do properly in a
language-oblivious way include: Render it on a screen or on paper
Oh, what have Web browsers been doing then? They surely have problems
in presenting text _well_, but you seem to be saying that the selection
of a font is among the worst problems. I disagree.


Choosing a font is only one problem. There are others including
line breaking. (And I don't mean it just "complex" to line breaking for
languages such as Thai, but also dynamic hyphenation for European
languages.
If I write about <span lang="ru">Dostoyevsky</span>, I don't want
the name appear in a fancy font just because a browser makes
foolish guesses.


In the absence of *script* identification, is Mozilla's behavior
really that foolish?


Yes. It should see immediately that Latin script is used. But in
addition to this, what's the big idea in selecting fonts according to
language?


I can't find a politically correct way of saying this, but there's are
pecking orders of language groups within scripts in terms of font
availability and quality. It's unfortunate.

For example Polish looks ugly if some glyphs come from a "Western" font
and others come from a "Central European" font.
It might make sense for some scripts, like CJK, but only in
cases where the language actually affects the generally preferred
choice of fonts.
Chinese text looks ugly if the ideograps that are also used for Japanese
come from a Kanji font while the rest come from a Chinese font.

When you write <span lang="ru">Dostoyevsky</span>, what would you want
recipients to do with the language data? That is, is it actually useful
for transliterated text to come with language data in any existing or
realistic client implementation for any of the purposes you list in
http://www.cs.tut.fi/~jkorpela/kielimerkkaus/1.html ? Is it there just
in case the user is curious and invokes "Properties" in Mozilla in order
to find out that Dostoyevsky is a Russian name?
How do you suggest the font heuristics should
work with UTF-8 (that is, when the dominant script can't be guessed
from the encoding)?


I don't suggest any font heuristics.

[...] And naturally the author can
make his own suggestions. There's no need for a browser play in that
game with its guesswork (aka heuristics).


Let's suppose I'm writing a content management system and I choose to
use UTF-8 for all output because
1) Prior to serialization the data is in UTF-16 anyway, because
I use Java, so producing UTF-8 or UTF-16 is easier than producing
something else.
2) I want every character that a user might enter in a form arrive
to the server intact (be representable in the encoding used).
Therefore, I have to use UTF-*.

What advice should I provide authors who want to use the system for
publishing Polish or Chinese text? How should they make their
suggestions?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #27

P: n/a
On Fri, 23 Jan 2004, Henri Sivonen wrote:

[addressing Jukka, but I shall offer an answer anyway ;-) ]
When you write <span lang="ru">Dostoyevsky</span>, what would you want
recipients to do with the language data?
If they are browsers, my answer would be "probably nothing". If they
are indexers, summarisers etc. then the answer would be different.
That is, is it actually useful
for transliterated text to come with language data in any existing or
realistic client implementation [...]


In theory, the /markup/ depends on the structure and attributes of the
content - it isn't *supposed* to be done with the intention of
producing a particular result on a particular client agent (that job
is delegated to stylesheet/s).

In theory, of course, theory and practice are the same, but in
practice....

So when you are raising issues of this kind, it might be useful if you
would make clear whether you have in mind the theoretical ideal, or
rather some particular practical issue related to current browsers and
other kinds of client agent.

Remark: IBM HPR will use different pronunciations depending on the
language markup, to take just one example (which is actually
irrelevant here, since it didn't offer Russian as an option, and I've
no idea what it would do with Russian-transliterated-into-Roman-
letters even if it did). But nevertheless, it's an interesting
what-if question, isn't it?
Jul 20 '05 #28

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote in message news:<Xn*****************************@193.229.0.31 >...
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla/Netscape uses the value of the LANG attribute to determine
the typeface in which the corresponding text is displayed.
That's an example of what I meant by _wrong_ use.

If I write about <span lang="ru">Dostoyevsky</span>,


I don't mean to sound ignorant, but what's the logic behind using
language mark-up for proper nouns?
I don't want the
name appear in a fancy font just because a browser makes foolish
guesses. That's why I recommend that lang markup be not used for
transliterated texts.


Presumably in an ideal mark-up language, language and script would be
independent attributes (and that way I'd have some sort of mark-up to
put around my IPA sections...)?

--- Safalra (Stephen Morley) ---
http://www.safalra.com/hypertext
Jul 20 '05 #29

P: n/a
On Thu, 22 Jan 2004, Neal wrote:
Holy crap, I have been looking for YEARS for &#345...
Where do I find a COMPLETE list of such characters that aren't just the
typical set?


You probably don't need a complete list. For a start, look at
<http://www.unics.uni-hannover.de/nhtcapri/multilingual2.html>

Set the encoding to "charset=UTF-8".
<http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>
to suit Netscape 4 and perhaps other older browsers.

Jul 20 '05 #30

P: n/a
Safalra:
If I write about <span lang="ru">Dostoyevsky</span>,
I don't mean to sound ignorant, but what's the logic behind using
language mark-up for proper nouns?
In this case the only need for the markup is the need to indicate the
language of that proper noun. That's why the otherwise meaningless
element "span" has been used. It's just there in order to make it
possible to add the attribute "lang" that conveys the information that
the language in question is Russian.

If at the same time that name would have constituted a citation (to some
work by Dostoyevsky) then the following would have been appropriate:

<cite lang="ru">Dostoyevsky</cite>
Presumably in an ideal mark-up language, language and script would be
independent attributes (and that way I'd have some sort of mark-up to
put around my IPA sections...)?


Indication the script of a piece of text would be just as meaningful as
the following (using the ficticious attribute "text"):

<span text="book">book</span>

The text is already there as content, so there is of course absolutely
no need to indicate it with an attribute as well.

This

<span script="latin">book</span>

would be just as stupid. The text string "book" can't be anything else
but Latin script. If it wasn't Latin script, then it wouldn't consist of
the four Latin script characters "b", "o", "o" and "k", would it?

In the same way "Dostoyevsky" (written exactly like that) is written in
Latin script. There is no need (or should be no need) telling the
browser what it already knows.

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #31

P: n/a
On Fri, 23 Jan 2004, Henri Sivonen wrote:
For example Polish looks ugly if some glyphs come from a "Western" font
and others come from a "Central European" font.


This is especially true for Macintosh and Unix.
MS Windows users probably never encounter this problem - don't even know
that it exists.

I remind you of
<http://www.unics.uni-hannover.de/nhtcapri/temp/face-arial.gif>
It just comes into my mind that
<p lang="en"> ... <span lang="zh">Mao Zedong</span> ...
may give funny-looking results in Mozilla/Netscape.
So you better use LANG markup only with the original script.

Jul 20 '05 #32

P: n/a
Andreas Prilop:
It just comes into my mind that
<p lang="en"> ... <span lang="zh">Mao Zedong</span> ...
may give funny-looking results in Mozilla/Netscape.
So you better use LANG markup only with the original script.


Funny-looking results are the least of your problems if you use such
mark-up.

Windows users (Explorer or Mozilla) might get a prompt to download a
Chinese language pack in order to read that text - although there are no
Chinese characters in it. Some will probably suppose that the computer
has a virus (maybe from your web page).

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #33

P: n/a
On Fri, 23 Jan 2004, Safalra wrote:
If I write about <span lang="ru">Dostoyevsky</span>,
I don't mean to sound ignorant, but what's the logic behind using
language mark-up for proper nouns?


It's a fair question! Would you care to debate the topic as if
the example had been e.g <span lang="ru">glasnost</span> instead ?
Presumably in an ideal mark-up language, language and script would be
independent attributes
Well, they are defined to be independent in HTML (begging the question
whether HTML is an "ideal" mark-up language ;-)
(and that way I'd have some sort of mark-up to
put around my IPA sections...)?


In what sense do you not have? Such a markup would be entirely proper
in HTML.

Any language dependence re-enters only indirectly via Unicode, but as
far as HTML is concerned, writing system (script) and language are
independent properties.

Some browsers, as we've discussed, use language as a hint for font
selection, but that's an issue of cosmetics, it is NOT allowed to
cause any change in the actual characters displayed: the notorious
<font face="Dingbats"> etc. is a bogosity of the first water, as far
as HTML4 is concerned (exceeded only by the corresponding bogosity in
CSS), and I'm glad to see Mozilla resisting misguided demands to "make
it work" (i.e to break it so that it appears to do what the misguided
author intended).
Jul 20 '05 #34

P: n/a
On Fri, 23 Jan 2004, Andreas Prilop wrote:
Set the encoding to "charset=UTF-8".
<http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>
to suit Netscape 4 and perhaps other older browsers.


Perhaps we should say "old-ish browsers".

There have been browsers which would understand e.g iso-8869-7 Greek
mixed with Latin-1 entities such as &uuml; , but would not understand
utf-8 - that was true of 16-bit IE3.01 if my memory serves me right.
They would need the approach described in #s5 in order to display such
material correctly.

Then, as you say, there would be NN4.* browsers, which in general
don't understand #s5, but do understand #s6

Browsers which are even older, might not understand either. Indeed
there's one "browser" in use today that doesn't seem to understand
either: WebTV treats all encodings as a somewhat crippled form of
Windows-1252, if its developer simulation is accurate!

Since none of the affected browsers sends a meaningful Accept-charset,
I would rule out the idea of using content negotiation to choose the
right option. Since I'm fundamentally opposed to negotiating on the
basis of client agent strings, that leaves only a manual selection, if
you really have such challenging content -and- you care about such
elderly browsers.

My recommendation at the present time would be to use utf-8 (as per
#s6 or #s7 whichever is convenient to the author) for such material
(thus covering not only any RFC2070-conforming browser but also the
remaining NN4.* stragglers), and forget the remaining antique browser
versions. They're just too old to lose sleep over, by now.

Not that I would deliberately repel them if the material was
accessible to them; but sometimes the material by its very nature
requires a rich character repertoire, and then I think such an action
is justifiable.
Jul 20 '05 #35

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
Would you care to debate the topic as if
the example had been e.g <span lang="ru">glasnost</span> instead ?


Hmm, let's take <span lang="ru">vodka</span>, da?
And that makes me ponder whether 'tis nobler in the mind to write
<span lang="en-SC">whisky</span>
<span lang="en-IE">whiskey</span>
;-)
Jul 20 '05 #36

P: n/a
On Thu, 22 Jan 2004 19:45:53 +0000, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
If those characters were Arabic, then it would be useful to choose,
say, a Persian font if it were known that the language is Farsi.


Or, for a possibly better example, to choose a nastaliq font (the kind
that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.

Cheers,
Philip
--
Philip Newton <no***********@gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
Jul 20 '05 #37

P: n/a
Philip Newton <pn*************@newton.digitalspace.net> wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
If those characters were Arabic, then it would be useful to choose,
say, a Persian font if it were known that the language is Farsi.


Or, for a possibly better example, to choose a nastaliq font (the kind
that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.


That ain't a better example - it's the same example. Both Persian and
Urdu would prefer a nast'aliq typeface.
Jul 20 '05 #38

P: n/a
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Fri, 23 Jan 2004, Henri Sivonen wrote:

[addressing Jukka, but I shall offer an answer anyway ;-) ]
When you write <span lang="ru">Dostoyevsky</span>, what would you want
recipients to do with the language data?
If they are browsers, my answer would be "probably nothing". If they
are indexers, summarisers etc. then the answer would be different.


What would your answer be in that case?
That is, is it actually useful
for transliterated text to come with language data in any existing or
realistic client implementation [...]


In theory, the /markup/ depends on the structure and attributes of the
content - it isn't *supposed* to be done with the intention of
producing a particular result on a particular client agent (that job
is delegated to stylesheet/s).

So when you are raising issues of this kind, it might be useful if you
would make clear whether you have in mind the theoretical ideal, or
rather some particular practical issue related to current browsers and
other kinds of client agent.
I'm interested in realistic and practical use cases (for which software
support exists or realistically could exist in a useful way).

Having been involved in a couple of metadata-related projects myself,
I've observed that there's a tendency towars developing metadata fields
that seem like nice to have but would require either more labor to fill
than the supposed benefit is worth or would require the processing
software to pass the Turing test as a side effect. That's why I like to
call for realistic use cases when metadata is discussed.
Remark: IBM HPR will use different pronunciations depending on the
language markup, to take just one example (which is actually
irrelevant here, since it didn't offer Russian as an option, and I've
no idea what it would do with Russian-transliterated-into-Roman-
letters even if it did). But nevertheless, it's an interesting
what-if question, isn't it?


The question gets even more interesting if the surrounding language
causes the foreign name to look different due to flexion. Does it get so
interesting that we are sliding towards the Turing test?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #39

P: n/a
Tim
Tim <Ti*@mail.localhost> wrote:
Well, unless you're inventing something new, they're a country code
(e.g. en-us for U.S.A. English, en-au for Australian English, etc.).


Neal <ne*****@spamrcn.com> wrote:
Apologies if this has been answered elsewhere, but is there a list of
these codes anywhere?
Yes.

I don't know it off hand, or I'd mention it. Try searching for "country
codes."
And how necessary are they?
Generally, they're not (e.g. it doesn't make any difference to
understanding this text whether it's Australian, British, or American
English, though it can help with a spell checker). And the RFC that's
previously been mentioned in this thread goes as far as to comment that
sometimes they may cause more problems.
My specific application is a website for an orchestra using many foreign
titles and names. I'm imagining a speech reader will need the language
code to be able to pronounce the word correctly, but perhaps I am off here
as well. At any rate, a country subtag appears to be unimportant, as our
primary market is our US-based audience.


I'd make a hazardous guess that the speech synthesiser will still get
things wrong. English ones certainly do; although many other languages
do play by the rules a lot better than English does, you're never quite
sure how to pronounce someone's name.

--
My "from" address is totally fake. The reply-to address is real, but
may be only temporary. Reply to usenet postings in the same place as
you read the message you're replying to.

This message was sent without a virus, please delete some files yourself.
Jul 20 '05 #40

P: n/a
Bertilo Wennergren <be******@gmx.net> wrote in message news:<bu*************@news.t-online.com>...
Safalra:
If I write about <span lang="ru">Dostoyevsky</span>,
I don't mean to sound ignorant, but what's the logic behind using
language mark-up for proper nouns?


In this case the only need for the markup is the need to indicate the
language of that proper noun. That's why the otherwise meaningless
element "span" has been used. It's just there in order to make it
possible to add the attribute "lang" that conveys the information that
the language in question is Russian.


But what if the proper noun had been 'Natasha'? That's a Russian name,
but should I mark it up as such if the Natasha in question is not
Russian?
Presumably in an ideal mark-up language, language and script would be
independent attributes (and that way I'd have some sort of mark-up to
put around my IPA sections...)?


[snip]
<span script="latin">book</span>
would be just as stupid. The text string "book" can't be anything else
but Latin script. If it wasn't Latin script, then it wouldn't consist of
the four Latin script characters "b", "o", "o" and "k", would it?


What if it's IPA? Most Latin characters are present in IPA, but many
(vowels in particular) represent differents sound from what they would
in English, for example. A speech browser would need to know to
pronounce the word using IPA phonemes rather than English. Given some
time, I'm sure I could find an example of an English word that when
written in IPA uses the same characters as another English word. In
that case, script would need to be indicated.

--- Safalra (Stephen Morley) ---
http://www.safalra.com/hypertext
Jul 20 '05 #41

P: n/a
Safalra:
Bertilo Wennergren
In this case the only need for the markup is the need to indicate the
language of that proper noun. That's why the otherwise meaningless
element "span" has been used. It's just there in order to make it
possible to add the attribute "lang" that conveys the information that
the language in question is Russian. But what if the proper noun had been 'Natasha'? That's a Russian name,
but should I mark it up as such if the Natasha in question is not
Russian?
You decide what language the text is in. There are difficult cases. You
as the author has to make a decision.
<span script="latin">book</span>
would be just as stupid. The text string "book" can't be anything else
but Latin script. If it wasn't Latin script, then it wouldn't consist of
the four Latin script characters "b", "o", "o" and "k", would it?

What if it's IPA? Most Latin characters are present in IPA, but many
(vowels in particular) represent differents sound from what they would
in English, for example. A speech browser would need to know to
pronounce the word using IPA phonemes rather than English. Given some
time, I'm sure I could find an example of an English word that when
written in IPA uses the same characters as another English word. In
that case, script would need to be indicated.


True. There are exceptions.

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #42

P: n/a
On 24 Jan 2004 03:06:06 -0800, Safalra <us****@safalra.com> wrote:
Given some
time, I'm sure I could find an example of an English word that when
written in IPA uses the same characters as another English word. In
that case, script would need to be indicated.

IPA \bit\ is pronounced "beet." \robot\ is "rowboat," though with a
European r. The unadorned IPA vowels are pronounced in a Latin fashion,
unlike common English pronunciation where many such vowels are short.

I recall something from the recommendations saying that authors should in
some cases provide pronunciation help to a speech reader. Apologies for
not remembering the exact context, perhaps someone else recalls it as
well. Has W3C adopted any manner to do this?
Jul 20 '05 #43

P: n/a
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Let's say you've defined Futura as your preferred typeface for West
European Latin and Verdana for Cyrillic. Then Mozilla will display
your document with "charset=ISO-8859-1" or "charset=UTF-8" in Futura
but will display <span lang="ru">Dostoevskij</span> in Verdana.


I just realized that there's similar absurdity in IE, though at a
different level. Maybe it could be described just as documentation
error: If you go to Internet settings and select Fonts, IE lets you
specify the font used for various "character sets". These sets are
named as Latin, Greek, Cyrillic, etc. This seems to make sense, until
you realize that it's the _encoding_ that matters.

That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
page content as "Cyrillic", no matter what characters and what language
it actually contains. Similarly, if I specify a particular font for
"Cyrillic character set" and access a UTF-8 encoded page, IE does _not_
use that font for Cyrillic letters on the page. It seems to treat the
page content as "Latin based".

It's an interesting guessing game. It indirectly affects authoring in
the sense that the choice of an encoding has implications on fonts,
though only on pages that do not set font family (except when the user
overrides such settings), and in a rather unpredictable situation - the
defaults for the font settings in browsers for different "character
sets" presumably vary, and if users change them, they probably do so in
the dark, more or less, since few people know what's going on in those
settings.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #44

P: n/a
Henri Sivonen <hs******@iki.fi> wrote:
Choosing a font is only one problem. There are others including
line breaking.
Of course the _quality_ of rendering on screen or paper can be affected
by such processes. My point was that browsers have been able to present
documents without knowing the language, and they keep doing so (even
now, when they could in principle get the language information from
some pages, and they always had the option of recognizing language from
actual content - something that Google does with rather good rate of
success, no matter what we think about the idea in principle).

(Line breaking makes my head ache. The Unicode line breaking rules are
very complex and largely absurd, and browsers are now competing in
implementing some of the worst parts in a wrong way. But I digress.)
When you write <span lang="ru">Dostoyevsky</span>, what would you
want recipients to do with the language data?
Nothing particular. I'm just giving (meta)information. In a sense, here
I'm intentionally more papal than the pope - I am applying an
unconditional Priority 1 WAI guideline that the WAI itself violates.

And as I wrote, I don't recommend doing that in practice - but not
because the idea would be wrong. It's the Mozilla misbehavior that
makes it currently impractical.
That is, is it
actually useful for transliterated text to come with language data
in any existing or realistic client implementation for any of the
purposes you list in
http://www.cs.tut.fi/~jkorpela/kielimerkkaus/1.html ?
(What I list there is basically the reasons given in HTML 4
specification and in WCAG 1.0, with some explanations of mine.)

In any existing implementation, most probably not. As we know, there
are very few existing implementations that utilize of lang attributes,
and there are implementations that draw wrong conclusions from them.

In a realistic implementation, why not? Of course they would need to
know or guess the transliteration method, but there's nothing that
prevents them from making educated guesses, except that it means quite
some work. And the metainformation about transliteration could even be
transmitted in an HTTP header. Of course this is hypothetical, but so
it most talk about utilization lang attributes.
Is it there
just in case the user is curious and invokes "Properties" in
Mozilla in order to find out that Dostoyevsky is a Russian name?
Well, that's one actual usage of the information. And nothing to be
frowned upon, since when users find the right-click info features,
they will start using them. If you don't use lang markup for a name, it
will naturally report the language according to the lang attribute of
the enclosing element, i.e. give wrong information. In fact, on such
grounds, an extremist (?) could say that if lang markup is used at all,
it should be comprehensive. If you say nothing about language, you are
not giving wrong information. But if you say e.g. <html lang="en">,
then you _are_ claiming that each and every word in the document is in
English, unless stated otherwise in lang attributes for inner elements.
(Quite a job, isn't it? Often you don't even know the language of a
name. I guess we should use lang="und" then.)
Let's suppose I'm writing a content management system and I choose
to use UTF-8 for all output - -
What advice should I provide authors who want to use the system for
publishing Polish or Chinese text? How should they make their
suggestions?


You mean for fonts? By using font properties in CSS. As far as I can
see, this would be sufficient for defeating Mozilla's misbehavior.

I don't see how lang attributes would help in practice, though it would
be OK to declare the language as a preparation for the future.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #45

P: n/a
Bertilo Wennergren <be******@gmx.net> wrote:
In the same way "Dostoyevsky" (written exactly like that) is
written in Latin script. There is no need (or should be no need)
telling the browser what it already knows.


It is written in Latin letters, but the word "script" is somewhat
confusing here. There are many different systems of transliterating
Russian names, even in one country, and this is a constant source of
confusion. So the information needed for correct analysis of the word
would include information about the particular transliteration method.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #46

P: n/a
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Hmm, let's take <span lang="ru">vodka</span>, da?


An interesting proposal. :-) In fact, the word "vodka" could be
regarded as a Russian word, or as a loanword of Russian origin used in
English or some other language. Thus, the markup above could be
construed as an author's expression for the intent of reading it as a
genuinely Russian word, pronounced the Russian way (reading its "d" as
unvoiced, "t", etc.), as far as possible. Needless to say, it is
overoptimistic to expect user agents to understand such finer points
very soon.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #47

P: n/a
On Sat, 24 Jan 2004, Jukka K. Korpela wrote:
I just realized that there's similar absurdity in IE, though at a
different level. Maybe it could be described just as documentation
error: If you go to Internet settings and select Fonts, IE lets you
specify the font used for various "character sets". These sets are
named as Latin, Greek, Cyrillic, etc. This seems to make sense, until
you realize that it's the _encoding_ that matters.
It seems you may have observed part of the problem, and I've observed
a different part of the problem. Could I persuade you to take a look
at my observations in
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html , in
the part that relates to Win IE, and see how well it fits your own
observations?
That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
page content as "Cyrillic", no matter what characters and what language
it actually contains.
The language attribute in HTML also has an influence: some examples
are shown on my page.

As I say, it could be that each of us is only seeing part of the
picture. With hindsight, some of my observations might only be
accurate in relation to pages that are advertised as utf-8.
Similarly, if I specify a particular font for "Cyrillic character
set" and access a UTF-8 encoded page, IE does _not_ use that font
for Cyrillic letters on the page.
That depends...
It seems to treat the page content as "Latin based".
That will not happen if you choose a Latin font which contains no
Cyrillic characters (use the MS font properties extension to view the
relevant properties of the font).

As I recall, I can make it use for Cyrillic the font that I configured
for Greek, if I choose a Latin font which has no Cyrillic.
It's an interesting guessing game.
I've set out my guess on the above page. The writing systems are set
out in an ordered list, and my guess was that it works its way down
this list until it finds a font which contains support for the desired
writing system (even if the chosen font's support is incomplete
relative to the one which was configured for that writing system!).
It indirectly affects authoring in the sense that the choice of an
encoding has implications on fonts, though only on pages that do not
set font family (except when the user overrides such settings),


Well, sort-of. The primary guideline is surely to mark up the
document accurately, and leave the client agent to do the best job
that its authors were capable of? But yes, sometimes it's opportune
for document authors to make some allowances for known browser
shortcomings.

However, here the most usual proposal is that authors should offer a
font, or rather a selection of fonts, that the author found to be
viable. Unfortunately, in every case where this has been
investigated, while the suggestion of a font can improve the results
for some subset of browsers, it can make matters worse, sometimes a
lot worse, for some other subset of browsers. So much so that in this
kind of multi-script situation, I would recommend readers who are
having difficulties with the default settings, to try reconfiguring
their browser to ignore any author-specified fonts and work with their
own font defaults for best results.
Jul 20 '05 #48

P: n/a
On Sat, 24 Jan 2004, Alan J. Flavell wrote:
That will not happen if you choose a Latin font which contains no
Cyrillic characters (use the MS font properties extension to view the
relevant properties of the font).


Oh, perhaps an easier way to do this is to visit IE's font defaults
menu (tools> internet options> general> fonts). When you try to
select a particular language script (i.e writing system), IE will
present a menu of the available fonts for that language script. By a
process of elimination, the fonts which are not included in that list
do not support the script in question.

And immediatly we see the trap! When I carried out my tests in
Win/NT4, the Book Antiqua font provided there did not support Greek
nor Cyrillic. But now that I repeat the test in Win2K, well, you
guessed it: this font, with the same name, supports also Greek and
Cyrillic. Ho hum.

Jul 20 '05 #49

P: n/a
On Sat, 24 Jan 2004, Alan J. Flavell wrote:

[Jukka wrote:]
That is, if you have e.g. charset=iso-8859-5, IE classifies the whole
page content as "Cyrillic", no matter what characters and what language
it actually contains.


The language attribute in HTML also has an influence: some examples
are shown on my page.


Please accept my apologies on this particular point. I now realise I
was misremembering _that_ specific behaviour: it was in fact seen in
Mozilla, not MSIE.

Jul 20 '05 #50

80 Replies

This discussion thread is closed

Replies have been disabled for this discussion.