473,320 Members | 1,817 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Lang attribute values


Been searching around, and found
http://www.w3.org/WAI/ER/IG/ert/iso639.htm which is great, as I've been
looking for a guide to what codes are acceptable.

I see stuff like lang="en-us" - that extension, where is that from? Is
there a codification somewhere?
Jul 20 '05
80 6430
Jukka K. Korpela:
Bertilo Wennergren <be******@gmx.net> wrote:
In the same way "Dostoyevsky" (written exactly like that) is
written in Latin script. There is no need (or should be no need)
telling the browser what it already knows.

It is written in Latin letters, but the word "script" is somewhat
confusing here. There are many different systems of transliterating
Russian names, even in one country, and this is a constant source of
confusion. So the information needed for correct analysis of the word
would include information about the particular transliteration method.


Indeed "script" is a vague term, but I don't think we should mix it with
"transcription system". There are several systems of Latin transcription
of Japanese. They all use Latin script.

But if there were a script attribute, it's value could of course consist
of things like "la" (Latin) "la-hep" (Latin script, Hepburn
transcription of Japanese), and also "ipa", "ipa-wide", "ipa-narrow"
etc. Or there could be another attribute for transcription systems.

That would all probably be a bit too much for HTML though.

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #51
Bertilo Wennergren <be******@gmx.net> wrote:
Indeed "script" is a vague term, but I don't think we should mix it
with "transcription system".
My point was that "script" in the vague sense has really no relevance
to markup whereas writing system has. When Russian is written in Latin
letters (using transliteration, basically, and not transcription), it
is a system of writing Russian. It can be viewed as consisting of a
composition of the normal writing system and Russian and a
transliteration method, but that's a different aspect
But if there were a script attribute, it's value could of course
consist
of things like "la" (Latin) "la-hep" (Latin script, Hepburn
transcription of Japanese), and also "ipa", "ipa-wide",
"ipa-narrow" etc.
No, "la" would not identify a writing system - it would refer to a
family of character repertoires, more or less, which is at a completely
different conceptual level. I can understand the idea of using "Latin",
"Cyrillic" etc., because there are languages that have or have had
writing systems that basically differ in the use of the base system of
letters (e.g., Latin, Cyrillic, or Arabic). But that's just one
possibility, and - as mentioned in this thread - it is relatively
obvious even without such metainformation whether e.g. some fragment of
Russian is written in Latin or Cyrillic letters. What is _not_ so
obvious, in many cases, is the specific writing system (e.g., "old" and
"new" Russian orthography, or the choice of a particular
transliteration method).
That would all probably be a bit too much for HTML though.


Some of the IANA registered "language subcodes" actually identify
writing systems. This indicates at least some subjective need for
specifying the writing system. But it's a wrong approach.

The situation is somewhat complex, though, since an orthography reform
is often coupled with some change of language, or could be _viewed_ as
creating a version of a language. But logically orthography is
orthogonal to dialect, jargon, and other variation reflected in a
language subcode.

Does someone really think that a new version of the German language has
been or is being created by the orthography reform that was officially
started in 1998? I don't think so. For adequate use of language
information, e.g. in spelling checking, orthography is relevant, but it
should be specified separately.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #52
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
It seems you may have observed part of the problem, and I've
observed a different part of the problem. Could I persuade you to
take a look at my observations in
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html ,
in the part that relates to Win IE, and see how well it fits your
own observations?
Now that I looked at that page again, I realized that it describes
(among other things) the problem I tried to explain. I had read it but
probably forgotten it, since it had not really caused me trouble. But
now it had.
However, here the most usual proposal is that authors should offer
a font, or rather a selection of fonts, that the author found to be
viable. Unfortunately, in every case where this has been
investigated, while the suggestion of a font can improve the
results for some subset of browsers, it can make matters worse,
sometimes a lot worse, for some other subset of browsers.


In situations where the author knows that some font(s) that are
relatively commonly installed contain the characters he uses in a
document, I think it is reasonable to write a font-family suggestion
for body if the font is qualitatively acceptable. I'm naturally
referring to situations where a rich character repertoire is used, so
that we know that common browsers with common default settings will
fail to render all the characters. As a rough rule of thumb, if you use
characters that are not present in Times New Roman, consider suggesting
body { font-family: "Arial Unicode MS"; }
maybe with some other fonts too, if you have checked that each of them
has all the characters you're using.

The sure gain is that a large number of IE users will be able to read
the page without difficulty. The potential loss is that users who
actually have a qualitatively better font in their system and a browser
configured to use it will need an extra action to override the page
settings. I don't like the loss, but I think it's acceptable.

But I recently encountered a problem where Arial Unicode MS is not
sufficient. Not knowing what to do, I decided to make no font
suggestions for the text, since anything I considered would have sure
and considerable drawbacks as well. (This is one of the cases where
creating a PDF alternative is almost a must.)

It's unfortunate that Code2000 is qualitatively so awful. I could
accept it as the fallback font to be used for those characters that are
not present in any other font, but copy text looks horrendous in
Code2000. But using font-family: "Arial Unicode MS", "Code2000" does
not work the defined way on IE, and it makes things worse when a
browser implements it correctly and has both Code2000 and some better
very-large-repertoire font installed.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #53
Jukka K. Korpela:
Bertilo Wennergren <be******@gmx.net> wrote:
But if there were a script attribute, it's value could of course
consist
of things like "la" (Latin) "la-hep" (Latin script, Hepburn
transcription of Japanese), and also "ipa", "ipa-wide",
"ipa-narrow" etc. No, "la" would not identify a writing system - it would refer to a
family of character repertoires, more or less, which is at a completely
different conceptual level.
I think we're agreeing here.
I can understand the idea of using "Latin",
"Cyrillic" etc., because there are languages that have or have had
writing systems that basically differ in the use of the base system of
letters (e.g., Latin, Cyrillic, or Arabic). But that's just one
possibility, and - as mentioned in this thread - it is relatively
obvious even without such metainformation whether e.g. some fragment of
Russian is written in Latin or Cyrillic letters. What is _not_ so
obvious, in many cases, is the specific writing system (e.g., "old" and
"new" Russian orthography, or the choice of a particular
transliteration method).
True.
That would all probably be a bit too much for HTML though.

Some of the IANA registered "language subcodes" actually identify
writing systems. This indicates at least some subjective need for
specifying the writing system. But it's a wrong approach. The situation is somewhat complex, though, since an orthography reform
is often coupled with some change of language, or could be _viewed_ as
creating a version of a language. But logically orthography is
orthogonal to dialect, jargon, and other variation reflected in a
language subcode.
That would seem to mean that a separate attribute "orthography" with a
value from a wide range of codes for various writing systems used for
various languages, would make sense.
Does someone really think that a new version of the German language has
been or is being created by the orthography reform that was officially
started in 1998? I don't think so. For adequate use of language
information, e.g. in spelling checking, orthography is relevant, but it
should be specified separately.


So "<span lang='de' orthography='de-neu'>Schloss</span>" would in
principle be OK then? (Supposing that "de-neu" - or whatever - has been
officially registered as the code for the new German orthograpy.)

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #54
Jukka K. Korpela:
As a rough rule of thumb, if you use
characters that are not present in Times New Roman, consider suggesting
body { font-family: "Arial Unicode MS"; }
maybe with some other fonts too, if you have checked that each of them
has all the characters you're using.


You should be aware that "Arial Unicode MS" can be installed on Linux
systems, but that on many such systems it will fail to render any
italics. So suggesting that font might disable italics for some users.

If italics are used for emphasized text or citations (or something else)
that could be a problem on pages where emphasis, citation etc. convey
important pieces of information.

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #55
On Sun, 25 Jan 2004, Jukka K. Korpela wrote:
As a rough rule of thumb, if you use
characters that are not present in Times New Roman, consider suggesting
body { font-family: "Arial Unicode MS"; }
maybe with some other fonts too, if you have checked that each of them
has all the characters you're using.
Well, at least if they have Arial Unicode MS, you know that the font
has the rich character repertoire. Whereas many font family names
denote fonts which come in more than one version, having widely
different repertoires - previous discussion has shown numerous
examples.

It's a dilemma. Arial Unicode MS typeface has only one font, whereas
(for example) the Palatino Linotype typeface has also italic, bold and
bold italic fonts. Lucida Sans Unicode typeface also has a fairly
wide repertoire but only one font. When italic, bold etc. have to be
derived from the regular font, the results are suboptimal.
The sure gain is that a large number of IE users will be able to read
the page without difficulty. The potential loss is that users who
actually have a qualitatively better font in their system and a browser
configured to use it will need an extra action to override the page
settings. I don't like the loss, but I think it's acceptable.
It's a value judgement call, which could very well come out different
for each situation. I really don't have a final view on it.

Fortunately, if one uses a central stylesheet then a change of
opinion can be easily implemented!
It's unfortunate that Code2000 is qualitatively so awful.


It's a reasonable choice when repertoire is the overwhelming
consideration, and cosmetics can take a back place.

(Then there's the problem of monospace.)

cheers
Jul 20 '05 #56
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
In situations where the author knows that some font(s) that are
relatively commonly installed contain the characters he uses in a
document, I think it is reasonable to write a font-family suggestion
[...] As a rough rule of thumb, if you use characters that are not
present in Times New Roman, consider suggesting

body { font-family: "Arial Unicode MS"; }
Wouldn't it be safer to leave <body> alone, and only suggest
an alternate font-family for parts of the document known to
contain the problematic characters?

For example, if I'm composing a Bible-study page that has a
few scattered Greek words, oughtn't it just use:

span.polytonic { font-family: "Palatino Linotype" }
;K

Jul 20 '05 #57
On Thu, 22 Jan 2004 22:04:46 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
It might be a good idea to extend the euro-centric list
serif, sans-serif, cursive, fantasy
by
naskhi, nastaliq, thuluth
etc.


Sounds reasonable to me. Is "thuluth" what is sometimes called "sülüs"?

Cheers,
Philip
--
Philip Newton <no***********@gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
Jul 20 '05 #58
On Fri, 23 Jan 2004 21:17:41 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
Philip Newton <pn*************@newton.digitalspace.net> wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
If those characters were Arabic, then it would be useful to choose,
say, a Persian font if it were known that the language is Farsi.


Or, for a possibly better example, to choose a nastaliq font (the kind
that slopes) for Urdu vs a default naskhi (horizontal) font for Arabic.


That ain't a better example - it's the same example. Both Persian and
Urdu would prefer a nast'aliq typeface.


Ah, I did not know that Persian also preferred nastaliq. Thanks.

Cheers,
Philip
--
Philip Newton <no***********@gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
Jul 20 '05 #59
On Fri, 23 Jan 2004 16:32:02 +0200, Henri Sivonen <hs******@iki.fi>
wrote:
In article <Xn****************************@193.229.0.31>, "Jukka K.
Korpela" <jk******@cs.tut.fi> wrote:
Yes. It should see immediately that Latin script is used. But in
addition to this, what's the big idea in selecting fonts according
to language?


I can't find a politically correct way of saying this, but there's
are pecking orders of language groups within scripts in terms of
font availability and quality. It's unfortunate.

For example Polish looks ugly if some glyphs come from a "Western"
font and others come from a "Central European" font.


Mmm. Or if you want to have d-with-caron; you often can't use U+010F
LATIN SMALL LETTER D WITH CARON since this will typically have a glyph
with apostrophe after rather than caron above due to Czech and Slovak
typesetting habits (if I interpret the comment in the Unicode standard
correctly). But what if I'm not typesetting Czech or Slovak, but a
language which uses d-with-caron? (This is a real example, though the
language in question is not a natlang.)

Cheers,
Philip
--
Philip Newton <no***********@gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
Jul 20 '05 #60
Bertilo Wennergren <be******@gmx.net> wrote:
You should be aware that "Arial Unicode MS" can be installed on
Linux systems, but that on many such systems it will fail to render
any italics. So suggesting that font might disable italics for some
users.


Sounds bad. But I would classify it as a browser error, no matter what
the actual causes are. Such a situation will create problems without my
help too, since if someone installs the font, he probably intends to
use it at least casually, and he can himself tell his browser to use it
as a default font.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #61
Philip Newton <pn*************@newton.digitalspace.net> wrote:
Or if you want to have d-with-caron; you often can't use U+010F
LATIN SMALL LETTER D WITH CARON since this will typically have a
glyph with apostrophe after rather than caron above due to Czech
and Slovak typesetting habits (if I interpret the comment in the
Unicode standard correctly).
If you have d with caron, then U+010F is the correct character.
It is true that the appearance of the character usually has an
apostrophe on the right of it rather than a caron above it, but this is
glyph variation, which does not change the identity of a character.
But what if I'm not typesetting Czech
or Slovak, but a language which uses d-with-caron?
You mean "which uses d-with-caron that should be displayed in a manner
different from the usual one"?
(This is a real
example, though the language in question is not a natlang.)


If it's a conlang, it'll probably lack a registered language code, so
the use of a lang attribute would be somewhat pointless. Besides, the
language should have been designed to use a different character, if the
distinction is essential.

On the practical side, using font settings directly is surely the way
that has much better chances of creating the desired appearance than
using lang="x-fictitional-martian" in the hope of encountering browsers
that think "oh, so this not Czech or Slovak but some unknown language,
maybe I should find a font where the diacritic really looks like a
caron". :-)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #62
Jukka K. Korpela:
Bertilo Wennergren <be******@gmx.net> wrote:
You should be aware that "Arial Unicode MS" can be installed on
Linux systems, but that on many such systems it will fail to render
any italics. So suggesting that font might disable italics for some
users.

Sounds bad. But I would classify it as a browser error, no matter what
the actual causes are.
It's a problem with the operating system, or rather with it's font
handling. TTF fonts get italics only if there is a separate italic font
variant.
Such a situation will create problems without my
help too, since if someone installs the font, he probably intends to
use it at least casually, and he can himself tell his browser to use it
as a default font.


He might have it around for some special uses, intending it to be used
only when he himself chooses it. That's the case for me. I have
obviously not chosen it as a font to be used by default for any encoding
in my browser.

--
Bertilo Wennergren <be******@gmx.net> <http://www.bertilow.com>
Jul 20 '05 #63
On Sun, 25 Jan 2004 20:35:27 +0000 (UTC), "Jukka K. Korpela"
<jk******@cs.tut.fi> wrote:
Philip Newton <pn*************@newton.digitalspace.net> wrote:
Or if you want to have d-with-caron; you often can't use U+010F
LATIN SMALL LETTER D WITH CARON since this will typically have a
glyph with apostrophe after rather than caron above due to Czech
and Slovak typesetting habits (if I interpret the comment in the
Unicode standard correctly).
If you have d with caron, then U+010F is the correct character.


Indeed.
It is true that the appearance of the character usually has an
apostrophe on the right of it rather than a caron above it, but this
is glyph variation, which does not change the identity of a
character.
True.

I was reacting to Henri Sivonen saying

: I can't find a politically correct way of saying this, but there's
: are pecking orders of language groups within scripts in terms of
: font availability and quality. It's unfortunate.
:
: For example Polish looks ugly if some glyphs come from a "Western"
: font and others come from a "Central European" font.

to point out that Czech and Slovak appear to be higher on the pecking
order here as well, and tend to impose their typographical preferences
on others (not directly, but by choice of those who design the fonts).

Similarly to how somebody writing Polish and who'd prefer a more acutely
sloped accent on his ó will have difficulties due to other languages'
preferences.

Or somebody writing Romanian who'd prefer to have his LATIN SMALL LETTER
S WITH CEDILLAs display with comma below instead, since he's not writing
Turkish. (Some fonts I have display s with cedilla but t with comma
below, which probably looks extra weird in Romanian: I can imagine it'd
be better or more consistent to have both characters appear similar [if
wrong] than to have them appear different.)
But what if I'm not typesetting Czech or Slovak, but a language
which uses d-with-caron?


You mean "which uses d-with-caron that should be displayed in a
manner different from the usual one"?


I mean "which uses d-with-caron that should be displayed in a manner
different from the Czech and Slovak one". Using "usual" rather depends
on the context.

But I suppose that's quibbling. So yes, I suppose I agree. Yes, a
language which uses d-with-caron that can be displayed as d-with-caron
or d-with-circumflex (e.g. in some handwriting styles), but not with
d-with-apostrophe.
Besides, the language should have been designed to use a different
character, if the distinction is essential.


Hm? d-with-caron is the correct character. I'm saying that it's
difficult to get a font showing an appropriate glyph due to pecking
order constraints that determine which language decides what "the"
reference glyph looks like. But the character is unambiguously LATIN
SMALL LETTER D WITH CARON, alongside several other letters with caron
which display correctly (e.g. C, R, or S).

Cheers,
Philip
--
Philip Newton <no***********@gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
Jul 20 '05 #64
Philip Newton <pn*************@newton.digitalspace.net> wrote:
Similarly to how somebody writing Polish and who'd prefer a more
acutely sloped accent on his ó will have difficulties due to other
languages' preferences.
I think we agree on the principle that language information could be
relevant to optimal selection of fonts - and this is among the
officially listed benefits of lang markup. But I see this as rather
marginal, mostly on practical grounds, since there's very little in the
direction of supporting this idea, and browsers' attempts at using lang
markup in font selection are basically wrong.

By the way, we can't really blame browser vendors too much. How many
people actually use lang markup? How many do it _right_? (I'm afraid
there are page editors that routinely add lang="en" without telling
their user or anyone else.) Besides, the specifications are vague.
And at the top of the foolishness, HTML 4 has lang, XHTML 1 adds
xml:lang, so maybe we should use both, except that lang appears to be
getting deprecated. Yet, if any software actually utilizes language
markup, I would expect it to know lang more probably than xml:lang.
Or somebody writing Romanian who'd prefer to have his LATIN SMALL
LETTER S WITH CEDILLAs display with comma below instead, since he's
not writing Turkish.


This particular issue is somewhat different, and - not surprisingly -
confused in its own way. According to a statement by the Romanian
standards institute, Romanian uses s with comma, not s with cedilla, so
they see this as a character difference, not glyph difference, and
s with comma has been added into Unicode for this reason, with quite
some handwaving.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #65
On Mon, 26 Jan 2004 08:53:22 +0000 (UTC), Jukka K. Korpela
<jk******@cs.tut.fi> wrote:

And at the top of the foolishness, HTML 4 has lang, XHTML 1 adds
xml:lang, so maybe we should use both, except that lang appears to be
getting deprecated.


The XHTML spec says to use both, and xml:lang takes preference. But I
can't see why they changed the attribute.

Is there difference in the syntax of lang and xml:lang? Is there a reason
lang could not also be used in XHTML?
Jul 20 '05 #66
Neal <ne*****@spamrcn.com> wrote:
And at the top of the foolishness, HTML 4 has lang, XHTML 1 adds
xml:lang, so maybe we should use both, except that lang appears to
be getting deprecated.
The XHTML spec says to use both, and xml:lang takes preference.


The XHTML 1.0 spec says so, but XHTML 1.1 has removed lang. On the
other hand, XHTML 1.0 is mostly an exercise in futility, and XHTML 1.1
is at least 1.1 times that. But the XHTML 2.0 draft, too, has xml:lang
only.
But I can't see why they changed the attribute.
To make the world safe for XML. Someone invented the idea that many XML
based markup systems should have an attribute for specifying the
language, so they defined xml:lang. Don't ask me why it needs to be
prefixed. If they wanted to make it a reserved attribute, so that no
XML based system should ever define a lang attribute except for a
particular purpose with a particular syntax and meaning, they could
have said that. But they were lost in namespace and couldn't say it
without invoking "namespaces".
Is there difference in the syntax of lang and xml:lang?
No.
Is there a reason lang could not also be used in XHTML?


Well it _can_ be used in XHTML. There is no formal prohibition in XML
against using any attribute name you like for language information, but
if you read between the lines,
http://www.w3.org/TR/REC-xml#sec-lang-tag
effectively tells that if you have an attribute for language, you had
better use xml:lang. There's no particular reason for the XML spec to
contain that part otherwise, since it does _not_ automatically make
xml:lang part of XML itself. It says: "In valid documents, this
attribute, like any other, must be declared if it is used."

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #67
Philip Newton <pn*************@newton.digitalspace.net> wrote:
Is "thuluth" what is sometimes called "sülüs"?


Zat's ze Turkish vay of spelling Arabic vords. ;-)
Jul 20 '05 #68
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
The language attribute in HTML also has an influence: some examples
are shown on my page.


Please accept my apologies on this particular point. I now realise I
was misremembering _that_ specific behaviour: it was in fact seen in
Mozilla, not MSIE.


To compensate for this short-coming, I can offer you a dependency
of Internet Explorer on the DIR attribute. ;-)
<http://www.unics.uni-hannover.de/nhtcapri/temp/percent.html>
<http://www.unics.uni-hannover.de/nhtcapri/temp/percent.html6>
I have not yet fully understood this bug. Can you reproduce it?
You need to define some typeface without Arabic glyphs (such as
Verdana) as "Latin-preferred typeface". Then IE fails to display
the Arabic percent sign in some instances.
Jul 20 '05 #69
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
Does someone really think that a new version of the German language has
been or is being created by the orthography reform that was officially
started in 1998?


The differences are neglectable as compared with the differences
between en-GB-Oxford and en-US-Usenet. ;-)
Jul 20 '05 #70
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
You should be aware that "Arial Unicode MS" can be installed on
Linux systems, but that on many such systems it will fail to render
any italics. So suggesting that font might disable italics for some
users.


Sounds bad. But I would classify it as a browser error, no matter what
the actual causes are.


That is subject to debate. Many people consider it an error of a
word-processing/layout/drawing program when it fakes an italic style
where no true italic font is available. Some (Mac & Windows) programs
actually don't let you choose "bold" or "italic" if the typeface has
no bold or italic font.
Jul 20 '05 #71
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
(I'm afraid
there are page editors that routinely add lang="en" without telling
their user or anyone else.)


MS Word and other Microsoft programs do this on the basis of the
current keyboard layout. When I type English text using a German
keyboard layout, MS Word includes "\lang1031", i.e. language=German.
(You can check this by saving your documents in RTF.)
Jul 20 '05 #72
On Mon, 26 Jan 2004, Andreas Prilop wrote:
To compensate for this short-coming, I can offer you a dependency
of Internet Explorer on the DIR attribute. ;-)
<http://www.unics.uni-hannover.de/nhtcapri/temp/percent.html>
<http://www.unics.uni-hannover.de/nhtcapri/temp/percent.html6>
I have not yet fully understood this bug. Can you reproduce it?
It seems so. To start the test, I picked, as Latin font, the first
font on the alphabetical list that I had on this win2k system, which
happens to be "Albertus Extra Bold", whose properties are reported to
be:

Supported Unicode Ranges:
^ (yup, this area is completely empty!)

Supported code pages:

1252 Latin 1
1250 Latin 2:East Europe
1254 Turkish
1257 Windows Baltic.
You need to define some typeface without Arabic glyphs (such as
Verdana) as "Latin-preferred typeface".
I guess that meets your criteria. No Arabic unicode ranges nor "code
pages".
Then IE fails to display the Arabic percent sign in some instances.


Indeed: I'm looking at the utf-8 version...

On the left-aligned lines, it's shown as an empty box on
line 3. On the right-aligned lines, it's shown as an empty box on
lines 2 and 3.

How odd. Mine is IE6 version 6.0.2800.1106 with SP1 and a couple of
Q-numbers, for the record.

On the 8859-6 version, on the other hand, they all show as
(Arabic-looking) percent signs.

OK then, some other font choices:

* Verdana: same results as above.

* Lucida Sans Unicode: same (it doesn't have Arabic)

* Palatino Linotype: WOOPS!!! Instead of empty boxes, it displays
fleurs-de-lys in place of the missing percent signs!!!
And then fonts which contain Arabic:

* Arial Unicode MS: all percent signs are shown (you knew that!).

* Code2000: same.
Btw, just for the record, let's see what the other fonts were at
the time, working down the registry list in numerical order till
we get to Arabic:

Greek: Arial Unicode MS
Cyrillic: ditto
Armenian: Code2000
Hebrew: Lucida Sans Unicode

Nothing deliberate - just the relics of earlier tests.

Jul 20 '05 #73
Mad Bad Rabbit <ma**********@yahoo.com> wrote:
body { font-family: "Arial Unicode MS"; }
Wouldn't it be safer to leave <body> alone,


Yes!
and only suggest
an alternate font-family for parts of the document known to
contain the problematic characters?
Perhaps.
For example, if I'm composing a Bible-study page that has a
few scattered Greek words, oughtn't it just use:
span.polytonic { font-family: "Palatino Linotype" }


I used to object _any_ typeface specification in HTML, thus following
<http://ppewww.ph.gla.ac.uk/~flavell/charset/browsers-fonts.html#dont>
But now I've done it myself for the poor souls using Internet Explorer.
<http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html>
Jul 20 '05 #74
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
You should be aware that "Arial Unicode MS" can be installed on
Linux systems, but that on many such systems it will fail to
render any italics. So suggesting that font might disable italics
for some users.


Sounds bad. But I would classify it as a browser error, no matter
what the actual causes are.


That is subject to debate. Many people consider it an error of a
word-processing/layout/drawing program when it fakes an italic
style where no true italic font is available.


It's theoretically subject to debate, now that HTML 2.0 is just
history. The good old spec _required_ that browsers render <em> and
<strong> as distinct from each other and from normal text. But that's
still pretty much the idea, is it not? So if a browser simply decides
not to italicize or slant text in <em> because the font in use is
e.g. Arial Unicode MS, then it's its responsibility to figure out
something else to make the difference.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #75
On Mon, 26 Jan 2004, Andreas Prilop wrote:
<http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html>


Uh-uh, you're using space and zero-width joiner to exhibit the
initial, final and medial forms. I guess I could do that in my Arabic
Unicode page also?

Jul 20 '05 #76
On Mon, 26 Jan 2004, Alan J. Flavell wrote:
<http://www.unics.uni-hannover.de/nhtcapri/urdu-alphabet.html>


Uh-uh, you're using space and zero-width joiner to exhibit the
initial, final and medial forms. I guess I could do that in my Arabic
Unicode page also?


I was surprised that the zero-width joiner (*) does work with
Mozilla 1.3 and later. However, an earlier version of Netscape (7.0?)
didn't recognize it, IIRC.

Has someone still Netscape 7.0 on Windows [2000 or XP]?
Can you tell me whether it shows different glyphs in the third column
of <http://www.unics.uni-hannover.de/nhtcapri/arabic-alphabet.html> ?

[ ـ does work with Netscape 7.0. ]

Jul 20 '05 #77
In article <Xn***************************@193.229.0.31>,
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
Henri Sivonen <hs******@iki.fi> wrote:
Choosing a font is only one problem. There are others including
line breaking.
Of course the _quality_ of rendering on screen or paper can be affected
by such processes.


I think it is worthwhile to try to improve the quality.
My point was that browsers have been able to present
documents without knowing the language, and they keep doing so (even
now,
The Mozilla feature can improve the quality of rendering in very
realistic cases. It happens to degrade the quality in your rather
theoretical example. But still, all the characters are rendered, so is
it "just" a matter of quality. Isn't it appropriate to optimize the
quality in cases that can plausibly occur on countless pages even if you
can come up with a rare counter-example where the optimization degrades
the quality?
and they always had the option of recognizing language from
actual content
Browsers--being interactive applications that aim for incremental
display--have never really had that option.
When you write <span lang="ru">Dostoyevsky</span>, what would you
want recipients to do with the language data?


Nothing particular. I'm just giving (meta)information.


I no longer appreciate the inclusion of metadata when the inclusion is
not motivated by a realistic use case and is done just for the sake of
providing metadata.
In a sense, here
I'm intentionally more papal than the pope - I am applying an
unconditional Priority 1 WAI guideline that the WAI itself violates. .... And as I wrote, I don't recommend doing that in practice - but not
because the idea would be wrong. It's the Mozilla misbehavior that
makes it currently impractical.
If you want to make a point about hypocrisy in WAI, why do you point at
Mozilla as something that is misbehaving?
That is, is it
actually useful for transliterated text to come with language data
in any existing or realistic client implementation for any of the
purposes you list in
http://www.cs.tut.fi/~jkorpela/kielimerkkaus/1.html ? .... In any existing implementation, most probably not.
Doesn't that make the inclusion of language metadata on transliterated
names about as useful as migrating from HTML 4.01 to XHTML 1.0 served as
text/html? :-)
In a realistic implementation, why not?
Because isolated transliterated Russian words marked as Russian are so
rare and there are so many bugs and so little developer time.
Of course they would need to
know or guess the transliteration method, but there's nothing that
prevents them from making educated guesses, except that it means quite
some work.
So guessing the transliteration method would be OK, but making educated
font guesses based on explicit language information is not OK?
I guess we should use lang="und" then.)
Or, rather lang="".
Let's suppose I'm writing a content management system and I choose
to use UTF-8 for all output - -
What advice should I provide authors who want to use the system for
publishing Polish or Chinese text? How should they make their
suggestions?


You mean for fonts?


Yes.
By using font properties in CSS.
The usual advice from Prilop and Flavell is not to do that.
I don't see how lang attributes would help in practice, though it would
be OK to declare the language as a preparation for the future.


For example, on X11 platforms where there are separate fonts for various
8-bit repertoires, Mozilla can choose a Central European (ISO-8859-2)
font instead of a "Western" (ISO-8859-1) one for unaccented Latin
characters as well as the accented ones if the content is marked up as
Polish.

For example, on Mac OS X, which comes with a variety of CJK fonts,
Mozilla can choose a Simplified Chinese font or a Traditional Chinese
font instead of a Japanese font if the content is marked up as zh-CN or
zh-TW respectively.

In both cases the lang attributes do help in *practice*.

Test: http://iki.fi/hsivonen/test/lang.htm8

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #78
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
How do you suggest the font heuristics should work with UTF-8


What's wrong with displaying Latin characters using the selected Latin
font? And so on.


Sometimes there are different fonts for different subsets of the Latin
Unicode repertoire.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #79
On Tue, 27 Jan 2004, Henri Sivonen wrote:
publishing Polish or Chinese text?

[J.Korpela:] By using font properties in CSS.
The usual advice from Prilop and Flavell is not to do that.


Could I stress, though, that CJK is _not_ my field: I'm aware that
they'd probably want to disambiguate Han unified characters. I'd have
to take advice as to whether that's better achieved by language
markup, font suggestions, or both, though.

But regarding Polish, I think the bottom line is the one I gave
before, although I added a few more words to it yesterday:

It may be that you have the best of intentions, and you could indeed
help some proportion of readers whose browsers are not set-up
optimally, but you also risk causing real harm to some other
proportion of readers. Conversely, readers who are having problems
with displaying what is otherwise a properly-made i18n document on
their browsers, provided of course that those browsers have been set
up well for the writing systems in question, might be advised to try
telling the browser to ignore document-specified font selection

http://ppewww.ph.gla.ac.uk/~flavell/...onts.html#dont

Even in two major browser/families (IE, and Mozilla+relatives), on
Windows, there are considerations which would lead to contradictory
choices here, as far as _font_ proposals are concerned. Other
browsers, unknown to us, may well suffer from other shortcomings. I'm
suggesting it would be better not to take that risk.

[examples snipped]
In both cases the lang attributes do help in *practice*.
I'm quite willing to believe that, in practice.

As you well know, that's not the same as proposing a named font (I'm
just stressing that point in case any reader might have got confused
as we switched the discussion from one to another).
Test: http://iki.fi/hsivonen/test/lang.htm8


Sorry, I don't have any systems conveniently at hand where that shows
the distinction that you're aiming to prove. I'll try to remember to
try it when I can.

Jul 20 '05 #80
On Mon, 26 Jan 2004 08:53:22 +0000 (UTC), "Jukka K. Korpela"
<jk******@cs.tut.fi> wrote:
Philip Newton <pn*************@newton.digitalspace.net> wrote:
Or somebody writing Romanian who'd prefer to have his LATIN SMALL
LETTER S WITH CEDILLAs display with comma below instead, since
he's not writing Turkish.


This particular issue is somewhat different, and - not surprisingly
- confused in its own way. According to a statement by the Romanian
standards institute, Romanian uses s with comma, not s with cedilla,
so they see this as a character difference, not glyph difference,
and s with comma has been added into Unicode for this reason, with
quite some handwaving.


Similarly, Polish could claim that Polish uses, say, o with kreska, not
o with acute - yet as far as I know, Unicode treats this as a glyph
difference.

http://studweb.euv-frankfurt-o.de/tw...ek/kreska.html
says, for example,

You might have heard that the acute accent is used in Polish
language. Wrong! The Polish kreska in a 8 point face seems similar
to acute but if you look closer, you'll discover that a Polish
kreska, when designed according to the requirements of Polish
typography, is differently shaped and placed than the usual acute.

So who can tell, really? I suppose it often boils down to the pecking
order. (That document also talks about language-specific glyph
substituting.)

Cheers,
Philip
--
Philip Newton <no***********@gmx.li>
That really is my address; no need to remove anything to reply.
If you're not part of the solution, you're part of the precipitate.
Jul 20 '05 #81

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Zhang Weiwu | last post by:
Hello. I am working with a php software project, in it (www.egroupware.org) Chinese simplified locate is "zh" while Traditional Chinese "tw". I wish to send correct language attribute in http...
12
by: Andreas Prilop | last post by:
What was the reason to introduce a new attribute "xml:lang" instead of "lang"? This bothers both authors and browsers in different language versions: HTML 4, XHTML 1.0, XHTML 1.1. HTML has only...
37
by: Jan Wagner | last post by:
Hi, can't figure this one out, what's the CSS way to specify the language? In HTML it would be simply an lang="xx" attribute, or XHTML xml:lang="xx", but, how about in CSS? This would be...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.