By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,457 Members | 1,316 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,457 IT Pros & Developers. It's quick & easy.

Troubles with ISO 8859-2

P: n/a
I thought I would make a table of the Latin-2 character code. I
use a meta tag to indicate ISO-8859-2, but when I use the numeric
entity, for, say Latin Capital Letter R With Acute, which is
À it gives me the ISO-8859-1 character (either that or the
UTF-8 character, which would be the same thing, I imagine.)

I'm curious about this. I've pretty much given up the project at
this point, but I'm wondering what it is I'm not seeing.

I can provide a URL, but that creates a new set of problems.
FireFox indicates that the page on the server is UTF-8, which may
be a separate issue, since when viewing the page locally FireFox
indicates it's ISO-8859-2. Yet it does appear that my page, even
locally, is displaying Unicode.

The URL for the page is:
http://www.sundry.ws/comp/iso-8859-2.html
but only the first row is finished.

Ian
--
http://www.bookstacks.org/
http://www.sundry.ws/
Jul 23 '05 #1
Share this Question
Share on Google+
36 Replies


P: n/a
Ian Rastall wrote:
The URL for the page is:
http://www.sundry.ws/comp/iso-8859-2.html
but only the first row is finished.


Even when the server is set up properly, I still see the A-acute:

http://tranchant.plus.com/tmp/iso-8859-2.htm2

Beyond my knowledge, I'm afraid.

--
Mark.
http://tranchant.plus.com/
Jul 23 '05 #2

P: n/a
On Wed, 1 Dec 2004, Ian Rastall wrote:
I thought I would make a table of the Latin-2 character code.
Handy cross-reference at
http://www.unicode.org/Public/MAPPIN...859/8859-2.TXT
although the values are in hex rather than decimal
I use a meta tag to indicate ISO-8859-2, but when I use the numeric
entity,
"Numerical character reference" is the exact term
for, say Latin Capital Letter R With Acute, which is À
No, it isn't. As the cross-mapping table shows, it's character 0x154,
which in decimal would be 340. You can also see it on my page at
http://ppewww.ph.gla.ac.uk/~flavell/...unidata01.html
it gives me the ISO-8859-1 character
That's what 192 is, so it's working as designed.
I'm curious about this. I've pretty much given up the project at
this point, but I'm wondering what it is I'm not seeing.


You've got a fundamental misunderstanding about how characters are
represented in HTML. There's a section about it in the HTML4
specification.

good luck
Jul 23 '05 #3

P: n/a
In comp.infosystems.www.authoring.html Alan J. Flavell wrote:
for, say Latin Capital Letter R With Acute, which is À
No, it isn't. As the cross-mapping table shows, it's
character 0x154, which in decimal would be 340. You can also
see it on my page at
http://ppewww.ph.gla.ac.uk/~flavell/...unidata01.html


From what I understand, in ISO-8859-2, it's 0xC0:
http://www.columbia.edu/kermit/latin2.html
which is decimal 192.
You've got a fundamental misunderstanding about how characters
are represented in HTML.


Yes. I'm treating this as a learning experience. Any further
thoughts would be appreciated.

Ian
--
http://www.bookstacks.org/
http://www.sundry.ws/
Jul 23 '05 #4

P: n/a
On Wed, 1 Dec 2004, Ian Rastall wrote:
In comp.infosystems.www.authoring.html Alan J. Flavell wrote:
for, say Latin Capital Letter R With Acute, which is À


No, it isn't. As the cross-mapping table shows, it's
character 0x154, which in decimal would be 340.


From what I understand, in ISO-8859-2, it's 0xC0:


Which is why I referred you to

| Handy cross-reference at
| http://www.unicode.org/Public/MAPPIN...859/8859-2.TXT
| although the values are in hex rather than decimal

There are corresponding cross-mapping tables in that hierarchy for
most of the character encodings that you're likely to encounter.[1]
You've got a fundamental misunderstanding about how characters
are represented in HTML.


Yes. I'm treating this as a learning experience. Any further
thoughts would be appreciated.


You could try the HTML4 specification, as I already suggested.

http://www.w3.org/TR/html401/charset.html

Or in (much) more detail, http://www.w3.org/TR/charmod/

Your most important first step, though, is to un-learn what you think
you already know, because it's evidently wrong in at least one key
detail, and it's preventing you from seeing the answer. No offence
intended.

The Document Character Set of HTML, and XHTML, is always
Unicode/iso-10646, no matter what the external character
encoding scheme might be. And *those* are the values you
need to use in numerical character references ( &#number; )

good luck

[1] (The one for "Symbol" is out of date, though.)

Jul 23 '05 #5

P: n/a
On 1 Dec 2004, Ian Rastall wrote:
User-Agent: Xnews/06.08.25
I suggest to get at least a MIME-conforming newreader
when writing about such a subject.
I thought I would make a table of the Latin-2 character code.


What for? Don't we have enough yet? For example:
http://www.unics.uni-hannover.de/nht...european.html2
http://www.unics.uni-hannover.de/nht...al2.html#latin

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 23 '05 #6

P: n/a
On Wed, 1 Dec 2004, Alan J. Flavell wrote:
There are corresponding cross-mapping tables in that hierarchy for
most of the character encodings that you're likely to encounter.[1]

[1] (The one for "Symbol" is out of date, though.)


You can always get the latest version from Dr. Watson.

SCNR

Jul 23 '05 #7

P: n/a
On Wed, 1 Dec 2004, Alan J. Flavell wrote:
| Handy cross-reference at
| http://www.unicode.org/Public/MAPPIN...859/8859-2.TXT

There are corresponding cross-mapping tables in that hierarchy for
most of the character encodings that you're likely to encounter.[1]

[1] (The one for "Symbol" is out of date, though.)


For the Symbol character set, you could also refer to
http://www.unicode.org/Public/MAPPIN...PLE/SYMBOL.TXT

Jul 23 '05 #8

P: n/a
On Wed, 1 Dec 2004, I wrote:
For the Symbol character set, you could also refer to
http://www.unicode.org/Public/MAPPIN...PLE/SYMBOL.TXT


Or look into the "PostScript language reference", 3rd ed.
http://www.google.com/search?q=PLRM.pdf

Jul 23 '05 #9

P: n/a

"Ian Rastall" <id*******@gmail.com> wrote in message
news:Xn***************************@130.133.1.4...
I thought I would make a table of the Latin-2 character code. I
use a meta tag to indicate ISO-8859-2, but when I use the numeric
entity, for, say Latin Capital Letter R With Acute, which is
À it gives me the ISO-8859-1 character (either that or the
UTF-8 character, which would be the same thing, I imagine.)
You're confusing two things: the character encoding of the page and the
Document Character Set. The latter is always Unicode, and the numeric
character references (the &#xxx; codes) always relate to that. So À
will always mean character 192 from the Unicode character set.

The character encoding (UTF-8, 8859-2, etc.) describes the encoding that
maps the actual bytes in the source document to the characters they are
meant to represent. If you use an editor to create an HTML file that
includes a capital A with a grave accent, and the editor uses 8859-1
character representation, then that character is being stored as a single
byte with value 192. If your server serves that file and correctly tells the
client that encoding 8859-1 was used, then the client will interpret it as
capital A with grave, and will display it as the Unicode equivalent (which
is also 192). But, if your server tells the client that encoding 8859-2 was
used, then the client will interpret the 192 as capital R with an acute
accent, and will try to display it as the Unicode equivalent--which is
Unicode character 340.

If you want to use a numeric character reference to display capital R with
acute, you need the representation Ŕ.

I'm curious about this. I've pretty much given up the project at
this point, but I'm wondering what it is I'm not seeing.

I can provide a URL, but that creates a new set of problems.
FireFox indicates that the page on the server is UTF-8, which may
be a separate issue, since when viewing the page locally FireFox
indicates it's ISO-8859-2. Yet it does appear that my page, even
locally, is displaying Unicode.


When the file is served, the server supplies a header that indicates a UTF-8
encoding (or maybe no encoding is specified and the browser assumes UTF-8 by
default--I don't know about this). When you open the file locally, there is
no server-supplied header indicating the encoding used, and the browser
apparently falls back on the META tag.

But again, as I said earlier, numeric character references are always
interpreted as Unicode.

Jul 23 '05 #10

P: n/a
In comp.infosystems.www.authoring.html Harlan Messinger wrote:
numeric character references are always interpreted as Unicode.


And ... there's the answer. Thank you!

Ian
--
http://www.bookstacks.org/
http://www.sundry.ws/
Jul 23 '05 #11

P: n/a
On Wed, 1 Dec 2004 11:34:45 +0000, Alan J. Flavell wrote:
Your most important first step, though, is to un-learn what you think
you already know, because it's evidently wrong in at least one key
detail, and it's preventing you from seeing the answer. No offence
intended.


I had taken it as an offence at first, but I see what you mean now. My
understanding before was that the purpose of declaring an encoding scheme
was to determine how numeric character references were interpreted. I
thought that if I wanted to insert something other than a US-ASCII
character, I had to escape it. A product of being American, I suppose.

After poring over the links you gave me, and some others, such as Jukka's,
my understanding now is that you can use the actual character, without
having to escape it, as long as the encoding supports it. IOW, I can use
Chinese characters, if I want, as long as I'm declaring UTF-8, and can
somehow copy and paste them into the document. This is where my
understanding is right now, although it's incomplete.

Thanks for helping me figure this out. I did, in fact, have to unlearn a
very fundamental misunderstanding.

Ian
--
http://www.sundry.ws/
http://www.bookstacks.org/
Jul 23 '05 #12

P: n/a
On Sun, 5 Dec 2004, Ian Rastall wrote:
My understanding before was that the purpose of declaring an
encoding scheme was to determine how numeric character references
were interpreted.
The way in which HTML has used this SGML idea is pretty-much the exact
opposite, as you now realise.
I thought that if I wanted to insert something other than a US-ASCII
character, I had to escape it.
The term "Document Character Set" has a specialised meaning in
SGML terminology, and isn't in any way bound to the character encoding
scheme. HTML has chosen a fixed "Document Character Set", namely
iso-10646/Unicode. The &#-notation permits /any/ character of that
set to be represented, no matter what encoding scheme is used - even
us-ascii (or iso-646/IRV if you want to make it less parochial ;-)

But a more compact way of representing documents may be to choose an
appropriate encoding scheme, and represent the characters in that
encoding scheme i.e as coded bytes. Which scheme to choose, would be
influenced by the predominant content of the document.

If you still care about Netscape 4.* versions, then there are some
definite restrictions on what will work in practice, because NN4's
implementation is seriously defective in this regard.
After poring over the links you gave me, and some others, such as
Jukka's, my understanding now is that you can use the actual
character, without having to escape it, as long as the encoding
supports it.
Exactly.
IOW, I can use Chinese characters, if I want, as long as I'm
declaring UTF-8,
Yes; or indeed the various other Chinese encodings which are around...
and can somehow copy and paste them into the document.
Well, that's something which has to be resolved between you and your
authoring software and OS; as long as that software does the right
job, the WWW doesn't care how you generated the document, it only
cares that what you send to the web is in accordance with the
applicable rules. But basically "yes", that is a practical
possibility on appropriate software plaforms.
Thanks for helping me figure this out. I did, in fact, have to
unlearn a very fundamental misunderstanding.


I know. It's hard to pitch the message at the right level so that
the person gets the necessary heads-up but without causing offence.

thanks for the feedback.

Maybe this could be helpful
http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html

Jul 23 '05 #13

P: n/a
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
IOW, I can use Chinese characters, if I want, as long as I'm
declaring UTF-8,


Yes; or indeed the various other Chinese encodings which are around...


Side note: When using UTF-8 for Chinese, it is important to also declare
the language using the lang attribute. Otherwise Mozilla will prefer
Japanese fonts for unified ideographs. (zh-CN is considered simplified
and zh-TW is considered traditional.)

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #14

P: n/a
On Sun, 5 Dec 2004, Henri Sivonen wrote:
Side note: When using UTF-8 for Chinese, it is important to also
declare the language using the lang attribute.
I guess you understand this area better than I do - it's outside of
what I understand, or am ever likely to. But at least I think I can
name some of the parts. Here goes:

Unicode includes the so-called Han Unification, which means characters
which play the same role in CJK (Chinese - both kinds -, Japanese and
Korean) writing systems are represented by the same character code;
the preferred glyphs are meant to be disambiguated by additionally
specifying the language (which may indirectly result in an appropriate
font being chosen, but HTML should not get involved in that).

Whereas, when using the various codings which are native to those
"languages"[1], the preferred choice of glyphs is implied by the
coding, which in turn implies the writing system. Specifying the
language is recommended anyway, but isn't essential if the coding is
specific enough[2]

How am I doing so far?
Otherwise Mozilla will prefer Japanese fonts for unified ideographs.
(zh-CN is considered simplified and zh-TW is considered
traditional.)


Doesn't it have a configuration for that?

all the best

[1] Chinese isn't really a "language", so much as a writing system. I
recall two colleagues trying to hold a conversation at lunch in the
canteen of the Institute, but they spoke mutually unintelligible
Chinese languages, and were only able to communicate by scribbling on
the table-napkins, with occasional spoken remarks in their broken
German...

[2] report of analogous effects for some non-CJK writing systems in
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html
Jul 23 '05 #15

P: n/a
On Sun, 5 Dec 2004 12:57:01 +0000, Alan J. Flavell wrote:
Maybe this could be helpful
http://ppewww.ph.gla.ac.uk/~flavell/...checklist.html


Yes, that's a good page. I had it bookmarked for a while, but I'll read it
again.

Ian
--
http://www.sundry.ws/
http://www.bookstacks.org/
Jul 23 '05 #16

P: n/a
Ian Rastall wrote:
After poring over the links you gave me, and some others, such as Jukka's,
my understanding now is that you can use the actual character, without
having to escape it, as long as the encoding supports it.
Yes, that's correct. The first thing you need to make sure of though,
is that your editor is actually saving the file with the correct
character encoding. This is usually available in the Save As or Options
dialogs, but not all editors support such features. Be careful if yours
doesn't, because many will actally use Windows-1252 (if you're using
Windows), even though they may claim to use ASCII or ISO-8859-1.

Simply writing the character encoding in the <meta> element without
configuring your editor properly is not good enough, though I've seen
many people try that, and fail. (You should also note that the <meta>
element is not recommended, in favour of a real HTTP Content-Type header.)
IOW, I can use Chinese characters, if I want, as long as I'm declaring
UTF-8, and can somehow copy and paste them into the document.


To get any character, you can use a tool like the Character Map in
Windows, or an equivalent for your system. Alternatively, you can
generate them using a browser that supports data: URIs:

eg.
data:text/html,&#x2014;

Where you can use any decimal or hexadecimal character references or
HTML character entities. Or, if you have JS enabled, then you can use
these Unicode tools [1], which are part of my copy of the devedge
sidebar [2].

Lastly, you need to ensure that your system has unicode fonts available,
or at least fonts with the glyphs that you're using. Without them,
you'll only see boxes, question marks or whatever place holder character
your editor uses.

[1] http://lachy.id.au/dev/mozilla/sideb...haracter-tools
[2] http://lachy.id.au/blogs/log/2004/10/devedge-sidebar

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://SpreadFirefox.com/ Igniting the Web
Jul 23 '05 #17

P: n/a
Under Subject: Re: Troubles with ISO 8859-2
Henri Sivonen <hs******@iki.fi> wrote:
Side note: When using UTF-8 for Chinese, it is important to also
declare the language using the lang attribute. Otherwise Mozilla
will prefer Japanese fonts for unified ideographs.
It puzzles me why they default to Japanese fonts. My guess is that they
estimate that Japanese is still more widely used on the Web than
Chinese.
(zh-CN is
considered simplified and zh-TW is considered traditional.)


Which is fairly odd, since such coding uses country code to distinguish
between writing systems - even using the TW code, which is politically
very problematic.

However it seems that at least Mozilla Firefox 1.0 also supports
the more logical, IANA registered codes zh-Hans (simplified) and
zh-Hant (traditional), treating mere zh as equivalent to zh-Hans.

I created a simple demo; sorry, the text around it is in Finnish:
http://www.cs.tut.fi/~jkorpela/kieli...html#cjk-table

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #18

P: n/a
In article <Xn****************************@193.229.0.31>,
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
It puzzles me why they default to Japanese fonts. My guess is that they
estimate that Japanese is still more widely used on the Web than
Chinese.
Other possibilities (I don't know the real reason):
* By chance; had to pick some order
* Japan considered financially more important for AOL/Netscape
* Japanese fonts tend to be of higher quality
(zh-CN is
considered simplified and zh-TW is considered traditional.)


Which is fairly odd, since such coding uses country code to distinguish
between writing systems -


Well, the custom is that the traditional form is used on Taiwan and in
Hong Kong and the simplified form is used in the PRC. So it is not that
odd.
even using the TW code, which is politically
very problematic.
There's also zh-HK. AFAIK, the official meaning of TW is "Taiwan,
Province of China", so it explicitly disasserts that which is
problematic to assert.
However it seems that at least Mozilla Firefox 1.0 also supports
the more logical, IANA registered codes zh-Hans (simplified) and
zh-Hant (traditional), treating mere zh as equivalent to zh-Hans.


Yes, those aliases are supported.
http://lxr.mozilla.org/seamonkey/sou...gGroups.proper
ties#184

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Jul 23 '05 #19

P: n/a
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
Unicode includes the so-called Han Unification, .... Specifying the language is recommended anyway, but
isn't essential if the coding is specific enough[2]
Yes.
Otherwise Mozilla will prefer Japanese fonts for unified ideographs.
(zh-CN is considered simplified and zh-TW is considered
traditional.)


Doesn't it have a configuration for that?


For the pecking order in the absence of language group metadata
(explicit or guessed from the encoding)? AFAIK, it does not.
[1] Chinese isn't really a "language", so much as a writing system.
ISO considers both Mandarin and Cantonese zh, although some would
consider them different languages sharing a writing system.
[2] report of analogous effects for some non-CJK writing systems in
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html


BTW, on X11 using the old non-Xft gfx implementation, the rendering of
UTF-8-encoded Central European pages also improves in Mozilla in some
cases if the language has been declared explicitly.

Have you re-tested Yiddish recently? Mozilla is now supposed to treat it
as an alias for Hebrew for the purpose of font selection.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #20

P: n/a
On Fri, 10 Dec 2004, Henri Sivonen wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
report of analogous effects for some non-CJK writing systems in
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html


Have you re-tested Yiddish recently? Mozilla is now supposed to treat it
as an alias for Hebrew for the purpose of font selection.


I hope it's on Alan's agenda. :-)
http://www.unics.uni-hannover.de/nht...-attribute.htm
Pashto and Urdu are still not recognized.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #21

P: n/a
In article <Pine.GSO.4.44.0412101356460.6128-100000@s5b004>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm
Pashto and Urdu are still not recognized.


Would it be adequate to treat them like Arabic for the purpose of font
selection?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #22

P: n/a
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> wrote:
In article <Pine.GSO.4.44.0412101356460.6128-100000@s5b004>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm
Pashto and Urdu are still not recognized.


Would it be adequate to treat them like Arabic for the purpose of font
selection?


Oh, BTW, I didn't find a bug number for this. :-)

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #23

P: n/a
On Fri, 10 Dec 2004, Henri Sivonen wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm
Pashto and Urdu are still not recognized.


Would it be adequate to treat them like Arabic for the purpose of font
selection?


Yes, of course - also Sindhi (sd).

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #24

P: n/a
On Fri, 10 Dec 2004, Henri Sivonen wrote:
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm
Pashto and Urdu are still not recognized.


Would it be adequate to treat them like Arabic for the purpose of font
selection?


As I understand it, there's a couple of extra characters needed, which
might or might not appear in a font which otherwise covered Arabic.
But - although that /is/ a problem for Internet Exploder - it's
something that Mozilla happily copes with, and goes and finds the
characters somewhere else (I don't understand what magic it uses, but
it seems to work).

So any consequences are no worse than cosmetic - they don't
prevent the content from being rendered.
Jul 23 '05 #25

P: n/a
On Fri, 10 Dec 2004, Andreas Prilop wrote:
Have you re-tested Yiddish recently? Mozilla is now supposed to treat it
as an alias for Hebrew for the purpose of font selection.


I hope it's on Alan's agenda. :-)
http://www.unics.uni-hannover.de/nht...-attribute.htm


On Mozilla 1.7.3 Win32, I'm seeing the test page's Hebrew and Yiddish
lines displayed the same as each other, and different from all the
others. I'd interpret that as showing that issue has indeed been
fixed.
Jul 23 '05 #26

P: n/a
On Fri, 10 Dec 2004, Alan J. Flavell wrote:
Pashto and Urdu are still not recognized.


Would it be adequate to treat them like Arabic for the purpose of font
selection?


As I understand it, there's a couple of extra characters needed, which
might or might not appear in a font which otherwise covered Arabic.
But - although that /is/ a problem for Internet Exploder - it's
something that Mozilla happily copes with, and goes and finds the
characters somewhere else


This means that glyphs are taken from different typefaces,
resulting in something like this:
http://www.unics.uni-hannover.de/nht...face-arial.gif

However, text labelled LANG=ps, sd, ur (Pashto, Sindhi, Urdu) should
be displayed _completely_ in the typeface I chose for the Arabic script.
On MS Windows, this might be Tahoma with a rich repertoire of Arabic
glyphs. I might choose Verdana as typeface for Latin, Greek, Cyrillic
and Tahoma as typeface for Hebrew and Arabic.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #27

P: n/a
On Fri, 10 Dec 2004, Alan J. Flavell wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm


On Mozilla 1.7.3 Win32, I'm seeing the test page's Hebrew and Yiddish
lines displayed the same as each other, and different from all the
others.


In addition, select different encodings in Mozilla. For example, if
you select ISO-8859-8, then the [first] line without LANG attribute
should be displayed in the same typeface as the Hebrew and Yiddish
lines.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #28

P: n/a
In article <Pine.GSO.4.44.0412101608470.6357-100000@s5b004>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Fri, 10 Dec 2004, Henri Sivonen wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm
Pashto and Urdu are still not recognized.


Would it be adequate to treat them like Arabic for the purpose of font
selection?


Yes, of course - also Sindhi (sd).


The fix is trivial. See
https://bugzilla.mozilla.org/show_bug.cgi?id=274102

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #29

P: n/a
On Fri, Dec 10, Andreas Prilop inscribed on the eternal scroll:
This means that glyphs are taken from different typefaces,
which is why I mentioned cosmetics, but the result, even if not ideal,
is surely better than Exploder's habit of displaying empty boxes.
resulting in something like this:
http://www.unics.uni-hannover.de/nht...face-arial.gif
So, in the worst cases, it looks like the proverbial ransom note.
However, text labelled LANG=ps, sd, ur (Pashto, Sindhi, Urdu) should
be displayed _completely_ in the typeface I chose for the Arabic script.
On MS Windows, this might be Tahoma with a rich repertoire of Arabic
glyphs.


OK. Were you happy about the situation with initial, medial, final
and isolated forms, that you were mentioning earlier in relation to
some of these Arabic-based script families?

--
Procrastination gives you something to look forward
to putting off tomorrow. -spotted on ahbou
Jul 23 '05 #30

P: n/a
On Sat, 11 Dec 2004, Henri Sivonen wrote:
http://www.unics.uni-hannover.de/nht...-attribute.htm
Updated on 13 Dec 2004: "LANG=xx" added.
Pashto and Urdu are still not recognized.


The fix is trivial. See
https://bugzilla.mozilla.org/show_bug.cgi?id=274102


Thanks!
There is another issue with the LANG attribute, namely with unkown languages
such as "xx" or "mn". Currently, Mozilla 1.7 displays such text in the
Western typeface. However, I think such text should be displayed in the
same way as text without LANG attribute, i.e. in the typeface corresponding
to the encoding (charset) of the page.

Concrete example:
charset=ISO-8859-5 lang="xx"
lang="mn"

[Use the URL given above and choose "ISO-8859-5" in the browser.]
Mozilla 1.7 uses the Western typeface for display. But it should use
the Cyrillic typeface because of charset=ISO-8859-5.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #31

P: n/a
On Sat, 11 Dec 2004, Alan J. Flavell wrote:
However, text labelled LANG=ps, sd, ur (Pashto, Sindhi, Urdu) should
be displayed _completely_ in the typeface I chose for the Arabic script.
On MS Windows, this might be Tahoma with a rich repertoire of Arabic
glyphs.


OK. Were you happy about the situation with initial, medial, final
and isolated forms, that you were mentioning earlier in relation to
some of these Arabic-based script families?


No. The problem is more subtle. Suppose I have chosen Arial as my
preferred Western typeface and Tahoma as my preferred Arabic typeface
(on MS Windows). I have text marked with LANG=ps. Then Mozilla will
first apply the Western typeface Arial because the language "ps"
is currently unrecognized. Arial contains letters for the Arabic
language, but not the special Pashto letters. Pashto letters are
therefore taken from Tahoma (?) or Arial Unicode (?) or whatever.

*However, letters from different typefaces do not join.*

You should be able to see this effect on
http://ppewww.ph.gla.ac.uk/~flavell/...unidata06.html ;-)
when the glyph for ـ is taken from your Western Typeface
Arial and the glyphs for, say, Urdu (0679, 0688, 0691, 06C1, 06BE)
are taken from some other typeface.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #32

P: n/a
On Sat, 11 Dec 2004, Alan J. Flavell wrote:
So, in the worst cases, it looks like the proverbial ransom note.


It might look like
http://www.unics.uni-hannover.de/nht...emp/pashto.gif
All lines contain the word "Pashto", but only the first line
in Tahoma is correct. The other lines have been formatted in
Akhbar, Neskh, Arial Unicode, resp., where the special Pashto
letter "sh" is either taken from another font or does not join.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #33

P: n/a
In article <Pine.GSO.4.44.0412131600080.9276-100000@s5b005>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
There is another issue with the LANG attribute, namely with unkown languages
such as "xx" or "mn". Currently, Mozilla 1.7 displays such text in the
Western typeface.


Could be by design. Could be a bug. Have you filed a bug?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #34

P: n/a
On Mon, 13 Dec 2004, Henri Sivonen wrote:
There is another issue with the LANG attribute, namely with unkown languages
such as "xx" or "mn". Currently, Mozilla 1.7 displays such text in the
Western typeface.


Could be by design. Could be a bug. Have you filed a bug?


No - I'm still waiting for a reaction to bug #266474.

This test page
http://www.unics.uni-hannover.de/nht...p/ticket.html6
should show the unwanted behaviour with the LANG attribute.
In Mozilla for Windows, choose Arial as your Western typeface
and Tahoma as your Arabic typeface. Tahoma on Windows 2000
contains all Urdu letters; Arial does not. The text without
LANG markup should be displayed fine in Tahoma, where the
three letters (t. k t.) join.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #35

P: n/a
In article <Pine.GSO.4.44.0412141551560.14390-100000@s5b004>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Mon, 13 Dec 2004, Henri Sivonen wrote:
There is another issue with the LANG attribute, namely with unkown
languages
such as "xx" or "mn". Currently, Mozilla 1.7 displays such text in the
Western typeface.


Could be by design. Could be a bug. Have you filed a bug?


No - I'm still waiting for a reaction to bug #266474.


How is that bug relevant to this issue?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #36

P: n/a
On Tue, 14 Dec 2004, Henri Sivonen wrote:
No - I'm still waiting for a reaction to bug #266474.


How is that bug relevant to this issue?


It isn't. But I ask myself why I should bother filing bugs
when there is no reaction at all.

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #37

This discussion thread is closed

Replies have been disabled for this discussion.