469,362 Members | 2,441 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,362 developers. It's quick & easy.

Zero width space still unsafe?

Jukka reports on
http://www.cs.tut.fi/~jkorpela/chars/spaces.html
that Internet Explorer 6 fails on the "zero width space" U+200B ​

Is this observation still valid? For which versions of MS Windows
does it apply? Does it depend on the encoding (charset)?
I have a test page in three encodings:
http://www.unics.uni-hannover.de/nht...temp/zwsp.html
http://www.unics.uni-hannover.de/nht...mp/zwsp.html11
http://www.unics.uni-hannover.de/nhtcapri/temp/zwsp.tis
After each letter "z" there is a "zero width space". Do you see
an empty box instead? The correct browser behaviour would be
to allow a line break after "zero width space".
http://validator.w3.org does not recognize ISO-8859-11.
Why not?

Jul 23 '05 #1
28 8753
On Mon, 20 Dec 2004, Andreas Prilop wrote:
http://validator.w3.org does not recognize ISO-8859-11.
Why not?


Hmmm, Google's hit for this:

http://mail.apps.ietf.org/ietf/charsets/msg01362.html

(which leads to

http://mail.apps.ietf.org/ietf/charsets/msg01363.html )

says that (as of April 2003) it hadn't been registered at IANA.

And it's still not registered with IANA (although 8859-16, which
I think came in at around the same time, is there)
Jul 23 '05 #2
On Mon, 20 Dec 2004 15:46:54 +0100, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
Jukka reports on
http://www.cs.tut.fi/~jkorpela/chars/spaces.html
that Internet Explorer 6 fails on the "zero width space" U+200B ​

Is this observation still valid? For which versions of MS Windows
does it apply? Does it depend on the encoding (charset)?
I have a test page in three encodings:
http://www.unics.uni-hannover.de/nht...temp/zwsp.html
http://www.unics.uni-hannover.de/nht...mp/zwsp.html11
http://www.unics.uni-hannover.de/nhtcapri/temp/zwsp.tis
After each letter "z" there is a "zero width space". Do you see
an empty box instead? The correct browser behaviour would be
to allow a line break after "zero width.
Mozilla and Firefox behaves as required, i.e. no "empty box" and
correct line breaks at various points depending on UA window width.

IE6(+latest SP) is also correct for UTF-8 but...

....it shows the box for the other two examples but still linebreaks at
points either before or after the boxes depending on window width.
Peculiar behavior :-)
http://validator.w3.org does not recognize ISO-8859-11.
Why not?


Que Nick?

--
Rex
Jul 23 '05 #3
On Mon, 20 Dec 2004, Jan Roland Eriksson wrote:

[IE...]
... shows the box for the other two examples but still linebreaks at
points either before or after the boxes depending on window width.


Strange, it doesn't do that for me (neither IE6 Win2K nor XP SP2).

However, I do believe that both of them have the Japanese language
option installed. Yup: control panel -> regional options shows that
my Win2k has Japanese and various other language options enabled,
though *not* Thai; whereas this XP has the boxes turned on for
"complex script... including Thai" and "East Asian languages".
Jul 23 '05 #4
On Mon, 20 Dec 2004 16:22:53 +0000, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
On Mon, 20 Dec 2004, Jan Roland Eriksson wrote:
[IE...]
... shows the box for the other two examples but still linebreaks at
points either before or after the boxes depending on window width.


Strange, it doesn't do that for me (neither IE6 Win2K nor XP SP2).
...I do believe that both of them have the Japanese language
option installed. Yup: control panel -> regional options shows that
my Win2k has Japanese and various other language options enabled,
though *not* Thai; whereas this XP has the boxes turned on for
"complex script... including Thai" and "East Asian languages".


XP-Pro+Sp2 here and IE6+latest SP (plus all the latest sequrity stuff of
course) but no "fancy" langauages, only English and Swedish AFAICS.

(I can't read anything but text in Western alphabets anyway :-)

--
Rex
Jul 23 '05 #5
On Mon, 20 Dec 2004, Jan Roland Eriksson wrote:
(I can't read anything but text in Western alphabets anyway :-)


Neither can I, but by installing Japanese I found I got a load of
interesting symbols to display in IE, which were otherwise
unavailable, even though they had no evident relevance to Japanese.

(AFAIR, most of them were previously displaying just fine in Mozilla,
which was finding them from somewhere or other - but IE wasn't finding
them, as I discuss on my browsers-fonts web page.)
Jul 23 '05 #6
In article <Pine.GSO.4.44.0412201534460.12988-100000@s5b004>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
http://www.unics.uni-hannover.de/nht...temp/zwsp.html
http://www.unics.uni-hannover.de/nht...mp/zwsp.html11
http://www.unics.uni-hannover.de/nhtcapri/temp/zwsp.tis
After each letter "z" there is a "zero width space". Do you see
an empty box instead?


I see a box in Firefox (trunk) on OS X.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #7
On Mon, 20 Dec 2004 17:59:50 +0000, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
On Mon, 20 Dec 2004, Jan Roland Eriksson wrote:
(I can't read anything but text in Western alphabets anyway :-)


Neither can I, but by installing Japanese I found I got a load of
interesting symbols to display in IE, which were otherwise
unavailable, even though they had no evident relevance to Japanese.


It works for me, no fancy language packs installed. Is it perhaps
font related?

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

Jul 23 '05 #8
On Mon, 20 Dec 2004, Jim Ley wrote:
It works for me, no fancy language packs installed.
That's a useful data point, thanks. Would that be XP?
Is it perhaps font related?


Could well be - I'm afraid my understanding of Windows internals
is quite lacking - most of what I think I've grasped has been done
by experimenting. And installing and de-installing fonts and language
packs to prove a point, rapidly gets stale, as I'm sure you'd agree,

There do seem to be some typographical issues that can only be
resolved by installing the relevant language pack. I'm afraid I
don't really know whether this is one of them or not.
Jul 23 '05 #9
On Mon, 20 Dec 2004 21:23:36 +0000, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
On Mon, 20 Dec 2004, Jim Ley wrote:
It works for me, no fancy language packs installed.


That's a useful data point, thanks. Would that be XP?


Yes XP SP2

The only thing that might be thought of as increasing support for more
chars was manually installing Arial Unicode.

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

Jul 23 '05 #10
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Jukka reports on
http://www.cs.tut.fi/~jkorpela/chars/spaces.html
that Internet Explorer 6 fails on the "zero width space" U+200B


.... in "normal" conditions, yes. By "normal" I mean that the font used
is not Arial Unicode MS or Lucida Sans Unicode (or some special font).

It seems to me that the behavior mostly depends on fonts, which in turn
depend on many things. If an author style sheet suggests
font-family: Arial Unicode MS, Lucida Sans Unicode;
then I would say that the great majority of users would see the
document rendered properly in this respect. But such settings may have
drawbacks.

The problem, as I understand it, is this:
- IE 6 (and even IE 4 and IE 5) knows the basic property of U+200B that
a line break is permitted after it
- however it does not know that it has zero width so that the browser
need not render anything for it
- so it uses whatever the font in use has for the character
- and it fails to scan through the available fonts to pick up one that
contains a glyph for the character.

So my practical conclusion is that U+200B is not ready for prime time,
and if it is important to suggest permissible line breaks in a long
string, the nonstandard <wbr> is still the practical solution.

For some additional notes, see
http://www.cs.tut.fi/~jkorpela/html/nobr.html#zwsp
where I mention that the HTML 4.01 specification explicitly leaves the
rendering of ZWSP (as one of the white space characters for which
rendering is _not_ defined) explicitly undefined.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #11
On Mon, 20 Dec 2004 21:23:36 +0000, "Alan J. Flavell"
<fl*****@ph.gla.ac.uk> wrote:
On Mon, 20 Dec 2004, Jim Ley wrote: [...]
Is it perhaps font related?

Could well be - I'm afraid my understanding of Windows internals
is quite lacking...
The real basic fact is that there is no one single person that knows how
Windows is supposed to work today, not even within MS themselves.

That come as a result of "outsourcing" for coding works. Most parts of
MS products are today produced in so called low cost countries, India,
Russia, China and every other country that is willing to sell the souls
of their people just to get the money in.

For quite some time back it's all about the money, and protection of the
"monopoly". Heck, MS is in a "full control" position of just about every
hard disk producing company in the world. Proved by the fact that it is
cheaper to buy a new HD with Win-something pre installed than it is to
get the same drive all blank from the start :-)
- most of what I think I've grasped has been done by experimenting.


So have we all, but the target keeps moving around :-)

Allow me to predict (as based on last days "experimenting") that, given
the right tool, every and all Win NT/XP user can find at least a 1000
dead entries in his registry data base.

The "registry database" is just another con played on MS users that made
it possible for MS to hide away all the basic idiocy that is buried in
that OP-system.

From what I have found it looks like a garbage dump for both MS and
other applications that gets installed in the Win environment.

I'm pretty sure that this (ab)usage of the "registry database" was not
an original idea of Dave Cutler.

--
Rex
Jul 23 '05 #12
On Mon, 20 Dec 2004, Jukka K. Korpela wrote:
It seems to me that the behavior mostly depends on fonts, which in turn
depend on many things. If an author style sheet suggests
font-family: Arial Unicode MS, Lucida Sans Unicode;
then I would say that the great majority of users would see the
document rendered properly in this respect. But such settings may have
drawbacks.
I believe that Tahoma is likely to rate better than L.S.U in this
regard, whereas we shouldn't assume that most people have A.U.MS.

Whereas, if they have a font that's well tuned to their writing
system, then telling MSIE to use any of the above will be a
disservice to them. It's a difficult choice to have to make.
So my practical conclusion is that U+200B is not ready for prime time,
In general I'd have to agree with you. However, the context was
browsing of the Thai writing system, so one might presume that anyone
interested in that would be willing to equip themselves with an
appropriate font and browser settings. The fact that it'll make a
hopeless mess for the rest of us is neither here nor there, since we
can't read it anyway. IMHO and YMMV...
and if it is important to suggest permissible line breaks in a long
string, the nonstandard <wbr> is still the practical solution.
I don't know why that cited Thai page claims that this non-standard
<wbr> is no longer working (for some practical value of the term
"working" ;-)

Mind you, the marker could just as well be <foobar> or <secam>, for
all that most browsers seem to care. Or <x> if you prefer less typing
;-)
For some additional notes, see
http://www.cs.tut.fi/~jkorpela/html/nobr.html#zwsp
where I mention that the HTML 4.01 specification explicitly leaves the
rendering of ZWSP (as one of the white space characters for which
rendering is _not_ defined) explicitly undefined.


Possibly; but there are hints elsewhere that browsers are expected to
apply appropriate typography for the writing system in use, and
Thai evidently needs this, so it's still on the agenda for browser
implementers, no matter that HTML doesn't demand it in so many words.

Jul 23 '05 #13
On Mon, 20 Dec 2004 23:55:56 +0100, Jan Roland Eriksson
<jr****@newsguy.com> wrote:
That come as a result of "outsourcing" for coding works. Most parts of
MS products are today produced in so called low cost countries, India,
Russia, China and every other country that is willing to sell the souls
of their people just to get the money in.
Good, I'm very, very glad that they're using low cost developers,
almost all the problems I've seen with outsourcing has been because of
poor management by the western countries, not low cost developers. It
certainly makes sense for them.
Heck, MS is in a "full control" position of just about every
hard disk producing company in the world. Proved by the fact that it is
cheaper to buy a new HD with Win-something pre installed than it is to
get the same drive all blank from the start :-)
Could you tell me where I get to buy these hard disks? I've never
even seen a hard disk for sale with an operating system on it.
Allow me to predict (as based on last days "experimenting") that, given
the right tool, every and all Win NT/XP user can find at least a 1000
dead entries in his registry data base.


I think there's a good chance that any computer user could find 1000
dead lines of config data.

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

Jul 23 '05 #14
On Mon, 20 Dec 2004, Henri Sivonen wrote:
http://www.unics.uni-hannover.de/nhtcapri/temp/zwsp.tis
After each letter "z" there is a "zero width space". Do you see
an empty box instead?


I see a box in Firefox (trunk) on OS X.


Firefox (Solaris 9) does not display a box - it shows only the letters
and breaks, if necessary, after "z".

The MacThai character set includes the zero width space:
http://www.unicode.org/Public/MAPPIN...APPLE/THAI.TXT
If you don't mind, you might (temporarily) install Thai language
support and see what happens.

I regard the "zero width space" not as a graphic character, but as
a control character like "newline" or "zero width joiner". There's
nothing to display with these characters. What's the point of including
glyphs for "newline" or "zero width space" in a font? Consider a
program that wouldn't do a newline when the font has no glyph for it!
A bit stupid. There's something wrong with programs when they insist
of displaying certain glyphs for the control characters "newline" or
"zero width space".

The mystery is:
How are existing Thai pages written?

Jul 23 '05 #15
In article <Pine.GSO.4.44.0412211523310.12191-100000@s5b003>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Mon, 20 Dec 2004, Henri Sivonen wrote:
http://www.unics.uni-hannover.de/nhtcapri/temp/zwsp.tis
After each letter "z" there is a "zero width space". Do you see
an empty box instead?


I see a box in Firefox (trunk) on OS X.


Firefox (Solaris 9) does not display a box - it shows only the letters
and breaks, if necessary, after "z".

The MacThai character set includes the zero width space:
http://www.unicode.org/Public/MAPPIN...APPLE/THAI.TXT
If you don't mind, you might (temporarily) install Thai language
support and see what happens.


I already have "fonts for additional languages" installed and the Thai
input methods are selectable.

Thai display in Gecko on OS X is broken:
https://bugzilla.mozilla.org/show_bug.cgi?id=225217

In general, Gecko on OS X will continue to be broken for many languages
until the gfx is migrated to ATSUI. I'm not holding my breath.
https://bugzilla.mozilla.org/show_bug.cgi?id=atsui

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #16
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
I regard the "zero width space" not as a graphic character, but as
a control character like "newline" or "zero width joiner".
That's a reasonable idea, but Unicode defines it as "separator, space".
There's nothing to display with these characters.
By definition, zero width space has no width but may get expanded in
formatting.

I'd say it's dual: printable _and _control character, in the same sense
as the Ascii space is.
What's the point of
including glyphs for "newline" or "zero width space" in a font?
Regarding "newline", depends on what you mean. A program that
cannot handle Ascii CR and LF is probably so broken that nothing helps.
But the _preferred_ line separator in Unicode is LINE SEPARATOR U+2028,
and support to it in programs is fairly limited. Similar considerations
apply to ZERO WIDTH SPACE: programs might fail to recognize it in any
particular meaning but just try to render it. For such situations, a
fallback, in the form of a glyph shape, would be useful. For zero width
space, an empty zero-width glyph is appopriate. LS is a different issue
(maybe it _should_ look like a special symbol that someone indicates
line separation).
There's something wrong with programs
when they insist of displaying certain glyphs for the control
characters "newline" or "zero width space".


The don't have adequate Unicode support, but who has?

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #17
On Mon, 20 Dec 2004 23:41:24 GMT, ji*@jibbering.com (Jim Ley) wrote:
On Mon, 20 Dec 2004 23:55:56 +0100, Jan Roland Eriksson
<jr****@newsguy.com> wrote:
That come as a result of "outsourcing" for coding works. Most parts of
MS products are today produced in so called low cost countries...
Good, I'm very, very glad that they're using low cost developers,
almost all the problems I've seen with outsourcing has been because of
poor management by the western countries, not low cost developers.
It certainly makes sense for them.


It did not mean to imply that low cost developers are doing a bad job,
on the contrary in most cases.

But it's my experience from some 25 years in industrial automation that
the problems of creating a good final product is proportional to the
square of the distance between the point of management and the point of
production. It's not only "poor management" but lots of other criteria's
that comes into this, cultural differences not to be forgotten.
Heck, MS is in a "full control" position of just about every
hard disk producing company in the world. Proved by the fact that it is
cheaper to buy a new HD with Win-something pre installed than it is to
get the same drive all blank from the start :-)


Could you tell me where I get to buy these hard disks? I've never
even seen a hard disk for sale with an operating system on it.


The computer store in the same block where I live could be a good start.
Their arguments for selling pre installed Win drives is that it's
cheaper and I can always go on to reformat the drive myself if I need it
blank.

Sweden has for numbers of years been regarded as being the most Win
populated per capita country in the world. There are political reasons
for this, e.g. private PC's can be had as tax deductible units through
ones own employer. That may have something to do with status of the HD
market here too.

--
Rex [nuf OT for now]
Jul 23 '05 #18
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Mon, 20 Dec 2004, Jukka K. Korpela wrote:
It seems to me that the behavior mostly depends on fonts, which in
turn depend on many things. If an author style sheet suggests
font-family: Arial Unicode MS, Lucida Sans Unicode;
then I would say that the great majority of users would see the
document rendered properly in this respect. But such settings may
have drawbacks.
I believe that Tahoma is likely to rate better than L.S.U in this
regard, whereas we shouldn't assume that most people have A.U.MS.


But on my system at least (Win98, with Tahoma probably as shipped with
Windows), Tahoma does not contain U+200B. Instead, a square is
displayed.
Whereas, if they have a font that's well tuned to their writing
system, then telling MSIE to use any of the above will be a
disservice to them. It's a difficult choice to have to make.
Indeed. But at least people using MSIE would see the data (assuming the
author has correctly identified the font(s) he suggests so that each of
them contains all the glyphs needed).
In general I'd have to agree with you. However, the context was
browsing of the Thai writing system, so one might presume that
anyone interested in that would be willing to equip themselves with
an appropriate font and browser settings.
I'm afraid I have missed that part of the discussion. Surely for some
specific purposes, we need to make some fair assumptions about the
potential audience.
I don't know why that cited Thai page claims that this non-standard
<wbr> is no longer working (for some practical value of the term
"working" ;-)
Perhaps because Nescape dropped support in some version(s) - but soon
restored it.
Mind you, the marker could just as well be <foobar> or <secam>, for
all that most browsers seem to care. Or <x> if you prefer less
typing ;-)


Do you think so? In my test, foo<foobar>bar gets treated the same way
as foobar.

But now it's time for a really weird observation.

I used MS Word 2000 and inserted (via Insert/Chararacter) a special
character for line break hints (sorry, I just assume they call it that
way in the English version - that's my back-translation), which turns
out to be U+200C ZERO-WIDTH NON-JOINER at least when I save as HTML,
i.e. I get *. Now that's not ZWSP, though similar. But wait...
The HTML that Word spits out contains

<p class=MsoNormal><span lang=FI>foo</span><span dir=RTL></span><span
lang=AR-SA dir=RTL>*</span><span lang=FI>bar<span style=
'letter-spacing:3.0pt'><o:p></o:p></span></span></p>

and while this monstrous, it "works" in the sense that there is no box
or bar in place of the special character; instead it works as an
invisible character that permits a simple line break - _even if_ the
font used does not contain that character.

Magic? I was able to reduce this to
foo<span dir="rtl">*</span>bar
and the same trick works for ​ as well.

Can we declare this an official hack? :-) And should it be more
"semantic", with bdo instead of span?

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #19
On Wed, 22 Dec 2004, Jukka K. Korpela wrote:
But on my system at least (Win98, with Tahoma probably as shipped with
Windows), Tahoma does not contain U+200B. Instead, a square is
displayed.
Thus confirming what I keep saying to others, that the name of a font
is no guarantee of its character repertoire, in general.
Mind you, the marker could just as well be <foobar> or <secam>, for
all that most browsers seem to care. Or <x> if you prefer less
typing ;-)


Do you think so?


Not any longer - sorry! I'm sure I tested this, but it may have been
some years back. My apologies for posting that without checking!!
But now it's time for a really weird observation. [...] Magic? I was able to reduce this to
foo<span dir="rtl">*</span>bar
and the same trick works for ​ as well.

Can we declare this an official hack? :-)
Bizarre. How many other browsers do we have to try it in before
we can confidently recommend it...?
And should it be more "semantic", with bdo instead of span?


I'll save that question for later, if I may ;-)
Jul 23 '05 #20
Thai is unusual in that it uses spaces between sentences, but no spaces
within sentences.

Breaking between words is done by some combination of a dictionary and
an algorithm that can recognise where a word ends (don't ask me for
details, I am not a programmer). This requires support from the
operating system. Pre-Unicode, there was a special Thai edition of
Windows. With Unicode, Thai support is built in to Windows (though not
necessarily installed by default).

Applications need to use the OS' support for Thai in order to break
between words. This works in recent browsers and in Word for Windows.
It does not work in Word 2004 because Microsoft have not yet made use
of the Thai support in Mac OS X 10.3.

--
Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)

Jul 23 '05 #21
On Wed, 22 Dec 2004, Jukka K. Korpela wrote:
But on my system at least (Win98, with Tahoma probably as shipped with
Windows), Tahoma does not contain U+200B. Instead, a square is
displayed.
That's why fonts have a version number, too :-) The character set of
Tahoma has been enlarged with every Windows version. The version that
comes with Windows XP/2003 covers all extended Arabic characters and
is therefore well suited for all languages that use the Arabic script.
However, the context was browsing of the Thai writing system,


I'm afraid I have missed that part of the discussion.


Yes, it was hidden in personal e-mail between Alan and me :-)
Magic? I was able to reduce this to
foo<span dir="rtl">*</span>bar
and the same trick works for ​ as well.

^^^^
Did you mean ZWSP ​ or ZWJ * ?

What about
foo<span dir="rtl"></span>bar
foo*bar
?

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #22
On Thu, 23 Dec 2004, Andreas Prilop wrote:
Jukka:
I'm afraid I have missed that part of the discussion.


Yes, it was hidden in personal e-mail between Alan and me :-)


Not entirely: there had been mentions of iso-8859-11 and Thai on
this thread too, although I'm not blaming Jukka for missing it.
Jul 23 '05 #23
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Magic? I was able to reduce this to
foo<span dir="rtl">*</span>bar and the same trick works for
​ as well. ^^^^
Did you mean ZWSP ​ or ZWJ * ?


I meant ZWSP as I wrote. As far as I understand, ZWJ is a way to
_prevent_ line breaks.
What about
foo<span dir="rtl"></span>bar
Interesting idea (maybe the magic _is_ just in the dir attribute), but
IE seems to completely ignore the span element (as it should) and treat
the above as just
foobar
foo*bar
?


That was among the alternatives I tested, and there * doesn't
work as it should; instead I see roughly
foo|bar
i.e. a bar-like symbol in place of the special character. Adding <span>
markup without dir attribute does not change this. So it seems that the
magic is in the interaction between that attribute and the special
character.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #24
On Thu, 23 Dec 2004, Jukka K. Korpela wrote:
Magic? I was able to reduce this to
foo<span dir="rtl">*</span>bar and the same trick works for
​ as well. ^^^^
Did you mean ZWSP ​ or ZWJ * ?


I meant ZWSP as I wrote.


But didn't you write earlier that ​ is displayed as an
empty box?
As far as I understand, ZWJ is a way to _prevent_ line breaks.


No, no! ZWJ and ZWNJ have nothing to do with line breaks.
At least, they shall not; they control the shape of Arabic glyphs.

A preliminary document is here:
http://www.unics.uni-hannover.de/nhtcapri/zwnj.html

--
Mars, unlike Earth, has no atmosphere.
The Chicago manual of style, 15th ed., p. 362

Jul 23 '05 #25
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
But didn't you write earlier that ​ is displayed as an
empty box?
Yes, and using <span dir="rtl">​</span> prevents that.
As far as I understand, ZWJ is a way to _prevent_ line breaks.


No, no! ZWJ and ZWNJ have nothing to do with line breaks.


(My point above was that I didn't consider ZWJ since it prevents line
breaks instead of permitting them.)

Well, ZWJ _does_ prevent line breaks and ZWNJ allows line breaks where
they wouldn't otherwise be allowed, don't they? They have line breaking
behavior, even if the reason for their existence might be something
different.

MS Word (even Word 2003) seems to generate ZWNJ when I select a line
breaking hint from the Insert/Character/Special characters menu.
This might reflect some older idea of using ZWNJ for such purposes.
And a casual Web author might get the same idea, e.g. because HTML has
&zwnj; (and &zwj;) but not &zwsp;.
At least, they shall not; they control the shape of Arabic glyphs.


Or joining behavior in general, don't they?

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #26
On Thu, 23 Dec 2004, Jukka K. Korpela wrote:
No, no! ZWJ and ZWNJ have nothing to do with line breaks.
Well, ZWJ _does_ prevent line breaks and ZWNJ allows line breaks where
they wouldn't otherwise be allowed, don't they?


No! I did refer you already to
http://www.unics.uni-hannover.de/nhtcapri/zwnj.html
which shows (among other things) that &zwnj; may be part of a
Persian word. Breaking after or before &zwnj; is not acceptable!
They have line breaking
behavior, even if the reason for their existence might be something
different.
I don't know what you mean by "line breaking behavior". Perhaps you
just mean IE's (broken) behaviour. Please refer to
http://www.unicode.org/reports/tr14/#Table1
http://www.unicode.org/Public/4.0-Up...reak-4.0.0.txt
Line breaking before and after U+200C, U+200D is prohibited.
MS Word (even Word 2003) seems to generate ZWNJ when I select a line
breaking hint from the Insert/Character/Special characters menu.


You just demonstrate (again) that Microsoft's programs are broken
as designed.
Jul 23 '05 #27
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
I don't know what you mean by "line breaking behavior".
Sorry for my confusion.
Perhaps you
just mean IE's (broken) behaviour.
Well, I guess I mainly confused ZWJ and ZWNJ with zero-width spaces.
Please refer to
http://www.unicode.org/reports/tr14/#Table1
http://www.unicode.org/Public/4.0-Up...reak-4.0.0.txt
I stand corrected, but...
Line breaking before and after U+200C, U+200D is prohibited.


....as far as I can see, they are in line breaking class CM, which means
that a line break before the character is prohibited, whereas a line
break after it may or may not be allowed, depending on the next
character.
MS Word (even Word 2003) seems to generate ZWNJ when I select a
line breaking hint from the Insert/Character/Special characters
menu.


You just demonstrate (again) that Microsoft's programs are broken
as designed.


Well, it surely looks _very_ odd now, and might explain some of my
difficulties as a book author (when I had tried to help the layout
process with such hints - which might cause serious trouble when
porting data from MS Word to a publishing program).

Luckily IE does not treat &zwnj; that way. But if you use "Save As Web
page" in MS Word, it actually generates * (= &zwnj;) from a line
breaking hint, as I mentioned, so Microsoft programs aren't quite
compatible even with other Microsoft programs. (This is really not such
a buig surprise.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 23 '05 #28
On Tue, 4 Jan 2005, Jukka K. Korpela wrote:
Line breaking before and after U+200C, U+200D is prohibited.


...as far as I can see, they are in line breaking class CM, which means
that a line break before the character is prohibited, whereas a line
break after it may or may not be allowed, depending on the next
character.


Yes - I tacitly assumed that there are ordinary letters (class AL)
before and after U+200C, U+200D as in my examples.

Jul 23 '05 #29

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Stephen Weatherly | last post: by
1 post views Thread by pmgriffin | last post: by
9 posts views Thread by web1110 | last post: by
3 posts views Thread by Ali Sahin | last post: by
6 posts views Thread by Hacking Bear | last post: by
5 posts views Thread by GarryJones | last post: by
8 posts views Thread by rodeored | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.