To wrap or not to wrap "C++"?

Boris

Can anyone tell me if Opera 9.5 is behaving correctly when wrapping the
word C++, eg:

C+
+

Opera 9.2 didn't wrap C++. For those who use Opera 9.5 there is a test
case at http://www.highscore.de/browsertest/cpp.html (try different window
sizes until Opera 9.5 wraps C++).

Boris

Sep 13 '08 #1

Subscribe Post Reply

1903

Holger Jeromin

Boris schrieb am 13.09.2008 16:23:

Can anyone tell me if Opera 9.5 is behaving correctly when wrapping the
word C++, eg:

C+
+

Opera 9.6 Weekly beta wraps
C
++
and
C+
+

Version 9.60 beta
Build 10427
Plattform Win32
Betriebssystem Windows XP

Opera 9.2 didn't wrap C++. For those who use Opera 9.5 there is a test
case at http://www.highscore.de/browsertest/cpp.html (try different window
sizes until Opera 9.5 wraps C++).

--
Mit freundlichen Grüßen
Holger Jeromin

Sep 13 '08 #2

Ben C

On 2008-09-13, Boris <bo****@web.dewrote:

Can anyone tell me if Opera 9.5 is behaving correctly when wrapping the
word C++, eg:

C+
+

The specification that defines all this is Unicode Standard Annex #14.

Browsers don't have to follow that specification to claim they support
HTML and/or CSS, but it's the easiest way to support a large number of
world languages.

So technically, Opera is correct pretty much whatever it does, but it
looks like they are basically doing the Unicode rules.

You can read Annex 14 for yourself:

http://unicode.org/reports/tr14/

The long and short of it is that "+" has the "Line breaking class" of PR
or "Prefix" so it's treated a bit like a currency symbol.

Now, I think that means that "by default" you can't break "C+" or "+C",
based on the definition of PR.

Not sure what they mean by "by default". I haven't read the whole spec.

But if you look at the pair table in section 7.3, that says you can
break between "C+" (and "++") but not between "+C", which is just what
Opera is doing. (The class of "C" is "AL"-- "alphabetic" or something).
By 'break between "C+"' I mean break between the "C" and the "+".

Korpela is the expert on this kind of thing.

If you want to prevent wrapping, because after all C++ is a special use
of "+", use white-space: nowrap.

Opera 9.2 didn't wrap C++. For those who use Opera 9.5 there is a test
case at http://www.highscore.de/browsertest/cpp.html (try different
window sizes until Opera 9.5 wraps C++).

It's possible they've been improving their support for world languages
and so sharpened up the Unicode-conformance of their line-breaking
method.

Sep 13 '08 #3

Jukka K. Korpela

Ben C wrote:

On 2008-09-13, Boris <bo****@web.dewrote:
>Can anyone tell me if Opera 9.5 is behaving correctly when wrapping
the word C++, eg:

C+
+

The specification that defines all this is Unicode Standard Annex #14.

Not really.

Browsers don't have to follow that specification to claim they support
HTML and/or CSS,

Thus, UAX #14 does _not_ define whether the behavior is correct or not.

HTML (or CSS) specifications do not require conformance to the Unicode
Standard. (They define things in terms of it, or rather its partial
equivalent ISO 10646, but that's a different issue.) Moreover, UAX #14,
though part of the standard, is not normative except for a few parts, so
even if Opera claimed Unicode conformance, it could wrap C++ as it likes,
formally speaking.

but it's the easiest way to support a large number of
world languages.

I disagree; see http://www.cs.tut.fi/~jkorpela/unicode/linebr.html for some
arguments. UAX #14 is quite a mess and basically tries to deal with
_general_ principles of line breaking. Yet its rules are often very coarse,
either preventing completely acceptable line breaks or (more often) allowing
foolish line breaks. I would say that it's not very useful except in
exceptional situations where you _must_ break a string somewhere and have no
better guidelines. Unfortunately, web browsers have started implementing
parts of UAX #14 to an increasing amount (though still not very much and
never really consistently). The old principle of treating only whitespace as
allowable break point generally works better, though it naturally fails for
language that don't use whitespace between words - but such problems should
be solved in a different way.

The long and short of it is that "+" has the "Line breaking class" of
PR or "Prefix" so it's treated a bit like a currency symbol.

Yes. But the meaning of PR is described vaguely in UAX #14, and the prose
part contradicts the formal parts.

Now, I think that means that "by default" you can't break "C+" or
"+C", based on the definition of PR.

Not sure what they mean by "by default". I haven't read the whole
spec.

I have read the whole spec, and I'm not sure what they mean "by default".

But if you look at the pair table in section 7.3, that says you can
break between "C+" (and "++") but not between "+C", which is just what
Opera is doing. (The class of "C" is "AL"-- "alphabetic" or
something). By 'break between "C+"' I mean break between the "C" and
the "+".

The formal rules before the pair table imply that a "direct break" is
allowed between AL and PR as well as between PR and PR, so in "C++", a break
is permitted anywhere (between C and + as well as between + and +), and
Opera works that way. "Direct break" means that a break is allowed even if
no space intervenes.

Actually, it seems to me that the pair table, as well as the formal rules,
also permits direct break between PR and AL. They prevent a break between PR
and NU (= number), though, so +1 is unbreakable whereas +A is breakable.
Opera 9.5 does not break +A, but who knows whether some next version will
follow UAX #14 to more madness?

If you want to prevent wrapping, because after all C++ is a special
use of "+", use white-space: nowrap.

That's one possibility and works most of the time, but do you really want to
let such things depend on _styling_? I don't think it is a matter of
optional presentational features whether you speak of "C++" or of "C+ +" or
"C ++".

For reasons explained at
http://www.cs.tut.fi/~jkorpela/html/nobr.html
I think it is preferable to use C++. When standards are wrong,
don't let them prevent you from doing things the best possible way.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Sep 13 '08 #4

Ben C

On 2008-09-13, Jukka K. Korpela <jk******@cs.tut.fiwrote:

Ben C wrote:

>On 2008-09-13, Boris <bo****@web.dewrote:
>>Can anyone tell me if Opera 9.5 is behaving correctly when wrapping
the word C++, eg:

C+
+

The specification that defines all this is Unicode Standard Annex #14.

Not really.

OK, but I don't know of another specification for it, and I suspect it
may be the one Opera are actually using.

>Browsers don't have to follow that specification to claim they support
HTML and/or CSS,

Thus, UAX #14 does _not_ define whether the behavior is correct or not.

Yes.

HTML (or CSS) specifications do not require conformance to the Unicode
Standard. (They define things in terms of it, or rather its partial
equivalent ISO 10646, but that's a different issue.) Moreover, UAX #14,
though part of the standard, is not normative except for a few parts, so
even if Opera claimed Unicode conformance, it could wrap C++ as it likes,
formally speaking.

Yes, I did try to say that.

>but it's the easiest way to support a large number of
world languages.

I disagree; see http://www.cs.tut.fi/~jkorpela/unicode/linebr.html for some
arguments.

You make some good points there. But still, implementing line-breaking
for every language you want to support without a specification is quite
a daunting prospect.

Even Japanese and Chinese aren't that easy-- there are various
bracket and quote characters to watch out for.

[...]

The old principle of treating only whitespace as allowable break point
generally works better, though it naturally fails for language that
don't use whitespace between words - but such problems should be
solved in a different way.

How?

>The long and short of it is that "+" has the "Line breaking class" of
PR or "Prefix" so it's treated a bit like a currency symbol.

Yes. But the meaning of PR is described vaguely in UAX #14, and the prose
part contradicts the formal parts.

>Now, I think that means that "by default" you can't break "C+" or
"+C", based on the definition of PR.

Not sure what they mean by "by default". I haven't read the whole
spec.

I have read the whole spec, and I'm not sure what they mean "by default".

I also note from your document that "PR" is one of the "informative"
classes.

>But if you look at the pair table in section 7.3, that says you can
break between "C+" (and "++") but not between "+C", which is just what
Opera is doing. (The class of "C" is "AL"-- "alphabetic" or
something). By 'break between "C+"' I mean break between the "C" and
the "+".

The formal rules before the pair table imply that a "direct break" is
allowed between AL and PR as well as between PR and PR, so in "C++", a break
is permitted anywhere (between C and + as well as between + and +), and
Opera works that way. "Direct break" means that a break is allowed even if
no space intervenes.

I think that's the same thing the table is saying.

Actually, it seems to me that the pair table, as well as the formal rules,
also permits direct break between PR and AL.

There's definitely a "PR x AL" in LB24. Perhaps you're looking at
LB25...

They prevent a break between PR
and NU (= number), though, so +1 is unbreakable whereas +A is breakable.
Opera 9.5 does not break +A, but who knows whether some next version will
follow UAX #14 to more madness?

I still maintain +A is unbreakable according to LB24.

I wouldn't be surprised if Opera followed UAX #14 pretty strictly
already. They don't strike me as the types to do things by halves.

>If you want to prevent wrapping, because after all C++ is a special
use of "+", use white-space: nowrap.

That's one possibility and works most of the time, but do you really want to
let such things depend on _styling_? I don't think it is a matter of
optional presentational features whether you speak of "C++" or of "C+ +" or
"C ++".

I don't have a strong view on that.

Sep 13 '08 #5

Jukka K. Korpela

Ben C wrote:

But still, implementing line-breaking
for every language you want to support without a specification is
quite a daunting prospect.

UAX #14 does _not_ define line breaking rules for all languages, or for
_any_ language. It specifies some _general_ rules, which largely revolve
around special characters.

If you wanted to have line breaking by the rules of English, Finnish, and
Russian, for example, your main concern should be hyphenation (which is
rather different in nature in those languages). You would need to deal with
some special issues with special characters (e.g. apostrophe and colon)
which may appear in words. The rest would be basically wrapping at
whitespace. Anything you add to that is probably external to all of those
languages. If the text mentions "C++" or the C++ expression "i++", it's to
be handled differently from rules for English, Finnish, and Russian.
Generally, you should treat it as indivisible. And if you need to line wrap
C++ code, for example, special rules are needed, rules specific to the C++
"language".

>The old principle of treating only whitespace as allowable break
point generally works better, though it naturally fails for language
that
don't use whitespace between words - but such problems should be
solved in a different way.

How?

I'm sure experts on different languages can present good answers to such
questions. After all, languages like Chinese were written and printed long
before Unicode was invented. Part of the rules might be formulated as rules
for line breaking behavior of characters, but they would not take us very
far. General character-level rules would work when some characters are only
used in specific languages that e.g. always allow a break after those
characters. But the rules can be more complicated and much above the
character level.

>Actually, it seems to me that the pair table, as well as the formal
rules, also permits direct break between PR and AL.

There's definitely a "PR x AL" in LB24. Perhaps you're looking at
LB25...

You're right. I somehow managed to lose track when looking at the pair table
_and_ to miss LB24 when quickly searching for all occurrences of "PR" in the
rules.

I still maintain +A is unbreakable according to LB24.

Right.

I wouldn't be surprised if Opera followed UAX #14 pretty strictly
already. They don't strike me as the types to do things by halves.

I'm afraid you might be right. Opera seems to have started to fail to wrap
in a context like
"foo" (bar)
since by UAX #14, a break is not allowed between the ASCII apostrophe and an
opening parenthesis. Opera also follows UAX #14 and the example set by IE in
wrapping between a letter and an opening parenthesis even when no space
intervenes, as in
foo(bar)

Excuse me while I jump on the walls and talk incomprehensibly to myself.

Far from dealing with line wrapping automatically, this effectively forces
authors to use the "nonstandard" tags <nobrand <wbrliberally whenever
they use anything but letters, digits, and basic punctuation in texts.

>I don't think it is a
matter of optional presentational features whether you speak of
"C++" or of "C+ +" or "C ++".

I don't have a strong view on that.

I do, because whitespace is significant there; "C++" is a single name,
whereas "C ++" is another name (of another language) followed by a space and
an operator.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Sep 13 '08 #6

Ben C

On 2008-09-13, Jukka K. Korpela <jk******@cs.tut.fiwrote:

Ben C wrote:

>But still, implementing line-breaking
for every language you want to support without a specification is
quite a daunting prospect.

UAX #14 does _not_ define line breaking rules for all languages, or for
_any_ language. It specifies some _general_ rules, which largely revolve
around special characters.

But LineBreak.txt gives you a breaking class for most of the characters.
That is effectively the language-specific information.

It's probably script-specific rather than language-specific. But it gets
quite tricky: Korean is sometimes broken at spaces even though it uses
basically ideographs.

If you wanted to have line breaking by the rules of English, Finnish, and
Russian, for example, your main concern should be hyphenation (which is
rather different in nature in those languages).

I didn't know that. But browsers don't do hyphenation anyway.

[...]

>>The old principle of treating only whitespace as allowable break
point generally works better, though it naturally fails for language
that don't use whitespace between words - but such problems should
be solved in a different way.

How?

I'm sure experts on different languages can present good answers to such
questions.

My original point was that if you just implement UAX #14 you don't need
any experts on all the different languages. I take your word for it that
the results might not be as good.

After all, languages like Chinese were written and printed long before
Unicode was invented. Part of the rules might be formulated as rules
for line breaking behavior of characters, but they would not take us
very far. General character-level rules would work when some
characters are only used in specific languages that e.g. always allow
a break after those characters. But the rules can be more complicated
and much above the character level.

But this does make life awfully difficult for people trying to make
browsers (and word processors, etc.).

[...]

>I wouldn't be surprised if Opera followed UAX #14 pretty strictly
already. They don't strike me as the types to do things by halves.

I'm afraid you might be right. Opera seems to have started to fail to
wrap in a context like "foo" (bar) since by UAX #14, a break is not
allowed between the ASCII apostrophe

Is '"' an ASCII apostrophe? Even so, Opera does refuse to break "foo"
(bar). And " has the same breaking class as '.

and an opening parenthesis. Opera also follows UAX #14 and the example
set by IE in wrapping between a letter and an opening parenthesis even
when no space intervenes, as in foo(bar)

Excuse me while I jump on the walls and talk incomprehensibly to myself.

Knock yourself out. Refusing to break "foo" (bar) but breaking foo(bar)
is egregious.

Sep 13 '08 #7

Dr J R Stockton

In comp.infosystems.www.authoring.html message <op.uhfblsh59dsao3@burk>,
Sat, 13 Sep 2008 16:23:42, Boris <bo****@web.deposted:

>Can anyone tell me if Opera 9.5 is behaving correctly when wrapping the
word C++, eg:

With sensible rules, a three-character "word" would never be broken.
And I don't much like the idea of breaking to give a two-character
fragment, unless it is positively determined that the word breaks well
there.

--
(c) John Stockton, nr London UK. ??*@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/- FAQish topics, acronyms, & links.
Check boilerplate spelling -- error is a public sign of incompetence.
Never fully trust an article from a poster who gives no full real name.

Sep 13 '08 #8

Hendrik Maryns

Op 14-09-08 00:38 heeft Ben C als volgt van zich laten horen:

Is '"' an ASCII apostrophe? Even so, Opera does refuse to break "foo"
(bar). And " has the same breaking class as '.

But Thunderbird seems to have no problem with it :-) Or was that slrn?

H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
www.lieverleven.be
http://catb.org/~esr/faqs/smart-questions.html

Sep 14 '08 #9

Ben C

On 2008-09-14, Hendrik Maryns <ia*******@sneakemail.comwrote:

Op 14-09-08 00:38 heeft Ben C als volgt van zich laten horen:
>Is '"' an ASCII apostrophe? Even so, Opera does refuse to break "foo"
(bar). And " has the same breaking class as '.

But Thunderbird seems to have no problem with it :-) Or was that slrn?

Vim, which just breaks at spaces.

Sep 14 '08 #10

Boris

On Sat, 13 Sep 2008 22:42:30 +0200, Jukka K. Korpela <jk******@cs.tut.fi>
wrote:

[...]For reasons explained at
http://www.cs.tut.fi/~jkorpela/html/nobr.html
I think it is preferable to use C++. When standards are
wrong, don't let them prevent you from doing things the best possible
way.

I had asked in the newsgroup opera.page-display before where someone
recommended to use ⁠ (see
http://groups.google.com/group/opera...0f9dfc99a1642).
I haven't checked yet though what other browsers are going to do when they
see something like C⁠+⁠+ - not sure if this opens another
can of worms?

Boris

Sep 14 '08 #11

Jukka K. Korpela

Ben C wrote:

But LineBreak.txt gives you a breaking class for most of the
characters. That is effectively the language-specific information.

It's by definition language-independent: it assigns properties to
characters, no matter which (if any) language they are used in. Admittedly,
_some_ characters are used in one language only. But that's coincidential
and may change without notice.

It's probably script-specific rather than language-specific.

Not really. It's character-specific. The Unicode Standard assigns a script
property to each character, but many characters are used across scripts.

But browsers don't do hyphenation anyway.

That's a big part of the problem. When you don't hyphenate, you often get
horrible layout for texts containing long words. Little does it help to
break poor little "C++" then, and it's just incorrect.

My original point was that if you just implement UAX #14 you don't
need any experts on all the different languages.

I'm afraid that's a common misconception, and UAX #14 doesn't try very hard
to prevent it.

>[...] But the rules can be more complicated
and much above the character level.

But this does make life awfully difficult for people trying to make
browsers (and word processors, etc.).

Actually, word processors often handle with it decently, for the languages
they support. Web browsers are much more primitive in handling texts, but
there is no reason why they could not have language-dependent line breaking.

>I'm afraid you might be right. Opera seems to have started to fail to
wrap in a context like "foo" (bar) since by UAX #14, a break is not
allowed between the ASCII apostrophe

Is '"' an ASCII apostrophe?

Sorry, my mistake; it's the ASCII quotation mark, treated as "neutral"
quotation mark in UAX #14. But as you say, the ASCII apostrophe (') has a
similar issue. So do the "left" and "right" quotation marks, U+201C and
U+201D, i.e. the normal English quotes, since they are in fact used in
different ways in different languages.

The original idea in HTML was that any space was a permitted line break
point and no other line breaks would appear, except possibly after a hyphen,
and otherwise no line breaks are generated in formatting (except of course
when explicitly specified in markup). This is coarse and doesn't work at all
for many languages. But at least it does not arbitrarily break strings and
it does not arbitrarily prevent line breaks e.g. between a quoted string and
a parenthetic string when a space intervenes.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Sep 14 '08 #12

Ben C

On 2008-09-14, Jukka K. Korpela <jk******@cs.tut.fiwrote:

Ben C wrote:

>But LineBreak.txt gives you a breaking class for most of the
characters. That is effectively the language-specific information.

It's by definition language-independent: it assigns properties to
characters, no matter which (if any) language they are used in. Admittedly,
_some_ characters are used in one language only. But that's coincidential
and may change without notice.

>It's probably script-specific rather than language-specific.

Not really. It's character-specific. The Unicode Standard assigns a script
property to each character, but many characters are used across scripts.

Where I suppose line-breaking conventions may be different (and also
across languages).

[...]

>My original point was that if you just implement UAX #14 you don't
need any experts on all the different languages.

I'm afraid that's a common misconception, and UAX #14 doesn't try very
hard to prevent it.

I am now starting to feel a bit let-down by UAX #14. Life is not so
simple after all.

Sep 14 '08 #13

Ben C

On 2008-09-14, Boris <bo****@web.dewrote:

On Sat, 13 Sep 2008 22:42:30 +0200, Jukka K. Korpela <jk******@cs.tut.fi>
wrote:

>[...]For reasons explained at
http://www.cs.tut.fi/~jkorpela/html/nobr.html
I think it is preferable to use C++. When standards are
wrong, don't let them prevent you from doing things the best possible
way.

I had asked in the newsgroup opera.page-display before where someone
recommended to use ⁠ (see
http://groups.google.com/group/opera...0f9dfc99a1642).
I haven't checked yet though what other browsers are going to do when they
see something like C⁠+⁠+ - not sure if this opens another
can of worms?

It might work in Opera, but not necessarily in other browsers which
aren't so sold on the whole Unicode thing. It is mentioned in Korpela's
nobr page above.

Your choices are basically:

1. that
2. nobr
3. white-space: nowrap

None are worm-free. Choose your compromise.

Sep 14 '08 #14

To wrap or not to wrap "C++"?

Similar topics