468,456 Members | 1,739 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,456 developers. It's quick & easy.

Soft-hyphens or breakable points in a string

Hi,

My page has a table with many columns such that the right-side of the
table gets chopped off when printed. I specify a table width of 100%,
but otherwise no cell dimensions are specified. The culprits are 2 wide
columns which contain e-mail addresses.

I can get the page to fit entirely on the printer output if the browser
would break the e-mail address string at the '@' symbol. What I've done
for now is replaced the '@' in all e-mail addresses with
'[space]@[space]' which now wraps nicely and my table fits. However,
this is somewhat undesirable because for e-mail addresses that are
already short, it shows the address with spaces around the '@' symbol.

Is there an HTML trick I can use that tells the browser that it is
permissible, but only if needed, to break the string at the '@' or dot
(.), much like the soft-hyphen does in Word?

Mark
Sep 12 '05 #1
6 3236
Mark wrote:
My page has a table with many columns such that the right-side of the
table gets chopped off when printed. I specify a table width of 100%,
but otherwise no cell dimensions are specified. The culprits are 2 wide
columns which contain e-mail addresses.
I would primarily consider the possibilities for reducing the amount of
information per row. In the absence of a URL demonstrating the actual
problem, I cannot make any more specific suggestion.

Secondarily, I would consider whether it is possible to reduce the width
requirements of _other_ columns than those containing E-mail addresses.
The reason is that breaking an E-mail address may cause confusion and
even give a wrong idea of what the address is.
I can get the page to fit entirely on the printer output if the browser
would break the e-mail address string at the '@' symbol.
The Unicode line breaking rules define "@" as belonging to line breaking
class AL, i.e. as comparable to alphabetic characters. Although those
rules are generally highly debatable, there is wisdom behind this
particular assignment. The at sign is typically used in contexts like
E-mail addresses, URLs, and programming language constructs where a line
break between "@" and an adjacent letter would not be appropriate. An
E-mail address is basically an unbreakable string that must not contain
whitespace (except in a comment).

Thus, I would avoid breaking an E-mail address at almost any cost.
What I've done
for now is replaced the '@' in all e-mail addresses with
'[space]@[space]' which now wraps nicely and my table fits.
That's even worse, since it introduces whitespace on both side of "@". A
naive user might even think that the space is part of the address.
(After all, few people in the world know the _exact_ syntax of E-mail
address, i.e. are variations and complications and special cases that
are allowed.)
Is there an HTML trick I can use that tells the browser that it is
permissible, but only if needed, to break the string at the '@' or dot
(.), much like the soft-hyphen does in Word?


There is the <wbr> trick, e.g.
jkorpela@<wbr>cs.<wbr>tut.<wbr>fi
It's genuinely a trick: it works in most browsing situations but does
not conform to any standard. There's also the standard-conforming way of
using a zero width no-break space, which works very rarely and causes
quite some trouble when it doesn't. See
http://www.cs.tut.fi/~jkorpela/html/nobr.html#suggest

According to the reputable "Chicago Manual of Style" (clause 7.44), if a
URL or E-mail address needs to be broken, the break should appear
"between elements, after a colon, a slash, a double slash, or the symbol
@ but before a period or any other punctuation or symbols". I think
there's a wisdom in not breaking after but before a period: a period at
the end of line will easily be seen as terminating the address, whereas
a period at the start of a line suggests that it is a continuation of
the preceding line.

P.S. The soft hyphen does _not_ work the way you think in MS Word. If
you enter a soft hyphen character, MS Word treats it as yet another
graphic character and displayes it in all occasions. You can use an MS
Word command to add "soft hyphen", but what really happens is that a
normal hyphen-minus "-" is inserted, together with invisible extra
information that forbids a line break after it.
Sep 12 '05 #2
On Mon, 12 Sep 2005, Jukka K. Korpela wrote:
P.S. The soft hyphen does _not_ work the way you think in MS Word. If
you enter a soft hyphen character, MS Word treats it as yet another
graphic character and displayes it in all occasions. You can use an MS
Word command to add "soft hyphen", but what really happens is that a
normal hyphen-minus "-" is inserted, together with invisible extra
information that forbids a line break after it.


Your statement is meaningless when you don't define which character
the "soft hyphen" is supposed to be. In older versions of MS Word for
Macintosh as well as for Windows, character 31 = 0x1F = 037 was used
for the soft hyphen. I don't know if this is still true for Word 2003.

You can check this by inserting char \037 into a text file or
the expression \'1f into an RTF file.

Sep 12 '05 #3
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Mon, 12 Sep 2005, Jukka K. Korpela wrote:
P.S. The soft hyphen does _not_ work the way you think in MS Word. If
you enter a soft hyphen character, MS Word treats it as yet another
graphic character and displayes it in all occasions. You can use an MS
Word command to add "soft hyphen", but what really happens is that a
normal hyphen-minus "-" is inserted, together with invisible extra
information that forbids a line break after it.
Your statement is meaningless when you don't define which character
the "soft hyphen" is supposed to be.


In the absence of a reference to any other standard or specification,
I think it is fair to postulate the understanding that "soft hyphen" means
the character named that way in ISO 10646, Unicode, and standards in the
ISO 8859 family.
In older versions of MS Word for
Macintosh as well as for Windows, character 31 = 0x1F = 037 was used
for the soft hyphen. I don't know if this is still true for Word 2003.

You can check this by inserting char \037 into a text file or
the expression \'1f into an RTF file.


That might be true - I haven't studied what Word really inserts when you
give the command for inserting an "Optional Hyphen" (that seems to be what
MS Word calls it in the English version). But if I save a document in RTF
format from MS Word, "Optional Hyphen" gets turned into "\-".

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Sep 12 '05 #4
On Mon, 12 Sep 2005, Jukka K. Korpela wrote:
In older versions of MS Word for
Macintosh as well as for Windows, character 31 = 0x1F = 037 was used
for the soft hyphen. I don't know if this is still true for Word 2003.


That might be true - I haven't studied what Word really inserts when you
give the command for inserting an "Optional Hyphen" (that seems to be what
MS Word calls it in the English version).


Here is a text file that contains char 31 = 0x1F several times:
http://www.unics.uni-hannover.de/nht...oft-hyphen.txt
Word 97 recognizes this character as soft hyphen.

Sep 14 '05 #5

On Mon, 12 Sep 2005, Jukka K. Korpela wrote:
Mark wrote:
My page has a table with many columns such that the right-side of the
table gets chopped off when printed. I specify a table width of 100%,
but otherwise no cell dimensions are specified. The culprits are 2 wide
columns which contain e-mail addresses.
I would primarily consider the possibilities for reducing the amount of
information per row. In the absence of a URL demonstrating the actual
problem, I cannot make any more specific suggestion.

Secondarily, I would consider whether it is possible to reduce the width
requirements of _other_ columns than those containing E-mail addresses.
The reason is that breaking an E-mail address may cause confusion and
even give a wrong idea of what the address is.
I can get the page to fit entirely on the printer output if the browser
would break the e-mail address string at the '@' symbol.


The Unicode line breaking rules define "@" as belonging to line breaking
class AL, i.e. as comparable to alphabetic characters. Although those
rules are generally highly debatable, there is wisdom behind this
particular assignment. The at sign is typically used in contexts like
E-mail addresses, URLs, and programming language constructs where a line
break between "@" and an adjacent letter would not be appropriate. An
E-mail address is basically an unbreakable string that must not contain
whitespace (except in a comment).


Check out RFC 822.

[blockquote]

3.1.4. STRUCTURED FIELD BODIES

To aid in the creation and reading of structured fields, the
free insertion of linear-white-space (which permits folding
by inclusion of CRLFs) is allowed between lexical tokens.
Rather than obscuring the syntax specifications for these
structured fields with explicit syntax for this linear-white-
space, the existence of another "lexical" analyzer is assumed.
This analyzer does not apply for unstructured field bodies
that are simply strings of text, as described above. The
analyzer provides an interpretation of the unfolded text
composing the body of the field as a sequence of lexical sym-
bols.

These symbols are:

- individual special characters
- quoted-strings
- domain-literals
- comments
- atoms

The first four of these symbols are self-delimiting. Atoms
are not; they are delimited by the self-delimiting symbols and
by linear-white-space. For the purposes of regenerating
sequences of atoms and quoted-strings, exactly one SPACE is
assumed to exist, and should be used, between them. (Also, in
the "Clarifications" section on "White Space", below, note the
rules about treatment of multiple contiguous LWSP-chars.)

So, for example, the folded body of an address field

":sysmail"@ Some-Group. Some-Org,
Muhammed.(I am the greatest) Ali @(the)Vegas.WBA

is analyzed into the following lexical symbols and types:

:sysmail quoted string
@ special
Some-Group atom
. special
Some-Org atom
, special
Muhammed atom
. special
(I am the greatest) comment
Ali atom
@ atom
(the) comment
Vegas atom
. special
WBA atom

The canonical representations for the data in these addresses
are the following strings:

":sysmail"@Some-Group.Some-Org

and

Mu**********@Vegas.WBA

[/blockquote]

Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
^ ^
| |
That example appears to have two spaces in it that are not within
parentheses.

I have received more than one request for anti-virus help sent to the
"mailto:" address on my CIH virus page at
http://www.chebucto.ns.ca/~af380/CIH.html

HREF="mailto:%20af380@( Norman )chebucto( De Forest ).ns( CIH.html ).ca"

With spaces *outside* the parentheses the address still works fine with
Lynx on my ISP's system but some email software on some systems
(**cough**cough**Microsoft**cough**) fails to strip out the spaces outside
the comments when doing a DNS lookup on the hostname and/or when passing
the address in the MAIL TO: command (violating the "when passing such
structured information to other systems, such as mail protocol services"
clause quoted below) and thus fails to send the message.

The quoted passage above, is immediately followed by:

[blockquote]

Note: For purposes of display, and when passing such struc-
tured information to other systems, such as mail proto-
col services, there must be NO linear-white-space
between <word>s that are separated by period (".") or
at-sign ("@") and exactly one SPACE between all other
<word>s. Also, headers should be in a folded form.

[/blockquote]

The "For purposes of display" would appear to rule out the original
poster's use of space but the RFC fails to say what should happen should
an address be longer than the character width of a display (only that
any line-break must be followed by a whitespace character (space or tab)).

Thus, I would avoid breaking an E-mail address at almost any cost.
What I've done
for now is replaced the '@' in all e-mail addresses with
'[space]@[space]' which now wraps nicely and my table fits.
That's even worse, since it introduces whitespace on both side of "@". A
naive user might even think that the space is part of the address.
(After all, few people in the world know the _exact_ syntax of E-mail
address, i.e. are variations and complications and special cases that
are allowed.)
Is there an HTML trick I can use that tells the browser that it is
permissible, but only if needed, to break the string at the '@' or dot
(.), much like the soft-hyphen does in Word?


There is the <wbr> trick, e.g.
jkorpela@<wbr>cs.<wbr>tut.<wbr>fi
It's genuinely a trick: it works in most browsing situations but does
not conform to any standard. There's also the standard-conforming way of
using a zero width no-break space, which works very rarely and causes
quite some trouble when it doesn't. See
http://www.cs.tut.fi/~jkorpela/html/nobr.html#suggest


Don't you mean the "zero width non-joiner" there (U+200C, *) (as
opposed to the zero width joiner, U+200D, &@8205;)?

I think that the use of the zero width non-joiner should be the preferred
way of doing things and that "works very rarely and causes quite some
trouble when it doesn't" sould be replaced by "however it may be
necessary to get[1] software authors to fix their buggy treatment of
this character which causes quite some trouble when it doesn't work".

If Lynx can handle it properly (as well as the soft hyphen), why can't IE
and Firefox do the same? (I haven't tried it with Opera yet.)

According to the reputable "Chicago Manual of Style" (clause 7.44), if a
URL or E-mail address needs to be broken, the break should appear
"between elements, after a colon, a slash, a double slash, or the symbol
@ but before a period or any other punctuation or symbols". I think
there's a wisdom in not breaking after but before a period: a period at
the end of line will easily be seen as terminating the address, whereas
a period at the start of a line suggests that it is a continuation of
the preceding line.

P.S. The soft hyphen does _not_ work the way you think in MS Word. If
you enter a soft hyphen character, MS Word treats it as yet another
graphic character and displayes it in all occasions. You can use an MS
Word command to add "soft hyphen", but what really happens is that a
normal hyphen-minus "-" is inserted, together with invisible extra
information that forbids a line break after it.


[1] The following change is optional depending on the reader's
preferences (may wrap on your display but enter as one long line):
s/get software authors to/beat software authors about the head and shoulders until they/
--
``Why don't you find a more appropiate newsgroup to post this tripe into?
This is a meeting place for a totally differnt kind of "vision impairment".
Catch my drift?'' -- "jim" in alt.disability.blind.social regarding an
off-topic religious/political post, March 28, 2005

Sep 26 '05 #6
"Norman L. DeForest" <af***@chebucto.ns.ca> wrote:
Check out RFC 822.
Why? It has been obsoleted by the IETF.
[blockquote]
Pointless. In future, please cite (relevant) document by clause instead of
<bulkquote>.
That example appears to have two spaces in it that are not within
parentheses.
So? It's still a bad idea to break an E-mail address. Who could guess that
a line break is to be replaced by a space in some occasions and by nothing
in other occasions.
HREF="mailto:%20af380@( Norman )chebucto( De Forest ).ns( CIH.html
).ca"
Irrespectively of E-mail address syntax, that violates generic URL syntax,
which forbids unencoded spaces.
There's also the standard-conforming way of
using a zero width no-break space, which works very rarely and causes
quite some trouble when it doesn't. See
http://www.cs.tut.fi/~jkorpela/html/nobr.html#suggest


Don't you mean the "zero width non-joiner" there (U+200C, *) (as
opposed to the zero width joiner, U+200D, &@8205;)?


Please trim your quotes or otherwise indicate what you comment on; the word
"there" is particularly vague. I meant zero width space instead of zero
width no-break space, of course; the cited page tells this correctly
(I usually write web pages more carefully than Usenet postings).
U+200C and U+200D don't belong here, and I don't mention them in my posting
or on my page; they are for affecting ligature behavior and similar issues.
I think that the use of the zero width non-joiner should be the
preferred way of doing things
Which things? Cursive joining?
and that "works very rarely and causes
quite some trouble when it doesn't" sould be replaced by "however it
may be necessary to get[1] software authors to fix their buggy
treatment of this character which causes quite some trouble when it
doesn't work".


That's just play with words.
--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Sep 27 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Jim | last post: by
reply views Thread by NPC403 | last post: by
1 post views Thread by subhajit12345 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.