By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,630 Members | 1,010 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,630 IT Pros & Developers. It's quick & easy.

Translating foreign text into html code - help

P: n/a
I have a paragraph of text pasted into a word document, it's in Polish,

complete with polish characters. They show up just fine in word, but
the program I use for web page programming, HomeSite, won't translate
it. When I paste the text into the code, the special characters are
missing. If they would show up there I could use the Replace Special
Characters feature to change it to the proper code, but it won't even
paste into it correctly. Is there a way to get Word to do this, or
another program or web site than can do this? I've been searching the
web and posting questions and askig everyone but with no luck so far. I
did find a program for Mac, but I'm using a PC.

Oct 25 '05 #1
Share this Question
Share on Google+
23 Replies


P: n/a

gr***@kcls.org wrote:
I have a paragraph of text pasted into a word document, it's in Polish,

complete with polish characters. They show up just fine in word, but
the program I use for web page programming, HomeSite, won't translate
it.
Assuming you want to have the Polish letters "translated" to &#xxx;
notation, rather than Polish text to English text:

When I paste the text into the code, the special characters are missing. If they would show up there I could use the Replace Special
Characters feature to change it to the proper code, but it won't even
paste into it correctly.


Either get a new program for your web authoring work, or get a tool to
change all "special" characters to numeric references. UniRed is a good
choice: http://unired.sourceforge.net/. Paste your Polish text into
UniRed, and save it with the option "Unicode representation: &#DDDD;".

By the way, there's nothing "special" about Polish letters, except that
they're not in the Latin-1 character set. If you intend to publish more
Polish text, you should consider serving your content as utf-8, so you
can just use the characters you need, rather than cluttering your page
with numeric references. It will keep your code more legible, and is
far more intuitive. It may mean you'll have to give up on HomeSite, if
it isn't Unicode-capable.

Good luck,
Garmt de Vries.

Oct 26 '05 #2

P: n/a
Garmt de Vries wrote:
Either get a new program for your web authoring work, or get a tool to
change all "special" characters to numeric references.


If he is using a sufficiently new version of MS Word, he could just
select File/Save As and select a format like "Web page (filtered)" to
get an HTML version, with numeric references. The "filtered" thing or
something like that means that MS Word refrains from spitting out most
its usual "Office XML" stuff and you get something reasonable like

<p class=MsoNormal><span lang=FI>This is Polish: Wałęsa</span></p>

That's what a version of Word produced. Of course, the lang attribute it
inserts is worse than nonsense. It's partly my fault, since I was lazy
and didn't set the language in Word. If I paint the text, set its
language to English, then click on the Polish name and set its language
to Polish, and save as above, I get (here I quote a little more):

<body lang=EN-US>

<div class=Section1>

<p class=MsoNormal>This is Polish: <span lang=PL>Wałęsa</span></p>

</div>

</body>

Not bad. The class attribute has of course no effect per se, and it
might even be useful at times. Some day someone might wish to use some
styling for paragraphs generated using MS Office software, and the class
name MsoNormal is in practice a rather reliable indicator.

Setting the language to Polish has hardly any noticeable effect at
present, but it's still the right thing to do. (I guess the most
probable situation where it is useful is when someone opens the HTML
document in MS Word or some compatible program, which recognizes the
lang markup and uses this information in its spelling or grammar checks.
Somewhat deceptively, my version of MS Word has no such checks available
for Polish, so anything I claim to be Polish will "pass", i.e. will not
be flagged by MS Word.)

Oct 26 '05 #3

P: n/a
On Wed, 26 Oct 2005, Jukka K. Korpela wrote:
If he is using a sufficiently new version of MS Word, he could just
select File/Save As and select a format like "Web page (filtered)"
to get an HTML version, with numeric references.
Does Word still have this nasty habit of saving &#number; references
in the range 128-159 decimal, which W3C specifies to be UNUSED?

Also, beware of Word's habit of allowing the user to insert symbols,
and then generating pseudo-HTML which uses Symbol font referring to
Latin-1 characters, instead of using the proper Unicode references. As
a web admin, I keep getting examples of this abuse from our Word-based
authors. Such pseudo-HTML won't work on www-compatible clients (as
you obviously know, but I'm writing this for any other readers who
haven't met this problem yet).
The "filtered" thing or something like that means that MS Word
refrains from spitting out most its usual "Office XML" stuff and you
get something reasonable like

<p class=MsoNormal><span lang=FI>This is Polish: Wałęsa</span></p>

That's what a version of Word produced. Of course, the lang
attribute it inserts is worse than nonsense. It's partly my fault,
since I was lazy and didn't set the language in Word.
Which reminds me of the time that my Prof. created lots of Word
documents while his version of Word was set to its installation
default of US English; he finally complained about all the wrong
spellings it was proposing to him. But when he changed the Word
setting to British English, he found that it treated all the existing
documents as being in a foreign language (so it wouldn't add spellings
to his local dictionary). All rather confusing, really. But that was
a few years back.
Setting the language to Polish has hardly any noticeable effect at
present, but it's still the right thing to do.


It could adjust the pronunciation of a speaking browser (IBM HPR
supports the concept, although it didn't support Polish the last time
that I looked).

Oct 26 '05 #4

P: n/a
Alan J. Flavell wrote:
Does Word still have this nasty habit of saving &#number; references
in the range 128-159 decimal, which W3C specifies to be UNUSED?
Used the way I described, MS Word 2002 saves characters like the em and
en dash and curly punctuation marks "as such", as octets, and it inserts
a meta tag that says that the encoding is window-1252. This might not be
optimal, but it's surely better, as a matter of principle, than
generating undefined references. - The same happens in the Save As Web
page (without filtering).
Also, beware of Word's habit of allowing the user to insert symbols,
and then generating pseudo-HTML which uses Symbol font referring to
Latin-1 characters, instead of using the proper Unicode references.


This problem is still present. Word 2002 just emits mode "modern" stuff,
using CSS:
<span style='font-family:Symbol'>abc</span>
But I think this only happens when the user has explicitly chosen the
Symbol font. If I simply insert, say, the letter small alpha, using a
font that contains it, MS Word turns it to α instead of generating
something that uses the Symbol font.
Oct 26 '05 #5

P: n/a
Wed, 26 Oct 2005 12:27:56 +0100 from Alan J. Flavell
<fl*****@ph.gla.ac.uk>:
On Wed, 26 Oct 2005, Jukka K. Korpela wrote:
If he is using a sufficiently new version of MS Word, he could just
select File/Save As and select a format like "Web page (filtered)"
to get an HTML version, with numeric references.

I'm curious about the value of "sufficiently". Word 2003 did not do
this when I tried a test just now: instead it inserted actual
characters.
Does Word still have this nasty habit of saving &#number; references
in the range 128-159 decimal, which W3C specifies to be UNUSED?


I have some curly quotes in my test, and it saved them as characters
147 and 148 -- the actual characters, not “ and ”.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 26 '05 #6

P: n/a
On Wed, 26 Oct 2005, Stan Brown wrote:
I have some curly quotes in my test, and it saved them as characters
147 and 148 -- the actual characters, not “ and ”.


This means "charset=windows-1252". What happens if you throw in
some Greek and Cyrillic letters?

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Oct 26 '05 #7

P: n/a
On Wed, 26 Oct 2005, Jukka K. Korpela wrote:
<p class=MsoNormal>

Some day someone might wish to use some
styling for paragraphs generated using MS Office software, and the class
name MsoNormal is in practice a rather reliable indicator.
Good idea! I put this into my User Stylesheet to expose Word-
generated pages. ;-)
Setting the language to Polish has hardly any noticeable effect at
present, but it's still the right thing to do.


Mozilla and related browsers will display text with "lang=pl"
in the typeface selected for "Central European", which may be
better suited than the typeface selected for "Western European".
http://ppewww.ph.gla.ac.uk/~flavell/...ers-fonts.html

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Oct 26 '05 #8

P: n/a
Wed, 26 Oct 2005 16:21:46 +0200 from Andreas Prilop <nhtcapri@rrzn-
user.uni-hannover.de>:
On Wed, 26 Oct 2005, Stan Brown wrote:
I have some curly quotes in my test, and it saved them as characters
147 and 148 -- the actual characters, not “ and ”.


This means "charset=windows-1252". What happens if you throw in
some Greek and Cyrillic letters?


Still charset="windows-1252",(*) still actual characters 147 and 148,
but Ω for a capital Omega.

And still a couple of screens worth of inline style sheet for classes
that aren't used in the HTML. :-)
(*) Mozilla 1.7 under Windows XP renders the 148 as a curly quote,
even if I change windows-1252 to iso-8859-1 in the <meta>. Is that
correct behavior?

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 26 '05 #9

P: n/a
Tim
On Wed, 26 Oct 2005 15:54:47 +0300, Jukka K. Korpela sent:
and it inserts a meta tag that says that the encoding is window-1252.
This might not be optimal, but it's surely better, as a matter of
principle, than generating undefined references.


I suppose that would give an uploading tool a chance to do something with
the file, if you felt like it. e.g. Tidy and translate all local files
into what you ought to be putting on your webserver.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Oct 27 '05 #10

P: n/a
On Wed, 26 Oct 2005, Stan Brown wrote:
Wed, 26 Oct 2005 12:27:56 +0100 from Alan J. Flavell
<fl*****@ph.gla.ac.uk>:
On Wed, 26 Oct 2005, Jukka K. Korpela wrote:
If he is using a sufficiently new version of MS Word, he could just
select File/Save As and select a format like "Web page (filtered)"
to get an HTML version, with numeric references.

I'm curious about the value of "sufficiently". Word 2003 did not do
this when I tried a test just now: instead it inserted actual
characters.


I hadn't done this myself yet, but I note that if, in Word 2003, I go
to Tools> Options> , then on the General tab there's a button for [Web
Options...] - it's possible to choose a number of what appear to be
significant options, which are *NOT* (AFAICS) accessible from the
"save As..." menu (poor interface design!).

Aside from the rather obvious "Encoding" tab, there's a range of what
purport to be browser compatibility options. As you'd expect, there's
no option there to ask for W3C-compatible results :-( , and the only
non-MS browsers which get any kind of mention are Netscape 4 or
earlier.

OK, I just tried setting the "encoding" to what it calls "Western
European (ISO)" (which seems to be a dumbed-down way of saying
iso-8859-1), and saving as filtered HTML. I can report that all the
Windows-1252-specific characters which were included in my sample got
saved as their &#bignumber; Unicode representations.

Kind-of amazing that they finally got there.
I have some curly quotes in my test, and it saved them as characters
147 and 148 -- the actual characters, not “ and ”.


Well, that's technically correct too, then, on condition that the
encoding is announced as windows-1252. But it's rather rude of them
to make that the default.
Oct 27 '05 #11

P: n/a
On Wed, 26 Oct 2005, Stan Brown wrote:
Wed, 26 Oct 2005 16:21:46 +0200 from Andreas Prilop <nhtcapri@rrzn-
user.uni-hannover.de>:

This means "charset=windows-1252". What happens if you throw in
some Greek and Cyrillic letters?
Still charset="windows-1252",(*) still actual characters 147 and 148,
but Ω for a capital Omega.


That's still consistent, then.
(*) Mozilla 1.7 under Windows XP renders the 148 as a curly quote,
even if I change windows-1252 to iso-8859-1 in the <meta>. Is that
correct behavior?


I don't think there's any definition of "correct" behaviour when
UNUSED characters are included in HTML.

Where there -is- a definition of correct behaviour - one which is
typically violated by MSIE - then I'm glad to see that the Mozilla
folk tend to stand fast against misguided demands to just do what IE
does. At least in their Standards mode(s).

But where no particular correct behaviour is mandated, they seem
content to do what web pages normally expect. And I'd say it's a
historical fact that Win-based browsers (not only MSIE) tend to behave
as if they're working in Windows-1252, even when told that the
document is in iso-8859-1. And this seems to be what you're
describing here.

But that behaviour wasn't copied, for example, in Netscape (<=4.*)
implementations for non-Windows platforms (unixoid and, I suppose,
Mac), which is what led to the trenchant[1] remarks on the
"demoronizer" web site http://www.fourmilab.ch/webtools/demoroniser/

cheers

[1]Nebenbei - I still don't know a good German translation for
"trenchant", which I wanted when we were translating something or
other for an FAQ. Online suggestions are "energisch" or "scharf",
neither of which seems quite le mot juste. Ho hum.
Oct 27 '05 #12

P: n/a
Thu, 27 Oct 2005 12:54:39 +0100 from Alan J. Flavell
<fl*****@ph.gla.ac.uk>:
On Wed, 26 Oct 2005, Stan Brown wrote:
(About Word 2003 under Windows XP Pro SP2)
Wed, 26 Oct 2005 12:27:56 +0100 from Alan J. Flavell
<fl*****@ph.gla.ac.uk>:
On Wed, 26 Oct 2005, Jukka K. Korpela wrote:
> If he is using a sufficiently new version of MS Word, he could just
> select File/Save As and select a format like "Web page (filtered)"
> to get an HTML version, with numeric references.


I'm curious about the value of "sufficiently". Word 2003 did not do
this when I tried a test just now: instead it inserted actual
characters.


I hadn't done this myself yet, but I note that if, in Word 2003, I go
to Tools> Options> , then on the General tab there's a button for [Web
Options...] - it's possible to choose a number of what appear to be
significant options, which are *NOT* (AFAICS) accessible from the
"save As..." menu (poor interface design!).


I'll say! I did look for options on Save As and didn't see any; I
didn't think to look elsewhere.
Aside from the rather obvious "Encoding" tab,
:-) It's obvious, but that didn't stop me from missing it the first
time, even after following your paragraph above I looked at Options
-> General -> Web (and why isn't "Web" a tab? there are already a
dozen tabs -- adding one would be better than burying it on one of
the pages).

Once you knocked me on the head :-) I went back and picked Encoding
Western European, checking "Always save Web pages in the default
encoding". Then on the Fonts tab, under Character Set(!) I picked
English/Western European/Other Latin.
there's a range of what
purport to be browser compatibility options. As you'd expect, there's
no option there to ask for W3C-compatible results :-( , and the only
non-MS browsers which get any kind of mention are Netscape 4 or
earlier.
Apparently clicking different "what browser this will be viewed on"
choices only changes which items are checked in the five
compatibility options below.
OK, I just tried setting the "encoding" to what it calls "Western
European (ISO)" (which seems to be a dumbed-down way of saying
iso-8859-1), and saving as filtered HTML. I can report that all the
Windows-1252-specific characters which were included in my sample got
saved as their &#bignumber; Unicode representations.
I tried something with curly quotes, a degree sign, and Greek capital
Alpha and Omega. The curly quotes came out as &#8200-odd, much to my
amazement, and the Greek as &#900-odd. The degree sign was written to
the file as a single character, which out of habit I never do; but
I'm pretty sure it's legal since degree is character 176 in my
selected encoding.

BTW, the character encoding in the <meta> tag was ISO-8859-15, not
ISO-8859-1. AFAIK they're the same except that -15 has a euro symbol.
Kind-of amazing that they finally got there.


Extremely so.

I have a page of Word tips in preparation -- I should write down all
these convolutions while they're fresh in my mind. Thanks for
supplying the missing pieces.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 27 '05 #13

P: n/a
Thu, 27 Oct 2005 13:08:47 +0100 from Alan J. Flavell
<fl*****@ph.gla.ac.uk>:
On Wed, 26 Oct 2005, Stan Brown wrote:
(*) Mozilla 1.7 under Windows XP renders the 148 as a curly quote,
even if I change windows-1252 to iso-8859-1 in the <meta>. Is that
correct behavior?


I don't think there's any definition of "correct" behaviour when
UNUSED characters are included in HTML.


Makes sense; thanks.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 27 '05 #14

P: n/a
Tim
On Thu, 27 Oct 2005 12:54:39 +0100, Alan J. Flavell sent:
Aside from the rather obvious "Encoding" tab, there's a range of what
purport to be browser compatibility options. As you'd expect, there's no
option there to ask for W3C-compatible results :-( , and the only non-MS
browsers which get any kind of mention are Netscape 4 or earlier.


Even OpenOffice.org, of which I expect better, only offers similar HTML
export compatibility choices:

HTML 3.2
Microsoft Internet Explorer
Netscape Navigator (the default)
OpenOffice.org Writer

Although, doing a simple test with it set for Nutscrape, it's outputting
HTML 4.0 Transitional. And almost (*) without any errors, though I only
did something as simple as a heading, a bulleted list, and a couple of
paragraphs.

* There's no page title, it didn't ask me to provide one, I had to
manually adjust the page properties to put one in (no hints given, I just
tested an assumption about how it might be done).

I've never really liked HTML authoring tools, though this might just be
usable as a basic word processor for creating a page, that I'd then
post-process. I wouldn't touch MS Word with a ten foot bargepole, though.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Oct 27 '05 #15

P: n/a
On Thu, 27 Oct 2005, Tim wrote:

[of OpenOffice...]
I've never really liked HTML authoring tools, though this might just be
usable as a basic word processor for creating a page, that I'd then
post-process.
OK.
I wouldn't touch MS Word with a ten foot bargepole, though.


No, I wouldn't either, from choice; but the Uni has a campus licence
for the thing, and many of the authors who supply material for web
pages are going to use it anyway, so it's useful if we can find ways
to tame it.

For years I had been relying on a customised configuration of
rtf2html (I mean the one which later transmogrified to Logictran).
But I think this, subject to various caveats which I might post on
other branches of this thread, and with a tad of post-processing,
might be acceptable now.

cheers
Oct 27 '05 #16

P: n/a
On Thu, 27 Oct 2005, Stan Brown wrote, quoting me:
there's a range of what purport to be browser compatibility
options. As you'd expect, there's no option there to ask for
W3C-compatible results :-( , and the only non-MS browsers which
get any kind of mention are Netscape 4 or earlier.
Apparently clicking different "what browser this will be viewed on"
choices only changes which items are checked in the five
compatibility options below.


I'm not sure that's the whole story. Certainly there are some
interactions between the browser compatibility bar and the option
boxes, but certain choices can be made independently. And I suspect
there may be other compatibility features behind the scenes which
aren't evident from those few option boxes.
OK, I just tried setting the "encoding" to what it calls "Western
European (ISO)" (which seems to be a dumbed-down way of saying
iso-8859-1), and saving as filtered HTML. I can report that all the
Windows-1252-specific characters which were included in my sample got
saved as their &#bignumber; Unicode representations.


I tried something with curly quotes, a degree sign, and Greek capital
Alpha and Omega. The curly quotes came out as &#8200-odd, much to my
amazement, and the Greek as &#900-odd. The degree sign was written to
the file as a single character, which out of habit I never do; but
I'm pretty sure it's legal since degree is character 176 in my
selected encoding.


That all looks fine to me.
BTW, the character encoding in the <meta> tag was ISO-8859-15, not
ISO-8859-1.
Ouch! I got iso-8859-1. So maybe this is another effect of the
browser compatibility options? Or is it?

No - hang on - their pull-down menu has "Latin 9 (ISO)" (which is
another way of saying iso-8859-15) as a separate selection than
"Western European (ISO)", but there's no entry that says explicitly
"Latin 1 (ISO)". How odd. So we seem to have an open question here.
"Western European (ISO)" definitely emitted iso-8859-1 for me.
AFAIK they're the same except that -15 has a euro symbol.


My euro character in my test document turned into € - which is
as it should be in iso-8859-1

Earlier peer-reviewed advice on character encoding had concluded
that there was no point at all in using iso-8859-15 for an HTML
document: by the time that browsers had started supporting -15, they
already had adequate support for utf-8, and so the more-compatible
thing to do, if Windows-1252 was considered rude, was to use utf-8.

Use of 8859-15 for a plain-text document is a different matter
altogether, of course. But we're not concerned with that here.

cheers
Oct 27 '05 #17

P: n/a
Thu, 27 Oct 2005 23:57:30 +0900 from Tim
<ti*@mail.localhost.invalid>:
I've never really liked HTML authoring tools, though this might just be
usable as a basic word processor for creating a page, that I'd then
post-process. I wouldn't touch MS Word with a ten foot bargepole, though.


I agree completely. But like it or not, a lot of people have bought
into MS propaganda that they can create Web pages in Word, so I think
we want to tell them how to minimize the damage.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 27 '05 #18

P: n/a
Thu, 27 Oct 2005 17:20:32 +0100 from Alan J. Flavell
<fl*****@ph.gla.ac.uk>:
No - hang on - their pull-down menu has "Latin 9 (ISO)" (which is
another way of saying iso-8859-15) as a separate selection than
"Western European (ISO)", but there's no entry that says explicitly
"Latin 1 (ISO)". How odd. So we seem to have an open question here.
"Western European (ISO)" definitely emitted iso-8859-1 for me.


Perhaps because I'm running a US English version and you're running
(I guess) the UK English version?

It's a damn silly difference, but then, this _is_ Microsoft.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 27 '05 #19

P: n/a
"Jukka K. Korpela" ,comp.infosystems.www.authoring.html:
<p class=MsoNormal>This is Polish: <span lang=PL>Wałęsa</span></p>
[snip]
Setting the language to Polish has hardly any noticeable effect at
present, but it's still the right thing to do. (I guess the most
probable situation where it is useful is when someone opens the HTML
document in MS Word or some compatible program, which recognizes the
lang markup and uses this information in its spelling or grammar checks.
Somewhat deceptively, my version of MS Word has no such checks available
for Polish, so anything I claim to be Polish will "pass", i.e. will not
be flagged by MS Word.)


It is also useful in all kinds of natural language processing usages, in
particular machine translation (the main issue is that, because of the
high frequence of incorrect lang/xml:lang attributes, NLP software cannot
rely solely on these in the general case).
Oct 27 '05 #20

P: n/a
On Thu, 27 Oct 2005, Alan J. Flavell wrote:

[of "filtered" HTML emitted by Word 2003]
But I think this, subject to various caveats which I might post on
other branches of this thread, and with a tad of post-processing,
might be acceptable now.


I *did* mean to make a negative comment about the fact that the
wretched thing insists on emitting in-line CSS that specifies font
family - and fixed text-sizes in pt units. Like this (shudder!):

<span style='font-size:13.5pt;font-family:"Lucida Sans Unicode"'>‟</span>

The options seemed to include the possibility to change the family, or
to change the numbers of pt units; but what we'd *really* want is the
option to take them out altogether, and apply presentation(s) via an
external stylesheet.

Still, I suppose if we're going to have to post-process the crap for
one reason, we might as well post-process it for *several* reasons.

The nice thing about the rtf2html process that I was doing earlier was
that if you didn't like what was coming out, you changed the
conversion template to suit, and re-ran the whole conversion from a
single click - no post-processing phase was used. But it had
sufficient rough edges that some other process was really needed by
now. In fact to be honest it had grown long whiskers....

regards
Oct 28 '05 #21

P: n/a
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla and related browsers will display text with "lang=pl"
in the typeface selected for "Central European", which may be
better suited than the typeface selected for "Western European".


I've been thinking about this but I still cannot decide whether
this problem is serious enough to justify avoiding language markup.

It's not that bad if a block of quoted text in Polish appears in a
different font; it might even be useful. But if individual names and other
words inside paragraphs jump out that way, it can be nasty.

I suppose the problem is not very common, since Mozilla &Co. seem to have
default settings so that the same fonts are used for "Western European" and
"Central European".

But should we tell authors to explicitly suggest font-family for all text,
if they use language markup for inline elements? After all, the "guess font
from your guess of language" game is played only when a page does not tell
which font should be used.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Nov 1 '05 #22

P: n/a
On Tue, 1 Nov 2005, Jukka K. Korpela wrote:
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla and related browsers will display text with "lang=pl"
in the typeface selected for "Central European", which may be
better suited than the typeface selected for "Western European".
^^^^^^^^^^^^^ It's not that bad if a block of quoted text in Polish appears in a
different font; it might even be useful.


Err ... that's exactly what I wrote.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Nov 2 '05 #23

P: n/a
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Tue, 1 Nov 2005, Jukka K. Korpela wrote:
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Mozilla and related browsers will display text with "lang=pl"
in the typeface selected for "Central European", which may be better
suited than the typeface selected for "Western European".
^^^^^^^^^^^^^

It's not that bad if a block of quoted text in Polish appears in a
different font; it might even be useful.


Err ... that's exactly what I wrote.


I think there's a considerable difference in attitude.

What I tried to say, between the lines at least, that the feature is mostly
_harmful_, though usually not very serious for a couple of reasons - but
I admitted that for _blocks_ of text it might even be useful. I think it's
misleading to present the feature as a benefit of using language markup, when
it is in fact a risk - though a controllable risk, if you know about it
and use explicit font suggestions.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Nov 3 '05 #24

This discussion thread is closed

Replies have been disabled for this discussion.