By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,487 Members | 1,074 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,487 IT Pros & Developers. It's quick & easy.

convert non-western languages to HTML from Word

P: n/a
I really need some help with langauge conversion to HTML. My
translators are translating into Word and I need to convert Word to
HTML. It's been awhile since I've worked with Unicode and know that
not all the fonts being used are unicode. Is there a way to strip all
the junk out of the MS Word filtered (yeah right) HTML. All I want are
the basic formatting tags no spans,fonts, divs, css, but I don't want
to lose any language identification in the doctype or metatags or
directionality. Any help would be greatly appreciated. This is for a
highly ranked (google) non-profit site.
Thanks
Jan 19 '08 #1
Share this Question
Share on Google+
10 Replies


P: n/a
On 2008-01-19, annalisa <an*******@yahoo.comwrote:
I really need some help with langauge conversion to HTML. My
translators are translating into Word and I need to convert Word to
HTML. It's been awhile since I've worked with Unicode and know that
not all the fonts being used are unicode.
Never mind the fonts. What you want out of the Word docs is the
characters. You need to figure out how Word has encoded the output and
then probably transcode it to UTF-8 (you don't have to use UTF-8 but
it's simpler).

A good transcoding program is "iconv".
Is there a way to strip all the junk out of the MS Word filtered (yeah
right) HTML.
I am lucky enough not to be speaking from experience of having had to do
that but I would start with Python and BeautifulSoup.
All I want are
the basic formatting tags no spans,fonts, divs, css, but I don't want
to lose any language identification in the doctype or metatags or
directionality.
Directionality should just work-- the characters are stored from "start"
to "end" and it's up to the browser to lay them out right-to-left or
left-to-right where appropriate.

An interesting question though is whether your authors have used special
characters like RLO and RLE, and whether if they have Word will save
them out as the Unicode characters.

Then you have to decide whether to leave them in the output, or to
replace them with the equivalent unicode-bidi properties. I don't know
which has better browser support.
Jan 19 '08 #2

P: n/a
On Sat, 19 Jan 2008, Ben C wrote:
Directionality should just work-- the characters are stored from "start"
to "end" and it's up to the browser to lay them out right-to-left or
left-to-right where appropriate.
Directionality doesn't "just work" - on the contrary, the bidirectional
algorithm refers to seven control or formatting characters and explains
how to use them. In HTML however, you should replace them with DIR
markup. Read more at
* http://www.unics.uni-hannover.de/nht.../if.tut.sc.www
An interesting question though is whether your authors have used special
characters like RLO and RLE, and whether if they have Word will save
them out as the Unicode characters.
Then you have to decide whether to leave them in the output, or to
replace them with the equivalent unicode-bidi properties.
By "unicode-bidi properties", do you mean "CSS properties"?
Normally, you should prefer HTML markup (DIR attribute) to
CSS properties and to Unicode control characters.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Jan 22 '08 #3

P: n/a
On Sun, 20 Jan 2008, Jukka K. Korpela wrote:
>An interesting question though is whether your authors have used
special characters like RLO and RLE, and whether if they have Word
will save them out as the Unicode characters.

That might be a problem... but in my test, RLO doesn't seem to work
even in Word,
You must first install
http://www.microsoft.com/globaldev/h...pintlsupp.mspx
http://www.microsoft.com/globaldev/h...kintlsupp.mspx
so why would an author use it?
Authors should avoid these Unicode characters in HTML:
http://www.unics.uni-hannover.de/nht...l-text#control
You could use character references like ‫ . Check at
http://www.unics.uni-hannover.de/nht...t-to-left.html
whether they work in your browser.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Jan 22 '08 #4

P: n/a
On 2008-01-22, Andreas Prilop <ap*********@trashmail.netwrote:
On Sat, 19 Jan 2008, Ben C wrote:
>Directionality should just work-- the characters are stored from "start"
to "end" and it's up to the browser to lay them out right-to-left or
left-to-right where appropriate.

Directionality doesn't "just work" - on the contrary, the bidirectional
algorithm refers to seven control or formatting characters and explains
how to use them.
But do they work in Word? And do you know how widely used they are in
general?
In HTML however, you should replace them with DIR
markup.
Read more at
? http://www.unics.uni-hannover.de/nht.../if.tut.sc.www ?
Useful link, thanks.
>An interesting question though is whether your authors have used special
characters like RLO and RLE, and whether if they have Word will save
them out as the Unicode characters.
Then you have to decide whether to leave them in the output, or to
replace them with the equivalent unicode-bidi properties.

By "unicode-bidi properties", do you mean "CSS properties"?
Well, I mean one particular CSS property: unicode-bidi.
Jan 22 '08 #5

P: n/a
On Tue, 22 Jan 2008, Ben C wrote:
>Directionality doesn't "just work" - on the contrary, the bidirectional
algorithm refers to seven control or formatting characters and explains
how to use them.

But do they work in Word?
First install
http://www.microsoft.com/globaldev/h...kintlsupp.mspx
http://www.microsoft.com/globaldev/h...pintlsupp.mspx

Then download
http://www.unics.uni-hannover.de/nht...t-to-left.html

Open this file in your editor(s) and compare.
And do you know how widely used they are in general?
Right-to-left works with Mozilla 1.0, Internet Explorer 5.0;
but I cannot test every browser. Everybody can test with
http://www.unics.uni-hannover.de/nht...t-to-left.html
because this page uses only (or mostly) ASCII digits.
You don't need to read Arabic or Hebrew.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Jan 23 '08 #6

P: n/a
On Tue, 22 Jan 2008, Ben C wrote:
>http://www.unics.uni-hannover.de/nht...l-text#control

I'm not concerned about the difficulty of editing the source text
(you may have a point there, but it's not the one I'm interested in).
Then we speak about different things.
Are you sure using character references instead of UTF-8 makes any
difference?
I wrote about the *source* text, not the page rendering.

In the above mentioned paragraph, there are two links to
http://www.unics.uni-hannover.de/nht...haracters.html
http://www.unics.uni-hannover.de/nht...haracters.text

Have you read them? As you can (or should) see, the page display
is (or should be) the same. However one line in the source text
is f^H messed up.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell
Jan 23 '08 #7

P: n/a
On 2008-01-23, Andreas Prilop <ap*********@trashmail.netwrote:
On Tue, 22 Jan 2008, Ben C wrote:
[...]
>And do you know how widely used they are in general?

Right-to-left works with Mozilla 1.0, Internet Explorer 5.0;
but I cannot test every browser. Everybody can test with
No I meant to ask, do you know how often writers of Arabic and other
bidirectional languages actually use RLE, RLO, etc.?
Jan 23 '08 #8

P: n/a
On Wed, 23 Jan 2008, Ben C wrote:
No I meant to ask, do you know how often writers of Arabic and other
bidirectional languages actually use RLE, RLO, etc.?
I have no idea.

--
Bugs in Internet Explorer 7
http://www.unics.uni-hannover.de/nhtcapri/ie7-bugs
Jan 23 '08 #9

P: n/a
On 2008-01-23, Andreas Prilop <ap*********@trashmail.netwrote:
On Tue, 22 Jan 2008, Ben C wrote:
>>http://www.unics.uni-hannover.de/nht...l-text#control

I'm not concerned about the difficulty of editing the source text
(you may have a point there, but it's not the one I'm interested in).

Then we speak about different things.
>Are you sure using character references instead of UTF-8 makes any
difference?

I wrote about the *source* text, not the page rendering.
Ah I see what you mean: you mean when viewing the source in a
bidirectional text editor the bidi algorithm (as applied by the editor)
will be confounded by the newlines.

On the other hand if you use ‫ in the source, they won't have any
effect at all, in the editor, which also won't be right. But at least
may be less confusing.

I haven't ever used a bidi text editor, but do they insert newlines or
U+2028s when you press the RETURN key?

U+2028 in the source text is not a great idea because it wouldn't be
collapsed according to CSS 2.1 specifications. Such an editor would
therefore be unsuitable for editing HTML intended for use in browser. If
it did insert newlines OTOH it wouldn't be very useful for bidirectional
text editing, at least not in the conventional fashion in which lines
are explicitly broken to wrap them.

CSS 2.1 should probably specify that U+2028s should be collapsed
whenever LFs are.

But I am interested to know if you have any experience with
bidirectional text editors and what their behaviour is.
Jan 23 '08 #10

P: n/a
On Wed, 23 Jan 2008, Ben C wrote:
>I wrote about the *source* text, not the page rendering.

Ah I see what you mean: you mean when viewing the source in a
bidirectional text editor the bidi algorithm (as applied by the editor)
will be confounded by the newlines.
Not by the newlines but by the bidi control characters.
On the other hand if you use ‫ in the source, they won't have any
effect at all, in the editor, which also won't be right.
Not right? What do you mean?
U+2028 in the source text is not a great idea
I never wrote about U+2028 or 

but only about U+202A to U+202E.

--
Top-posting.
What's the most irritating thing on Usenet?
Jan 23 '08 #11

This discussion thread is closed

Replies have been disabled for this discussion.