473,768 Members | 1,622 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

MS Word to XHTML

Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?

Any comments on the quality/effectiveness of suitable products also
welcomed.

Sep 11 '05 #1
15 3380
__/ [Caversham] on Sunday 11 September 2005 06:02 \__
Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?

Any comments on the quality/effectiveness of suitable products also
welcomed.


I would advice you to do the following:

* Download Open Office 2 beta (openoffice.org )

* Install it on your Windows machine

* Open the Word document in Open Office

* Save or export as HTML

* Fragment the output as requires, probably by hand (WYSIWYG programs like
Word have no notion of structure or semantics)

* Run HTMLTidy on the resulting HTML (find it in sourceforge.org )

* Modify output to fit XHTML standards

* Use search & replace for the task above

* Lastly, make sure your code validates (W3C validator)

Good luck,

Roy

--
Roy S. Schestowitz | "Slashdot is standard-compliant... in Japan"
http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E
7:40am up 17 days 6:08, 3 users, load average: 2.10, 2.08, 1.85
Sep 11 '05 #2

On Sun, 11 Sep 2005, Roy Schestowitz wrote (seen on alt.html):

[...]
* Fragment the output as requires, probably by hand (WYSIWYG programs
like Word have no notion of structure or semantics)


This isn't by any means aimed at you personally, but your posting
triggered a response from me, and it looks as if knowledge is proceeding
backwards.

Proper use of MS Word uses Styles, oriented towards the structure of the
document. (If I had my way, I'd rip the direct styling buttons out of the
main menu of Word, and hide them away in an Advanced Users menu). Such
properly-made Word documents are reasonably capable of being converted
well to structural HTML, and a stylesheet suitable for web use can then be
applied (it usually won't be the same "style sheet" (= style template) as
would be suitable for a printed Word document, of course!).

I had some experience, around 1997-8, with the (payware) rtftohtml program
- subsequently renamed and marketed under the company name Logictran - it
had this pretty-much sorted out. I must admit I haven't got experience of
it since the change of name, but I can say that the principles of the
original program seemed to what I was looking for, unlike most of the
other pseudo-WYSIWYG garbage from other places (that offended all sense of
what is suitable for the WWW).

With that rtftohtml program, decently structured Word could be turned into
decently structured HTML, and split on chapter or section headings quite
automatically, with HTML indexes and table of contents generated
automatically. OK, there were some rough edges, but at least the
principles showed up just fine. I find it sad that some 7 years later we
seem to have fallen back to the stone age of direct styling and
pseudo-WYSIWYG in most of the Word conversions that I have seen.

[Note - there are other programs called rtftohtml or rtf2html - it may be
that some of them do a similar job, I can't speak for or against them,
I'm just commenting as a reasonably satistfied user of version 4 of this
particular program from around 1998 onwards.]
Sep 11 '05 #3
Alan J. Flavell wrote:
On Sun, 11 Sep 2005, Roy Schestowitz wrote (seen on alt.html):

[...]
* Fragment the output as requires, probably by hand (WYSIWYG programs
like Word have no notion of structure or semantics)

This isn't by any means aimed at you personally, but your posting
triggered a response from me, and it looks as if knowledge is proceeding
backwards.

Proper use of MS Word uses Styles, oriented towards the structure of the
document. (If I had my way, I'd rip the direct styling buttons out of the
main menu of Word, and hide them away in an Advanced Users menu). Such
properly-made Word documents are reasonably capable of being converted
well to structural HTML, and a stylesheet suitable for web use can then be
applied (it usually won't be the same "style sheet" (= style template) as
would be suitable for a printed Word document, of course!).

I had some experience, around 1997-8, with the (payware) rtftohtml program
- subsequently renamed and marketed under the company name Logictran - it
had this pretty-much sorted out. I must admit I haven't got experience of
it since the change of name, but I can say that the principles of the
original program seemed to what I was looking for, unlike most of the
other pseudo-WYSIWYG garbage from other places (that offended all sense of
what is suitable for the WWW).

With that rtftohtml program, decently structured Word could be turned into
decently structured HTML, and split on chapter or section headings quite
automatically, with HTML indexes and table of contents generated
automatically. OK, there were some rough edges, but at least the
principles showed up just fine. I find it sad that some 7 years later we
seem to have fallen back to the stone age of direct styling and
pseudo-WYSIWYG in most of the Word conversions that I have seen.

[Note - there are other programs called rtftohtml or rtf2html - it may be
that some of them do a similar job, I can't speak for or against them,
I'm just commenting as a reasonably satistfied user of version 4 of this
particular program from around 1998 onwards.]


Word XP and upwards stores its documents in XML format doesn't it? You
could probably write your own XSLT to turn in into HTML fairly easily.

--
x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
# this post (c) Miranda Thomas 2005
# explicitly no permission given to Forum4Designers
# to duplicate this post.
Sep 11 '05 #4
Roy Schestowitz wrote:
* Run HTMLTidy on the resulting HTML (find it in sourceforge.org )
* Modify output to fit XHTML standards
* Use search & replace for the task above


Tidy can do all of this -- use the "-asxhtml" option.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Sep 11 '05 #5
On Sun, 11 Sep 2005, SpaceGirl wrote:
Alan J. Flavell wrote:
[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]
Word XP and upwards stores its documents in XML format doesn't it?
So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
You could probably write your own XSLT to turn in into HTML fairly
easily.


There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.
Sep 11 '05 #6
__/ [Toby Inkster] on Sunday 11 September 2005 10:02 \__
Roy Schestowitz wrote:
* Run HTMLTidy on the resulting HTML (find it in sourceforge.org )
* Modify output to fit XHTML standards
* Use search & replace for the task above


Tidy can do all of this -- use the "-asxhtml" option.


I didn't know about the existence of this option. Perhaps I am using an
(very) old version of tidy. I wasn't impressed the last time I used it,
which was over a year ago. I must also have thought about complex cases
when I suggested the steps above. Placements of images, for example, might
pose some difficulties, especially if they float.

OO.org will be a decent tools for steering away from non-standard attributes
and hard-coded fonts. The last thing the World Wide Web needs is more code
that is 'made up', which non-MS browsers like Firefox must accept and adapt
to. Sad, yet inevitable.

It sometimes upsets me that kids at school are taught to compose using
WYSIWYG paradigms. It only encourages information to be uniterpretable.
Like Zeldman once said, people used to toss bottles out the car's window
until they realised the impact of carelessness and laziness (misquotation,
but something to that effect anyway).

Roy

--
Roy S. Schestowitz | "Computers are useless. They only solve problems"
http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E
1:35pm up 17 days 12:03, 3 users, load average: 0.67, 0.94, 0.88
Sep 11 '05 #7
__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__
On Sun, 11 Sep 2005, SpaceGirl wrote:
Alan J. Flavell wrote:
[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]
Word XP and upwards stores its documents in XML format doesn't it?


So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
You could probably write your own XSLT to turn in into HTML fairly
easily.


There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.


I fully agree with you on that point. Any attempt at rephrasing the same
ideas would result in depletion. To suggest ways forward, I suggest that
the OP, who clearly wants to publish material on the Web, learns LaTeX.
Shall the idea of editing raw text become daunting, I suggest LyX < lyx.org [LyX: Front-end to LaTeX]. 5 minutes with LyX would help anyone realise

the difference and convey the idea, e.g. varying outputs, styles,
imposition of structure, etc.

Only a few days ago, somebody in the LyX mailing lists mentioned his
upcoming presentation on "Word: What you See Is What a Mess". The
presentation I deliver on Wednesday is well-formed XHTML <
http://schestowitz.com/Weblog/archiv...blic-speaking/ > and is
motored by Eric Meyer's S5.

Roy

--
Roy S. Schestowitz | "Software sucks. Open Source sucks less."
http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E
1:45pm up 17 days 12:13, 3 users, load average: 0.51, 0.58, 0.70
Sep 11 '05 #8
Alan J. Flavell wrote:
You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk.


That is a very nice analogy -- I must try to remember it.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Sep 11 '05 #9
"Caversham" <ac******@yahoo .com> writes:
Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?


I have a macro "Wrocco" that extracts XML from a documented
including paragraph and character styles and document
properties, but not everything (no formatting or tables).

The VBA source code and some links to other resources can
be found in the project page:

http://www.purl.org/stefan_ram/pub/wrocco_en

If you would use any tool to create XML from Word (including
XHTML), you could then use XSLT to split this into multiple
pages, I assume.

Sep 11 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
7420
by: Clifford W. Racz | last post by:
Has anyone solved the issue of translating lists in Word 2003 (WordML) into xHTML? I have been trying to get the nested table code for my XSLT to work for a while now, with no way to get the collection that I need. To begin, I am using xsltproc that conmes with Cygwin as my processor. I have no particular affinity to this processor except that it is open source and standards compliant. I don't like M$, but if using a M$ processing...
13
4010
by: kurtj | last post by:
Hello Gurus: I have a validation script (below) that is somehow messed up. If the Name field is blank, I get the alert message, then the browser window goes to a blank document with the word "false" on it. What the ?!?!?! To test, I commented out the 'return false;' code in the second IF block, so now if there is a value in Name then I get the alert message for Email and the page stays put.
4
1561
by: Ruben | last post by:
Hi all, I'm looking for a module (COM object/library/anything callable from PHP) to convert Word documents to valid XHTML. Does anyone know of something that does this? It doesn't have to be free. Thanks in advance. Ruben.
17
8132
by: alxasa | last post by:
Hi, can someone please show me how to most elegently do this?..... I have a textbox, and I want to search the contents of it and replace all instances of a certain word, and replace that word with something else. For the purposes of this it could be replacing "green" with "blue". Can someone please show me how to properly do this? :) Sincerest regards, Alxasa.
1
4102
by: Darsin | last post by:
What i am doing is to pull the data from a CMS and import it to Word 2007 Beta and i also have to export the data from Word 2007 Beta back to that CMS. We have with us two Web Services of the CMS. The Web Services are explained as follows: IMPORT WEB SERVICE:
14
2554
by: Linda Jimerson | last post by:
Hi - I'm using xhtml 1.0 and css 2.0 to code my website. I've run into an odd problem. I have a simple <h3> element (centered) under my photo, but as you can see Iin FIREFOX, Opera and Netscape) the first word of my <h3> header is tossed to the right top of larger div. Can anyone help. It seems like this only happens when the first word is very short (two or three chars). Thanks sooo much. linda ...
3
3131
by: Martin Bretschneider | last post by:
Hi, ms word should output xhtml without any css style. Tidy (http://tidy.sourceforge.net/) helps quite a lot but leaves the css styles like the following: <p class="P11 c2">foo</p> <ul class="c4"> <li class="P11 c3"> <p class="P11 c2">bar</p>
2
2045
by: koraykazgan | last post by:
Hi all, I am using a WebService in ASP.Net 2.0 to retrieve Data in XTHML format. I want to put this data in a Word Document and send this document to the client. Till now, I just used Response.AddHeader and set the type to "Application/MSWord". It was a plain text file, but because of the extension .doc, the user coult opened the document in word. And Word is able to show HTML Documents, so everything worked fine. But now, I have to put...
2
2815
by: icewalker | last post by:
Hi I have been trying to open a new window in Word/OO Writer with JS using the following code (and numerous variations I could add...): tw = window.open('about:blank',''); tw.document.writeln('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">'); tw.document.writeln('<HTML><HEAD>'); tw.document.writeln('Content-type: application/msword'); tw.document.writeln('</HEAD>');
0
9407
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10176
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10018
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9964
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8840
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5425
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3933
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3540
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2808
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.