473,775 Members | 2,277 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

MS Word to XHTML

Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages?

Any comments on the quality/effectiveness of suitable products also
welcomed.

Sep 11 '05
15 3383
Hi,

Tempore 12:19:53, die Sunday 11 September 2005 AD, hinc in foribus {microsoft.publ ic.word.vba.gen eral,microsoft. public.word.doc management,alt. html,comp.text. xml} scripsit Alan J. Flavell <fl*****@ph.gla .ac.uk>:
Word XP and upwards stores its documents in XML format doesn't it?


So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
You could probably write your own XSLT to turn in into HTML fairly
easily.


There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.


I warmheartedly applaud your brilliant analysis. You stated your point very clearly.

It's depressing to see what a tiny percentage of people realize (or bother with) the importance of structural markup.

The future does not look bright. I have seen so called 'IT-classes' where they make innocent people believe they are IT-experts when they can change the background color of characters typed in Word...

regards,
--
Joris Gillis (http://users.telenet.be/root-jg/me.html)
Spread the wiki (http://www.wikipedia.org)
Sep 11 '05 #11
Roy Schestowitz wrote:
__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__

On Sun, 11 Sep 2005, SpaceGirl wrote:

Alan J. Flavell wrote:


[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]

Word XP and upwards stores its documents in XML format doesn't it?


So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
Word documents, being style based, are easy to convert. Use XSLT to
strip out all the crap so that all you end up with is basic HTML - <p>'s
and <h>'s. I wasn't suggested that anything more complicated that that
should be attempted - but I HAVE seen it done pretty successfully with
Word 2003 files. In the case of that client (although I wasn't part of
the team who wrote those tools), their customers would submit Word
documents and the XSLT would convert them into both HTML and PDFs, and
the reproduction was almost perfect (styling and colours anyway).
You could probably write your own XSLT to turn in into HTML fairly
easily.


There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.
That wasn't what I saw, but like I said I wasn't on that team. As far as
I could tell they wrote a simple parser.
You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.


Strange, as I've never had a problem. Generally I have to do it in a
sort of round-robin of programs; First save your Word documents as PDF,
then save the PDF as a web page. It works just fine.

<snip stuff I cant be bothered to read, seeing as everyone else is being
so fucking rude>
--
x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
# this post (c) Miranda Thomas 2005
# explicitly no permission given to Forum4Designers
# to duplicate this post.
Sep 11 '05 #12
__/ [SpaceGirl] on Sunday 11 September 2005 20:46 \__
Roy Schestowitz wrote:
__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__

On Sun, 11 Sep 2005, SpaceGirl wrote:
Alan J. Flavell wrote:

[comprehensive quote of my posting, without apparently having anything
relevant to say about it.]
Word XP and upwards stores its documents in XML format doesn't it?

So what? XML is only a format for defining markup. If the markup
doesn't do anything meaningful (specifically - if it only creates a
visual result on a printed page, without having any significant
structure) then it's not going to turn into effective HTML: it'd just
be the usual garbage in / garbage out that we're accustomed to with
Word conversions to soi-disant "web" format.
Word documents, being style based, are easy to convert. Use XSLT to
strip out all the crap so that all you end up with is basic HTML - <p>'s
and <h>'s. I wasn't suggested that anything more complicated that that
should be attempted - but I HAVE seen it done pretty successfully with
Word 2003 files. In the case of that client (although I wasn't part of
the team who wrote those tools), their customers would submit Word
documents and the XSLT would convert them into both HTML and PDFs, and
the reproduction was almost perfect (styling and colours anyway).
You could probably write your own XSLT to turn in into HTML fairly
easily.

There seems to be some kind of conceptual disconnect here. Most Word
documents (in my experience) simply don't contain the necessary
structure for useful conversion to HTML: they've been created as a
purely visual construction for printing onto paper. It's irrelevant
what underlying technology you use (RTF, XML, SGML, whatever) - the
problem is that the source material simply does not represent the
needed structures, *because the document authors do not put it there*.
That wasn't what I saw, but like I said I wasn't on that team. As far as
I could tell they wrote a simple parser.

I believe that's possible, but it depends on the standard that the author
sticks to. Word does not /force/ the author to add structural information.
Hence, hacks are allowed which leave bits hanging aloof.

You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk. And the kind of "fresh milk" that's needed here is
logically structured text markup. Not visual formatting. Until the
authors of Word documents can grasp that, the prospects for conversion
of Word to web formats are poor, IMHO.


Strange, as I've never had a problem. Generally I have to do it in a
sort of round-robin of programs; First save your Word documents as PDF,
then save the PDF as a web page. It works just fine.

I have had bad experiences converting PDF's to HTML. I even wrote about this
very <http://schestowitz.com/Weblog/archives/2005/05/24/pdf-to-html/>
particular conversion because I found it frustrating. PDF involves
embedment of objects to fit the media, e.g. A4 paper, so it is bound to
lose what is necessary for a good conversion.

<snip stuff I cant be bothered to read, seeing as everyone else is being
so fucking rude>

Are you referring to me? Did I say anything rude? Please clarify if
possible.

Roy
Sep 12 '05 #13
On Sun, 11 Sep 2005, Roy Schestowitz wrote:
To suggest ways forward, I suggest that
the OP, who clearly wants to publish material on the Web, learns LaTeX.
Well, this drifts somewhat off the topic of some of the crossposted
groups, but our physicists are accustomed to writing their
publications in some form of latex, and I can say that when I was
handling the web-ifying of their publications, several years back, I
was (for the most part) getting good results from a program called
latex2html, and most problems were attributable to identifiable
causes, none of which were usually a major hindrance. (Back then we
had to make do with the deplorable HMTL version called HTML/3.2, but,
aside from that, the principles seemed right).
Shall the idea of editing raw text become daunting, I suggest LyX
< lyx.org > [LyX: Front-end to LaTeX]. 5 minutes with LyX would help
anyone realise the difference and convey the idea, e.g. varying
outputs, styles, imposition of structure, etc.

Only a few days ago, somebody in the LyX mailing lists mentioned his
upcoming presentation on "Word: What you See Is What a Mess".


googled!

It's really the principles which count here: but in practical terms,
I'm sure you're right in aiming at a format which promotes >doing the
right thing< by default - as opposed to one which has prominent
direct-formatting buttons on its user interface, and logical markup as
an apparently advanced topic which, I'm afraid, too many of authors
seem to disdain learning.

all the best
Sep 12 '05 #14
[Groups distribution reduced]

__/ [Alan J. Flavell] on Monday 12 September 2005 17:33 \__
On Sun, 11 Sep 2005, Roy Schestowitz wrote:
To suggest ways forward, I suggest that
the OP, who clearly wants to publish material on the Web, learns LaTeX.


Well, this drifts somewhat off the topic of some of the crossposted
groups, but our physicists are accustomed to writing their
publications in some form of latex, and I can say that when I was
handling the web-ifying of their publications, several years back, I
was (for the most part) getting good results from a program called
latex2html, and most problems were attributable to identifiable
causes, none of which were usually a major hindrance. (Back then we
had to make do with the deplorable HMTL version called HTML/3.2, but,
aside from that, the principles seemed right).

I use latex2html almost religiously. I estimate that about 1000 pages in my
site are in one form or another a product of latex2html, which has always
produced better output than lyx2html, for example. I discussed latex2html
in depth a couple of days ago and I continue to promote it.

Shall the idea of editing raw text become daunting, I suggest LyX
< lyx.org > [LyX: Front-end to LaTeX]. 5 minutes with LyX would help
anyone realise the difference and convey the idea, e.g. varying
outputs, styles, imposition of structure, etc.

Only a few days ago, somebody in the LyX mailing lists mentioned his
upcoming presentation on "Word: What you See Is What a Mess".


googled!

It's really the principles which count here: but in practical terms,
I'm sure you're right in aiming at a format which promotes >doing the
right thing< by default - as opposed to one which has prominent
direct-formatting buttons on its user interface, and logical markup as
an apparently advanced topic which, I'm afraid, too many of authors
seem to disdain learning.

all the best

Only last night I was in a similar position involving my supervisor who
heads the Computer Science Department [I believe it is sensible to make
this public given the nature of the discussion]. For a Windows-centric
person like himself, who uses Office almost exclusively, it was difficult
to satisfy a Linux-dominated department. Conversion of a Word document to
HTML, also to be embedded in E-mail (I must bite my tongue) was never a
good idea. The final outcome is a PDF attachment with hyperlinks. My
arguments about standards, structure-based composition and the like seem to
have led to this result, which I suspect many will be satisfied with.

Best Wishes,

Roy

--
Roy S. Schestowitz | "Avoid missing ball for higher score"
http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E
6:10pm up 18 days 13:16, 3 users, load average: 0.66, 0.29, 0.29
Sep 12 '05 #15
Toby Inkster wrote:
Alan J. Flavell wrote:
You might as well try to convert cheese into fresh cream: both are
fine milk products, it's true, but instead of trying to convert the
one into the other, you'd do better to produce them both starting from
fresh milk.


That is a very nice analogy -- I must try to remember it.


The others in common use are

Turning hamburgers back into cows
Turning scrambled eggs back into chickens

///Peter

Sep 13 '05 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
7420
by: Clifford W. Racz | last post by:
Has anyone solved the issue of translating lists in Word 2003 (WordML) into xHTML? I have been trying to get the nested table code for my XSLT to work for a while now, with no way to get the collection that I need. To begin, I am using xsltproc that conmes with Cygwin as my processor. I have no particular affinity to this processor except that it is open source and standards compliant. I don't like M$, but if using a M$ processing...
13
4010
by: kurtj | last post by:
Hello Gurus: I have a validation script (below) that is somehow messed up. If the Name field is blank, I get the alert message, then the browser window goes to a blank document with the word "false" on it. What the ?!?!?! To test, I commented out the 'return false;' code in the second IF block, so now if there is a value in Name then I get the alert message for Email and the page stays put.
4
1561
by: Ruben | last post by:
Hi all, I'm looking for a module (COM object/library/anything callable from PHP) to convert Word documents to valid XHTML. Does anyone know of something that does this? It doesn't have to be free. Thanks in advance. Ruben.
17
8134
by: alxasa | last post by:
Hi, can someone please show me how to most elegently do this?..... I have a textbox, and I want to search the contents of it and replace all instances of a certain word, and replace that word with something else. For the purposes of this it could be replacing "green" with "blue". Can someone please show me how to properly do this? :) Sincerest regards, Alxasa.
1
4102
by: Darsin | last post by:
What i am doing is to pull the data from a CMS and import it to Word 2007 Beta and i also have to export the data from Word 2007 Beta back to that CMS. We have with us two Web Services of the CMS. The Web Services are explained as follows: IMPORT WEB SERVICE:
14
2554
by: Linda Jimerson | last post by:
Hi - I'm using xhtml 1.0 and css 2.0 to code my website. I've run into an odd problem. I have a simple <h3> element (centered) under my photo, but as you can see Iin FIREFOX, Opera and Netscape) the first word of my <h3> header is tossed to the right top of larger div. Can anyone help. It seems like this only happens when the first word is very short (two or three chars). Thanks sooo much. linda ...
3
3131
by: Martin Bretschneider | last post by:
Hi, ms word should output xhtml without any css style. Tidy (http://tidy.sourceforge.net/) helps quite a lot but leaves the css styles like the following: <p class="P11 c2">foo</p> <ul class="c4"> <li class="P11 c3"> <p class="P11 c2">bar</p>
2
2045
by: koraykazgan | last post by:
Hi all, I am using a WebService in ASP.Net 2.0 to retrieve Data in XTHML format. I want to put this data in a Word Document and send this document to the client. Till now, I just used Response.AddHeader and set the type to "Application/MSWord". It was a plain text file, but because of the extension .doc, the user coult opened the document in word. And Word is able to show HTML Documents, so everything worked fine. But now, I have to put...
2
2817
by: icewalker | last post by:
Hi I have been trying to open a new window in Word/OO Writer with JS using the following code (and numerous variations I could add...): tw = window.open('about:blank',''); tw.document.writeln('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">'); tw.document.writeln('<HTML><HEAD>'); tw.document.writeln('Content-type: application/msword'); tw.document.writeln('</HEAD>');
0
9622
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9454
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10270
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10109
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9916
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8940
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6718
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5361
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
2854
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.