By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,508 Members | 1,223 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,508 IT Pros & Developers. It's quick & easy.

clean up html document created by Word

P: n/a
jd
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff

Mar 30 '07 #1
Share this Question
Share on Google+
9 Replies


P: n/a
On Mar 30, 12:20 pm, "jd" <chima...@gmail.comwrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
You could try Beautiful Soup at http://www.crummy.com/software/Beaut...mentation.html

Python is good for parsing HTML/XML, so you could also try googling
Python parsing as well.

Mike

Mar 30 '07 #2

P: n/a
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
The non-python solution:

http://www.w3.org/People/Raggett/tidy/

Peter
Mar 30 '07 #3

P: n/a
jkn
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N

Mar 30 '07 #4

P: n/a
jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
From that very page I linked to:

"""
Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word
bulks out HTML files with stuff for round-tripping presentation between
HTML and Word. If you are more concerned about using HTML on the Web, check
out Tidy's "Word-2000" config option! Of course Tidy does a good job on
Word'97 files as well!
"""

Peter

Mar 30 '07 #5

P: n/a
Tidy can now perform wonders on HTML saved from Microsoft Word 2000!
Word bulks out HTML files with stuff for round-tripping presentation
between HTML and Word. If you are more concerned about using HTML on the
Web, check out Tidy's "Word-2000"
<http://www.w3.org/People/Raggett/tidy/#word2000config option! Of
course Tidy does a good job on Word'97 files as well!
-- source: http://www.w3.org/People/Raggett/tidy/

jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N

--
Shane Geiger
IT Director
National Council on Economic Education
sg*****@ncee.net | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy
Mar 30 '07 #6

P: n/a
jd:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
It's not an easy job, and it may require some manual editing, because
that html is the worst I have seen. You can use Tidy, there is a GUI
too, and you can use its suggestions to manually remove the offending
things, at the end Tidy is able to digest it, and return a cleaned up
html. But then you have just started, you need to process it even
more.

A solution is to avoid creating the Html in the first place, or to use
something more like Word 97 to create it. Dreamweaver too is able to
help with Word2000+ trashy html, but usually not enough.

If the structure of the Html document is simple enough, and assuming
you are using Windows, you can open it with Word, save it as RTF,
reopen it with Wordpad, save it again to remove some trash, and then
use something else (like Word 97, or maybe even Aracnophobia, etc) to
convert it to Html. Generally I've never found a really good way to
convert Rtf to a very good Html.

Bye,
bearophile

Mar 30 '07 #7

P: n/a
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
demoroniser - correct moronic and gratuitously incompatible HTML generated
by Microsoft applications

http://www.fourmilab.ch/webtools/demoroniser/

Unless you want to write your own with Python.

Mar 30 '07 #8

P: n/a
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
There is a Microsoft add-on for Word which helps to reduce the mess
called 'HTML filter'. Go for it here:

http://www.microsoft.com/downloads/d...displaylang=EN

run it and then use afterwards the other in this thread suggested
'cleaning' methods.

Claudio
Mar 30 '07 #9

P: n/a
jd
Wow, thanks for all the great responses!

Here's my summary:

- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/
- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style sheet and so the
rendering isn't as nice; the latter does include the style sheet but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p>&nbsp;</p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/d...displaylang=EN

BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.

- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/

-- jeff

Mar 31 '07 #10

This discussion thread is closed

Replies have been disabled for this discussion.