473,386 Members | 1,819 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

clean up html document created by Word

jd
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff

Mar 30 '07 #1
9 3454
On Mar 30, 12:20 pm, "jd" <chima...@gmail.comwrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
You could try Beautiful Soup at http://www.crummy.com/software/Beaut...mentation.html

Python is good for parsing HTML/XML, so you could also try googling
Python parsing as well.

Mike

Mar 30 '07 #2
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
The non-python solution:

http://www.w3.org/People/Raggett/tidy/

Peter
Mar 30 '07 #3
jkn
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N

Mar 30 '07 #4
jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
From that very page I linked to:

"""
Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word
bulks out HTML files with stuff for round-tripping presentation between
HTML and Word. If you are more concerned about using HTML on the Web, check
out Tidy's "Word-2000" config option! Of course Tidy does a good job on
Word'97 files as well!
"""

Peter

Mar 30 '07 #5
Tidy can now perform wonders on HTML saved from Microsoft Word 2000!
Word bulks out HTML files with stuff for round-tripping presentation
between HTML and Word. If you are more concerned about using HTML on the
Web, check out Tidy's "Word-2000"
<http://www.w3.org/People/Raggett/tidy/#word2000config option! Of
course Tidy does a good job on Word'97 files as well!
-- source: http://www.w3.org/People/Raggett/tidy/

jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N

--
Shane Geiger
IT Director
National Council on Economic Education
sg*****@ncee.net | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy
Mar 30 '07 #6
jd:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
It's not an easy job, and it may require some manual editing, because
that html is the worst I have seen. You can use Tidy, there is a GUI
too, and you can use its suggestions to manually remove the offending
things, at the end Tidy is able to digest it, and return a cleaned up
html. But then you have just started, you need to process it even
more.

A solution is to avoid creating the Html in the first place, or to use
something more like Word 97 to create it. Dreamweaver too is able to
help with Word2000+ trashy html, but usually not enough.

If the structure of the Html document is simple enough, and assuming
you are using Windows, you can open it with Word, save it as RTF,
reopen it with Wordpad, save it again to remove some trash, and then
use something else (like Word 97, or maybe even Aracnophobia, etc) to
convert it to Html. Generally I've never found a really good way to
convert Rtf to a very good Html.

Bye,
bearophile

Mar 30 '07 #7
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
demoroniser - correct moronic and gratuitously incompatible HTML generated
by Microsoft applications

http://www.fourmilab.ch/webtools/demoroniser/

Unless you want to write your own with Python.

Mar 30 '07 #8
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
There is a Microsoft add-on for Word which helps to reduce the mess
called 'HTML filter'. Go for it here:

http://www.microsoft.com/downloads/d...displaylang=EN

run it and then use afterwards the other in this thread suggested
'cleaning' methods.

Claudio
Mar 30 '07 #9
jd
Wow, thanks for all the great responses!

Here's my summary:

- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/
- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style sheet and so the
rendering isn't as nice; the latter does include the style sheet but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p>&nbsp;</p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/d...displaylang=EN

BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.

- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/

-- jeff

Mar 31 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Travis Pupkin | last post by:
Hi, I use a WYSIWYG rich text editor on a few web sites for clients to manage their content, but a lot of them use word and paste into the box and end up with a lot of crappy code as a result. ...
20
by: Al Moritz | last post by:
Hi all, I was always told that the conversion of Word files to HTML as done by Word itself sucks - you get a lot of unnecessary code that can influence the design on web browsers other than...
8
by: Omey Samaroo | last post by:
Dear Access Gurus, What I would like to do is create a word document that merges some fields in a query. The next step is to have the document e-mailed to a group of users. What I have done so...
0
by: Niyazi | last post by:
Hi, I created application that store the data in SQL SERVER that reside on network. The client also use this application to access the resources provided with application. But is the client want...
2
by: s.danyal.k | last post by:
Hi All, I have created an application in C# that converts HTML file to MS word documents. The HTML file may also have images , for e.g "<img src='http://www.google.com.pk/images/hp0.gif'></img>"....
19
by: thisis | last post by:
Hi All, i have this.asp page: <script type="text/vbscript"> Function myFunc(val1ok, val2ok) ' do something ok myFunc = " return something ok" End Function </script>
232
by: robert maas, see http://tinyurl.com/uh3t | last post by:
I'm working on examples of programming in several languages, all (except PHP) running under CGI so that I can show both the source files and the actually running of the examples online. The first...
10
by: Greg Lovern | last post by:
I have a very large html table created by MS Word, saved as it's "Web Page, Filtered" file type. Every html table cell has lots of formatting tags. Most of the file size is that formatting. Is...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.