473,799 Members | 2,926 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

clean up html document created by Word

jd
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff

Mar 30 '07 #1
9 3474
On Mar 30, 12:20 pm, "jd" <chima...@gmail .comwrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
You could try Beautiful Soup at http://www.crummy.com/software/Beaut...mentation.html

Python is good for parsing HTML/XML, so you could also try googling
Python parsing as well.

Mike

Mar 30 '07 #2
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
The non-python solution:

http://www.w3.org/People/Raggett/tidy/

Peter
Mar 30 '07 #3
jkn
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N

Mar 30 '07 #4
jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
From that very page I linked to:

"""
Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word
bulks out HTML files with stuff for round-tripping presentation between
HTML and Word. If you are more concerned about using HTML on the Web, check
out Tidy's "Word-2000" config option! Of course Tidy does a good job on
Word'97 files as well!
"""

Peter

Mar 30 '07 #5
Tidy can now perform wonders on HTML saved from Microsoft Word 2000!
Word bulks out HTML files with stuff for round-tripping presentation
between HTML and Word. If you are more concerned about using HTML on the
Web, check out Tidy's "Word-2000"
<http://www.w3.org/People/Raggett/tidy/#word2000config option! Of
course Tidy does a good job on Word'97 files as well!
-- source: http://www.w3.org/People/Raggett/tidy/

jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.

If Beautiful Soup does, then I'm intererested!

jon N

--
Shane Geiger
IT Director
National Council on Economic Education
sg*****@ncee.ne t | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy
Mar 30 '07 #6
jd:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
It's not an easy job, and it may require some manual editing, because
that html is the worst I have seen. You can use Tidy, there is a GUI
too, and you can use its suggestions to manually remove the offending
things, at the end Tidy is able to digest it, and return a cleaned up
html. But then you have just started, you need to process it even
more.

A solution is to avoid creating the Html in the first place, or to use
something more like Word 97 to create it. Dreamweaver too is able to
help with Word2000+ trashy html, but usually not enough.

If the structure of the Html document is simple enough, and assuming
you are using Windows, you can open it with Word, save it as RTF,
reopen it with Wordpad, save it again to remove some trash, and then
use something else (like Word 97, or maybe even Aracnophobia, etc) to
convert it to Html. Generally I've never found a really good way to
convert Rtf to a very good Html.

Bye,
bearophile

Mar 30 '07 #7
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
demoroniser - correct moronic and gratuitously incompatible HTML generated
by Microsoft applications

http://www.fourmilab.ch/webtools/demoroniser/

Unless you want to write your own with Python.

Mar 30 '07 #8
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.

Thanks...

-- jeff
There is a Microsoft add-on for Word which helps to reduce the mess
called 'HTML filter'. Go for it here:

http://www.microsoft.com/downloads/d...displaylang=EN

run it and then use afterwards the other in this thread suggested
'cleaning' methods.

Claudio
Mar 30 '07 #9
jd
Wow, thanks for all the great responses!

Here's my summary:

- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/
- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style sheet and so the
rendering isn't as nice; the latter does include the style sheet but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p>&nbsp;</p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/d...displaylang=EN

BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.

- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/

-- jeff

Mar 31 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5480
by: Travis Pupkin | last post by:
Hi, I use a WYSIWYG rich text editor on a few web sites for clients to manage their content, but a lot of them use word and paste into the box and end up with a lot of crappy code as a result. Has anyone created an ASP include that will clean up the word html and just leave the basic formatting intact? Thanks.
20
7358
by: Al Moritz | last post by:
Hi all, I was always told that the conversion of Word files to HTML as done by Word itself sucks - you get a lot of unnecessary code that can influence the design on web browsers other than Internet Explorer. Our computer expert in my company had told me already a while ago that I should learn HTML and encode myself. I was never inclined to do so (I am no computer expert), and when upon his suggestion I looked how my pages (converted to...
8
1548
by: Omey Samaroo | last post by:
Dear Access Gurus, What I would like to do is create a word document that merges some fields in a query. The next step is to have the document e-mailed to a group of users. What I have done so far is create the query and merged the document. All I need to do is press the merge button while in the word document and it is done. I would like to create a button that opens the specific word document,and merge it in one step and e-mail it to a...
0
3233
by: Niyazi | last post by:
Hi, I created application that store the data in SQL SERVER that reside on network. The client also use this application to access the resources provided with application. But is the client want to register new customer or companies they will enter the information in Windows Form and the program automaticaly creates the WORD document under specific folder under application path. Once the empty word file created than ask user if they want...
2
5002
by: s.danyal.k | last post by:
Hi All, I have created an application in C# that converts HTML file to MS word documents. The HTML file may also have images , for e.g "<img src='http://www.google.com.pk/images/hp0.gif'></img>". The HTML file saved into .doc file. Now the problem is that whenever the .doc file is opened it goes to the link mentioned in the <imgtag. This means that only the link is saved in the .doc file and not the ACTUAL image. Now I want the actual...
19
3829
by: thisis | last post by:
Hi All, i have this.asp page: <script type="text/vbscript"> Function myFunc(val1ok, val2ok) ' do something ok myFunc = " return something ok" End Function </script>
232
13360
by: robert maas, see http://tinyurl.com/uh3t | last post by:
I'm working on examples of programming in several languages, all (except PHP) running under CGI so that I can show both the source files and the actually running of the examples online. The first set of examples, after decoding the HTML FORM contents, merely verifies the text within a field to make sure it is a valid representation of an integer, without any junk thrown in, i.e. it must satisfy the regular expression: ^ *?+ *$ If the...
10
9828
by: Greg Lovern | last post by:
I have a very large html table created by MS Word, saved as it's "Web Page, Filtered" file type. Every html table cell has lots of formatting tags. Most of the file size is that formatting. Is there a free or inexpensive editor that can quickly remove all formatting to minimize the file size? I tried a few freeware editors, but wasn't able to find a way to clean it up.
0
9686
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9540
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10475
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10222
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10026
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7564
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6805
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5585
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3757
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.