I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff 9 3474
On Mar 30, 12:20 pm, "jd" <chima...@gmail .comwrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff
You could try Beautiful Soup at http://www.crummy.com/software/Beaut...mentation.html
Python is good for parsing HTML/XML, so you could also try googling
Python parsing as well.
Mike
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
The non-python solution: http://www.w3.org/People/Raggett/tidy/
Peter
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
If Beautiful Soup does, then I'm intererested!
jon N
jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
From that very page I linked to:
"""
Tidy can now perform wonders on HTML saved from Microsoft Word 2000! Word
bulks out HTML files with stuff for round-tripping presentation between
HTML and Word. If you are more concerned about using HTML on the Web, check
out Tidy's "Word-2000" config option! Of course Tidy does a good job on
Word'97 files as well!
"""
Peter
Tidy can now perform wonders on HTML saved from Microsoft Word 2000!
Word bulks out HTML files with stuff for round-tripping presentation
between HTML and Word. If you are more concerned about using HTML on the
Web, check out Tidy's "Word-2000"
<http://www.w3.org/People/Raggett/tidy/#word2000config option! Of
course Tidy does a good job on Word'97 files as well!
-- source: http://www.w3.org/People/Raggett/tidy/
jkn wrote:
IIUC, the original poster is asking about 'cleaning up' in the sense
of removing the swathes of unnecessary and/or redundant 'cruft' that
Word puts in there, rather than making valid HTML out of invalid HTML.
Again, IIUC, HTMLtidy does not do this.
If Beautiful Soup does, then I'm intererested!
jon N
--
Shane Geiger
IT Director
National Council on Economic Education sg*****@ncee.ne t | 402-438-8958 | http://www.ncee.net
Leading the Campaign for Economic and Financial Literacy
jd:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
It's not an easy job, and it may require some manual editing, because
that html is the worst I have seen. You can use Tidy, there is a GUI
too, and you can use its suggestions to manually remove the offending
things, at the end Tidy is able to digest it, and return a cleaned up
html. But then you have just started, you need to process it even
more.
A solution is to avoid creating the Html in the first place, or to use
something more like Word 97 to create it. Dreamweaver too is able to
help with Word2000+ trashy html, but usually not enough.
If the structure of the Html document is simple enough, and assuming
you are using Windows, you can open it with Word, save it as RTF,
reopen it with Wordpad, save it again to remove some trash, and then
use something else (like Word 97, or maybe even Aracnophobia, etc) to
convert it to Html. Generally I've never found a really good way to
convert Rtf to a very good Html.
Bye,
bearophile
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff
demoroniser - correct moronic and gratuitously incompatible HTML generated
by Microsoft applications http://www.fourmilab.ch/webtools/demoroniser/
Unless you want to write your own with Python.
jd wrote:
I am looking for python code (working or sample code) that can take an
html document created by Microsoft Word and clean it up (if you've
never had to look at a Word-generated html document, consider yourself
lucky ;-) Alternatively, if you know of a non-python solution, I'd
like to hear about it.
Thanks...
-- jeff
There is a Microsoft add-on for Word which helps to reduce the mess
called 'HTML filter'. Go for it here: http://www.microsoft.com/downloads/d...displaylang=EN
run it and then use afterwards the other in this thread suggested
'cleaning' methods.
Claudio
Wow, thanks for all the great responses!
Here's my summary:
- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs. However, it
doesn't remove the unnecessary html generated by Word. http://www.fourmilab.ch/webtools/demoroniser/
- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html". The former results in
slightly cleaner html but doesn't include the style sheet and so the
rendering isn't as nice; the latter does include the style sheet but
it's got slightly more junk in it. Both approaches preserve the
"blank" paragraphs (basically, <p> </p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document. http://www.microsoft.com/downloads/d...displaylang=EN
BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.
- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features). The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this. No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up. http://www.w3.org/People/Raggett/tidy/
-- jeff This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Travis Pupkin |
last post by:
Hi,
I use a WYSIWYG rich text editor on a few web sites for clients to
manage their content, but a lot of them use word and paste into the box
and end up with a lot of crappy code as a result.
Has anyone created an ASP include that will clean up the word html and
just leave the basic formatting intact?
Thanks.
|
by: Al Moritz |
last post by:
Hi all,
I was always told that the conversion of Word files to HTML as done by
Word itself sucks - you get a lot of unnecessary code that can
influence the design on web browsers other than Internet Explorer. Our
computer expert in my company had told me already a while ago that I
should learn HTML and encode myself. I was never inclined to do so (I
am no computer expert), and when upon his suggestion I looked how my
pages (converted to...
|
by: Omey Samaroo |
last post by:
Dear Access Gurus,
What I would like to do is create a word document that merges some fields in
a query. The next step is to have the document e-mailed to a group of users.
What I have done so far is create the query and merged the document. All I
need to do is press the merge button while in the word document and it is
done. I would like to create a button that opens the specific word
document,and merge it in one step and e-mail it to a...
|
by: Niyazi |
last post by:
Hi,
I created application that store the data in SQL SERVER that reside on
network. The client also use this application to access the resources
provided with application. But is the client want to register new customer or
companies they will enter the information in Windows Form and the program
automaticaly creates the WORD document under specific folder under
application path. Once the empty word file created than ask user if they want...
|
by: s.danyal.k |
last post by:
Hi All,
I have created an application in C# that converts HTML file to MS word
documents. The HTML file may also have images , for e.g "<img
src='http://www.google.com.pk/images/hp0.gif'></img>".
The HTML file saved into .doc file. Now the problem is
that whenever the .doc file is opened it goes to the link mentioned in
the <imgtag. This means that only the link is saved in the .doc file
and not the ACTUAL image. Now I want the actual...
| |
by: thisis |
last post by:
Hi All,
i have this.asp page:
<script type="text/vbscript">
Function myFunc(val1ok, val2ok)
' do something ok
myFunc = " return something ok"
End Function
</script>
|
by: robert maas, see http://tinyurl.com/uh3t |
last post by:
I'm working on examples of programming in several languages, all
(except PHP) running under CGI so that I can show both the source
files and the actually running of the examples online. The first
set of examples, after decoding the HTML FORM contents, merely
verifies the text within a field to make sure it is a valid
representation of an integer, without any junk thrown in, i.e. it
must satisfy the regular expression: ^ *?+ *$
If the...
|
by: Greg Lovern |
last post by:
I have a very large html table created by MS Word, saved as it's "Web
Page, Filtered" file type. Every html table cell has lots of
formatting tags. Most of the file size is that formatting.
Is there a free or inexpensive editor that can quickly remove all
formatting to minimize the file size?
I tried a few freeware editors, but wasn't able to find a way to clean
it up.
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |