By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,918 Members | 1,974 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,918 IT Pros & Developers. It's quick & easy.

Re: How to use win32com to convert a MS WORD doc to HTML ?

P: n/a
Lave wrote:
Hi, all !

I'm a totally newbie huh:)

I want to convert MS WORD docs to HTML, I found python windows
extension win32com can make this. But I can't find the method, and I
can't find any document helpful.
You have broadly two approaches here, both
involving automating Word (ie using the
COM object model it exposes, referred to
in another post in this thread).

1) Use the COM model to have Word load your
doc, and SaveAs it in HTML format. Advantage:
it's relatively straightforward. Disadvantage:
you're at the mercy of whatever HTML Word emits.

2) Use the COM model to iterate over the paragraphs
in your document, emitting your own HTML. Advantage:
you get control. Disadvantage: the more complex your
doc, the more work you have to do. (What do you do with
images, for example? Internal links?)

To do the first, just record a macro in Word to
do what you want and then reproduce the macro
in Python. Something like this:

<code>
import win32com.client

doc = win32com.client.GetObject ("c:/data/temp/songs.doc")
doc.SaveAs (FileName="c:/data/temp/songs.html", FileFormat=8)
doc.Close ()

</code>

To do the second, you have to roll your own html
doc. Crudely, this would do it:

<code>
import codecs
import win32com.client
doc = win32com.client.GetObject ("c:/data/temp/songs.doc")
with codecs.open ("c:/data/temp/s2.html", "w", encoding="utf8") as f:
f.write ("<html><body>")
for para in doc.Paragraphs:
text = para.Range.Text
style = para.Style.NameLocal
f.write ('<p class="%(style)s">%(text)s</p>\n' % locals ())

doc.Close ()

</code>

TJG
Aug 19 '08 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.