.doc to html and pdf conversion with python

Alexander Klingenstein

I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However I need an automated way to converst the .doc to PDF first.

Is there a way to do what I want either with a python lib, 3rd party app, or maybe remote controlling Word (a la VBA) by "printing" to PDF with a distiller?
I already tried wvware from gwnuwin32, however it has problems with big image files embedded in .doc file(looks like a mmap error).

Alex

__________________________________________________ ____________________
XXL-Speicher, PC-Virenschutz, Spartarife & mehr: Nur im WEB.DE Club!
Jetzt gratis testen! http://freemail.web.de/home/landingpad/?mc=021130

Oct 14 '06 #1

Subscribe Post Reply

5603

Luap777

Alexander Klingenstein wrote:

I need to take a bunch of .doc files (word 2000) which have a little textincluding some tables/layout and mostly pictures and comvert them to a pdfand extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However Ineed an automated way to converst the .doc to PDF first.

Is there some reason you really want to convert to PDF first? You can
get much better HTML right from the Word doc. You'll lose a lot of info
going from PDF to HTML.

Something like this can open doc in Word, save as HTML, then close doc.

import os, win32com.client

wdApp = win32com.client.Dispatch("Word.Application")
wdApp.Visible = 1

def SaveDocAsHTML(docPath, htmlPath):
doc = wdApp.Documents.Open(docPath)
# See
mk:@MSITStore:C:\Program%20Files\Microsoft%20Offic e\OFFICE11\1033\VBAWD10.CHM::/html/womthSaveAs1.htm
# in Word VBA help doc for more info.

# Saves all text and formatting with HTML tags so that the
resulting document can be viewed in a Web browser.
doc.SaveAs(htmlPath, win32com.client.constants.wdFormatHTML)
# Saves text with HTML tags with minimal cascading style sheet
formatting. The resulting document can be viewed in a Web browser.
#doc.SaveAs(htmlPath,
win32com.client.constants.wdFormatFilteredHTML)
doc.Close()

And if you aren't satisfied with the ugly HTML you're likely to get,
you can try running µTidylib (http://utidylib.berlios.de/) on the
output after this step also.

Thank you,
Paul

Oct 14 '06 #2

Eric_Dexter

google won't do a good job with .doc files but they may do pdf to html
and back.. It's per each I just mentioned it to make fun of them here
is my resume converted from a monster.com .doc file

http://docs.google.com/View?docid=dftrj73t_3cfwjdv
Lu*****@gmail.com wrote:

Alexander Klingenstein wrote:
I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I cando create the html with pdftohtml called from python with popen. HoweverI need an automated way to converst the .doc to PDF first.

Is there some reason you really want to convert to PDF first? You can
get much better HTML right from the Word doc. You'll lose a lot of info
going from PDF to HTML.

Something like this can open doc in Word, save as HTML, then close doc.

import os, win32com.client

wdApp = win32com.client.Dispatch("Word.Application")
wdApp.Visible = 1

def SaveDocAsHTML(docPath, htmlPath):
doc = wdApp.Documents.Open(docPath)
# See
mk:@MSITStore:C:\Program%20Files\Microsoft%20Offic e\OFFICE11\1033\VBAWD10..CHM::/html/womthSaveAs1.htm
# in Word VBA help doc for more info.

# Saves all text and formatting with HTML tags so that the
resulting document can be viewed in a Web browser.
doc.SaveAs(htmlPath, win32com.client.constants.wdFormatHTML)
# Saves text with HTML tags with minimal cascading style sheet
formatting. The resulting document can be viewed in a Web browser.
#doc.SaveAs(htmlPath,
win32com.client.constants.wdFormatFilteredHTML)
doc.Close()

And if you aren't satisfied with the ugly HTML you're likely to get,
you can try running µTidylib (http://utidylib.berlios.de/) on the
output after this step also.

Thank you,
Paul

Oct 15 '06 #3

Similar topics

python 2.2 string conversion ?

by: ken | last post by:

I've been looking for a solution to a string to long conversion problem that I've run into >>> x = 'e10ea210' >>> print x e10ea210 >>> y=long(x) Traceback (most recent call last): File...

Python

Text-to-HTML processing program

by: phil hunt | last post by:

Does anyone know of a text-to-HTML processing program, ideally written in Python because I'll probably be wanting to make small modifications to it, which is simple and straightforward to use and...

Python

integer type conversion problem/question

by: Faheem Mitha | last post by:

Hi, I'm not sure what would be more appropriate, so I'm ccing it to both alt.comp.lang.learn.c-c++ and comp.lang.python, with followup to alt.comp.lang.learn.c-c++. While working with a...

Python

Wikipedia - conversion of in SQL database stored data to HTML

by: Claudio Grondi | last post by:

Is there an already available script/tool able to extract records and generate proper HTML code out of the data stored in the Wikipedia SQL data base? e.g. converting all occurences of ] to <a...

Python

Request new feature suggestions for my PDF conversion toolkit - xtopdf

by: vasudevram | last post by:

Hi all, I had created this open source project - xtopdf - http://sourceforge.net/projects/xtopdf - some time ago. It's a toolkit to help with conversion of other file formats to PDF. The...

Python

Html character entity conversion

by: pak.andrei | last post by:

Here is my script: from mechanize import * from BeautifulSoup import * import StringIO b = Browser() f = b.open("http://www.translate.ru/text.asp?lang=ru") b.select_form(nr=0) b = "hello...

Python

xtopdf: PDF creation / conversion toolkit: alpha release of v1.3

by: vasudevram | last post by:

Hi group, xtopdf: PDF creation / conversion toolkit: alpha release of v1.3 This is actually a somewhat preliminary announcement, but may be of interest to developers / users who know Python...

Python

The Python Papers Edition One

by: tleeuwenburg | last post by:

Greetings all, Some of you may have noticed the launch of the Python Journal a while back. Due to artistic differences, the journal has now been re-launched as The Python Papers. It is available...

Python

Python/C API, Numeric Python, Type Conversion

by: ronysk | last post by:

Hi, I am posting here to seek for help on type conversion between Python (Numeric Python) and C. Attachment A is a math function written in C, which is called by a Python program. I had...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server