parsing MS word docs -- tutorial request

bp.tralfamadore

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::

Oct 28 '08 #1

Subscribe Post Reply

3969

Okko Willeboordsed

Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp*************@gmail.com wrote:

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::

Oct 29 '08 #2

Mike Driscoll

On Oct 29, 4:32*am, Okko Willeboordsed <okko.willeboor...@gmail.com>
wrote:

Get a copy of; *Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp.tralfamad...@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. *Thanks for your attention and
patience.

::bp::

Also check out MSDN as the win32 module is a thin wrapper so most of
the syntax on MSDN or in VB examples can be directly translated to
Python. There's also a PyWin32 mailing list which is quite helpful:

http://mail.python.org/mailman/listinfo/python-win32

Mike

Oct 29 '08 #3

Reedick, Andrew

-----Original Message-----

From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of
bp*************@gmail.com
Sent: Tuesday, October 28, 2008 10:26 AM
To: py*********@python.org
Subject: parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
--
http://mail.python.org/mailman/listinfo/python-list

Word Object Model:
http://msdn.microsoft.com/en-us/library/bb244515.aspx

Google for sample code to get you started.

Oct 29 '08 #4

Kay Schluehr

On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:

All,

I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. *Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay

Oct 29 '08 #5

Terry Reedy

Kay Schluehr wrote:

On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
>All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A related solution is to use OpenOffice to convert to
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/de...projects/odfpy

Oct 29 '08 #6

bp.tralfamadore

Thanks everyone -- very helpful!
I really appreciate your help -- that is what makes the world a
wonderful place.

peace.

::bp::

Oct 29 '08 #7

Similar topics

Parsing MS Word Document?

by: MrBill | last post by:

I would like to be able to open, read, and extract data from a report that is produced in MS Word. The doc seems to contain embedded spreadsheets. I would like to extract some of the data from...

Python

Help with parsing web page

by: RiGGa | last post by:

Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...

Python

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

Problem with Exporting Data to Word

by: Mux | last post by:

I am facing the following problem while exporting data to Word. The current implementation is as described below: I have a JSP file which has a link that enables you to export the data to Word....

Javascript

ASP.Net ways of streaming pdf / word docs?

by: Jim Bancroft | last post by:

Hi everyone, We've have files we'd like to store in a SQL Server blob or text column and make available online for our clients. Instead of linking to a document sitting on a file server, we...

ASP.NET

Command language parsing - how formal to get?

by: Chris Carlen | last post by:

Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...

C / C++

XML Parsing Newbie Madness

by: Benoit | last post by:

I've been instructing myself in XML DOM parsing using the w3schools tutorial and decided to try an example of my own. I'd written a short XML file that looked like this: <?xml version="1.0"...

Javascript

Library to parse MS word docs & convert Word docs to PDF

by: KYG | last post by:

Hi , I'm trying to design and build a web app (my first C++ app) using the WT C++ web toolkit. Been looking for a way to read an MS Word doc into a 'stream?' and manipulate (search, copy/ delete...

C / C++

parsing non-ascii characters

by: Ronn | last post by:

Hello all, I have a list: suffix = and I'm trying to check a word to see if any of the suffixes exist in the list for example: if word in suffix: print "A suffix exist in your word"

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing