473,225 Members | 1,414 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,225 software developers and data experts.

parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
Oct 28 '08 #1
6 3960
Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp*************@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
Oct 29 '08 #2
On Oct 29, 4:32*am, Okko Willeboordsed <okko.willeboor...@gmail.com>
wrote:
Get a copy of; *Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp.tralfamad...@gmail.com wrote:
All,
I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.
Any help would be appreciated. *Thanks for your attention and
patience.
::bp::
Also check out MSDN as the win32 module is a thin wrapper so most of
the syntax on MSDN or in VB examples can be directly translated to
Python. There's also a PyWin32 mailing list which is quite helpful:

http://mail.python.org/mailman/listinfo/python-win32

Mike
Oct 29 '08 #3
-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of
bp*************@gmail.com
Sent: Tuesday, October 28, 2008 10:26 AM
To: py*********@python.org
Subject: parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
--
http://mail.python.org/mailman/listinfo/python-list

Word Object Model:
http://msdn.microsoft.com/en-us/library/bb244515.aspx

Google for sample code to get you started.
Oct 29 '08 #4
On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. *Thanks for your attention and
patience.

::bp::
One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay

Oct 29 '08 #5
Kay Schluehr wrote:
On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
>All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).
A related solution is to use OpenOffice to convert to
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/de...projects/odfpy

Oct 29 '08 #6

Thanks everyone -- very helpful!
I really appreciate your help -- that is what makes the world a
wonderful place.

peace.

::bp::
Oct 29 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: MrBill | last post by:
I would like to be able to open, read, and extract data from a report that is produced in MS Word. The doc seems to contain embedded spreadsheets. I would like to extract some of the data from...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...
17
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...
2
by: Mux | last post by:
I am facing the following problem while exporting data to Word. The current implementation is as described below: I have a JSP file which has a link that enables you to export the data to Word....
5
by: Jim Bancroft | last post by:
Hi everyone, We've have files we'd like to store in a SQL Server blob or text column and make available online for our clients. Instead of linking to a document sitting on a file server, we...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
5
by: Benoit | last post by:
I've been instructing myself in XML DOM parsing using the w3schools tutorial and decided to try an example of my own. I'd written a short XML file that looked like this: <?xml version="1.0"...
2
by: KYG | last post by:
Hi , I'm trying to design and build a web app (my first C++ app) using the WT C++ web toolkit. Been looking for a way to read an MS Word doc into a 'stream?' and manipulate (search, copy/ delete...
2
by: Ronn | last post by:
Hello all, I have a list: suffix = and I'm trying to check a word to see if any of the suffixes exist in the list for example: if word in suffix: print "A suffix exist in your word"
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: mar23 | last post by:
Here's the situation. I have a form called frmDiceInventory with subform called subfrmDice. The subform's control source is linked to a query called qryDiceInventory. I've been trying to pick up the...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
by: jimatqsi | last post by:
The boss wants the word "CONFIDENTIAL" overlaying certain reports. He wants it large, slanted across the page, on every page, very light gray, outlined letters, not block letters. I thought Word Art...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.