468,468 Members | 2,694 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,468 developers. It's quick & easy.

parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
Oct 28 '08 #1
6 3734
Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp*************@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
Oct 29 '08 #2
On Oct 29, 4:32*am, Okko Willeboordsed <okko.willeboor...@gmail.com>
wrote:
Get a copy of; *Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp.tralfamad...@gmail.com wrote:
All,
I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.
Any help would be appreciated. *Thanks for your attention and
patience.
::bp::
Also check out MSDN as the win32 module is a thin wrapper so most of
the syntax on MSDN or in VB examples can be directly translated to
Python. There's also a PyWin32 mailing list which is quite helpful:

http://mail.python.org/mailman/listinfo/python-win32

Mike
Oct 29 '08 #3
-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of
bp*************@gmail.com
Sent: Tuesday, October 28, 2008 10:26 AM
To: py*********@python.org
Subject: parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
--
http://mail.python.org/mailman/listinfo/python-list

Word Object Model:
http://msdn.microsoft.com/en-us/library/bb244515.aspx

Google for sample code to get you started.
Oct 29 '08 #4
On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. *Thanks for your attention and
patience.

::bp::
One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay

Oct 29 '08 #5
Kay Schluehr wrote:
On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
>All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).
A related solution is to use OpenOffice to convert to
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/de...projects/odfpy

Oct 29 '08 #6

Thanks everyone -- very helpful!
I really appreciate your help -- that is what makes the world a
wonderful place.

peace.

::bp::
Oct 29 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by MrBill | last post: by
9 posts views Thread by RiGGa | last post: by
17 posts views Thread by Mark | last post: by
2 posts views Thread by Mux | last post: by
5 posts views Thread by Jim Bancroft | last post: by
13 posts views Thread by Chris Carlen | last post: by
5 posts views Thread by Benoit | last post: by
2 posts views Thread by Ronn | last post: by
reply views Thread by NPC403 | last post: by
1 post views Thread by kmladenovski | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.