473,399 Members | 2,478 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
Oct 28 '08 #1
6 3969
Get a copy of; Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp*************@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
Oct 29 '08 #2
On Oct 29, 4:32*am, Okko Willeboordsed <okko.willeboor...@gmail.com>
wrote:
Get a copy of; *Python Programming on Win32, ISBN 1-56592-621-8
Use Google and VBA for help

bp.tralfamad...@gmail.com wrote:
All,
I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.
Any help would be appreciated. *Thanks for your attention and
patience.
::bp::
Also check out MSDN as the win32 module is a thin wrapper so most of
the syntax on MSDN or in VB examples can be directly translated to
Python. There's also a PyWin32 mailing list which is quite helpful:

http://mail.python.org/mailman/listinfo/python-win32

Mike
Oct 29 '08 #3
-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of
bp*************@gmail.com
Sent: Tuesday, October 28, 2008 10:26 AM
To: py*********@python.org
Subject: parsing MS word docs -- tutorial request

All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::
--
http://mail.python.org/mailman/listinfo/python-list

Word Object Model:
http://msdn.microsoft.com/en-us/library/bb244515.aspx

Google for sample code to get you started.
Oct 29 '08 #4
On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
All,

I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. *Thanks for your attention and
patience.

::bp::
One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).

A few years ago I used this conversion to implement roughly following
thing algorithm:

1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.

This way I could link two documents ( those were public specifications
being originally disconnected ).

Kay

Oct 29 '08 #5
Kay Schluehr wrote:
On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
>All,

I am trying to write a script that will parse and extract data from a
MS Word document. Can / would anyone refer me to a tutorial on how to
do that? (perhaps from tables). I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.

Any help would be appreciated. Thanks for your attention and
patience.

::bp::

One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).
A related solution is to use OpenOffice to convert to
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/de...projects/odfpy

Oct 29 '08 #6

Thanks everyone -- very helpful!
I really appreciate your help -- that is what makes the world a
wonderful place.

peace.

::bp::
Oct 29 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: MrBill | last post by:
I would like to be able to open, read, and extract data from a report that is produced in MS Word. The doc seems to contain embedded spreadsheets. I would like to extract some of the data from...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...
17
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...
2
by: Mux | last post by:
I am facing the following problem while exporting data to Word. The current implementation is as described below: I have a JSP file which has a link that enables you to export the data to Word....
5
by: Jim Bancroft | last post by:
Hi everyone, We've have files we'd like to store in a SQL Server blob or text column and make available online for our clients. Instead of linking to a document sitting on a file server, we...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
5
by: Benoit | last post by:
I've been instructing myself in XML DOM parsing using the w3schools tutorial and decided to try an example of my own. I'd written a short XML file that looked like this: <?xml version="1.0"...
2
by: KYG | last post by:
Hi , I'm trying to design and build a web app (my first C++ app) using the WT C++ web toolkit. Been looking for a way to read an MS Word doc into a 'stream?' and manipulate (search, copy/ delete...
2
by: Ronn | last post by:
Hello all, I have a list: suffix = and I'm trying to check a word to see if any of the suffixes exist in the list for example: if word in suffix: print "A suffix exist in your word"
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.