By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,900 Members | 1,353 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,900 IT Pros & Developers. It's quick & easy.

Parsing MS Word Document?

P: n/a
I would like to be able to open, read, and extract data from a report that
is produced in MS Word. The doc seems to contain embedded spreadsheets. I
would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS
Word and sure would like to find a module that hides some of this so-called
innovation from me.

Thanks,
Bill
Jul 18 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
"MrBill" <no****@nospam.com> writes:
I would like to be able to open, read, and extract data from a report that
is produced in MS Word. The doc seems to contain embedded spreadsheets. I
would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS
Word and sure would like to find a module that hides some of this so-called
innovation from me.


:-) Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF. Word can import
and export to it. You can automate that from VB or Python in the
usual COM ways (see 3.). I don't know whether you'll get useful
RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
for Windows extensions (or ctypes, if you're brave). Perhaps ADO
is what you're looking for? IIRC, ADO isn't too complicated and
can treat Excel sheets as data sources just as it does for
relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane. You can keep it in VB, or do the relatively trivial
edits required to convert it to Python. Here's an example on
automating RTF generation:

http://www.google.com/groups?q=autho...box.com&rnum=1
John
Jul 18 '05 #2

P: n/a
Thanks John,
This should get me started.
Bill
"John J. Lee" <jj*@pobox.com> wrote in message
news:87************@pobox.com...
"MrBill" <no****@nospam.com> writes:
I would like to be able to open, read, and extract data from a report that is produced in MS Word. The doc seems to contain embedded spreadsheets. I would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS Word and sure would like to find a module that hides some of this so-called innovation from me.
:-) Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF. Word can import
and export to it. You can automate that from VB or Python in the
usual COM ways (see 3.). I don't know whether you'll get useful
RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
for Windows extensions (or ctypes, if you're brave). Perhaps ADO
is what you're looking for? IIRC, ADO isn't too complicated and
can treat Excel sheets as data sources just as it does for
relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane. You can keep it in VB, or do the relatively trivial
edits required to convert it to Python. Here's an example on
automating RTF generation:

http://www.google.com/groups?q=autho...box.com&rnum=1

John

Jul 18 '05 #3

P: n/a
MrBill wrote:
I would like to be able to open, read, and extract data from a
report that
is produced in MS Word. The doc seems to contain embedded
spreadsheets. I would like to extract some of the data from the
spreadsheets and feed it
into another application. I've been reading a little bit about
OLE and MS Word and sure would like to find a module that hides
some of this so-called innovation from me.


Here is another strategy:

1. Load the document into MS Word. Save the document as HTML.

2. Run the `links` Web browser on the file with the -dump option.
This will convert the HTML into plain text. Example:

links -dump mydoc.html > mydoc.txt

3. Use Python to extract information from the resulting plain text
file.

Another suggestion -- The Web browser `links` formats tables
differently from and perhaps better than `lynx`. But, you might
try lynx, too.

Dave

--
Dave Kuhlman
http://www.rexx.com/~dkuhlman
dk******@rexx.com
Jul 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.