473,378 Members | 1,639 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Parsing MS Word Document?

I would like to be able to open, read, and extract data from a report that
is produced in MS Word. The doc seems to contain embedded spreadsheets. I
would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS
Word and sure would like to find a module that hides some of this so-called
innovation from me.

Thanks,
Bill
Jul 18 '05 #1
3 21489
"MrBill" <no****@nospam.com> writes:
I would like to be able to open, read, and extract data from a report that
is produced in MS Word. The doc seems to contain embedded spreadsheets. I
would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS
Word and sure would like to find a module that hides some of this so-called
innovation from me.


:-) Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF. Word can import
and export to it. You can automate that from VB or Python in the
usual COM ways (see 3.). I don't know whether you'll get useful
RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
for Windows extensions (or ctypes, if you're brave). Perhaps ADO
is what you're looking for? IIRC, ADO isn't too complicated and
can treat Excel sheets as data sources just as it does for
relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane. You can keep it in VB, or do the relatively trivial
edits required to convert it to Python. Here's an example on
automating RTF generation:

http://www.google.com/groups?q=autho...box.com&rnum=1
John
Jul 18 '05 #2
Thanks John,
This should get me started.
Bill
"John J. Lee" <jj*@pobox.com> wrote in message
news:87************@pobox.com...
"MrBill" <no****@nospam.com> writes:
I would like to be able to open, read, and extract data from a report that is produced in MS Word. The doc seems to contain embedded spreadsheets. I would like to extract some of the data from the spreadsheets and feed it
into another application. I've been reading a little bit about OLE and MS Word and sure would like to find a module that hides some of this so-called innovation from me.
:-) Yeah, isn't all that baroque complexity wonderful?

1. Alex Martelli's suggestion on this list: use RTF. Word can import
and export to it. You can automate that from VB or Python in the
usual COM ways (see 3.). I don't know whether you'll get useful
RTF out of embedded Excel sheets, though.

2. Use OpenOffice via PyUNO.

3. As you already know, use the MS Office object models, with Python
for Windows extensions (or ctypes, if you're brave). Perhaps ADO
is what you're looking for? IIRC, ADO isn't too complicated and
can treat Excel sheets as data sources just as it does for
relational databases.

For simpler Word docs (no embedded stuff), there are other tools out
there, but they'd be no use in this case.

A useful tip for 3. is to record a VB macro in Word, then edit it to
something sane. You can keep it in VB, or do the relatively trivial
edits required to convert it to Python. Here's an example on
automating RTF generation:

http://www.google.com/groups?q=autho...box.com&rnum=1

John

Jul 18 '05 #3
MrBill wrote:
I would like to be able to open, read, and extract data from a
report that
is produced in MS Word. The doc seems to contain embedded
spreadsheets. I would like to extract some of the data from the
spreadsheets and feed it
into another application. I've been reading a little bit about
OLE and MS Word and sure would like to find a module that hides
some of this so-called innovation from me.


Here is another strategy:

1. Load the document into MS Word. Save the document as HTML.

2. Run the `links` Web browser on the file with the -dump option.
This will convert the HTML into plain text. Example:

links -dump mydoc.html > mydoc.txt

3. Use Python to extract information from the resulting plain text
file.

Another suggestion -- The Web browser `links` formats tables
differently from and perhaps better than `lynx`. But, you might
try lynx, too.

Dave

--
Dave Kuhlman
http://www.rexx.com/~dkuhlman
dk******@rexx.com
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Anders Eriksson | last post by:
Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to...
3
by: kids_pro | last post by:
Hi there, Is there any technique where I can use to parse Word Document paragraph and return the text in specific font format? Because in word document we can have many paragraphs and in...
5
by: STeve | last post by:
Hey guys, I currently have a 100 page word document filled with various "articles". These articles are delimited by the Style of the text (IE. Heading 1 for the various titles) These articles...
10
by: Curtis | last post by:
Does anyone have any good examples of parsing WebPages in VB.Net. My application needs to get information from certain HTML tables and I haven't been able to find a good way to approach the...
4
by: almurph | last post by:
Hi everyone, Can you help me please? Say you have hashtable of about 500 key/value pairs. This hashtable has to parse a word. If the word matches the key the then said word must be replaced by...
13
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple...
1
by: connectalok | last post by:
Hi All, I have been trying to solve this issue. I have a word document which is basically a resume of 25-30 pages. I need to make 3 word documents out of it short, medium and a long, with headings...
2
by: pramodkh | last post by:
Hi All I am parsing a word doc using perl. I am using Win32::OLE module for this. I am able to get the Paragraphs/styles/Text from the word doc. But facing some problem when I am trying to get...
6
by: bp.tralfamadore | last post by:
All, I am trying to write a script that will parse and extract data from a MS Word document. Can / would anyone refer me to a tutorial on how to do that? (perhaps from tables). I am aware of,...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.