On 28 Okt., 15:25, bp.tralfamad...@gmail.com wrote:
All,
I am trying to write a script that will parse and extract data from a
MS Word document. *Can / would anyone refer me to a tutorial on how to
do that? *(perhaps from tables). *I am aware of, and have downloaded
the pywin32 extensions, but am unsure of how to proceed -- I'm not
familiar with the COM API for word, so help for that would also be
welcome.
Any help would be appreciated. *Thanks for your attention and
patience.
::bp::
One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
readable ).
A few years ago I used this conversion to implement roughly following
thing algorithm:
1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
equally structured.
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.
This way I could link two documents ( those were public specifications
being originally disconnected ).
Kay