473,466 Members | 1,417 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

MS Word parser

Hi all,
I'm currently using antiword to extract content from MS Word files.
Is there another way to do this without relying on any command prompt
application?

Jun 13 '07 #1
4 2648
ke********@gmail.com wrote:
Hi all,
I'm currently using antiword to extract content from MS Word files.
Is there another way to do this without relying on any command prompt
application?
Well you haven't given your environment, but is there
anything to stop you from controlling Word itself via
COM? I'm no Word expert, but looking around, this
seems to work:

<code>
import win32com.client
word = win32com.client.Dispatch ("Word.Application")
doc = word.Documents.Open ("c:/temp/temp.doc")
text = doc.Range ().Text

open ("c:/temp/temp.txt", "w").write (text.encode ("UTF-8"))
</code>

TJG
Jun 13 '07 #2
On Jun 13, 1:28 am, Tim Golden <m...@timgolden.me.ukwrote:
keniche...@gmail.com wrote:
Hi all,
I'm currently using antiword to extract content from MS Word files.
Is there another way to do this without relying on any command prompt
application?

Well you haven't given your environment, but is there
anything to stop you from controlling Word itself via
COM? I'm no Word expert, but looking around, this
seems to work:

<code>
import win32com.client
word = win32com.client.Dispatch ("Word.Application")
doc = word.Documents.Open ("c:/temp/temp.doc")
text = doc.Range ().Text

open ("c:/temp/temp.txt", "w").write (text.encode ("UTF-8"))
</code>

TJG
Tim,
I'm on Linux (RedHat) so using Word is not an option for me. Any
other suggestions?

Jun 13 '07 #3
On 2007-06-13, ke********@gmail.com <ke********@gmail.comwrote:
On Jun 13, 1:28 am, Tim Golden <m...@timgolden.me.ukwrote:
>keniche...@gmail.com wrote:
Hi all,
I'm currently using antiword to extract content from MS Word files.
Is there another way to do this without relying on any command prompt
application?

Well you haven't given your environment, but is there
anything to stop you from controlling Word itself via
COM? I'm no Word expert, but looking around, this
seems to work:

<code>
import win32com.client
word = win32com.client.Dispatch ("Word.Application")
doc = word.Documents.Open ("c:/temp/temp.doc")
text = doc.Range ().Text

open ("c:/temp/temp.txt", "w").write (text.encode ("UTF-8"))
</code>

TJG

Tim,
I'm on Linux (RedHat) so using Word is not an option for me. Any
other suggestions?
There is OpenOffice which has a Python API to it (called UNO). But
piping through antiword is probably easier.
Jun 14 '07 #4
ke********@gmail.com wrote:
I'm currently using antiword to extract content from MS Word files.
Is there another way to do this without relying on any command prompt
application?
There is also wvware http://wvware.sourceforge.net/, but it is also
generally a command-line application. Either of these programs are open
source, so you could (with a bit of work) wrap them with Swig or Pyrex
to access them directly from Python.

- Josiah
Jun 18 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

15
by: Caversham | last post by:
Is there any macro / other tool - free or commercial - that can split long Word docs into multiple XHTML pages? Any comments on the quality/effectiveness of suitable products also welcomed.
8
by: prabha | last post by:
Hello Everybody, I have to conert the word doc to multiple html files,according to the templates in the word doc. I had converted the word to xml.Also through Exsl ,had finished the multiple...
8
by: Asma | last post by:
Dear Sir, I am trying to find a way to open a Word document using C language and read the text of word doc into a variable. (Turbo C on Dos 6.0). Can anyone please tell me which libraries in...
3
by: Bob | last post by:
What I want to do is write a program that reads through a Word Document, finds certain words or sentences I want, and then paste into an Excel spreadsheet. I dont know much about C#. But I...
5
by: STeve | last post by:
Hey guys, I currently have a 100 page word document filled with various "articles". These articles are delimited by the Style of the text (IE. Heading 1 for the various titles) These articles...
7
by: R Reyes | last post by:
Can someone please explain to me why I can't get the MS Word Interop assembly to work in my VS2005 project? I'm trying to manipulate MS Word from my Web Form application and I can't get passed...
1
by: R Reyes | last post by:
ISSUE (reposted) =========================== Can someone please explain to me why I can't get the MS Word Interop assembly to work in my VS2005 project? For many people, they say they add the...
2
by: Andy | last post by:
Hi, I have an XML document that uses namespaces (it is from a Word 2007 file). I want to retrieve all the "t" elements that belong to the "w" namespace (<w:t>) using XPath from VB.NET 2003 (.NET...
9
by: sajithkahawatta | last post by:
my web page C#.net is work properly in my iis .but when i publish in another server it give errers.i used vs2005 and ms word 2003. but in the server there is no vs2005 or word installed. it has only...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.