By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,934 Members | 1,449 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,934 IT Pros & Developers. It's quick & easy.

Extracting text from Word document (for regular expression matching)

P: n/a
I would be very grateful for any help with the following:

I currently have the code below. This opens a MS Word document, and
uses C#'s internal regular expressions library to find if there is a
match within this document. When I run the code I get a parser error
- I think there is an escape character in the Word doc format, or
perhaps trying to do a match with the entire document is not a good
idea.

public DataRow[] getMatches()
{
ArrayList matches = new ArrayList();

StreamReader sr = null;

foreach(DataRow dr in theData.Rows)
{
string rx = dr["Term Name"].ToString();
sr = File.OpenText(inputFilePath);

if(Regex.IsMatch(rx, sr.ReadToEnd()))
{
matches.Add(dr);
}
}

sr.Close();
return (DataRow[])matches.ToArray(typeof(DataRow));
}

Is there any way of either:

1) Extracting just the text from the word document programatically?
(I.e. I don't want all the extra stuff that MS stores)
2) Parsing it into 'words'?
3) Putting all the words into a string array?
4) All of the above

I can probably do 2, 3 and 4, but I am struggling to think of a way to
do 1.

Any help would be much appreciated...

Cheers,

Mark.
Nov 16 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.