Extracting text from Word document (for regular expression matching)

Mico

I would be very grateful for any help with the following:

I currently have the code below. This opens a MS Word document, and
uses C#'s internal regular expressions library to find if there is a
match within this document. When I run the code I get a parser error
- I think there is an escape character in the Word doc format, or
perhaps trying to do a match with the entire document is not a good
idea.

public DataRow[] getMatches()
{
ArrayList matches = new ArrayList();

StreamReader sr = null;

foreach(DataRow dr in theData.Rows)
{
string rx = dr["Term Name"].ToString();
sr = File.OpenText(inputFilePath);

if(Regex.IsMatch(rx, sr.ReadToEnd()))
{
matches.Add(dr);
}
}

sr.Close();
return (DataRow[])matches.ToArray(typeof(DataRow));
}

Is there any way of either:

1) Extracting just the text from the word document programatically?
(I.e. I don't want all the extra stuff that MS stores)
2) Parsing it into 'words'?
3) Putting all the words into a string array?
4) All of the above

I can probably do 2, 3 and 4, but I am struggling to think of a way to
do 1.

Any help would be much appreciated...

Cheers,

Mark.

Nov 16 '05 #1

Subscribe Post Reply

1619

Similar topics

Extracting a portion of a string

by: Richard L Rosenheim | last post by:

I have some text where I need to extract some pieces from. The text will be in a format like this: a string description color="red" type="unknown" In the above example, I would be looking to...

.NET Framework

Extracting Numerica Data Pairs from Text Box

by: Michael Hill | last post by:

Hi, folks. I am writing a Javascript program that accepts (x, y) data pairs from a text box and then analyzes that data in various ways. This is my first time using text area boxes; in the past,...

Javascript

Reading an HTML document & extracting content

by: Cognizance | last post by:

Hi gang, I'm an ASP developer by trade, but I've had to create client side scripts with JavaScript many times in the past. Simple things, like validating form elements and such. Now I've been...

Javascript

Server Side include to replace text

by: Casey | last post by:

Hello, Can someone give me specific code to replace text on a page using server side javascript? I need to use server-side because I need the output to be recognized in the final HTML so that...

Javascript

Extracting text from a Word document via StreamReader - track chan

by: Kevin K | last post by:

Hi, I'm having a problem with extracting text from a Word document using StreamReader. As I'm developing a web application, I do NOT want the server to make calls to Word. I want to simply...

ASP.NET

[STRING] extract a word and text around it

by: teo | last post by:

hallo, I need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a...

Visual Basic .NET

how do I match an replace text with XSL

by: Alois Treindl | last post by:

A simple XSL question from a newbie: In an xml document which I transform via xsl into html output, I have some text which I want to be suppressed. The tags looks like this <anchor_ref...

.NET Framework

help extracting tag with boost:regex

by: MCH | last post by:

hi there, I am working with a HTML-like text with boost:regex. For example, the following pattern might occur in my text <abc efg> <p>EFG</p 12<3> In this case, I would like to extract...

C / C++

Text retrieval systems - 6: Queries

by: JosAH | last post by:

Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...

Java

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing