473,883 Members | 1,709 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

[STRING] extract a word and text around it

teo
hallo,

I need to extract a word and few text that
precedes and follows it (about 30 + 30 chars)
from a long textual document.

Like the description that Google returns when
it has found a given word.

In example from:

"Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks,
identified the decomposing carcass and oversaw its recovery on Wednesday.
The "bloated, black thing on the beach" was about 12 feet from the river's
edge, she said."

I have to find the 'carcass' word
and finally return with:
"identified the decomposing carcass and oversaw its recovery on"

---

Which is the *fast* method in VbNet?

In VB6 I would have used
InStr (with Binary option because faster)
to find the position of the word,
then Mid to extract the preceding text,
then Mid to extract the following text,
then build up my phrase in this way: text1 & word & text2 .

Any suggestion in VBNet ?
New methods, StringBuilder, Regular Expression... or what else?

--------
Thanks
(examples are obviously very appreciated ;-) )

Jun 17 '06 #1
7 2902
teo,
| Which is the *fast* method in VbNet?
It sounds like you have the methods identified, you simply want someone else
to test them for you. Why not test them yourself, as you probably already
have the situation (program) and data to test them with.

I would probably use a regular expression, as regex feels like the "correct"
solution (not necessarily the fastest method). The trick is going to be
ensuring that it is an efficient expression and not a poorly performing
one... For example using a lazy compare instead of a greedy compare on the
30 before & after...

If I have time later I will see what RegEx I can come up with...

--
Hope this helps
Jay B. Harlow [MVP - Outlook]
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
"teo" <te*@inwind.i t> wrote in message
news:ue******** *************** *********@4ax.c om...
| hallo,
|
| I need to extract a word and few text that
| precedes and follows it (about 30 + 30 chars)
| from a long textual document.
|
| Like the description that Google returns when
| it has found a given word.
|
| In example from:
|
| "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks,
| identified the decomposing carcass and oversaw its recovery on Wednesday.
| The "bloated, black thing on the beach" was about 12 feet from the river's
| edge, she said."
|
| I have to find the 'carcass' word
| and finally return with:
| "identified the decomposing carcass and oversaw its recovery on"
|
| ---
|
| Which is the *fast* method in VbNet?
|
| In VB6 I would have used
| InStr (with Binary option because faster)
| to find the position of the word,
| then Mid to extract the preceding text,
| then Mid to extract the following text,
| then build up my phrase in this way: text1 & word & text2 .
|
| Any suggestion in VBNet ?
| New methods, StringBuilder, Regular Expression... or what else?
|
| --------
|
|
| Thanks
| (examples are obviously very appreciated ;-) )
|
Jun 17 '06 #2
Use the same method as you would in VB6. Use the IndexOf method to find
the string and the Substring method to extract the part of the text.

However, I don't see the reason for getting the preceding text and
following text, just to put them together, when the string that you want
already exists in the text.

teo wrote:
hallo,

I need to extract a word and few text that
precedes and follows it (about 30 + 30 chars)
from a long textual document.

Like the description that Google returns when
it has found a given word.

In example from:

"Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks,
identified the decomposing carcass and oversaw its recovery on Wednesday.
The "bloated, black thing on the beach" was about 12 feet from the river's
edge, she said."

I have to find the 'carcass' word
and finally return with:
"identified the decomposing carcass and oversaw its recovery on"

---

Which is the *fast* method in VbNet?

In VB6 I would have used
InStr (with Binary option because faster)
to find the position of the word,
then Mid to extract the preceding text,
then Mid to extract the following text,
then build up my phrase in this way: text1 & word & text2 .

Any suggestion in VBNet ?
New methods, StringBuilder, Regular Expression... or what else?

--------
Thanks
(examples are obviously very appreciated ;-) )

Jun 17 '06 #3
Teo,
Here is a regex:

Dim input As String = "Sylvia Brunner, a marine mammals researcher
at the museum in Fairbanks, identified the decomposing carcass and oversaw
its recovery on Wednesday. The ""bloated, black thing on the beach"" was
about 12 feet from the river's edge, she said."
Dim pattern As String = ".{1,30}?carcas s.{1,30}"

Dim match As Match = Regex.Match(inp ut, pattern,
RegexOptions.Mu ltiline)

If match.Success Then
Debug.WriteLine (match.Value)
End If
--
Hope this helps
Jay B. Harlow [MVP - Outlook]
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
"teo" <te*@inwind.i t> wrote in message
news:ue******** *************** *********@4ax.c om...
| hallo,
|
| I need to extract a word and few text that
| precedes and follows it (about 30 + 30 chars)
| from a long textual document.
|
| Like the description that Google returns when
| it has found a given word.
|
| In example from:
|
| "Sylvia Brunner, a marine mammals researcher at the museum in Fairbanks,
| identified the decomposing carcass and oversaw its recovery on Wednesday.
| The "bloated, black thing on the beach" was about 12 feet from the river's
| edge, she said."
|
| I have to find the 'carcass' word
| and finally return with:
| "identified the decomposing carcass and oversaw its recovery on"
|
| ---
|
| Which is the *fast* method in VbNet?
|
| In VB6 I would have used
| InStr (with Binary option because faster)
| to find the position of the word,
| then Mid to extract the preceding text,
| then Mid to extract the following text,
| then build up my phrase in this way: text1 & word & text2 .
|
| Any suggestion in VBNet ?
| New methods, StringBuilder, Regular Expression... or what else?
|
| --------
|
|
| Thanks
| (examples are obviously very appreciated ;-) )
|
Jun 17 '06 #4
teo
On Sat, 17 Jun 2006 23:12:22 +0200, Göran Andersson <gu***@guffa.co m>
wrote:
Use the same method as you would in VB6. Use the IndexOf method to find
the string and the Substring method to extract the part of the text.

However, I don't see the reason for getting the preceding text and
following text, just to put them together, when the string that you want
already exists in the text.


The fact is that I'm building a searching engine, and I need to
format the searched word as Bold,
so I'm compelled to have two chunk of text ,
so I can format my final string as this:
plain Text1 + Bold word + plain Text2.

Because I have to extract the integral text from a column of a DB
(then extract only a part of it, as described above),
do you know if SQL syntax is able to perform such extraction?

Or I'm compelled to extract the string using the VB methods
after having stored the integral text in a DataReader?

Jun 17 '06 #5
teo
Thanks;
I didn't want you to do the work for me indeed,
I only liked to know the name of the functions
it is advisable to use for this case...

Teo,
Here is a regex:

Dim input As String = "Sylvia Brunner, a marine mammals researcher
at the museum in Fairbanks, identified the decomposing carcass and oversaw
its recovery on Wednesday. The ""bloated, black thing on the beach"" was
about 12 feet from the river's edge, she said."
Dim pattern As String = ".{1,30}?carcas s.{1,30}"

Dim match As Match = Regex.Match(inp ut, pattern,
RegexOptions.M ultiline)

If match.Success Then
Debug.WriteLine (match.Value)
End If


Jun 18 '06 #6
| The fact is that I'm building a searching engine, and I need to
| format the searched word as Bold,
Rather then search for the text each time, have you considered, "indexing"
each document.

Then when you need to do a search, you simply check the index, the index
would return where in the text the word was found.

--
Hope this helps
Jay B. Harlow [MVP - Outlook]
..NET Application Architect, Enthusiast, & Evangelist
T.S. Bradley - http://www.tsbradley.net
"teo" <te*@inwind.i t> wrote in message
news:su******** *************** *********@4ax.c om...
| On Sat, 17 Jun 2006 23:12:22 +0200, Göran Andersson <gu***@guffa.co m>
| wrote:
|
| >Use the same method as you would in VB6. Use the IndexOf method to find
| >the string and the Substring method to extract the part of the text.
| >
| >However, I don't see the reason for getting the preceding text and
| >following text, just to put them together, when the string that you want
| >already exists in the text.
|
| The fact is that I'm building a searching engine, and I need to
| format the searched word as Bold,
| so I'm compelled to have two chunk of text ,
| so I can format my final string as this:
| plain Text1 + Bold word + plain Text2.
|
| Because I have to extract the integral text from a column of a DB
| (then extract only a part of it, as described above),
| do you know if SQL syntax is able to perform such extraction?
|
| Or I'm compelled to extract the string using the VB methods
| after having stored the integral text in a DataReader?
|
Jun 18 '06 #7
teo
On Sat, 17 Jun 2006 22:42:07 -0500, "Jay B. Harlow [MVP - Outlook]"
<Ja************ @tsbradley.net> wrote:
| The fact is that I'm building a searching engine, and I need to
| format the searched word as Bold,
Rather then search for the text each time, have you considered, "indexing"
each document.

Then when you need to do a search, you simply check the index, the index
would return where in the text the word was found.


I know that there is such an option,
but I didn't think about it about a solution
because it seemed to me that it would have required
a lot of job to do firstly ;
also going to search a given word among the resulting huge list of
indexed words,
I think it would require a lot of time, maybe the same time than
it would require searching for the given every time.
I'm only guessing about, I've no benchmark....

Maybe I'm going to implement such a solution
when I've finished this method I've started to develop now.

----------

Another question:

the RegExp sample we discuss above
returns 30 + 30 , regardless how the 30 on the left start.
I'll try to explain what I mean.

I'd like to have the chunk of text on the left
starting where the sentence containing the given word starts,
(so to have the very first letter capitalized),
like the way Google displays the results,
that is, if you search 'Lewinsky'
Goggle returns with:

To maintain the *Lewinsky* Story's original feel, we will leave much
of this ... These were gifts the president had originally given
to Ms. Lewinsky himself. ...

In this way,
the 'T' letter is at # -16 position,
I renounce to the preceding 14 chars
and decide to start straight at # - 16
and decide to increase the chunk of text on the
right to 44 ( = 30 + 14).

If the "T" isn't within the 30 chars on left,
no problem, I accept the old 30+30 solution.

Is it possible?

-----

Basically,
we need to trace of the . (= dot) char
that signals to us that a sentence (within to 30 left) is going to start.

If a single RegExp doesn't workt,
we could maybe go with doubling the first RegExp (60 +60)
and then
with a second RegExp find the dot char
and then simply extract the following 60 chars chunk of
text on the right.

What about this?



Jun 18 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
4419
by: Nel | last post by:
Hi all, Before I re-invent the wheel here, has anyone willing to share a basic script to extract META keywords from a string. I have a string, let's say $pageText that contains the dynamic contents of the page. Ideally, I don't just want to explode the string and remove "and", "or" and "the" etc. because some the the repeated keywords may be more that one word long.
8
2194
by: John van Terheijden | last post by:
Hi. I'm trying to make a conversion algorithm that colors even and odd words in a HTML string with <div> tags. // input text with all sorts of HTML tags and whitespace $str = "<h1>Title here</h1> <p>This is a <strong>nice</strong> <img src="picture.gif" /><br> and some nice text </p>";
1
1474
by: Sheela | last post by:
Hi all gurus in tha club, I scripted a prog that extract a string from an html page excluding all the tags. The problem is that it works quite slowly and I wanted to know if somebody of us as an idea how to improve his performance. Thanks a lot SHE CODE
1
10700
by: cassandra.flowers | last post by:
Hi, I am using VB6 and want to extract text from a string. But ONLY take out words that begin with 't' or 'd'. The mainstring is input by the user into 'txtMain' and then by clicking a command button, all the words that begin with 't' or 'd' will be extracted and appear in a second text box, txtExtract. So, i know that I have to search through the main string for the character "d", then extract the following characters until it...
3
1622
by: Richard L Rosenheim | last post by:
I have some text where I need to extract some pieces from. The text will be in a format like this: a string description color="red" type="unknown" In the above example, I would be looking to extract the word "red". There's couple of ways I could approach the problem. I could use IndexOf to search for the string 'color=' and then extract the value using the Substr method. Or, I could use a regular expression like:
8
4264
by: yerk5 | last post by:
Does js have some way of doing what I want here: I want to paste a multiline block of text into a TEXTAREA form field, and there is data in the block of text that I want to extract into a variable. It's always going to be as the same column and row position of the block of text. For example, I want to extract the text from the block from say Line 3, column 4 through Line 3, column 15 into a variable.
1
3067
by: kellysgirl | last post by:
Now what you are going to see posted here is both the set of instructions I was given..and the code I have written. The instructions I was given are as follows In this case, you will create a Visual Basic 2005 solution that manipulates strings. It will parse a string containing a list of items within a text box and put the individual items into the list box. It will build the textbox string by putting the list box items together into a...
11
2912
by: Jacek Dziedzic | last post by:
Hi! I need a routine like: std::string nth_word(const std::string &s, unsigned int n) { // return n-th word from the string, n is 0-based // if 's' contains too few words, return "" // 'words' are any sequences of non-whitespace characters // leading, trailing and multiple whitespace characters // should be ignored.
1
2738
by: Edwin.Madari | last post by:
from each line separate out url and request parts. split the request into key-value pairs, use urllib to unquote key-value pairs......as show below... import urllib line = "GET...
0
9792
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11141
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10742
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10847
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9573
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
7126
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5991
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4220
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3232
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.