Parse text into words? - Visual Basic .NET

jim_adams

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 22 '06 #1

Subscribe Post Reply

1507

Travers Naran

ji*******@hotmail.com wrote:

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)

Jun 22 '06 #2

Scott M.

Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

<ji*******@hotmail.com> wrote in message
news:11**********************@u72g2000cwu.googlegr oups.com...

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 22 '06 #3

Cor Ligthert [MVP]

Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/de...vafctinstr.asp

http://msdn.microsoft.com/library/de...classtopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor
<ji*******@hotmail.com> schreef in bericht
news:11**********************@u72g2000cwu.googlegr oups.com...

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 23 '06 #4

Cor Ligthert [MVP]

Scott,

In past I have suggested this as a kind of 7th alternative (more for fun).

It works but it is slow with hug strings, even slower than Regex.

(We have tested this ones in this newsgroup, maybe you remember it you again
now I write this).

Cor

"Scott M." <s-***@nospam.nospam> schreef in bericht
news:OR**************@TK2MSFTNGP03.phx.gbl...

Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

<ji*******@hotmail.com> wrote in message
news:11**********************@u72g2000cwu.googlegr oups.com...
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 23 '06 #5

Larry Lard

Travers Naran wrote:

ji*******@hotmail.com wrote:
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Done.

--
Larry Lard
Replies to group please

Jun 23 '06 #6

guy

Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB

hth

guy

"ji*******@hotmail.com" wrote:

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 23 '06 #7

jim_adams

I need a list of unique words among all documents. Since many of the
documents will contain technical terms, now and then it's likely that a
new term will pop up.

guy wrote:

Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB

Jun 23 '06 #8

jim_adams

Hi Cor,

Thanks for the tip. I was always under the impression that doing
string parsing in a loop was very inefficient, and that regex was the
"enlightened" way.

My first hunch would have been to:

1) replace punctuation with spaces
2) split on spaces
3) step through the array one by one doing a binarysearch off a sorted
array.

Maybe I should go down this brute force route.

Thanks,

Jim
Cor Ligthert [MVP] wrote:

Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/de...vafctinstr.asp

http://msdn.microsoft.com/library/de...classtopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor
<ji*******@hotmail.com> schreef in bericht
news:11**********************@u72g2000cwu.googlegr oups.com...
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 23 '06 #9

Travers Naran

Larry Lard wrote:

Travers Naran wrote:
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.

Or if you're really insane, you can hand-write a lexical analyzer. :-)

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Um, that IS lexical analysis.

Jun 23 '06 #10

Larry Lard

Travers Naran wrote:

Larry Lard wrote:
Travers Naran wrote:
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().
Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.

Sure, but is it going to be faster than the below?

Or if you're really insane, you can hand-write a lexical analyzer. :-)

No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Um, that IS lexical analysis.

My mistake.

--
Larry Lard
Replies to group please

Jun 23 '06 #11

Similar topics

parse a string for certain words

by: Mosher | last post by:

Hi all, I was wondering if php can parse a text string for certain words and return "true" if that word is found. For example, I have a string like this: $string = "The rain in spain is the same...

PHP

Need a little parse help

by: Alex Nordhus | last post by:

Im trying to grab a colum of data from a text file and write it to a new file. I am having trouble getting It to write the data to newlines. Python is making it one Long string without any spaces...

Python

How to Parse Sentences into Words

by: HumanJHawkins | last post by:

Hi all, I have a table of text and associated data. I want to break apart the text into individual words, yet retain the data in other columns. For example: Sentence: Chapter:...

Microsoft SQL Server

Need to parse Name field so I have to fields, FirstName, LastName

by: Theresa Hancock via AccessMonster.com | last post by:

I have an Excel table I need to import into Access. The name is entered into one field "Name". I'd like to have two fields in Access, FirstName and LastName. How do I do this. -- Message posted...

Microsoft Access / VBA

Has any C library to parse JAVA serialized object string?

by: MMiGG | last post by:

Hi Our project need parse JAVA serialized object string in C, has any library? Thanx

C / C++

Taking data from a text file to parse html page

by: DH | last post by:

Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...

Python

How to parse this line of code manually

by: Davy | last post by:

Hi all, It is well known that Python is appreciated for its merit of concise. However, I found the over concise code is too hard to understand for me. Consider, for instance, def...

Python

Regular expression to parse and split string into array

by: rupinderbatra | last post by:

Hello everyone, I am using a regular expression to parse a text string into various parts -- for ex: string "How do you do" will be changed to array with all the words and white spaces. I am...

Javascript

How can I parse this textfile?

by: NuclearIce | last post by:

I have looked at literally hundreds of ways to parse but I do not know how to do it myself I just don't understand it really. I have a word list text file like this abbreviation abuser...

PHP

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware