473,382 Members | 1,658 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Parse text into words?

I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 22 '06 #1
10 1507
ji*******@hotmail.com wrote:
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?


You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)

Jun 22 '06 #2
Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

<ji*******@hotmail.com> wrote in message
news:11**********************@u72g2000cwu.googlegr oups.com...
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 22 '06 #3
Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/de...vafctinstr.asp

http://msdn.microsoft.com/library/de...classtopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor
<ji*******@hotmail.com> schreef in bericht
news:11**********************@u72g2000cwu.googlegr oups.com...
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 23 '06 #4
Scott,

In past I have suggested this as a kind of 7th alternative (more for fun).

It works but it is slow with hug strings, even slower than Regex.

(We have tested this ones in this newsgroup, maybe you remember it you again
now I write this).

Cor

"Scott M." <s-***@nospam.nospam> schreef in bericht
news:OR**************@TK2MSFTNGP03.phx.gbl...
Why can't you split on the space and replace the punctuation (since there
will only be a limited amount of types of punctuation) with nothing? This
seems to be the most efficient and simple way to do it.

Dim x As String = veryLargeString

y = y.Replace(", "," ")
y = y.Replace(". "," ")
y = y.Replace(": "," ")
y = y.Replace("; "," ")

Dim y As Array = x.Split(" ")

<ji*******@hotmail.com> wrote in message
news:11**********************@u72g2000cwu.googlegr oups.com...
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim


Jun 23 '06 #5

Travers Naran wrote:
ji*******@hotmail.com wrote:
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().


Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.

The other option is to write a lexical analyzer (lexer). There might
be some .Net equivalents of the old reliable Lex and Flex. Not sure if
they'd be faster in this case, and seem like massive over kill to me.

Or if you're really insane, you can hand-write a lexical analyzer. :-)


No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string

Done.

--
Larry Lard
Replies to group please

Jun 23 '06 #6
guy
Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB

hth

guy

"ji*******@hotmail.com" wrote:
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim

Jun 23 '06 #7
I need a list of unique words among all documents. Since many of the
documents will contain technical terms, now and then it's likely that a
new term will pop up.

guy wrote:
Jim,
is it essential that ALL words are added into your array? if not you could
probably optimise this by only doing the first few GB, maybe check to see how
many words have been added for each GB or 10000 words or whatever.

my bet is that you will quite quickly find that you are adding very few
words, and these will be hightly specialized ones, therefore you only need to
read the first few GB


Jun 23 '06 #8
Hi Cor,

Thanks for the tip. I was always under the impression that doing
string parsing in a loop was very inefficient, and that regex was the
"enlightened" way.

My first hunch would have been to:

1) replace punctuation with spaces
2) split on spaces
3) step through the array one by one doing a binarysearch off a sorted
array.

Maybe I should go down this brute force route.

Thanks,

Jim
Cor Ligthert [MVP] wrote:
Jim,

If I understand you well will be the combination of the VB method Instr and
a sortedlist be the quickest way to achieve what you want.

You go than through your text and when found in a loop you update everytime
the starting point fron instr while you set the word you found in the key of
the dictionary pair of the sortedlist

http://msdn.microsoft.com/library/de...vafctinstr.asp

http://msdn.microsoft.com/library/de...classtopic.asp

From Regex you can be from one thing sure, it will take probably at least 50
times more time than as above as above.

I hope this helps,

Cor
<ji*******@hotmail.com> schreef in bericht
news:11**********************@u72g2000cwu.googlegr oups.com...
I need a very efficient way to parse large amounts of text (GBs) on
word boundaries. Words will then be added to an array as long as they
haven't already been added. Splitting on a space is a bit too basic
since punctuation will remain. Maybe regex?

Thanks for any insights.

Jim


Jun 23 '06 #9

Larry Lard wrote:
Travers Naran wrote:
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().


Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.


Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.
Or if you're really insane, you can hand-write a lexical analyzer. :-)


No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string


Um, that IS lexical analysis.

Jun 23 '06 #10

Travers Naran wrote:
Larry Lard wrote:
Travers Naran wrote:
You've got a few choices. The Regex split can do what you want; just
split on [ ,.!?;:]. You could also define a regex for your words and
use Matches().
Regex is overkill for this problem, and for gigabytes of text we need
to think about performance slightly earlier than we normally would.


Have you tested the performance yet? Because a pre-compiled regex can
be surprisingly fast.


Sure, but is it going to be faster than the below?
Or if you're really insane, you can hand-write a lexical analyzer. :-)


No _lexical_ analysis is involved here - all we are doing is parsing.
This seems to me to be the simplest approach:

- Get the text into a Char array
- Procees through this array one Char at a time, maintaining an
initially-empty 'current word'
- When a character is read:
- - if it is a letter character, append it to the 'current word'
- - if it is not a letter character, the 'current word' is complete:
process it, and reset the 'current word' to the empty string


Um, that IS lexical analysis.


My mistake.

--
Larry Lard
Replies to group please

Jun 23 '06 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Mosher | last post by:
Hi all, I was wondering if php can parse a text string for certain words and return "true" if that word is found. For example, I have a string like this: $string = "The rain in spain is the same...
9
by: Alex Nordhus | last post by:
Im trying to grab a colum of data from a text file and write it to a new file. I am having trouble getting It to write the data to newlines. Python is making it one Long string without any spaces...
7
by: HumanJHawkins | last post by:
Hi all, I have a table of text and associated data. I want to break apart the text into individual words, yet retain the data in other columns. For example: Sentence: Chapter:...
5
by: Theresa Hancock via AccessMonster.com | last post by:
I have an Excel table I need to import into Access. The name is entered into one field "Name". I'd like to have two fields in Access, FirstName and LastName. How do I do this. -- Message posted...
3
by: MMiGG | last post by:
Hi Our project need parse JAVA serialized object string in C, has any library? Thanx
13
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...
4
by: Davy | last post by:
Hi all, It is well known that Python is appreciated for its merit of concise. However, I found the over concise code is too hard to understand for me. Consider, for instance, def...
3
by: rupinderbatra | last post by:
Hello everyone, I am using a regular expression to parse a text string into various parts -- for ex: string "How do you do" will be changed to array with all the words and white spaces. I am...
8
by: NuclearIce | last post by:
I have looked at literally hundreds of ways to parse but I do not know how to do it myself I just don't understand it really. I have a word list text file like this abbreviation abuser...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.