473,386 Members | 1,803 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Tokenizing a large buffer

Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan
Dec 18 '07 #1
10 1547
On Tue, 18 Dec 2007 08:04:41 -0800, Jonathan Sion <yo**@nobhillsoft.com>
wrote:
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)
I don't know how lex was implemented. And I don't know whether a state
machine is the best way to solve the problem. But I do know that it's a
reasonable way to solve the problem, and that I wrote a simple
implementation awhile ago and posted it here. You can see it in this post:
http://groups.google.com/group/micro...06f696d4500b77

I see some things in the implementation that I'd probably do differently
if I were doing it again today, but it ought to work. Or at least
something like it.

For what it's worth, just how large is this "large text string"? And how
frequently do you need to do this? If this is something that your code
needs to do over and over on a frequent basis, optimizing the
implementation would be useful. But I'd guess that 500 searches on even a
100K string or so wouldn't take that long, just to do it once.

There's some value in using the brute-force method, as it will keep the
code a _lot_ simpler. I wouldn't worry about the performance unless you
have a good reason to believe it will be a problem.

Hope that helps.

Pete
Dec 18 '07 #2

If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.

HTH,

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.

On Tue, 18 Dec 2007 08:04:41 -0800 (PST), Jonathan Sion
<yo**@nobhillsoft.comwrote:
>Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan
Dec 18 '07 #3
On Tue, 18 Dec 2007 10:38:24 -0800, Samuel R. Neff <sa********@nomail.com>
wrote:
>
If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.
In addition, how long can a Regex match string be? Assuming there's no
practical limit -- that is, you can put any arbitrary string there -- then
since Regex supports boolean "or" in the search pattern string, you could
just have a single search pattern string with all of the tokens in it.

So rather than looping with multiple searches using Regex, just loop on
the tokens in creating the search string. Then let Regex do all the hard
work.

Would this be as fast or faster than the state graph? I don't know...it
depends on whether the Regex authors put some effort into optimizing that
case. I don't know enough about Regex (implementation _or_ API :) ) to
have an answer to that. But even if they didn't, obviously Sam and I
agree that the simpler code is better as long as there's no direct
evidence that performance is actually going to be an issue.

Pete
Dec 18 '07 #4
Hi Pete,
I thought about OR-ing them all into a single pattern. this would
certainly improve SOME performance, but the problem is, many regular
expressions overlap, and when you ran a search pattern, it would stop
on the first one found, while I need all of them. Right now, I got
around 500 regexp patterns (one for each token) and while some could
be or-ed, and would save me some time, some cannot. I guess LEX was
created for a reason, and I thought to ask here if anybody knows of
a .net implementation.
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
(basically, I am running over all stored procedures in a database, to
find erros in code. I am building a TSQL compiler) and so running over
entire database, if its large, means executing it and going for a
drink every time. I will procceed with pete's advice, this could help
performance quite a bit I am sure.. though like I said, its not the
ultimate solution, I think
thanks,
J
Dec 19 '07 #5
On Dec 19, 1:49 pm, Jonathan Sion <y...@nobhillsoft.comwrote:

<snip>
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
When you say "could take 2-3 minutes" do you mean you've measured it
with a particular set of search patterns?

Jon
Dec 19 '07 #6
yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Dec 20 '07 #7
On Dec 20, 3:53 pm, Jonathan Sion <y...@nobhillsoft.comwrote:
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Are you already precompiling the regular expressions? How complicated
are the expressions?

Jon
Dec 20 '07 #8

2 minutes for a single procedure is way too much. I suggest you post
some sample code and a sample procedure. Don't need 500 expressions,
but 2-3 representative ones and 2-3 representative procedures, as well
as the relevant code that generates the regexp and does the loop.

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.
On Thu, 20 Dec 2007 07:53:40 -0800 (PST), Jonathan Sion
<yo**@nobhillsoft.comwrote:
>yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Dec 20 '07 #9
Hi,
I am sorry for the delay in my response. Here is a sample code. I got
my object oTokBase, which i defined, and simply holds a regular
expression pattern (in the TokenDef string property) and also holds
the 'options' that I transfer to the .net RE engine (case sensitive,
multiline, etc...) so this is the iteration, and i got about 500
patterns. I have to plead ignorance when asked if its 'compiled' or
not - is it worth reading about? (i.e. will it help performance) also,
I apologize for the VB.NET code in a c# group... the code is very
similar, and I came to this group knowing that the more serious
programmers would be here : )

here's the code, Thanks!
--------------------------------------------------------

For Each oTokBase In Me.m_TokDefs

oTokDef = oTokBase
Dim oRegex As Regex = New Regex(oTokDef.TokenDef,
oTokDef.RegExpOptions)
'run the search:
Dim mc As MatchCollection = oRegex.Matches(Me.m_sBuffer)
If mc.Count 0 Then
'aha! found:
Dim m As Match
Dim iCharPosInLine As Integer, iLine As Integer
For Each m In mc
Dim oToken As New Token(oTokDef.TokenDef, m.Value,
oTokDef.TokenID)
oToken.Position.AbsolutePos = m.Index
Me.GetPosInLine(m.Index, iCharPosInLine, iLine)
oToken.Position.CharPosInLine = iCharPosInLine
oToken.Position.Line = iLine
Me.TokenPositions.Add(oToken)
Next

End If
NextToken:
Next
Dec 27 '07 #10

"Jonathan Sion" <yo**@nobhillsoft.comwrote in message
news:4b**********************************@q77g2000 hsh.googlegroups.com...
Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)
google gives -- http://sourceforge.net/projects/csflex/
>
Thanks
Jonathan

Dec 31 '07 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

28
by: David Rubin | last post by:
I looked on google for an answer, but I didn't find anything short of using boost which sufficiently answers my question: what is a good way of doing string tokenization (note: I cannot use boost)....
0
by: zhimin | last post by:
Hi, I'm writing a program to send large file(100m) through dotnet using TCPListener & TCPClient, I'm sending the file with a ask and response loop: 1. Client send a flag 1 to server indicate it...
1
by: lwickland | last post by:
Summary: System.Net.ScatterGatherBuffers.MemoryChuck allocates inordinately large bytes when sending large post data. The following application consumes inordinate quantities of memory. My code...
12
by: Sharon | last post by:
I’m wrote a small DLL that used the FreeImage.DLL (that can be found at http://www.codeproject.com/bitmap/graphicsuite.asp). I also wrote a small console application in C++ (unmanaged) that uses...
2
by: gauravkhanna | last post by:
Hi All I need some help for the below problem: Scenario We need to send large binary files (audio file of about 10 MB or so) from the client machine (.Net Windows based application, located...
7
by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:
Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're...
3
by: patrickdepinguin | last post by:
Hi, I need to write large quantities of data to a file in C. The data comes from statistics that are continuously gathered from a simulator, and in order to not slow the whole thing down I would...
17
by: byte8bits | last post by:
How does C++ safely open and read very large files? For example, say I have 1GB of physical memory and I open a 4GB file and attempt to read it like so: #include <iostream> #include <fstream>...
1
by: Lambda | last post by:
As I know, when I use ifstream and ofstream to read and write file, it will use a stream buffer internally. How large is the stream buffer? If I want to write a large file, need I define a...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.