By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,646 Members | 1,175 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,646 IT Pros & Developers. It's quick & easy.

Tokenizing a large buffer

P: n/a
Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan
Dec 18 '07 #1
Share this Question
Share on Google+
10 Replies


P: n/a
On Tue, 18 Dec 2007 08:04:41 -0800, Jonathan Sion <yo**@nobhillsoft.com>
wrote:
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)
I don't know how lex was implemented. And I don't know whether a state
machine is the best way to solve the problem. But I do know that it's a
reasonable way to solve the problem, and that I wrote a simple
implementation awhile ago and posted it here. You can see it in this post:
http://groups.google.com/group/micro...06f696d4500b77

I see some things in the implementation that I'd probably do differently
if I were doing it again today, but it ought to work. Or at least
something like it.

For what it's worth, just how large is this "large text string"? And how
frequently do you need to do this? If this is something that your code
needs to do over and over on a frequent basis, optimizing the
implementation would be useful. But I'd guess that 500 searches on even a
100K string or so wouldn't take that long, just to do it once.

There's some value in using the brute-force method, as it will keep the
code a _lot_ simpler. I wouldn't worry about the performance unless you
have a good reason to believe it will be a problem.

Hope that helps.

Pete
Dec 18 '07 #2

P: n/a

If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.

HTH,

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.

On Tue, 18 Dec 2007 08:04:41 -0800 (PST), Jonathan Sion
<yo**@nobhillsoft.comwrote:
>Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan
Dec 18 '07 #3

P: n/a
On Tue, 18 Dec 2007 10:38:24 -0800, Samuel R. Neff <sa********@nomail.com>
wrote:
>
If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.
In addition, how long can a Regex match string be? Assuming there's no
practical limit -- that is, you can put any arbitrary string there -- then
since Regex supports boolean "or" in the search pattern string, you could
just have a single search pattern string with all of the tokens in it.

So rather than looping with multiple searches using Regex, just loop on
the tokens in creating the search string. Then let Regex do all the hard
work.

Would this be as fast or faster than the state graph? I don't know...it
depends on whether the Regex authors put some effort into optimizing that
case. I don't know enough about Regex (implementation _or_ API :) ) to
have an answer to that. But even if they didn't, obviously Sam and I
agree that the simpler code is better as long as there's no direct
evidence that performance is actually going to be an issue.

Pete
Dec 18 '07 #4

P: n/a
Hi Pete,
I thought about OR-ing them all into a single pattern. this would
certainly improve SOME performance, but the problem is, many regular
expressions overlap, and when you ran a search pattern, it would stop
on the first one found, while I need all of them. Right now, I got
around 500 regexp patterns (one for each token) and while some could
be or-ed, and would save me some time, some cannot. I guess LEX was
created for a reason, and I thought to ask here if anybody knows of
a .net implementation.
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
(basically, I am running over all stored procedures in a database, to
find erros in code. I am building a TSQL compiler) and so running over
entire database, if its large, means executing it and going for a
drink every time. I will procceed with pete's advice, this could help
performance quite a bit I am sure.. though like I said, its not the
ultimate solution, I think
thanks,
J
Dec 19 '07 #5

P: n/a
On Dec 19, 1:49 pm, Jonathan Sion <y...@nobhillsoft.comwrote:

<snip>
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
When you say "could take 2-3 minutes" do you mean you've measured it
with a particular set of search patterns?

Jon
Dec 19 '07 #6

P: n/a
yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Dec 20 '07 #7

P: n/a
On Dec 20, 3:53 pm, Jonathan Sion <y...@nobhillsoft.comwrote:
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Are you already precompiling the regular expressions? How complicated
are the expressions?

Jon
Dec 20 '07 #8

P: n/a

2 minutes for a single procedure is way too much. I suggest you post
some sample code and a sample procedure. Don't need 500 expressions,
but 2-3 representative ones and 2-3 representative procedures, as well
as the relevant code that generates the regexp and does the loop.

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.
On Thu, 20 Dec 2007 07:53:40 -0800 (PST), Jonathan Sion
<yo**@nobhillsoft.comwrote:
>yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Dec 20 '07 #9

P: n/a
Hi,
I am sorry for the delay in my response. Here is a sample code. I got
my object oTokBase, which i defined, and simply holds a regular
expression pattern (in the TokenDef string property) and also holds
the 'options' that I transfer to the .net RE engine (case sensitive,
multiline, etc...) so this is the iteration, and i got about 500
patterns. I have to plead ignorance when asked if its 'compiled' or
not - is it worth reading about? (i.e. will it help performance) also,
I apologize for the VB.NET code in a c# group... the code is very
similar, and I came to this group knowing that the more serious
programmers would be here : )

here's the code, Thanks!
--------------------------------------------------------

For Each oTokBase In Me.m_TokDefs

oTokDef = oTokBase
Dim oRegex As Regex = New Regex(oTokDef.TokenDef,
oTokDef.RegExpOptions)
'run the search:
Dim mc As MatchCollection = oRegex.Matches(Me.m_sBuffer)
If mc.Count 0 Then
'aha! found:
Dim m As Match
Dim iCharPosInLine As Integer, iLine As Integer
For Each m In mc
Dim oToken As New Token(oTokDef.TokenDef, m.Value,
oTokDef.TokenID)
oToken.Position.AbsolutePos = m.Index
Me.GetPosInLine(m.Index, iCharPosInLine, iLine)
oToken.Position.CharPosInLine = iCharPosInLine
oToken.Position.Line = iLine
Me.TokenPositions.Add(oToken)
Next

End If
NextToken:
Next
Dec 27 '07 #10

P: n/a

"Jonathan Sion" <yo**@nobhillsoft.comwrote in message
news:4b**********************************@q77g2000 hsh.googlegroups.com...
Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)
google gives -- http://sourceforge.net/projects/csflex/
>
Thanks
Jonathan

Dec 31 '07 #11

This discussion thread is closed

Replies have been disabled for this discussion.