473,938 Members | 23,191 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Tokenizing a large buffer

Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan
Dec 18 '07 #1
10 1580
On Tue, 18 Dec 2007 08:04:41 -0800, Jonathan Sion <yo**@nobhillso ft.com>
wrote:
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)
I don't know how lex was implemented. And I don't know whether a state
machine is the best way to solve the problem. But I do know that it's a
reasonable way to solve the problem, and that I wrote a simple
implementation awhile ago and posted it here. You can see it in this post:
http://groups.google.com/group/micro...06f696d4500b77

I see some things in the implementation that I'd probably do differently
if I were doing it again today, but it ought to work. Or at least
something like it.

For what it's worth, just how large is this "large text string"? And how
frequently do you need to do this? If this is something that your code
needs to do over and over on a frequent basis, optimizing the
implementation would be useful. But I'd guess that 500 searches on even a
100K string or so wouldn't take that long, just to do it once.

There's some value in using the brute-force method, as it will keep the
code a _lot_ simpler. I wouldn't worry about the performance unless you
have a good reason to believe it will be a problem.

Hope that helps.

Pete
Dec 18 '07 #2

If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.

HTH,

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.

On Tue, 18 Dec 2007 08:04:41 -0800 (PST), Jonathan Sion
<yo**@nobhillso ft.comwrote:
>Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net? :)

Thanks
Jonathan
Dec 18 '07 #3
On Tue, 18 Dec 2007 10:38:24 -0800, Samuel R. Neff <sa********@nom ail.com>
wrote:
>
If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.
In addition, how long can a Regex match string be? Assuming there's no
practical limit -- that is, you can put any arbitrary string there -- then
since Regex supports boolean "or" in the search pattern string, you could
just have a single search pattern string with all of the tokens in it.

So rather than looping with multiple searches using Regex, just loop on
the tokens in creating the search string. Then let Regex do all the hard
work.

Would this be as fast or faster than the state graph? I don't know...it
depends on whether the Regex authors put some effort into optimizing that
case. I don't know enough about Regex (implementation _or_ API :) ) to
have an answer to that. But even if they didn't, obviously Sam and I
agree that the simpler code is better as long as there's no direct
evidence that performance is actually going to be an issue.

Pete
Dec 18 '07 #4
Hi Pete,
I thought about OR-ing them all into a single pattern. this would
certainly improve SOME performance, but the problem is, many regular
expressions overlap, and when you ran a search pattern, it would stop
on the first one found, while I need all of them. Right now, I got
around 500 regexp patterns (one for each token) and while some could
be or-ed, and would save me some time, some cannot. I guess LEX was
created for a reason, and I thought to ask here if anybody knows of
a .net implementation.
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
(basically, I am running over all stored procedures in a database, to
find erros in code. I am building a TSQL compiler) and so running over
entire database, if its large, means executing it and going for a
drink every time. I will procceed with pete's advice, this could help
performance quite a bit I am sure.. though like I said, its not the
ultimate solution, I think
thanks,
J
Dec 19 '07 #5
On Dec 19, 1:49 pm, Jonathan Sion <y...@nobhillso ft.comwrote:

<snip>
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
When you say "could take 2-3 minutes" do you mean you've measured it
with a particular set of search patterns?

Jon
Dec 19 '07 #6
yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Dec 20 '07 #7
On Dec 20, 3:53 pm, Jonathan Sion <y...@nobhillso ft.comwrote:
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Are you already precompiling the regular expressions? How complicated
are the expressions?

Jon
Dec 20 '07 #8

2 minutes for a single procedure is way too much. I suggest you post
some sample code and a sample procedure. Don't need 500 expressions,
but 2-3 representative ones and 2-3 representative procedures, as well
as the relevant code that generates the regexp and does the loop.

Sam

------------------------------------------------------------
We're hiring! B-Line Medical is seeking .NET
Developers for exciting positions in medical product
development in MD/DC. Work with a variety of technologies
in a relaxed team environment. See ads on Dice.com.
On Thu, 20 Dec 2007 07:53:40 -0800 (PST), Jonathan Sion
<yo**@nobhillso ft.comwrote:
>yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )
Dec 20 '07 #9
Hi,
I am sorry for the delay in my response. Here is a sample code. I got
my object oTokBase, which i defined, and simply holds a regular
expression pattern (in the TokenDef string property) and also holds
the 'options' that I transfer to the .net RE engine (case sensitive,
multiline, etc...) so this is the iteration, and i got about 500
patterns. I have to plead ignorance when asked if its 'compiled' or
not - is it worth reading about? (i.e. will it help performance) also,
I apologize for the VB.NET code in a c# group... the code is very
similar, and I came to this group knowing that the more serious
programmers would be here : )

here's the code, Thanks!
--------------------------------------------------------

For Each oTokBase In Me.m_TokDefs

oTokDef = oTokBase
Dim oRegex As Regex = New Regex(oTokDef.T okenDef,
oTokDef.RegExpO ptions)
'run the search:
Dim mc As MatchCollection = oRegex.Matches( Me.m_sBuffer)
If mc.Count 0 Then
'aha! found:
Dim m As Match
Dim iCharPosInLine As Integer, iLine As Integer
For Each m In mc
Dim oToken As New Token(oTokDef.T okenDef, m.Value,
oTokDef.TokenID )
oToken.Position .AbsolutePos = m.Index
Me.GetPosInLine (m.Index, iCharPosInLine, iLine)
oToken.Position .CharPosInLine = iCharPosInLine
oToken.Position .Line = iLine
Me.TokenPositio ns.Add(oToken)
Next

End If
NextToken:
Next
Dec 27 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

28
8100
by: David Rubin | last post by:
I looked on google for an answer, but I didn't find anything short of using boost which sufficiently answers my question: what is a good way of doing string tokenization (note: I cannot use boost). For example, I have tried this: #include <algorithm> #include <cctype> #include <climits> #include <deque> #include <iostream>
0
913
by: zhimin | last post by:
Hi, I'm writing a program to send large file(100m) through dotnet using TCPListener & TCPClient, I'm sending the file with a ask and response loop: 1. Client send a flag 1 to server indicate it has data send to server. 2. Client send the buffer block size. 3. Client send the actual buffer to the server. 4. Server send a flag 1 to client indicating that the buffer has been successfully receeived. 5. The next loop until all data of the...
1
4024
by: lwickland | last post by:
Summary: System.Net.ScatterGatherBuffers.MemoryChuck allocates inordinately large bytes when sending large post data. The following application consumes inordinate quantities of memory. My code does not explicitly allocate memory in a loop nor does it explicitly allocate large blocks of memory. Yet, the application’s memory footprint will grow as large as 370 MB. Rarely will it run to completion; usually, it throws an out of memory...
12
6207
by: Sharon | last post by:
I’m wrote a small DLL that used the FreeImage.DLL (that can be found at http://www.codeproject.com/bitmap/graphicsuite.asp). I also wrote a small console application in C++ (unmanaged) that uses the DLL above. Now the application, together with the above DLL’s is successfully loading a TIF image file (62992 x 113386 Pixels, Huffman RLE compression, 3200 x 3200 DPI resolution, binary colored (1 Bit Per Pixel), file on disk size 43.08...
2
4652
by: gauravkhanna | last post by:
Hi All I need some help for the below problem: Scenario We need to send large binary files (audio file of about 10 MB or so) from the client machine (.Net Windows based application, located outside the home network) to the Web Server and then retrieve the file back from the web server to the client.
7
10836
by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:
Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're doing a proof-of-concept on WCF whereby we have a Windows form client and a Server. Our server is a middle-tier that interfaces with our SQL 05 database server.
3
9275
by: patrickdepinguin | last post by:
Hi, I need to write large quantities of data to a file in C. The data comes from statistics that are continuously gathered from a simulator, and in order to not slow the whole thing down I would obviously want the writes to go as fast and efficient as possible. Since I/O operations are rather slow, I was thinking that using a large buffer would be better than writing each data point every time. Each data point calls my function, at which...
17
10010
by: byte8bits | last post by:
How does C++ safely open and read very large files? For example, say I have 1GB of physical memory and I open a 4GB file and attempt to read it like so: #include <iostream> #include <fstream> #include <string> using namespace std; int main () {
1
2244
by: Lambda | last post by:
As I know, when I use ifstream and ofstream to read and write file, it will use a stream buffer internally. How large is the stream buffer? If I want to write a large file, need I define a stream buffer myself? In what situation, user-defined stream buffer is useful?
0
10125
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9963
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
11281
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9853
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
8207
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7377
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
6072
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
4441
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3495
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.