473,473 Members | 2,170 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Regular expression rejecting invalid files

Hi,

I am using a regular expression to read records from a text file. But when
reading files with invalid formats it takes ages before the program rjects
the file. So I want to optimise the expression to reject invalid files
faster.

The valid files are wellformed and looks something like this:

Once upon a time
CODE NUMBER:123
There was a little lamb
CODE NUMBER:2134

Each record is terminated by a form feed and the reg-expression is something
like this:
..Pattern = "(.*)\r\nCODE NUMBER:(\d+)\r\n\f"

Any ideas on how to speed up file rejection?

Regards
Bertrand
May 8 '06 #1
3 1346
Since you have newlines embedded in your regex, you are obviously
reading the entire file in before comparing it to your pattern, rather
than reading a record at a time. Then your regex has to look through
the whole thing to see if your pattern is there, Since binary files
can be large, you're probably taking a hit on the file I/O, and then
another hit on the regex.

I would probably try reading a dozen bytes from the beginning of each
file, and make sure each of the characters I got was alphanumeric,
whitespace, or some small set of punctuation. If it was, I'd go to the
full I/O and regex; if not, I'd assume I had a binary file and go on to
the next one.

May 8 '06 #2
Thanks, that is one way to do it. It does not seem to be I/O though. I think
that perhaps my expression is to "loose" in the sence that it does not
include file start/end symbols. If I could only make the grammar more strict
then I would presume that files would be sooner rejected. Any ideas whether
this is possible?

Regards
Bertrand

"sp********@gmail.com" wrote:
Since you have newlines embedded in your regex, you are obviously
reading the entire file in before comparing it to your pattern, rather
than reading a record at a time. Then your regex has to look through
the whole thing to see if your pattern is there, Since binary files
can be large, you're probably taking a hit on the file I/O, and then
another hit on the regex.

I would probably try reading a dozen bytes from the beginning of each
file, and make sure each of the characters I got was alphanumeric,
whitespace, or some small set of punctuation. If it was, I'd go to the
full I/O and regex; if not, I'd assume I had a binary file and go on to
the next one.

May 10 '06 #3
Since you're reading the entire file into a string before executing
your regex, the start of file is the start of string. The way your
regex is coded, the regex has to go all the way through the file before
it can reject it (that (.*) at the beginning). Is it really necessary
to capture everything that comes before CODE NUMBER?

If it is, you might try something like "^([a-zA-Z ]{5}.*)" in place of
your (.*). Without knowing what your "Once upon a time"s really look
like, it's kind of hard to say.

May 10 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
2
by: hillcountry74 | last post by:
Hi, I'm stuck with this regular expression from past 2 days. Desperately need help. I need a regular expression that will allow all characters except these *:~<>' This is my code in...
3
by: Joe | last post by:
Hi, I have been using a regular expression that I don’t uite understand to filter the valid email address. My regular expression is as follows: <asp:RegularExpressionValidator...
2
by: hillcountry74 | last post by:
Hi, I'm stuck with this regular expression from past 2 days. Desperately need help. I need a regular expression that will allow all characters except these *:~<>' This is my code in...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.