473,406 Members | 2,956 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Quicker reg exps?

Hi, I've written a reg exp for capturing a group of numbers from text files in the following format:
-1.4326 s < 0.6758 s < 1.4334 s
Any of the numbers can be positive or negative and the units (s) can change or even be absent. What I wanted was the three numbers (signs included)! Here was the reg exp I used to capture:
Expand|Select|Wrap|Line Numbers
  1. $line =~ m!([ |-]\d+\.\d+)\s+.*?<\s+([ |-]\d+\.\d+)\s+(.*?)<\s+([ |-]\d+\.\d+)!   
  2.  
The problem is that this must be used hundreds of thousands of times per file so speed is an issue! Does anyone have any ideas to make this reg exp faster? I'm not fully aware of what reg exp constructs incurr speed penalties?
Thanks!
Sep 24 '08 #1
6 1675
numberwhun
3,509 Expert Mod 2GB
The only thing I can really think of right off (due to it still being early and my brain is still sleeping), is to work to make your regex non-greedy if you can. You can read about it here and here.

Having a more exact regular expression is one key to speed. Also, in the beginning of your regex you have the following:

Expand|Select|Wrap|Line Numbers
  1. [ |-]
  2.  
I assume that the spacing before the pipe symbol is supposed to be a space, but to a regex, its just white space and not part of the regex. To indicate a space in a regex, you would use a \s, not an actual space.

Regards,

Jeff
Sep 24 '08 #2
Ganon11
3,652 Expert 2GB
Jeff,

A space inside a character class (such as the one he has) matches just that - a space. Whitespace is matched normally inside regexs unless a certain option is turned on (which I forget right now). In other words,

Expand|Select|Wrap|Line Numbers
  1. $line =~ /This is a test./;
will correctly match "This is a test." but not "Thisisatest."

Expand|Select|Wrap|Line Numbers
  1. C:\Users\Ganon11>perl
  2. while (1) {
  3.    chomp(my $line = <STDIN>);
  4.    if ($line =~ /This is a test./) {
  5.       print "Successful match.\n";
  6.    } else {
  7.       print "No match.\n";
  8.    }
  9. }
  10. ^Z
  11. This is a test.
  12. Successful match.
  13. Thisisatest.
  14. No match.
  15. ^C
Similarly,

Expand|Select|Wrap|Line Numbers
  1. $line =~ /(\w+)[ \t]/;
will match "Dogs ", "Cats ", but not "Mouse".

Expand|Select|Wrap|Line Numbers
  1. C:\Users\Ganon11>perl
  2. while (1) {
  3.         chomp(my $line = <STDIN>);
  4.         if ($line =~ /(\w+)[ \t]/) {
  5.                 print "Successful match.\n";
  6.         } else {
  7.                 print "No match.\n";
  8.         }
  9. }
  10. ^Z
  11. Dogs
  12. No match.
  13. Dogs and
  14. Successful match.
  15. Cats
  16. Successful match.
  17. There was a tab in the previous line
  18. Successful match.
  19. Mousenospace
  20. No match.
  21. ^C
The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
Sep 24 '08 #3
KevinADC
4,059 Expert 2GB
try:

Expand|Select|Wrap|Line Numbers
  1. $line =~ m/(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)/o;
the "o" on the end might also give some performance boost but you would have to test the code to see if that is true for your application.
Sep 24 '08 #4
numberwhun
3,509 Expert Mod 2GB
The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
Plus, with \s, you can add the modifiers to match none or many, where as I believe you would have to include as many spaces as you expect the way he has done it. I was just looking to efficiency, but also wasn't aware you could use a literal space as such.
Sep 24 '08 #5
Ganon11
3,652 Expert 2GB
You could use [ \t\n]+ or [ \t\n]* just like \s, it's just faster to write \s+ or \s*. I think.
Sep 24 '08 #6
KevinADC
4,059 Expert 2GB
\s is actually a character class, not just a meta character, its like \d ([0-9]) or \w ([a-zA-Z0-9_]) and not like \t or \n, which are meta characters that have only one interpolated meaning (tab and newline). Its actual meaning may also vary between older versions of perl and newer ones.

According to the perl 5.10 documentation:

\s matches a whitespace character, the set [\ \t\r\n\f] and others
Sep 24 '08 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: NotGiven | last post by:
I am researching the best place to put pictures. I have heard form both sides and I'd like to know why one is better than the other. Many thanks!
0
by: has | last post by:
I'm wondering if the following code is acceptable for shallow copying instances of new-style classes: class clone(object): def __init__(self, origObj): self.__dict__ = origObj.__dict__.copy()...
383
by: John Bailo | last post by:
The war of the OSes was won a long time ago. Unix has always been, and will continue to be, the Server OS in the form of Linux. Microsoft struggled mightily to win that battle -- creating a...
13
by: Chris Mantoulidis | last post by:
There must be some tips to make a program quicker. I guess more than 50% of ppl here will say "avoid the if-s". Yeah I know this makes a program quicker but some times an "if" is inevitable,...
1
by: Newbie | last post by:
Hi all, I have a access 2000 database with linked tables to a access 2000 backend database. The performance is really slow. I am looking for the best way to open a form to add a new record...
3
by: Mark | last post by:
Hi - when working with datasets, is it quicker to loop through the dataset, comparing some column values with predetermined values, or should I apply a filter on the dataset to retrieve the values...
5
by: shelleybobelly | last post by:
Hi, I have a new job. It needs to drop and re-create (by insert) a table every night. The table contains approximately 3,000,000 (and growing) records. The insert is fine, runs in 2 minutes. The...
7
by: adam | last post by:
I'm currently coding a CMS system for a site which includes the feature to create multiple sections inside a page. To add each of these new sections I'm using DOM with AJAX to save, but I've got a...
0
drhowarddrfine
by: drhowarddrfine | last post by:
I see these mistakes over and over again. Follow them and you can get a quicker and better answer to your questions. Include a link to your page or the complete HTML and CSS. A picture of the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.