467,179 Members | 1,263 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,179 developers. It's quick & easy.

Quicker reg exps?

Hi, I've written a reg exp for capturing a group of numbers from text files in the following format:
-1.4326 s < 0.6758 s < 1.4334 s
Any of the numbers can be positive or negative and the units (s) can change or even be absent. What I wanted was the three numbers (signs included)! Here was the reg exp I used to capture:
Expand|Select|Wrap|Line Numbers
  1. $line =~ m!([ |-]\d+\.\d+)\s+.*?<\s+([ |-]\d+\.\d+)\s+(.*?)<\s+([ |-]\d+\.\d+)!   
  2.  
The problem is that this must be used hundreds of thousands of times per file so speed is an issue! Does anyone have any ideas to make this reg exp faster? I'm not fully aware of what reg exp constructs incurr speed penalties?
Thanks!
Sep 24 '08 #1
  • viewed: 1417
Share:
6 Replies
numberwhun
Expert Mod 2GB
The only thing I can really think of right off (due to it still being early and my brain is still sleeping), is to work to make your regex non-greedy if you can. You can read about it here and here.

Having a more exact regular expression is one key to speed. Also, in the beginning of your regex you have the following:

Expand|Select|Wrap|Line Numbers
  1. [ |-]
  2.  
I assume that the spacing before the pipe symbol is supposed to be a space, but to a regex, its just white space and not part of the regex. To indicate a space in a regex, you would use a \s, not an actual space.

Regards,

Jeff
Sep 24 '08 #2
Ganon11
Expert 2GB
Jeff,

A space inside a character class (such as the one he has) matches just that - a space. Whitespace is matched normally inside regexs unless a certain option is turned on (which I forget right now). In other words,

Expand|Select|Wrap|Line Numbers
  1. $line =~ /This is a test./;
will correctly match "This is a test." but not "Thisisatest."

Expand|Select|Wrap|Line Numbers
  1. C:\Users\Ganon11>perl
  2. while (1) {
  3.    chomp(my $line = <STDIN>);
  4.    if ($line =~ /This is a test./) {
  5.       print "Successful match.\n";
  6.    } else {
  7.       print "No match.\n";
  8.    }
  9. }
  10. ^Z
  11. This is a test.
  12. Successful match.
  13. Thisisatest.
  14. No match.
  15. ^C
Similarly,

Expand|Select|Wrap|Line Numbers
  1. $line =~ /(\w+)[ \t]/;
will match "Dogs ", "Cats ", but not "Mouse".

Expand|Select|Wrap|Line Numbers
  1. C:\Users\Ganon11>perl
  2. while (1) {
  3.         chomp(my $line = <STDIN>);
  4.         if ($line =~ /(\w+)[ \t]/) {
  5.                 print "Successful match.\n";
  6.         } else {
  7.                 print "No match.\n";
  8.         }
  9. }
  10. ^Z
  11. Dogs
  12. No match.
  13. Dogs and
  14. Successful match.
  15. Cats
  16. Successful match.
  17. There was a tab in the previous line
  18. Successful match.
  19. Mousenospace
  20. No match.
  21. ^C
The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
Sep 24 '08 #3
KevinADC
Expert 2GB
try:

Expand|Select|Wrap|Line Numbers
  1. $line =~ m/(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)/o;
the "o" on the end might also give some performance boost but you would have to test the code to see if that is true for your application.
Sep 24 '08 #4
numberwhun
Expert Mod 2GB
The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
Plus, with \s, you can add the modifiers to match none or many, where as I believe you would have to include as many spaces as you expect the way he has done it. I was just looking to efficiency, but also wasn't aware you could use a literal space as such.
Sep 24 '08 #5
Ganon11
Expert 2GB
You could use [ \t\n]+ or [ \t\n]* just like \s, it's just faster to write \s+ or \s*. I think.
Sep 24 '08 #6
KevinADC
Expert 2GB
\s is actually a character class, not just a meta character, its like \d ([0-9]) or \w ([a-zA-Z0-9_]) and not like \t or \n, which are meta characters that have only one interpolated meaning (tab and newline). Its actual meaning may also vary between older versions of perl and newer ones.

According to the perl 5.10 documentation:

\s matches a whitespace character, the set [\ \t\r\n\f] and others
Sep 24 '08 #7

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

reply views Thread by has | last post: by
13 posts views Thread by Chris Mantoulidis | last post: by
1 post views Thread by Newbie | last post: by
5 posts views Thread by shelleybobelly | last post: by
7 posts views Thread by adam@areasix.co.uk | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.