By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,812 Members | 1,978 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,812 IT Pros & Developers. It's quick & easy.

Quicker reg exps?

P: 1
Hi, I've written a reg exp for capturing a group of numbers from text files in the following format:
-1.4326 s < 0.6758 s < 1.4334 s
Any of the numbers can be positive or negative and the units (s) can change or even be absent. What I wanted was the three numbers (signs included)! Here was the reg exp I used to capture:
Expand|Select|Wrap|Line Numbers
  1. $line =~ m!([ |-]\d+\.\d+)\s+.*?<\s+([ |-]\d+\.\d+)\s+(.*?)<\s+([ |-]\d+\.\d+)!   
  2.  
The problem is that this must be used hundreds of thousands of times per file so speed is an issue! Does anyone have any ideas to make this reg exp faster? I'm not fully aware of what reg exp constructs incurr speed penalties?
Thanks!
Sep 24 '08 #1
Share this Question
Share on Google+
6 Replies


numberwhun
Expert Mod 2.5K+
P: 3,503
The only thing I can really think of right off (due to it still being early and my brain is still sleeping), is to work to make your regex non-greedy if you can. You can read about it here and here.

Having a more exact regular expression is one key to speed. Also, in the beginning of your regex you have the following:

Expand|Select|Wrap|Line Numbers
  1. [ |-]
  2.  
I assume that the spacing before the pipe symbol is supposed to be a space, but to a regex, its just white space and not part of the regex. To indicate a space in a regex, you would use a \s, not an actual space.

Regards,

Jeff
Sep 24 '08 #2

Ganon11
Expert 2.5K+
P: 3,652
Jeff,

A space inside a character class (such as the one he has) matches just that - a space. Whitespace is matched normally inside regexs unless a certain option is turned on (which I forget right now). In other words,

Expand|Select|Wrap|Line Numbers
  1. $line =~ /This is a test./;
will correctly match "This is a test." but not "Thisisatest."

Expand|Select|Wrap|Line Numbers
  1. C:\Users\Ganon11>perl
  2. while (1) {
  3.    chomp(my $line = <STDIN>);
  4.    if ($line =~ /This is a test./) {
  5.       print "Successful match.\n";
  6.    } else {
  7.       print "No match.\n";
  8.    }
  9. }
  10. ^Z
  11. This is a test.
  12. Successful match.
  13. Thisisatest.
  14. No match.
  15. ^C
Similarly,

Expand|Select|Wrap|Line Numbers
  1. $line =~ /(\w+)[ \t]/;
will match "Dogs ", "Cats ", but not "Mouse".

Expand|Select|Wrap|Line Numbers
  1. C:\Users\Ganon11>perl
  2. while (1) {
  3.         chomp(my $line = <STDIN>);
  4.         if ($line =~ /(\w+)[ \t]/) {
  5.                 print "Successful match.\n";
  6.         } else {
  7.                 print "No match.\n";
  8.         }
  9. }
  10. ^Z
  11. Dogs
  12. No match.
  13. Dogs and
  14. Successful match.
  15. Cats
  16. Successful match.
  17. There was a tab in the previous line
  18. Successful match.
  19. Mousenospace
  20. No match.
  21. ^C
The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
Sep 24 '08 #3

KevinADC
Expert 2.5K+
P: 4,059
try:

Expand|Select|Wrap|Line Numbers
  1. $line =~ m/(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)[^<]+<\s+(-?\d+\.\d+)/o;
the "o" on the end might also give some performance boost but you would have to test the code to see if that is true for your application.
Sep 24 '08 #4

numberwhun
Expert Mod 2.5K+
P: 3,503
The special character \s is special only because it matches any kind of whitespace - therefore, I believe \s is equivalent to [ \t\n].
Plus, with \s, you can add the modifiers to match none or many, where as I believe you would have to include as many spaces as you expect the way he has done it. I was just looking to efficiency, but also wasn't aware you could use a literal space as such.
Sep 24 '08 #5

Ganon11
Expert 2.5K+
P: 3,652
You could use [ \t\n]+ or [ \t\n]* just like \s, it's just faster to write \s+ or \s*. I think.
Sep 24 '08 #6

KevinADC
Expert 2.5K+
P: 4,059
\s is actually a character class, not just a meta character, its like \d ([0-9]) or \w ([a-zA-Z0-9_]) and not like \t or \n, which are meta characters that have only one interpolated meaning (tab and newline). Its actual meaning may also vary between older versions of perl and newer ones.

According to the perl 5.10 documentation:

\s matches a whitespace character, the set [\ \t\r\n\f] and others
Sep 24 '08 #7

Post your reply

Sign in to post your reply or Sign up for a free account.