By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,848 Members | 1,375 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,848 IT Pros & Developers. It's quick & easy.

Need help with a Regular Expression

P: 25
Hi,
I am trying to understand a concept in Regex in Perl. How to write regex in Perl such that metacharacter * is not greedy.

Here is my code:-
Expand|Select|Wrap|Line Numbers
  1. #!usr/bin/perl
  2. use strict;
  3. my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
  4. Perl is based on the brace-delimited block style of AWK and C, 
  5. and was widely adopted for its strengths in text processing 
  6. and lack of the arbitrary limitations 
  7. of many scripting languages at the time.";
  8.  
  9. my $b;
  10. if ($sentence =~ /and(.*)\./s)
  11. {
  12.     $b = $1;
  13.     print "The following is the output:-\n";
  14.     print "$b\n";
  15. }
  16.  
#Output:-
The following is the output:-
first released in 1987,
Perl is based on the brace-delimited block style of AWK and C,
and was widely adopted for its strengths in text processing
and lack of the arbitrary limitations
of many scripting languages at the time

The * operator is very greedy and so I get the output like that.
I want the output to be just from the last occurence of "and" upto the "." like the following:-

lack of the arbitrary limitations
of many scripting languages at the time

So how do I achieve that? I tried using the repetition modifier {} after "and" but that does not work either.
I would appreciate if you could help me with this.

Thanks in advance,
Sangith
Jan 10 '08 #1
Share this Question
Share on Google+
2 Replies


KevinADC
Expert 2.5K+
P: 4,059
Hi,
I am trying to understand a concept in Regex in Perl. How to write regex in Perl such that metacharacter * is not greedy.

Here is my code:-
Expand|Select|Wrap|Line Numbers
  1. #!usr/bin/perl
  2. use strict;
  3. my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
  4. Perl is based on the brace-delimited block style of AWK and C, 
  5. and was widely adopted for its strengths in text processing 
  6. and lack of the arbitrary limitations 
  7. of many scripting languages at the time.";
  8.  
  9. my $b;
  10. if ($sentence =~ /and(.*)\./s)
  11. {
  12.     $b = $1;
  13.     print "The following is the output:-\n";
  14.     print "$b\n";
  15. }
  16.  
#Output:-
The following is the output:-
first released in 1987,
Perl is based on the brace-delimited block style of AWK and C,
and was widely adopted for its strengths in text processing
and lack of the arbitrary limitations
of many scripting languages at the time

The * operator is very greedy and so I get the output like that.
I want the output to be just from the last occurence of "and" upto the "." like the following:-

lack of the arbitrary limitations
of many scripting languages at the time

So how do I achieve that? I tried using the repetition modifier {} after "and" but that does not work either.
I would appreciate if you could help me with this.

Thanks in advance,
Sangith
Regular expressions are probably one of the more complicated things about perl (and many languages) that the casual perl coder will have to learn. A significant thing to note is that a regular expression will try and match a pattern as early as it can in a string. The word "and" occurs several times in the string, perl will try and match the first occurance, just after Larry Wall: "Larry Wall and".

In order to match the last occurance you actually want to use greedy matching:

/.*and (.*)\./

the first '.*' will match until the last occurance of: "and " (and-space). So you have to learn how to take advantage of greedy matching and when to use and when not to use it. But your problem is further complicated because it is a string of multiple lines (at least it looks that way in your post). To ignore the multiple-lines, you use the"s" modifier at the end of the regexp. This tells perl to treat the string as one long line and ignore all newlines except the one at the very end of the string (if there is one).

This is one way it could be done:

Expand|Select|Wrap|Line Numbers
  1. #!usr/bin/perl
  2. use strict;
  3. my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
  4. Perl is based on the brace-delimited block style of AWK and C,
  5. and was widely adopted for its strengths in text processing
  6. and lack of the arbitrary limitations
  7. of many scripting languages at the time.";
  8. my $r;
  9. if ($sentence =~ /.*and (.*)\./s)
  10. {
  11.    $r = $1;
  12.    print "The following is the output:-\n";
  13.    print "$r\n";
  14. }
This is a bit contrived to fit the string you posted. The pattern you want to match appears to start at the beginning of a line within the string. But if you did not know where the pattern started in the string you would probably have to use a different search pattern to avoid substring matches like "land" or "sand".

Here is a link that might help you:

http://perldoc.perl.org/perlretut.html

Take it a little at a time if it's confusing.
Jan 10 '08 #2

P: 25
Hi Kevin,
Thank you so much for your help! Your approach works just great!
I am using this perl code for parsing my text file. The string that I am searching for in the file is a fixed one and will not occur as a part of any other string, so this approach is the best one for me.

Thanks again,
Sangith


Regular expressions are probably one of the more complicated things about perl (and many languages) that the casual perl coder will have to learn. A significant thing to note is that a regular expression will try and match a pattern as early as it can in a string. The word "and" occurs several times in the string, perl will try and match the first occurance, just after Larry Wall: "Larry Wall and".

In order to match the last occurance you actually want to use greedy matching:

/.*and (.*)\./

the first '.*' will match until the last occurance of: "and " (and-space). So you have to learn how to take advantage of greedy matching and when to use and when not to use it. But your problem is further complicated because it is a string of multiple lines (at least it looks that way in your post). To ignore the multiple-lines, you use the"s" modifier at the end of the regexp. This tells perl to treat the string as one long line and ignore all newlines except the one at the very end of the string (if there is one).

This is one way it could be done:

Expand|Select|Wrap|Line Numbers
  1. #!usr/bin/perl
  2. use strict;
  3. my $sentence = "Perl is a dynamic programming language created by Larry Wall and first released in 1987,
  4. Perl is based on the brace-delimited block style of AWK and C,
  5. and was widely adopted for its strengths in text processing
  6. and lack of the arbitrary limitations
  7. of many scripting languages at the time.";
  8. my $r;
  9. if ($sentence =~ /.*and (.*)\./s)
  10. {
  11.    $r = $1;
  12.    print "The following is the output:-\n";
  13.    print "$r\n";
  14. }
This is a bit contrived to fit the string you posted. The pattern you want to match appears to start at the beginning of a line within the string. But if you did not know where the pattern started in the string you would probably have to use a different search pattern to avoid substring matches like "land" or "sand".

Here is a link that might help you:

http://perldoc.perl.org/perlretut.html

Take it a little at a time if it's confusing.
Jan 10 '08 #3

Post your reply

Sign in to post your reply or Sign up for a free account.