473,399 Members | 3,888 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Regular expressions (multiple match problem)

I have recently been experimenting with GNU C library regular
expression functions and noticed a problem with pattern matching. It
seems to recognize only the first match but ignoring the rest of them.
An example:

mikko.c:
-----

#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(int argc, char *argv[]) {
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;
}

-----

This intends to match regular expression 'k' against string 'mikko'
and return start and end of two first matches in the array pm of
regmatch_t:s. The output is, however:

$ ./mikko
start=2 end=3
start=-1 end=-1

instead of the expected

start=2 end=3
start=3 end=4

Is this a bug in GNU library or have I overlooked something? I have
not found any examples from the Internet of multiple subexpression
matching with <regex.heither.
With more complicated regular expressions it usually seems to return
only the first match as here, but with wildcards the largest match,
nevertheless only one of them.

Thanks,

Mikko Nummelin
Apr 2 '08 #1
5 8724
In article <ba**********************************@x41g2000hsb. googlegroups.com>,
mikko.n <mn******@gmail.comwrote:
>I have recently been experimenting with GNU C library regular
expression functions and noticed a problem with pattern matching.
Then you should ask in a GNU newsgroup. Regular expressions are
not part of the C standard, so the proper usage of
any particular regular expression library should be discussed
in the appropriate forum for that library.
--
"They called it golf because all the other four letter words
were taken." -- Walter Hagen
Apr 2 '08 #2
On 2 Apr 2008 at 6:20, mikko.n wrote:
I have recently been experimenting with GNU C library regular
expression functions and noticed a problem with pattern matching. It
seems to recognize only the first match but ignoring the rest of them.
An example:

mikko.c:
-----

#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(int argc, char *argv[]) {
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;
}

-----

This intends to match regular expression 'k' against string 'mikko'
and return start and end of two first matches in the array pm of
regmatch_t:s. The output is, however:

$ ./mikko
start=2 end=3
start=-1 end=-1

instead of the expected

start=2 end=3
start=3 end=4

Is this a bug in GNU library or have I overlooked something? I have
not found any examples from the Internet of multiple subexpression
matching with <regex.heither.
With more complicated regular expressions it usually seems to return
only the first match as here, but with wildcards the largest match,
nevertheless only one of them.
The problem is that you misunderstand what a match is.

If the regex matches, then pm[0] contains the offsets of the (first)
match for the whole regex. But pm[1],... don't contain the offets for
subsequent matches of the whole regex, but rather contain the offsets of
any parenthesized subexpressions that matched (in the match recorded in
pm[0]).

For example, try:

#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(void)
{
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k\\(.\\)",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;
}
$ ./a
start=2 end=4
start=3 end=4

Apr 2 '08 #3
On 2 huhti, 11:01, Antoninus Twink <nos...@nospam.invalidwrote:
On 2 Apr 2008 at 6:20, mikko.n wrote:
I have recently been experimenting with GNU C library regular
expression functions and noticed a problem with pattern matching. It
seems to recognize only the first match but ignoring the rest of them.
An example:
mikko.c:
-----
#include <stdio.h>
#include <regex.h>
#include <sys/types.h>
int main(int argc, char *argv[]) {
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;
}
-----
This intends to match regular expression 'k' against string 'mikko'
and return start and end of two first matches in the array pm of
regmatch_t:s. The output is, however:
$ ./mikko
start=2 end=3
start=-1 end=-1
instead of the expected
start=2 end=3
start=3 end=4
Is this a bug in GNU library or have I overlooked something? I have
not found any examples from the Internet of multiple subexpression
matching with <regex.heither.
With more complicated regular expressions it usually seems to return
only the first match as here, but with wildcards the largest match,
nevertheless only one of them.

The problem is that you misunderstand what a match is.

If the regex matches, then pm[0] contains the offsets of the (first)
match for the whole regex. But pm[1],... don't contain the offets for
subsequent matches of the whole regex, but rather contain the offsets of
any parenthesized subexpressions that matched (in the match recorded in
pm[0]).

For example, try:

#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(void)
{
regex_t p;
regmatch_t pm[2];
regcomp(&p,"k\\(.\\)",0);
regexec(&p,"mikko",2,pm,0);
printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
regfree(&p);
return 0;

}

$ ./a
start=2 end=4
start=3 end=4
Is there then a simple alternative which would work so that it returns
all the matches of the original regexp in the text?

Mikko Nummelin
Apr 2 '08 #4
mikko.n wrote, On 02/04/08 09:37:
On 2 huhti, 11:01, Antoninus Twink <nos...@nospam.invalidwrote:
>On 2 Apr 2008 at 6:20, mikko.n wrote:
<snip>
Is there then a simple alternative which would work so that it returns
all the matches of the original regexp in the text?
As Walter suggested, ask in a GNU group or mailing list where your
question would be topical (there is one specifically for regexp) instead
of comp.lang.c where it is not.

I note that this time you have added a cross post to
comp.unix.programmer where your question might be topical, but why
continue posting where it is not?
--
Flash Gordon
Apr 2 '08 #5
On 2 Apr 2008 at 8:37, mikko.n wrote:
Is there then a simple alternative which would work so that it returns
all the matches of the original regexp in the text?
Just use a loop, like this:
#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

int main(void)
{
regex_t p;
regmatch_t pm;
char *s="mikko mikko";
regoff_t last_match=0;
regcomp(&p, "k", 0);
while(regexec(&p, s+last_match, 1, &pm, 0) == 0) {
printf("start=%d end=%d\n", pm.rm_so + last_match, pm.rm_eo + last_match);
last_match += pm.rm_so+1;
}
regfree(&p);
return 0;
}

Apr 2 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
2
by: Christian Staffe | last post by:
Hi, I would like to check for a partial match between an input string and a regular expression using the Regex class in .NET. By partial match, I mean that the input string could not yet be...
4
by: Ben Dewey | last post by:
Hey, I have only been playing with regular expressions for some time. I am working on some code that parses and object 560 event log. I have created two expressions the first one which works...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
9
by: Pete Davis | last post by:
I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
5
by: teo | last post by:
I need to implement a boolean evaluation in a Regular Expression like this: (aaa AND bbb) OR (ccc AND ddd) (see the #3 case) - - - 1) If I need to match a single word only,
3
by: Zeba | last post by:
Hi guys, I need some help regarding regular expressions. Consider the following statement : System.Text.RegularExpressions.Match match =...
10
by: Thomas Dybdahl Ahle | last post by:
Hi, I'm writing a program with a large data stream to which modules can connect using regular expressions. Now I'd like to not have to test all expressions every time I get a line, as most of...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.