By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,948 Members | 1,595 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,948 IT Pros & Developers. It's quick & easy.

preg_match at offset

P: n/a
Hello,

I want to split a given string into tokens which are defined by regexes:

// example tokens - a bit more complex in real
$tokens = array(
'NUMBER' ='~^\d+~',
'NAME' ='~^[A-Za-z]+~',
'ANY' ='~^.~' ); // make sure there is always a match

while ( $input !== '' )
{
foreach ( $tokens as $name =$regex )
{
if ( preg_match( $regex, $input, $data ) )
{
addNewTokenToResult( $name, $data );
// remove the matched token from the input
$input = substr( $input, strlen( $data[0] ) );
break;
}
}
}

(Code just to illustrate my approach, not actually tested.)

The problem is that I must cut off each found token from the input
string. This requires the string to be copied in memory and is therefore
not very efficient. Also, there is the extra substr() call each time. So
I thought it would be more elegant if preg_match could match a token at
a specified offset within the input (i.e. the "^" anchor would not match
the start of the input string, but at the offset position).

In other words, I would like something like this:

$string = "abcfoobarfoo";
$offset = 3; // corresponds to the first "f" in the string

// return true having matched the first "foo"
preg_match_at_offset( $offset, '/^foo/', $string );
// return false, as "abc" is not at offset 3
preg_match_at_offset( $offset, '/^abc/', $string );

At first I thought the offset parameter of preg_match could do this, but
the manual (<http://de3.php.net/preg_match>) says:

"Note: Using offset is not equivalent to passing substr($subject,
$offset) to preg_match() in place of the subject string, because pattern
can contain assertions such as ^, $ or (?<=x)."

So the only way to achieve what I want seems to be to use substr() to
cut off everything before the offset and do a preg_match on the rest,
using the "^" anchor.

Is there no way to have a pattern match at a specific offset only?

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 1 '08 #1
Share this Question
Share on Google+
8 Replies


P: n/a
On Sat, 01 Nov 2008 14:06:24 +0100, Thomas Mlynarczyk wrote:
Hello,

I want to split a given string into tokens which are defined by regexes:

// example tokens - a bit more complex in real
It might be more helpful to let us know what you're really working
with.
$tokens = array(
'NUMBER' ='~^\d+~',
Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.
'NAME' ='~^[A-Za-z]+~',
And if a name contains characters like, á, é, ö, ü, or ä? You should
use "\w", which is locale-specific. Although, in your case, this
regex would be more suitable:

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';
'ANY' ='~^.~' ); // make sure there is always a match
It seems like your tokenizer will be quite inefficient with so much
use of regex. Is there any way you can approach your problem so that
you don't need to use regex for simple tokens? For example, using
is_numeric() would be faster than a regex. Instead of using the
'ANY' approach, maybe you could just use a 'continue' statement in
your loop.

[snip]
At first I thought the offset parameter of preg_match could do this, but
the manual (<http://de3.php.net/preg_match>) says:

"Note: Using offset is not equivalent to passing substr($subject,
$offset) to preg_match() in place of the subject string, because pattern
can contain assertions such as ^, $ or (?<=x)."
Read up on the "\G" assertion keeping track of match positions.

<URL:http://php.net/manual/en/regexp.reference.php>
--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);
Nov 2 '08 #2

P: n/a
On Sun, 02 Nov 2008 06:59:38 GMT, Curtis wrote:
[snip]
// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';
Sorry, forgot to get rid of the semi-colon:

'NAME' ='~[^\W\d_]+~',

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);
Nov 2 '08 #3

P: n/a
Curtis schrieb:
It might be more helpful to let us know what you're really working
with.
Well, it's supposed to be a general-purpose lexer receiving both the
input string and the tokens array in the constructor. The result would
either go to a parser, or be used to generate a nice syntax-highlighted
output of the source code. I also want to have the possibility of
working with languages like Yaml which use indent to convey grouping
information, but that would be a second step.
>$tokens = array(
'NUMBER' ='~^\d+~',

Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.
> 'NAME' ='~^[A-Za-z]+~',

And if a name contains characters like, á, é, ö, ü, or ä? You should
use "\w", which is locale-specific. Although, in your case, this
regex would be more suitable:

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';
That's why I wrote "example tokens" and "just to illustrate my approach".
It seems like your tokenizer will be quite inefficient with so much
use of regex. Is there any way you can approach your problem so that
you don't need to use regex for simple tokens?
Indeed, that is a point I should consider. I would somehow have to
include the information in my tokens array whether the token is to be
preg_match()'ed or a simple string to match. But I wonder if that extra
logic would not take longer than always using a regex. I suppose the
regex engine is clever enough to optimize simple regexes, especially,
since tokens usually appear several times in the input.

Also, performance is not that important to me here. If it was, I
probably wouldn't use PHP for the job.
Instead of using the
'ANY' approach, maybe you could just use a 'continue' statement in
your loop.
Yes, you are right. I will do that. Anyway, an 'ANY' would mean that
none of the "real" tokens matched, so it's a syntax error.
Read up on the "\G" assertion keeping track of match positions.
<URL:http://php.net/manual/en/regexp.reference.php>
Thanks - indeed, that sounds like just what I need. Meanwhile I have
tried the A modifier, and that seems to work, even though the
documentation is not very clear here:

$tokens
= array(
'T_IDENTIFIER' ='~[A-Za-z_][A-Za-z0-9_]*~',
'T_NUMBER' ='~[+-]?\\d+~',
// others omitted for brevity
);

function tokenize( $input, $tokens )
{
$input = preg_replace( '~\r?\n|\r\n~', "\n", $input );
$result = array();
$offset = 0;
$finish = strlen( $input );
$line = 1;

while ( $offset < $finish )
{
foreach ( $tokens as $name =$regex )
{
if ( preg_match( $regex . 'A', $input, $data, 0, $offset ) )
{
array_unshift( $data, $name, $line );
$offset += strlen( $data[2] );
$line += substr_count( $data[2], "\n" );
$result[] = $data;
continue 2;
}
}
$result[] = array( 'T_ERROR', 'Error in line ' . $line );
break;
}
return $result;
}

I will try out the \G assertion -- but, assuming both work (A and \G),
is there any reason to prefer one over the other?

Thanks for your suggestions.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux * avoir tort qu'ils ont raison!
(Coluche)
Nov 2 '08 #4

P: n/a
On Sun, 02 Nov 2008 15:31:40 +0100, th****@mlynarczyk-webdesign.de
wrote:
Curtis schrieb:
It might be more helpful to let us know what you're really working
with.
Well, it's supposed to be a general-purpose lexer receiving both the
input string and the tokens array in the constructor. The result would
either go to a parser, or be used to generate a nice syntax-highlighted
output of the source code. I also want to have the possibility of
working with languages like Yaml which use indent to convey grouping
information, but that would be a second step.
Sounds interesting. You might be able to take care of your syntax
highlighting with GeSHi.

<URL:http://qbnz.com/highlighter/>
$tokens = array(
'NUMBER' ='~^\d+~',
Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.
'NAME' ='~^[A-Za-z]+~',
And if a name contains characters like, á, é, ö, ü,or ä? You should
use "\w", which is locale-specific. Although, in your case, this
regex would be more suitable:

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';
That's why I wrote "example tokens" and "just to illustrate my approach".
That is why I initially asked to know what you were working with. In
the meantime, I just wanted to point out some possible things about
those regexes that might be unexpected or problematic, which might
also be of note to other readers.
It seems like your tokenizer will be quite inefficient with so much
use of regex. Is there any way you can approach your problem so that
you don't need to use regex for simple tokens?
Indeed, that is a point I should consider. I would somehow have to
include the information in my tokens array whether the token is to be
preg_match()'ed or a simple string to match. But I wonder if that extra
logic would not take longer than always using a regex. I suppose the
regex engine is clever enough to optimize simple regexes, especially,
since tokens usually appear several times in the input.
In places where there are a significant amount of contiguous simple
patterns, I suspect it would help quite a bit, depending on the size
of the data.
Also, performance is not that important to me here. If it was, I
probably wouldn't use PHP for the job.
As opposed to what? From personal experience, PHP performs very well
when installed as a PHP module. Its performance, when compared to
other commonly used interpreted languages for the Web, is comperable.

ISTM, any bottlenecks encountered, in your case, will be in the
algorithm, not the language.
Instead of using the
'ANY' approach, maybe you could just use a 'continue' statement in
your loop.
Yes, you are right. I will do that. Anyway, an 'ANY' would mean that
none of the "real" tokens matched, so it's a syntax error.
Read up on the "\G" assertion keeping track of match positions.
<URL:http://php.net/manual/en/regexp.reference.php>
Thanks - indeed, that sounds like just what I need. Meanwhile I have
tried the A modifier, and that seems to work, even though the
documentation is not very clear here:
[snip]

I will try out the \G assertion -- but, assuming both work (A and \G),
is there any reason to prefer one over the other?
Here's an excerpt from the PCRE documentation:

"The \G assertion is true only when the current matching position is
at the start point of the match, as specified by the offset argument
of preg_match(). It differs from \A when the value of offset is non-
zero. It is available since PHP 4.3.3."

Also, if you use the "PREG_OFFSET_CAPTURE" flag, the index at which
grouped matches occurred will be stored in the match array.

Until you progress along to a functioning example, it will be hard to
suggest any concrete, specific help. For now, all I can think of, is
to use the keys in your token arrays to indicate whether or not the
token should be parsed with regex.

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);
Nov 3 '08 #5

P: n/a
Curtis schrieb:
<URL:http://qbnz.com/highlighter/>
Thanks for the link - I have downloaded it and will try out this tool.

[Using simple string functions instead of regexes]
In places where there are a significant amount of contiguous simple
patterns, I suspect it would help quite a bit, depending on the size
of the data.
I suppose I will just have to try it out and see what it brings. But
then I prefer the code to be as simple as possible, so if I see no
particular need for a speed-up, I guess I will just do with the regexes.
>Also, performance is not that important to me here. If it was, I
probably wouldn't use PHP for the job.
As opposed to what? From personal experience, PHP performs very well
when installed as a PHP module. Its performance, when compared to
other commonly used interpreted languages for the Web, is comperable.
As opposed to a compiled language. If I was to invent my very own
scripting language and use it for real stuff, using PHP as
lexer/parser/interpreter would not be a good idea. But for having, say,
a PHP application which uses Yaml for its config files, the latter could
be parsed, "translated" to native PHP arrays and cached as such. Thus, I
would have the comfort of writing my config in Yaml and still get full
PHP performance. (Symfony does this, but I'm the kind of programmer who
wants to do it all by myself, even if that means re-inventing the wheel.
I can learn a lot this way.)
Here's an excerpt from the PCRE documentation:

"The \G assertion is true only when the current matching position is
at the start point of the match, as specified by the offset argument
of preg_match(). It differs from \A when the value of offset is non-
zero. It is available since PHP 4.3.3."
It seems as if the \G assertion does the same as the A modifier (not the
\A assertion!). At least my test worked:

$string = 'abcfoobarfoo';
echo preg_match( '~foo~A', $string, $data, 0, 3 ); // 1
echo preg_match( '~bar~A', $string, $data, 0, 3 ); // 0
echo preg_match( '~abc~A', $string, $data, 0, 3 ); // 0

Same result if I use '~\Gfoo~' etc. Strangely, '~\Afoo~' etc. gives me
three 0's instead of the expected 001. Maybe I misunderstood something
here. But the most important is that /\G/ or /.../A do solve my original
problem.

Thanks & greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux * avoir tort qu'ils ont raison!
(Coluche)
Nov 3 '08 #6

P: n/a
On Mon, 03 Nov 2008 16:22:03 +0100, th****@mlynarczyk-webdesign.de
wrote:
Curtis schrieb:
[snip]
As opposed to what? From personal experience, PHP performs very well
when installed as a PHP module. Its performance, when compared to
other commonly used interpreted languages for the Web, is comperable.

As opposed to a compiled language...
Well yes, when compared to C or C++, for example, it seems natural
the performance would be noticeably better in certain cases.
If I was to invent my very own
scripting language and use it for real stuff, using PHP as
lexer/parser/interpreter would not be a good idea. But for having, say,
a PHP application which uses Yaml for its config files, the latter could
be parsed, "translated" to native PHP arrays and cached as such. Thus, I
would have the comfort of writing my config in Yaml and still get full
PHP performance. (Symfony does this, but I'm the kind of programmer who
wants to do it all by myself, even if that means re-inventing the wheel.
I can learn a lot this way.)
I can definitely understand where you are coming from here. I have
done a fair amount of learning from the desire to understand how
things work.
Here's an excerpt from the PCRE documentation:

"The \G assertion is true only when the current matching position is
at the start point of the match, as specified by the offset argument
of preg_match(). It differs from \A when the value of offset is non-
zero. It is available since PHP 4.3.3."

It seems as if the \G assertion does the same as the A modifier (not the
\A assertion!). At least my test worked:

$string = 'abcfoobarfoo';
echo preg_match( '~foo~A', $string, $data, 0, 3 ); // 1
echo preg_match( '~bar~A', $string, $data, 0, 3 ); // 0
echo preg_match( '~abc~A', $string, $data, 0, 3 ); // 0

Same result if I use '~\Gfoo~' etc. Strangely, '~\Afoo~' etc. gives me
three 0's instead of the expected 001. Maybe I misunderstood something
here. But the most important is that /\G/ or /.../A do solve my original
problem.
From what I read in the manual about the /A modifier, it is
equivalent to starting your regex with the \A assertion (which
matches at the beginning of the *subject*, even in multiline mode).
The beginning of the subject being at the start of the specified
offset, or 0, by default.

As for the \G assertion, I have yet to have the opportunity to put it
into practical use. I think I'll try and getter a better grasp on
its usage to satisfy my own curiosity, too.

In any case, I hope all goes well in your project.

P.S.: there may be the off chance that PHP's built-in tokenizer
functions will be useful to you, too:

<URL:http://us3.php.net/manual/en/intro.tokenizer.php>

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);
Nov 3 '08 #7

P: n/a
On Mon, 03 Nov 2008 23:52:51 GMT, dy****@sig.invalid wrote:
[snip]
From what I read in the manual about the /A modifier, it is
equivalent to starting your regex with the \A assertion (which
matches at the beginning of the *subject*, even in multiline mode).
The beginning of the subject being at the start of the specified
offset, or 0, by default.
Seems I misread the manual, and your post. My above quoted text is
rubbish, sorry.

When I ran a test, both using the /A modifier and the \A assertion at
the beginning, both tests failed:

$s = 'abcfoobar';
$m = array();

echo preg_match('/foobar/A', $s, $m, null, 3); // 1
echo preg_match('/\Afoo/', $s, $m, null, 3); // 0
echo preg_match('/\Afoo/', substr($s,3), $m, null, 0); // 1

So, ISTM, regardless of offset, \A will look at the start of the
string, so the assertion seems to fail when an offset is greater than
0. Yet the /A modifier seems to adjust with the offset.

Sorry for the mistake.

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);
Nov 4 '08 #8

P: n/a
Curtis schrieb:
From what I read in the manual about the /A modifier, it is
equivalent to starting your regex with the \A assertion (which
matches at the beginning of the *subject*, even in multiline mode).
PHP Manual: "The \A, \Z, and \z assertions differ from the traditional
circumflex and dollar [...] in that they only ever match at the very
start and end of the subject string, whatever options are set."

According to the above, \A is a "stronger" version of "^", that always
matches at the real, very start of the complete string passed to a regex
function. Thus, not just the portion from a specified offset onwards. My
tests seem to confirm that.

PHP Manual: "The \G assertion is true only when the current matching
position is at the start point of the match, as specified by the offset
argument of preg_match(). It differs from \A when the value of offset is
non-zero."

That would be the behaviour which you think \A has. Also, the manual
explicitly states that \A ignores any offset.

PHP Manual: "If [the /A modifier] is set, the pattern is forced to be
"anchored", that is, it is constrained to match only at the start of the
string which is being searched (the "subject string")."

This is a bit vague with respect an offset parameter, but as my tests
have shown, "subject string" means here the substring specified by the
offset. And thus /A is equivalent to \G, not to \A, as illogical as it
seems, regarding the names chosen.

Thus, we have:

\G == /A
\A == ^ without /m
<URL:http://us3.php.net/manual/en/intro.tokenizer.php>
Yes, I have tried it already. And I would use it, rather than anything
"self-made", for parsing PHP.

Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux * avoir tort qu'ils ont raison!
(Coluche)
Nov 4 '08 #9

This discussion thread is closed

Replies have been disabled for this discussion.