preg_match at offset

Thomas Mlynarczyk

Hello,

I want to split a given string into tokens which are defined by regexes:

// example tokens - a bit more complex in real
$tokens = array(
'NUMBER' ='~^\d+~',
'NAME' ='~^[A-Za-z]+~',
'ANY' ='~^.~' ); // make sure there is always a match

while ( $input !== '' )
{
foreach ( $tokens as $name =$regex )
{
if ( preg_match( $regex, $input, $data ) )
{
addNewTokenToResult( $name, $data );
// remove the matched token from the input
$input = substr( $input, strlen( $data[0] ) );
break;
}
}
}

(Code just to illustrate my approach, not actually tested.)

The problem is that I must cut off each found token from the input
string. This requires the string to be copied in memory and is therefore
not very efficient. Also, there is the extra substr() call each time. So
I thought it would be more elegant if preg_match could match a token at
a specified offset within the input (i.e. the "^" anchor would not match
the start of the input string, but at the offset position).

In other words, I would like something like this:

$string = "abcfoobarfoo";
$offset = 3; // corresponds to the first "f" in the string

// return true having matched the first "foo"
preg_match_at_offset( $offset, '/^foo/', $string );
// return false, as "abc" is not at offset 3
preg_match_at_offset( $offset, '/^abc/', $string );

At first I thought the offset parameter of preg_match could do this, but
the manual (<http://de3.php.net/preg_match>) says:

"Note: Using offset is not equivalent to passing substr($subject,
$offset) to preg_match() in place of the subject string, because pattern
can contain assertions such as ^, $ or (?<=x)."

So the only way to achieve what I want seems to be to use substr() to
cut off everything before the offset and do a preg_match on the rest,
using the "^" anchor.

Is there no way to have a pattern match at a specific offset only?

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Nov 1 '08 #1

Subscribe Post Reply

3971

Curtis

On Sat, 01 Nov 2008 14:06:24 +0100, Thomas Mlynarczyk wrote:

Hello,

I want to split a given string into tokens which are defined by regexes:

// example tokens - a bit more complex in real

It might be more helpful to let us know what you're really working
with.

$tokens = array(
'NUMBER' ='~^\d+~',

Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.

'NAME' ='~^[A-Za-z]+~',

And if a name contains characters like, Ã¡, Ã©, Ã¶, Ã¼, or Ã¤? You should
use "\w", which is locale-specific. Although, in your case, this
regex would be more suitable:

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';

'ANY' ='~^.~' ); // make sure there is always a match

It seems like your tokenizer will be quite inefficient with so much
use of regex. Is there any way you can approach your problem so that
you don't need to use regex for simple tokens? For example, using
is_numeric() would be faster than a regex. Instead of using the
'ANY' approach, maybe you could just use a 'continue' statement in
your loop.

[snip]

At first I thought the offset parameter of preg_match could do this, but
the manual (<http://de3.php.net/preg_match>) says:

"Note: Using offset is not equivalent to passing substr($subject,
$offset) to preg_match() in place of the subject string, because pattern
can contain assertions such as ^, $ or (?<=x)."

Read up on the "\G" assertion keeping track of match positions.

<URL:http://php.net/manual/en/regexp.reference.php>
--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);

Nov 2 '08 #2

Curtis

On Sun, 02 Nov 2008 06:59:38 GMT, Curtis wrote:
[snip]

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';

Sorry, forgot to get rid of the semi-colon:

'NAME' ='~[^\W\d_]+~',

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);

Nov 2 '08 #3

Thomas Mlynarczyk

Curtis schrieb:

It might be more helpful to let us know what you're really working
with.

Well, it's supposed to be a general-purpose lexer receiving both the
input string and the tokens array in the constructor. The result would
either go to a parser, or be used to generate a nice syntax-highlighted
output of the source code. I also want to have the possibility of
working with languages like Yaml which use indent to convey grouping
information, but that would be a second step.

>$tokens = array(
'NUMBER' ='~^\d+~',

Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.

> 'NAME' ='~^[A-Za-z]+~',

And if a name contains characters like, Ã¡, Ã©, Ã¶, Ã¼, or Ã¤? You should
use "\w", which is locale-specific. Although, in your case, this
regex would be more suitable:

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';

That's why I wrote "example tokens" and "just to illustrate my approach".

It seems like your tokenizer will be quite inefficient with so much
use of regex. Is there any way you can approach your problem so that
you don't need to use regex for simple tokens?

Indeed, that is a point I should consider. I would somehow have to
include the information in my tokens array whether the token is to be
preg_match()'ed or a simple string to match. But I wonder if that extra
logic would not take longer than always using a regex. I suppose the
regex engine is clever enough to optimize simple regexes, especially,
since tokens usually appear several times in the input.

Also, performance is not that important to me here. If it was, I
probably wouldn't use PHP for the job.

Instead of using the
'ANY' approach, maybe you could just use a 'continue' statement in
your loop.

Yes, you are right. I will do that. Anyway, an 'ANY' would mean that
none of the "real" tokens matched, so it's a syntax error.

Read up on the "\G" assertion keeping track of match positions.
<URL:http://php.net/manual/en/regexp.reference.php>

Thanks - indeed, that sounds like just what I need. Meanwhile I have
tried the A modifier, and that seems to work, even though the
documentation is not very clear here:

$tokens
= array(
'T_IDENTIFIER' ='~[A-Za-z_][A-Za-z0-9_]*~',
'T_NUMBER' ='~[+-]?\\d+~',
// others omitted for brevity
);

function tokenize( $input, $tokens )
{
$input = preg_replace( '~\r?\n|\r\n~', "\n", $input );
$result = array();
$offset = 0;
$finish = strlen( $input );
$line = 1;

while ( $offset < $finish )
{
foreach ( $tokens as $name =$regex )
{
if ( preg_match( $regex . 'A', $input, $data, 0, $offset ) )
{
array_unshift( $data, $name, $line );
$offset += strlen( $data[2] );
$line += substr_count( $data[2], "\n" );
$result[] = $data;
continue 2;
}
}
$result[] = array( 'T_ERROR', 'Error in line ' . $line );
break;
}
return $result;
}

I will try out the \G assertion -- but, assuming both work (A and \G),
is there any reason to prefer one over the other?

Thanks for your suggestions.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux Ã* avoir tort qu'ils ont raison!
(Coluche)

Nov 2 '08 #4

Curtis

On Sun, 02 Nov 2008 15:31:40 +0100, th****@mlynarczyk-webdesign.de
wrote:

Curtis schrieb:

It might be more helpful to let us know what you're really working
with.
Well, it's supposed to be a general-purpose lexer receiving both the
input string and the tokens array in the constructor. The result would
either go to a parser, or be used to generate a nice syntax-highlighted
output of the source code. I also want to have the possibility of
working with languages like Yaml which use indent to convey grouping
information, but that would be a second step.

Sounds interesting. You might be able to take care of your syntax
highlighting with GeSHi.

<URL:http://qbnz.com/highlighter/>

$tokens = array(
'NUMBER' ='~^\d+~',
Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.

'NAME' ='~^[A-Za-z]+~',
And if a name contains characters like, Ã¡, Ã©, Ã¶, Ã¼,or Ã¤? You should
use "\w", which is locale-specific. Although, in your case, this
regex would be more suitable:

// exclude "\d" and "_" from "\w"
'NAME' ='~[^\W\d_]+~';
That's why I wrote "example tokens" and "just to illustrate my approach".

That is why I initially asked to know what you were working with. In
the meantime, I just wanted to point out some possible things about
those regexes that might be unexpected or problematic, which might
also be of note to other readers.

It seems like your tokenizer will be quite inefficient with so much
use of regex. Is there any way you can approach your problem so that
you don't need to use regex for simple tokens?
Indeed, that is a point I should consider. I would somehow have to
include the information in my tokens array whether the token is to be
preg_match()'ed or a simple string to match. But I wonder if that extra
logic would not take longer than always using a regex. I suppose the
regex engine is clever enough to optimize simple regexes, especially,
since tokens usually appear several times in the input.

In places where there are a significant amount of contiguous simple
patterns, I suspect it would help quite a bit, depending on the size
of the data.

Also, performance is not that important to me here. If it was, I
probably wouldn't use PHP for the job.

As opposed to what? From personal experience, PHP performs very well
when installed as a PHP module. Its performance, when compared to
other commonly used interpreted languages for the Web, is comperable.

ISTM, any bottlenecks encountered, in your case, will be in the
algorithm, not the language.

Instead of using the
'ANY' approach, maybe you could just use a 'continue' statement in
your loop.
Yes, you are right. I will do that. Anyway, an 'ANY' would mean that
none of the "real" tokens matched, so it's a syntax error.

Read up on the "\G" assertion keeping track of match positions.
<URL:http://php.net/manual/en/regexp.reference.php>
Thanks - indeed, that sounds like just what I need. Meanwhile I have
tried the A modifier, and that seems to work, even though the
documentation is not very clear here:

[snip]

I will try out the \G assertion -- but, assuming both work (A and \G),
is there any reason to prefer one over the other?

Here's an excerpt from the PCRE documentation:

"The \G assertion is true only when the current matching position is
at the start point of the match, as specified by the offset argument
of preg_match(). It differs from \A when the value of offset is non-
zero. It is available since PHP 4.3.3."

Also, if you use the "PREG_OFFSET_CAPTURE" flag, the index at which
grouped matches occurred will be stored in the match array.

Until you progress along to a functioning example, it will be hard to
suggest any concrete, specific help. For now, all I can think of, is
to use the keys in your token arrays to indicate whether or not the
token should be parsed with regex.

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);

Nov 3 '08 #5

Thomas Mlynarczyk

Curtis schrieb:

<URL:http://qbnz.com/highlighter/>

Thanks for the link - I have downloaded it and will try out this tool.

[Using simple string functions instead of regexes]

In places where there are a significant amount of contiguous simple
patterns, I suspect it would help quite a bit, depending on the size
of the data.

I suppose I will just have to try it out and see what it brings. But
then I prefer the code to be as simple as possible, so if I see no
particular need for a speed-up, I guess I will just do with the regexes.

>Also, performance is not that important to me here. If it was, I
probably wouldn't use PHP for the job.

As opposed to what? From personal experience, PHP performs very well
when installed as a PHP module. Its performance, when compared to
other commonly used interpreted languages for the Web, is comperable.

As opposed to a compiled language. If I was to invent my very own
scripting language and use it for real stuff, using PHP as
lexer/parser/interpreter would not be a good idea. But for having, say,
a PHP application which uses Yaml for its config files, the latter could
be parsed, "translated" to native PHP arrays and cached as such. Thus, I
would have the comfort of writing my config in Yaml and still get full
PHP performance. (Symfony does this, but I'm the kind of programmer who
wants to do it all by myself, even if that means re-inventing the wheel.
I can learn a lot this way.)

Here's an excerpt from the PCRE documentation:

"The \G assertion is true only when the current matching position is
at the start point of the match, as specified by the offset argument
of preg_match(). It differs from \A when the value of offset is non-
zero. It is available since PHP 4.3.3."

It seems as if the \G assertion does the same as the A modifier (not the
\A assertion!). At least my test worked:

$string = 'abcfoobarfoo';
echo preg_match( '~foo~A', $string, $data, 0, 3 ); // 1
echo preg_match( '~bar~A', $string, $data, 0, 3 ); // 0
echo preg_match( '~abc~A', $string, $data, 0, 3 ); // 0

Same result if I use '~\Gfoo~' etc. Strangely, '~\Afoo~' etc. gives me
three 0's instead of the expected 001. Maybe I misunderstood something
here. But the most important is that /\G/ or /.../A do solve my original
problem.

Thanks & greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux Ã* avoir tort qu'ils ont raison!
(Coluche)

Nov 3 '08 #6

Curtis

On Mon, 03 Nov 2008 16:22:03 +0100, th****@mlynarczyk-webdesign.de
wrote:

Curtis schrieb:

[snip]

As opposed to what? From personal experience, PHP performs very well
when installed as a PHP module. Its performance, when compared to
other commonly used interpreted languages for the Web, is comperable.

As opposed to a compiled language...

Well yes, when compared to C or C++, for example, it seems natural
the performance would be noticeably better in certain cases.

If I was to invent my very own
scripting language and use it for real stuff, using PHP as
lexer/parser/interpreter would not be a good idea. But for having, say,
a PHP application which uses Yaml for its config files, the latter could
be parsed, "translated" to native PHP arrays and cached as such. Thus, I
would have the comfort of writing my config in Yaml and still get full
PHP performance. (Symfony does this, but I'm the kind of programmer who
wants to do it all by myself, even if that means re-inventing the wheel.
I can learn a lot this way.)

I can definitely understand where you are coming from here. I have
done a fair amount of learning from the desire to understand how
things work.

Here's an excerpt from the PCRE documentation:

"The \G assertion is true only when the current matching position is
at the start point of the match, as specified by the offset argument
of preg_match(). It differs from \A when the value of offset is non-
zero. It is available since PHP 4.3.3."

It seems as if the \G assertion does the same as the A modifier (not the
\A assertion!). At least my test worked:

$string = 'abcfoobarfoo';
echo preg_match( '~foo~A', $string, $data, 0, 3 ); // 1
echo preg_match( '~bar~A', $string, $data, 0, 3 ); // 0
echo preg_match( '~abc~A', $string, $data, 0, 3 ); // 0

Same result if I use '~\Gfoo~' etc. Strangely, '~\Afoo~' etc. gives me
three 0's instead of the expected 001. Maybe I misunderstood something
here. But the most important is that /\G/ or /.../A do solve my original
problem.

From what I read in the manual about the /A modifier, it is
equivalent to starting your regex with the \A assertion (which
matches at the beginning of the *subject*, even in multiline mode).
The beginning of the subject being at the start of the specified
offset, or 0, by default.

As for the \G assertion, I have yet to have the opportunity to put it
into practical use. I think I'll try and getter a better grasp on
its usage to satisfy my own curiosity, too.

In any case, I hope all goes well in your project.

P.S.: there may be the off chance that PHP's built-in tokenizer
functions will be useful to you, too:

<URL:http://us3.php.net/manual/en/intro.tokenizer.php>

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);

Nov 3 '08 #7

Curtis

On Mon, 03 Nov 2008 23:52:51 GMT, dy****@sig.invalid wrote:
[snip]

From what I read in the manual about the /A modifier, it is
equivalent to starting your regex with the \A assertion (which
matches at the beginning of the *subject*, even in multiline mode).
The beginning of the subject being at the start of the specified
offset, or 0, by default.

Seems I misread the manual, and your post. My above quoted text is
rubbish, sorry.

When I ran a test, both using the /A modifier and the \A assertion at
the beginning, both tests failed:

$s = 'abcfoobar';
$m = array();

echo preg_match('/foobar/A', $s, $m, null, 3); // 1
echo preg_match('/\Afoo/', $s, $m, null, 3); // 0
echo preg_match('/\Afoo/', substr($s,3), $m, null, 0); // 1

So, ISTM, regardless of offset, \A will look at the start of the
string, so the assertion seems to fail when an offset is greater than
0. Yet the /A modifier seems to adjust with the offset.

Sorry for the mistake.

--
Curtis
$email = str_replace('sig.invalid', 'gmail.com', $from);

Nov 4 '08 #8

Thomas Mlynarczyk

Curtis schrieb:

From what I read in the manual about the /A modifier, it is
equivalent to starting your regex with the \A assertion (which
matches at the beginning of the *subject*, even in multiline mode).

PHP Manual: "The \A, \Z, and \z assertions differ from the traditional
circumflex and dollar [...] in that they only ever match at the very
start and end of the subject string, whatever options are set."

According to the above, \A is a "stronger" version of "^", that always
matches at the real, very start of the complete string passed to a regex
function. Thus, not just the portion from a specified offset onwards. My
tests seem to confirm that.

PHP Manual: "The \G assertion is true only when the current matching
position is at the start point of the match, as specified by the offset
argument of preg_match(). It differs from \A when the value of offset is
non-zero."

That would be the behaviour which you think \A has. Also, the manual
explicitly states that \A ignores any offset.

PHP Manual: "If [the /A modifier] is set, the pattern is forced to be
"anchored", that is, it is constrained to match only at the start of the
string which is being searched (the "subject string")."

This is a bit vague with respect an offset parameter, but as my tests
have shown, "subject string" means here the substring specified by the
offset. And thus /A is equivalent to \G, not to \A, as illogical as it
seems, regarding the names chosen.

Thus, we have:

\G == /A
\A == ^ without /m

<URL:http://us3.php.net/manual/en/intro.tokenizer.php>

Yes, I have tried it already. And I would use it, rather than anything
"self-made", for parsing PHP.

Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux Ã* avoir tort qu'ils ont raison!
(Coluche)

Nov 4 '08 #9

Similar topics

Problem with preg_match

by: fartsniff | last post by:

hello all, here is a preg_match routine that i am using. basically, $image is set in some code above, and it can be either st-1.gif or sb-1.gif (actually it randomly picks them from about 100...

PHP

preg_match VS preg_match_all

by: Han | last post by:

I'm wondering if someone can explain why the following works with preg_match_all, but not preg_match: $html = "product=3456789&" preg_match_all ("|product=(\d{5,10})&|i", $html, $out); $out...

PHP

!preg_match Regex not compiling

by: awebguynow | last post by:

I ran across this code, and it kind of made me nervous: (as an email validator) if ( !preg_match("/.*\@.*\..*/", $_POST) | preg_match("/(\)/", $_POST) ) 1) from bitwise experience with "C",...

PHP

How to escape string for preg_match?

by: squash | last post by:

I have a string equal to 'www/' that I want to use in a preg_match. Php keeps giving me the warning: Warning: preg_match(): Unknown modifier '/' How can I escape the string so the / in www/ is...

PHP

preg_match(): Compilation failed: regular expression too large

by: Clodoaldo Pinto | last post by:

preg_match(): Compilation failed: regular expression too large at offset 0 The regular expression is 34,745 bytes long. <?php $regExp = 'huge regexp with 34,745 bytes'; echo...

PHP

allowing single quote in preg_match

by: Mark Woodward | last post by:

Hi all, I'm trying to validate text in a HTML input field. How do I *allow* a single quote? // catch any nasty characters (eg !@#$%^&*()/\) $match = '/^+$/'; $valid_srch = preg_match($match,...

PHP

preg_match explanation

by: mantrid | last post by:

Hello Found this piece of code using preg_match to check file types during upload of files. $allowed_file_types = "(jpg|jpeg|gif|bmp|png)"; preg_match("/\." . $allowed_file_types . "$/i",...

PHP

Another preg_match() [function.preg-match]: Unknown modifier '{';-)

by: JanDoggen | last post by:

function vldLicense($lic) { echo "called with lic: ". $lic . "<br>"; echo preg_match('', $lic) . "<br>"; if (preg_match('{4}-{4}-{4}-{4}', $lic) == 0) return false; return true; } gives me:

PHP

preg_match doesn't work properly!?

by: chadsspameateremail | last post by:

I might have found a problem with how preg_match works though I'm not sure. Lets say you have a regular expression that you want to match a string of numbers. You might write the code like this:...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server