By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
426,165 Members | 1,928 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 426,165 IT Pros & Developers. It's quick & easy.

preg_match_all: combine different patterns

P: n/a
Han
I know this is possible (because preg can do almost anything!), but can't
get a handle on the syntax.

I have an HTML string:

<font size="3"><a
href="http://www.example.com?product=3456789&amp;company=3528">
Mickey Mouse</a></font><br><img width="200" height="200" border="0"
src="http://www.example.com/images/3456789.jpg"><b>$8.00</b>

I would like to extract the two substrings:

1) Everything between the anchor tags INCLUDING the opening/closing tags
2) The price between the bold tags EXCLUDING the opening/closing tags

I want the result to look like this:

<a href="http://www.example.com?product=3456789&amp;company=3528">
Mickey Mouse</a>$8.00

Finding the patterns separately is straightforward (untested):

preg_match_all ("|<font size=\"3\">(<a.*?</a>)</font>|i", $html, $out);

preg_match_all ("|<b>(\$\d{1,3}\.\d{2})</b>|i", $html, $out);

How can these be combined into one pattern?

Thanks in advance.
Jul 17 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Hi,
Han wrote:
I know this is possible (because preg can do almost anything!), but can't
get a handle on the syntax. Jap!
I have an HTML string:

<font size="3"><a
href="http://www.example.com?product=3456789&amp;company=3528">
Mickey Mouse</a></font><br><img width="200" height="200" border="0"
src="http://www.example.com/images/3456789.jpg"><b>$8.00</b>

I would like to extract the two substrings:

1) Everything between the anchor tags INCLUDING the opening/closing tags
2) The price between the bold tags EXCLUDING the opening/closing tags

I want the result to look like this:

<a href="http://www.example.com?product=3456789&amp;company=3528">
Mickey Mouse</a>$8.00

Finding the patterns separately is straightforward (untested):

preg_match_all ("|<font size=\"3\">(<a.*?</a>)</font>|i", $html, $out);

preg_match_all ("|<b>(\$\d{1,3}\.\d{2})</b>|i", $html, $out); preg_match_all ("|<font size=\"3\">(<a.*?</a>)</font>.*<b>(.*?)</b>|i",
$html, $out);
$out will then be an array
$out[0] // Hole Text that the Matches apply to
$out[1] // The <font ...> Part
$out[2] // The Prive Part
How can these be combined into one pattern?

Thanks in advance. No Problem.
You may take a look an a Book called "Mastering Regular Expressions" on
O'Reilly. Great Stuff!


Alexander

Jul 17 '05 #2

P: n/a
Han wrote:
<font size="3"><a
href="http://www.example.com?product=3456789&amp;company=3528">
Mickey Mouse</a></font><br><img width="200" height="200" border="0"
src="http://www.example.com/images/3456789.jpg"><b>$8.00</b>
You need to elaborate a bit. As you're using preg_match_all, I
suspect there are an indefinite number of occurrences, and the
properties differ each time.

Is your markup really always in that format? Are attribute values
always quoted using double quotation marks? No more attributes in
font(!)? Tags never span multiple lines? Might the font size
change? Might a bold element that you don't want to match appear
after the image? Yada, yada, yada.
preg_match_all ("|<font size=\"3\">(<a.*?</a>)</font>|i", $html, $out);

preg_match_all ("|<b>(\$\d{1,3}\.\d{2})</b>|i", $html, $out);

How can these be combined into one pattern?


Append the latter to the former. But there's a bit to sort out.

If the pattern is enclosed in double quotation marks, the dollar
sign ("$") must be escaped properly, with two backslashes.
Otherwise, the character sequence \$ would be interpreted as an
assertion. A preferable solution, however, is to use single
quotation marks, so as to avoid any befuddlement, cause
significantly less clutter, and be more efficient all-round.

The dot metacharacter doesn't match newlines by default. Setting
the s modifier alters its behaviour, allowing it to match
newlines; this allows the text to span across multiple lines.

A slightly more forgiving pattern (using single quotation marks):

`
<font\s.*size\s*=\s*(["\']?)\+?[1-7]\1.*>(<a.*</a\s*>)</font\s*>
..*
<b\s*>(\$\d{1,3}\.\d{2})</b\s*>
`Usix

The pattern is fundamentally the result of coupling your two
regexes together. The bridge is simply a match of any character,
including newlines, until a bold element with the required content
is found.

There are subtle nuances however. The deprecated font start-tag
now accepts more attributes; it may also be written across
multiple lines, so too can the anchor element; and the size
attribute value needn't be quoted.

If I were you, I might get out of the habit of demarcating
patterns with vertical lines ("|"). The grave accent ("`")
character is rarely used, and, thence, seems a wise alternative.
You might also consider the hash ("#") character, but be careful:
you can't comment patterns using PCRE_EXTENDED [1] then. Instead,
use the "(?# comment )" syntax. It took me a while to notice that!

Good luck.
[1] Pattern Modifiers,
http://www.php.net/manual/en/pcre.pattern.modifiers.php

--
Jock
Jul 17 '05 #3

P: n/a
Han
John,

Thanks for your detailed response.

You're right, I could have been more specific, but your assumptions were
correct. The markup is always the same and there are many occurrences.

After giving the matter some thought, I don't believe it's possible to do
everything required in one pass. There are multiple items that need to be
collected and manipulated so I have decided to grab larger blocks then loop
through the matches array, breaking everything out into smaller specific
elements.

Your comments and syntax helped a lot though and have given me several ideas
of how to streamline the process.

Much appreciated!

"John Dunlop" <jo*********@johndunlop.info> wrote in message
news:MP************************@news.freeserve.net ...
Han wrote:
<font size="3"><a
href="http://www.example.com?product=3456789&amp;company=3528">
Mickey Mouse</a></font><br><img width="200" height="200" border="0"
src="http://www.example.com/images/3456789.jpg"><b>$8.00</b>


You need to elaborate a bit. As you're using preg_match_all, I
suspect there are an indefinite number of occurrences, and the
properties differ each time.

Is your markup really always in that format? Are attribute values
always quoted using double quotation marks? No more attributes in
font(!)? Tags never span multiple lines? Might the font size
change? Might a bold element that you don't want to match appear
after the image? Yada, yada, yada.
preg_match_all ("|<font size=\"3\">(<a.*?</a>)</font>|i", $html, $out);

preg_match_all ("|<b>(\$\d{1,3}\.\d{2})</b>|i", $html, $out);

How can these be combined into one pattern?


Append the latter to the former. But there's a bit to sort out.

If the pattern is enclosed in double quotation marks, the dollar
sign ("$") must be escaped properly, with two backslashes.
Otherwise, the character sequence \$ would be interpreted as an
assertion. A preferable solution, however, is to use single
quotation marks, so as to avoid any befuddlement, cause
significantly less clutter, and be more efficient all-round.

The dot metacharacter doesn't match newlines by default. Setting
the s modifier alters its behaviour, allowing it to match
newlines; this allows the text to span across multiple lines.

A slightly more forgiving pattern (using single quotation marks):

`
<font\s.*size\s*=\s*(["\']?)\+?[1-7]\1.*>(<a.*</a\s*>)</font\s*>
.*
<b\s*>(\$\d{1,3}\.\d{2})</b\s*>
`Usix

The pattern is fundamentally the result of coupling your two
regexes together. The bridge is simply a match of any character,
including newlines, until a bold element with the required content
is found.

There are subtle nuances however. The deprecated font start-tag
now accepts more attributes; it may also be written across
multiple lines, so too can the anchor element; and the size
attribute value needn't be quoted.

If I were you, I might get out of the habit of demarcating
patterns with vertical lines ("|"). The grave accent ("`")
character is rarely used, and, thence, seems a wise alternative.
You might also consider the hash ("#") character, but be careful:
you can't comment patterns using PCRE_EXTENDED [1] then. Instead,
use the "(?# comment )" syntax. It took me a while to notice that!

Good luck.
[1] Pattern Modifiers,
http://www.php.net/manual/en/pcre.pattern.modifiers.php

--
Jock

Jul 17 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.