By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,967 Members | 1,687 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,967 IT Pros & Developers. It's quick & easy.

Parsing content for links

P: n/a
I have a content management system that has links within the content
field in the database and I need to verify if those links are correct.
What I need to have happen is have a php script query the database and
then parse through the content field to find all the <a hreftags to
get the href attribute value and the link text.

Does anyone have a way of doing this or a regex to do this?

Thanks,
Tony

Feb 21 '07 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Tony schreef:
I have a content management system that has links within the content
field in the database and I need to verify if those links are correct.
What I need to have happen is have a php script query the database and
then parse through the content field to find all the <a hreftags to
get the href attribute value and the link text.

Does anyone have a way of doing this or a regex to do this?
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$html, &$matches);
--
Arjen
http://www.hondenpage.com - Mijn site over honden
Feb 22 '07 #2

P: n/a
Tony wrote:
I have a content management system that has links within the content
field in the database and I need to verify if those links are correct.
What I need to have happen is have a php script query the database and
then parse through the content field to find all the <a hreftags to
get the href attribute value and the link text.

Does anyone have a way of doing this or a regex to do this?

Thanks,
Tony
Yeah, regex would be easiest, and there should be plenty out there,
but I might do something like this:

$re = '%
<a[^<>]+ # href may or may not come first
href=([\'"]) # capture single/double quote

# match a valid URI
(
[\w.-]+:(?://)? # scheme
[^?"]+ # authority

# possible query string and fragment
(?:
\\? [^#]+
(?: \\# [^"]+ )?
)?
)

\1 # captured quote from above
[^<>]* # possible remaining attributes
>( .*? ) # allow for nested tags
</a> # closing <atag
%xi';

The match for the URI would be in $match[2] and the text for the <a>
tag is in $match[3].

Just use this $re var in the preg_* functions.

Hope this helps,
Curtis
Feb 22 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.