By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,729 Members | 1,363 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,729 IT Pros & Developers. It's quick & easy.

Extract any URL from any string?

P: n/a
geturl.php

Too much code to paste here, but have a look at http://www.liarsscourge.com/

So far, I have not found a string that can break this...

Any built-in functions or suggestions for improvement?

Thanks in advance.
Feb 10 '07 #1
Share this Question
Share on Google+
14 Replies


P: n/a
Daz
On Feb 10, 8:41 pm, "deko" <d...@nospam.comwrote:
geturl.php

Too much code to paste here, but have a look athttp://www.liarsscourge.com/

So far, I have not found a string that can break this...

Any built-in functions or suggestions for improvement?

Thanks in advance.

I don't want to sound negative here, but what exactly is the point?
The reason I ask, is because I see no reason why you can't extract it
with a single regex expression, and then you could use another one to
validate it. It's simple enough to validate, the question is how valid
do you want it to be. Should you need to specify each TLD, or do you
just need to match a pattern. In either case, 2 or 3 regex expressions
max should be able to do what you are after. With a little extra
crafting, you should be able to extract multiple URLs in one go.

Feb 10 '07 #2

P: n/a
deko wrote:
geturl.php

Too much code to paste here, but have a look at
http://www.liarsscourge.com/
So far, I have not found a string that can break this...

Any built-in functions or suggestions for improvement?
1. Increase the error_reporting level and you will find some sloppy notices
2. Have a look at parse_url(), which might be useful
3. Use preg_* functions instead of POSIX ereg* function (performance)
4. Strings like the following cause infinite loops:

getURL('fofo http://discovery.co.uk/../foo');

Probable fix:

= Replace:

if (!eregi("^(com|net|org...)$", $urlString_a[$i])) {
...
}

= With:

if (preg_match("!^(com|net|org...)[^$]!", $urlString_a[$i], $m)) {
$urlString_a[$i] = $m[1];
}
JW
Feb 10 '07 #3

P: n/a
1. Increase the error_reporting level and you will find some sloppy notices
2. Have a look at parse_url(), which might be useful
3. Use preg_* functions instead of POSIX ereg* function (performance)
4. Strings like the following cause infinite loops:

getURL('fofo http://discovery.co.uk/../foo');

Probable fix:

= Replace:

if (!eregi("^(com|net|org...)$", $urlString_a[$i])) {
...
}

= With:

if (preg_match("!^(com|net|org...)[^$]!", $urlString_a[$i], $m)) {
$urlString_a[$i] = $m[1];
}
Outstanding. Thanks for the constructive feedback.
Feb 11 '07 #4

P: n/a
geturl 2.0...
1. Increase the error_reporting level and you will find some sloppy notices
so I have a few undefined variables... I thought this did not matter with PHP
(script vs. compiled code)
2. Have a look at parse_url(), which might be useful
well, if urlString = http://netforex.subdomain.net.foo'>, parse_url() returns:
netforex.subdomain.net.foo'- which is not much help. Nevertheless, I'm using
it on line 79 for validation - but I think it would be better validation if I
could do this:
if ($urlLink = parse_url($urlLink))
{
return $urlLink['host'];
}
??
3. Use preg_* functions instead of POSIX ereg* function (performance)
I was able to replace an eregi with preg_match on lines 18 and 29, but unsure of
the syntax for line 44 ... suggestions?
4. Strings like the following cause infinite loops:

getURL('fofo http://discovery.co.uk/../foo');
fixed! see line 38.

- - - - -

I'm wondering how much of a hack that for loop is (lines 26 - 53). Seems to
work okay...

Feb 11 '07 #5

P: n/a
http://www.liarsscourge.com <<== the script is here
"deko" <de**@nospam.comwrote in message
news:5K******************************@comcast.com. ..
geturl 2.0...
>1. Increase the error_reporting level and you will find some sloppy notices

so I have a few undefined variables... I thought this did not matter with PHP
(script vs. compiled code)
>2. Have a look at parse_url(), which might be useful

well, if urlString = http://netforex.subdomain.net.foo'>, parse_url()
returns: netforex.subdomain.net.foo'- which is not much help. Nevertheless,
I'm using it on line 79 for validation - but I think it would be better
validation if I could do this:
if ($urlLink = parse_url($urlLink))
{
return $urlLink['host'];
}
??
>3. Use preg_* functions instead of POSIX ereg* function (performance)

I was able to replace an eregi with preg_match on lines 18 and 29, but unsure
of the syntax for line 44 ... suggestions?
>4. Strings like the following cause infinite loops:

getURL('fofo http://discovery.co.uk/../foo');

fixed! see line 38.

- - - - -

I'm wondering how much of a hack that for loop is (lines 26 - 53). Seems to
work okay...
Feb 11 '07 #6

P: n/a
deko wrote:
so I have a few undefined variables... I thought this did not matter
with PHP (script vs. compiled code)
It doesn't, but it's just good practice to do so (and increasing the
error_reporting level saves you a lot of time debugging when making typos in
variable names).
well, if urlString = http://netforex.subdomain.net.foo'>,
parse_url() returns: netforex.subdomain.net.foo'- which is not much
The parse_url function can indeed easily be fooled, so you will have to add
some additional validation. But it's quite useful for what you are doing
(and pretty fast).
I was able to replace an eregi with preg_match on lines 18 and 29,
but unsure of the syntax for line 44 ... suggestions?
You mean the following line?
if (eregi("^(com|net|org...)$", $urlString_a[$i]))
You can simply replace this with:
if (preg_match("!^(com|net|org...)$!i", $urlString_a[$i]))

JW

Feb 11 '07 #7

P: n/a
>so I have a few undefined variables... I thought this did not matter
>with PHP (script vs. compiled code)

It doesn't, but it's just good practice to do so (and increasing the
error_reporting level saves you a lot of time debugging when making typos in
variable names).
Understood. Adjusting error_reporting is helpful for debugging and
optimization.
>well, if urlString = http://netforex.subdomain.net.foo'>,
parse_url() returns: netforex.subdomain.net.foo'- which is not much

The parse_url function can indeed easily be fooled, so you will have to add
some additional validation. But it's quite useful for what you are doing (and
pretty fast).
I'm not sure where parse_url is useful, precisely because it is so easily
fooled. The
strings I'm working with are entirely unpredictable. That's why I explode the
candidate URL into domains and inspect each one in a for loop.
>I was able to replace an eregi with preg_match on lines 18 and 29,
but unsure of the syntax for line 44 ... suggestions?

You mean the following line?
>if (eregi("^(com|net|org...)$", $urlString_a[$i]))

You can simply replace this with:
>if (preg_match("!^(com|net|org...)$!i", $urlString_a[$i]))
Thanks, but I'm still baffled by that syntax - why the second negation '!' after
the first '$' ? I understand that '$' means end of line, and 'i' means case
insensitive... is '$!' saying "not ending with"?

I still want to replace the eregi code (lines 44, 51, 55, 85) with preg_match,
but the preg_match syntax seems counter intuitive... just have not figured it
out yet. What's up with delimiters?

http://www.liarsscourge.com <<== latest code here

Thanks again for the help!

Feb 12 '07 #8

P: n/a
http://www.liarsscourge.com/ <<== this is better

known bug: if an email address appears in the test string before a valid URL,
the script will not find the URL

Feb 12 '07 #9

P: n/a
deko wrote:
http://www.liarsscourge.com/ <<== this is better

known bug: if an email address appears in the test string before a valid
URL, the script will not find the URL
Hard coding TLDs is generally not useful, as you never know when
unexpected ones may be put in use. Plus, you do not allow variance for
different schemes other than http(s).

You do not support valid URLs that have the authority:

<http://bob1234z_.sss:li*****@example.com/foo/bar
baz/index.bak.php?q=my&q2=query#frag2>

This is indeed a valid URL, but your algorithm fails. It's far more
useful to use a single regex, anticipate any scheme (look at wikipedia
or search some RFCs for valid URI format), and any TLD.

parse_url is not meant to be used for validation, as stated in the PHP
docs themselves.

This is an example implementing the regex I made, recently:

<?php
$re = '%
( [\w.+-]+ : (?://)? ) # scheme name

( [^/]+ ) # authority, domain
( / [^?]+ )? # path, if exists

# query and fragment, which may or may not exist
(?:
\\? # query initializer
( [^#]+ ) # grab query
(?: \\# ([\w-]+) )? # fragment, if exists
)?
%x';

$s = 'Welcome.to{"http://user:pa**@example.com/foo
bar".-/index.bak.php?q=query&r=arrr#frag2_2borky borked!';
if (preg_match($re, $s, $m)) {
echo '<p>Original: <code>' . $s . '</code></p>';
echo '<p>Extrapolation: <a href="'
. htmlentities($m[0], ENT_QUOTES) . '">' . ($m[1].$m[2])
. '</a(full URI in link, see status bar).</p>';
}
else {
echo 'Not a valid URI.';
}
?>

Curtis
Feb 12 '07 #10

P: n/a
Hard coding TLDs is generally not useful, as you never know when unexpected
ones may be put in use. Plus, you do not allow variance for different schemes
other than http(s).

You do not support valid URLs that have the authority:

<http://bob1234z_.sss:li*****@example.com/foo/bar
baz/index.bak.php?q=my&q2=query#frag2>

This is indeed a valid URL, but your algorithm fails. It's far more useful to
use a single regex, anticipate any scheme (look at wikipedia or search some
RFCs for valid URI format), and any TLD.

parse_url is not meant to be used for validation, as stated in the PHP docs
themselves.

This is an example implementing the regex I made, recently:

<?php
$re = '%
( [\w.+-]+ : (?://)? ) # scheme name

( [^/]+ ) # authority, domain
( / [^?]+ )? # path, if exists

# query and fragment, which may or may not exist
(?:
\\? # query initializer
( [^#]+ ) # grab query
(?: \\# ([\w-]+) )? # fragment, if exists
)?
%x';

$s = 'Welcome.to{"http://user:pa**@example.com/foo
bar".-/index.bak.php?q=query&r=arrr#frag2_2borky borked!';
if (preg_match($re, $s, $m)) {
echo '<p>Original: <code>' . $s . '</code></p>';
echo '<p>Extrapolation: <a href="'
. htmlentities($m[0], ENT_QUOTES) . '">' . ($m[1].$m[2])
. '</a(full URI in link, see status bar).</p>';
}
else {
echo 'Not a valid URI.';
}
?>
Thanks, I'll try to use this. As for Internet address validation, I've come to
this conclusion: I can only validate known quantities - that is, the scheme
(http, https, ftp) and the TLD. Granted, TLDs come and go, but not often enough
to avoid using a list to validate against. As for domain names, I can only
validate format - that is, 255 or less characters and no non-alphanumeric
characters (other than a hyphen). geturl.php is my first attempt at implementing
this. I'll have another rev posted at http://liarsscourge.com shortly. Thanks
for your help!

Feb 12 '07 #11

P: n/a
Here's a possible way to validate a host's domain name(s):

$invalid = "(~ ` ! @ # $ % ^ & * ( ) _+ = { } [ ] \ | : ; " ' < , ? /)"
//pseudo code
$url_a = parse_url($url);
$urlHost = $url_a['host'];
$urlHost_a = explode('.',$urlHost);
for ($i = count($urlHost_a) - 2; $i 0 ; $i--) //skips TLD
{
if (preg_match($invalid, $urlHost_a[$i]))
{
echo 'failed on '.$urlhost_a[$i].'<br>';
}
else
{
echo 'valid domain name(s)<br>';
}
}

Any suggestions on how to construct that $invalid pattern? I'm not sure what
syntakx to use or what characters need to be escaped.

Feb 12 '07 #12

P: n/a
deko wrote:
Thanks, but I'm still baffled by that syntax - why the second
negation '!' after the first '$' ? I understand that '$' means end
of line, and 'i' means case insensitive... is '$!' saying "not ending
with"?
The exclamation marks are used to indicate the beginning and the end of the
pattern. The following will also work:

#(pattern)#
|(pattern)|
?(pattern)?

.....or any character that's not in use by the pattern. This is not a
restriction, however, because you can escape characters with a backslash
when they occur both in the pattern and as pattern delimiters:

|(a\|b)|
JW
Feb 13 '07 #13

P: n/a
deko wrote:
Here's a possible way to validate a host's domain name(s):

$invalid = "(~ ` ! @ # $ % ^ & * ( ) _+ = { } [ ] \ | : ; " ' < , ?
/)" //pseudo code
$url_a = parse_url($url);
$urlHost = $url_a['host'];
$urlHost_a = explode('.',$urlHost);
for ($i = count($urlHost_a) - 2; $i 0 ; $i--) //skips TLD
{
if (preg_match($invalid, $urlHost_a[$i]))
{
echo 'failed on '.$urlhost_a[$i].'<br>';
}
else
{
echo 'valid domain name(s)<br>';
}
}

Any suggestions on how to construct that $invalid pattern? I'm not sure
what syntakx to use or what characters need to be escaped.
Again, I would implement it all in one regex, using the /x modifier to
maintain readability. You could use that list of invalid chars in a
negative lookahead - but there are plenty of ways you can do this.
Just not sure how worth it it would be to go all out.

Curtis
Feb 14 '07 #14

P: n/a
>Any suggestions on how to construct that $invalid pattern? I'm not sure
>what syntakx to use or what characters need to be escaped.

Again, I would implement it all in one regex, using the /x modifier to
maintain readability. You could use that list of invalid chars in a
negative lookahead - but there are plenty of ways you can do this.
Just not sure how worth it it would be to go all out.
This seems to be working:

preg_match('/[^0-9a-z-]/i', $urlHost_a[$d])

http://www.liarsscourge.com

Feb 14 '07 #15

This discussion thread is closed

Replies have been disabled for this discussion.