By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,949 Members | 1,455 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,949 IT Pros & Developers. It's quick & easy.

Best way to parse a url for validity?

P: n/a
I have checkURL(http://globalwarmingawareness2007.org.uk,
globalwarmingawareness2007.org.uk)

I see almost everyone using regular expressions. But I don't completely
trust them. Don't know if this code is the best way to find if a user
entered a valid URL and to avoid SQL injection from the URL.

function checkURL($url, $name)
{
global $incorrect_input;

$data=parse_url("http://".$url);
if(!$data)
die($incorrect_input[1].$name);
$host=$data['host'];
$path=$data['path'];
$query=$data['query'];
$fragment=$data['fragment'];

//url does not start with a letter, number
if (!preg_match('/^[A-Za-z0-9]/i',$host))
die($incorrect_input[1].$name);

//url does not contain a .
if (!preg_match('/([A-Za-z0-9]+\.)+/i',$host))
die($incorrect_input[1].$name);

//url ends with .
if (preg_match('/\.$/i',$host))
die($incorrect_input[1].$name);

$array=split('\.',$host);
$arraysize=count($array);

for ($i = 0; $i < $arraysize; $i++)
{
if (preg_match('/[^A-Za-z0-9\-\_]+/i',$array[$i]))
die($incorrect_input[1].$name);
}

//Only allow alphanumeric letters, _,-,/
if($path)
{
$len=strlen($path);
for ($i = 0; $i < $len; $i++)
{
$ascii = ord($path[$i]);
if (($ascii < 65 || $ascii 90) &&
($ascii < 48 || $ascii 57) &&
($ascii < 97 || $ascii 122))
if ($ascii != 45 && $ascii != 46 && $ascii != 95 && $ascii != 47)
die($incorrect_input[1].$name);
}
}

//Do not allow more than one consecutive slash for the path
if (preg_match('/[\/]{2,}/i', $path))
die($incorrect_input[1].$name);
if($query)
{
if (preg_match('/[^A-Za-z0-9\/\-\_\=\&]+/i',$query))
die($incorrect_input[1].$name);
if (preg_match('/[\=\&]{2,}/i',$query))
die($incorrect_input[1].$name);
}

if($fragment)
{
if (preg_match('/[^A-Za-z0-9\-\_\.]+/i',$fragment))
die($incorrect_input[1].$name);
}

return($url);
}
Apr 26 '07 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On Apr 26, 11:52 pm, Rick Stem <ricks...@yahoo.comwrote:
I have checkURL(http://globalwarmingawareness2007.org.uk,
globalwarmingawareness2007.org.uk)

I see almost everyone using regular expressions. But I don't completely
trust them. Don't know if this code is the best way to find if a user
entered a valid URL and to avoid SQL injection from the URL.

function checkURL($url, $name)
{
global $incorrect_input;

$data=parse_url("http://".$url);
if(!$data)
die($incorrect_input[1].$name);
$host=$data['host'];
$path=$data['path'];
$query=$data['query'];
$fragment=$data['fragment'];

//url does not start with a letter, number
if (!preg_match('/^[A-Za-z0-9]/i',$host))
die($incorrect_input[1].$name);

//url does not contain a .
if (!preg_match('/([A-Za-z0-9]+\.)+/i',$host))
die($incorrect_input[1].$name);

//url ends with .
if (preg_match('/\.$/i',$host))
die($incorrect_input[1].$name);

$array=split('\.',$host);
$arraysize=count($array);

for ($i = 0; $i < $arraysize; $i++)
{
if (preg_match('/[^A-Za-z0-9\-\_]+/i',$array[$i]))
die($incorrect_input[1].$name);
}

//Only allow alphanumeric letters, _,-,/
if($path)
{
$len=strlen($path);
for ($i = 0; $i < $len; $i++)
{
$ascii = ord($path[$i]);
if (($ascii < 65 || $ascii 90) &&
($ascii < 48 || $ascii 57) &&
($ascii < 97 || $ascii 122))
if ($ascii != 45 && $ascii != 46 && $ascii != 95 && $ascii != 47)
die($incorrect_input[1].$name);
}
}

//Do not allow more than one consecutive slash for the path
if (preg_match('/[\/]{2,}/i', $path))
die($incorrect_input[1].$name);

if($query)
{
if (preg_match('/[^A-Za-z0-9\/\-\_\=\&]+/i',$query))
die($incorrect_input[1].$name);
if (preg_match('/[\=\&]{2,}/i',$query))
die($incorrect_input[1].$name);
}

if($fragment)
{
if (preg_match('/[^A-Za-z0-9\-\_\.]+/i',$fragment))
die($incorrect_input[1].$name);
}

return($url);

}
it isnt the best way no, th above code restricts the url to a small
subset of valid urls, and doesnt prevent sql inject which can occur
inside POST payload as well as GET.
Architecturally it isnt the right way to think about the problem
either, IMHO, its the easy answer - restrict restrict restrict - its
no substitute for allowing all the valid urls, even ones with
injection, and then filtering the input/output of your scripts.
this kind of approach though can have validity, have you tried using
mod_security?
Within php means you will be restricting yourself from application
adjustments, rewrites, non-ascii language implementation, besides all
this, the approach above doesnt lend itself to easy adjustment,
whereas a simple block of more readable reg exp would do, once youve
made the leap of faith (shown by others to be a worthwhile leap) into
the world of reg exps which you can indeed trust despite their
complexity.

Apr 27 '07 #2

P: n/a

"Rick Stem" <ri******@yahoo.comwrote in message
news:f0*********@news4.newsguy.com...
|I have checkURL(http://globalwarmingawareness2007.org.uk,
| globalwarmingawareness2007.org.uk)
|
| I see almost everyone using regular expressions. But I don't completely
| trust them. Don't know if this code is the best way to find if a user
| entered a valid URL and to avoid SQL injection from the URL.

JESUS CHRIST!!!

'dont' trust them'? you mean 'i couldn't write one if it meant i'd get
laid'.

i don't 'trust' the code you've just written! have you completely overlooked
the fact that php has built-in functions that break out a url into the
pieces you're looking for? do you not know that even if it 'looks' valid, it
may point to nowhere?

'don't trust them'...i'm still laughing.
Apr 27 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.