If I have random and unpredictable user agent strings containing URLs, what is
the best way to extract the URL?
For example, let's say the string looks like this:
registered NYSE 943 <a href="http://netforex.net"Fo rex Trading Network
Organization </ain**@netforex.o rg
What's the best way to extract http://netforex.net ?
I have code that checks for identifiable browsers and bots, but when the agent
string has no identifiable information other than a URL, I want to grab the URL.
Here's a first crack at it:
..
..
..
[code omitted]
..
..
..
elseif (eregi("http://", $agent))
{
$agent = stristr($agent, "http://");
$agent = parse_url($agen t);
$agent = $agent['host'];
//check for subdomains
$agent_a = explode(".", $agent);
$agent_r = array_reverse($ agent_a);
$sub = count($agent_r) - 1;
$tld3 = substr($agent_r[0], 0, 3);
if (eregi("^(com|n et|org|edu|biz| gov)$", $tld3)) //common tld's
{
while ($sub 0)
{
$domain = $domain.$agent_ r[$sub].".";
$sub--;
}
$refurl = $domain.$tld3;
}
$referrer = "<a href='".$refurl ."'>".$refurl." </a>";
}
else
{
$referrer = "unknown";
}
Are there any PHP functions that will help here? How to handle sub domains?
International domains?
Thanks in advance.