By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,051 Members | 1,258 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,051 IT Pros & Developers. It's quick & easy.

Best way to extract URL from random string?

P: n/a
If I have random and unpredictable user agent strings containing URLs, what is
the best way to extract the URL?

For example, let's say the string looks like this:

registered NYSE 943 <a href="http://netforex.net"Forex Trading Network
Organization </ain**@netforex.org

What's the best way to extract http://netforex.net ?

I have code that checks for identifiable browsers and bots, but when the agent
string has no identifiable information other than a URL, I want to grab the URL.

Here's a first crack at it:
..
..
..
[code omitted]
..
..
..
elseif (eregi("http://", $agent))
{
$agent = stristr($agent, "http://");
$agent = parse_url($agent);
$agent = $agent['host'];
//check for subdomains
$agent_a = explode(".", $agent);
$agent_r = array_reverse($agent_a);
$sub = count($agent_r) - 1;
$tld3 = substr($agent_r[0], 0, 3);
if (eregi("^(com|net|org|edu|biz|gov)$", $tld3)) //common tld's
{
while ($sub 0)
{
$domain = $domain.$agent_r[$sub].".";
$sub--;
}
$refurl = $domain.$tld3;
}
$referrer = "<a href='".$refurl."'>".$refurl."</a>";
}
else
{
$referrer = "unknown";
}

Are there any PHP functions that will help here? How to handle sub domains?
International domains?

Thanks in advance.

Feb 9 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
How about:

if
(preg_match('/\\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0
-9+&@#\/%=~_|]/i', $subject, $result)) {
$url = $result[0];
} else {
$url = "";
}

-----Original Message-----
From: deko [mailto:de**@nospam.com]
Posted At: Friday, February 09, 2007 2:15 PM
Posted To: comp.lang.php
Conversation: Best way to extract URL from random string?
Subject: Best way to extract URL from random string?

If I have random and unpredictable user agent strings containing URLs,
what is
the best way to extract the URL?

For example, let's say the string looks like this:

registered NYSE 943 <a href="http://netforex.net"Forex Trading Network

Organization </ain**@netforex.org

What's the best way to extract http://netforex.net ?

I have code that checks for identifiable browsers and bots, but when the
agent
string has no identifiable information other than a URL, I want to grab
the URL.

Here's a first crack at it:
..
..
..
[code omitted]
..
..
..
elseif (eregi("http://", $agent))
{
$agent = stristr($agent, "http://");
$agent = parse_url($agent);
$agent = $agent['host'];
//check for subdomains
$agent_a = explode(".", $agent);
$agent_r = array_reverse($agent_a);
$sub = count($agent_r) - 1;
$tld3 = substr($agent_r[0], 0, 3);
if (eregi("^(com|net|org|edu|biz|gov)$", $tld3)) //common tld's
{
while ($sub 0)
{
$domain = $domain.$agent_r[$sub].".";
$sub--;
}
$refurl = $domain.$tld3;
}
$referrer = "<a href='".$refurl."'>".$refurl."</a>";
}
else
{
$referrer = "unknown";
}

Are there any PHP functions that will help here? How to handle sub
domains?
International domains?

Thanks in advance.

Feb 9 '07 #2

P: n/a
On Feb 9, 2:15 pm, "deko" <d...@nospam.comwrote:
Are there any PHP functions that will help here? How to handle sub domains?
International domains?

Thanks in advance.
well, you found parse_url
you might want to use regular expressions as well

$long_string = 'A HREF="http://something.else.example.com/blah/?
joe=bob"';
if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_string,$matches) )
{
$url = $matches[1]; // http://something.else.example.com/blah/?
joe=bob
$parts = parse_url($url);
if ( preg_match('/(.+)\.\w+\.\w+/',$parts['host'],$matches) )
echo $matches[1]; // something.else
}

Feb 9 '07 #3

P: n/a
Rik
On Fri, 09 Feb 2007 22:02:18 +0100, BKDotCom <bk***********@yahoo.com
wrote:
On Feb 9, 2:15 pm, "deko" <d...@nospam.comwrote:
>Are there any PHP functions that will help here? How to handle sub
domains?
International domains?

Thanks in advance.

well, you found parse_url
you might want to use regular expressions as well

$long_string = 'A HREF="http://something.else.example.com/blah/?
joe=bob"';
if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_string,$matches) )
Afaik protocols can only be a-z+, you don't have to capture the entire
match, and the url should have at least one character, so a little
optimised it would be:

'|[a-z]+://[^\s"\']+|i'

{
$url = $matches[1]; // http://something.else.example.com/blah/?
joe=bob
$url = $matches[0];

--
Rik Wasmus
Feb 9 '07 #4

P: n/a
"BKDotCom" <bk***********@yahoo.comwrote in message
news:11*********************@k78g2000cwa.googlegro ups.com...
On Feb 9, 2:15 pm, "deko" <d...@nospam.comwrote:
>Are there any PHP functions that will help here? How to handle sub domains?
International domains?

Thanks in advance.

well, you found parse_url
you might want to use regular expressions as well

$long_string = 'A HREF="http://something.else.example.com/blah/?
joe=bob"';
if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_string,$matches) )
{
$url = $matches[1]; // http://something.else.example.com/blah/?
joe=bob
$parts = parse_url($url);
if ( preg_match('/(.+)\.\w+\.\w+/',$parts['host'],$matches) )
echo $matches[1]; // something.else
}
use regex to handle subdomains... I see!

but wouldn't the first few lines of my original code be a more efficient
starting point?
elseif (eregi("http://", $agent))
{
$agent = stristr($agent, "http://");
$agent = parse_url($agent);
//now use preg_match() to return everything beginning with a '.' up to
the next word boundary (?)

still testing...

Feb 10 '07 #5

P: n/a

"Rik" <lu************@hotmail.comwrote in message
news:op.tnh2l8sgqnv3q9@misant...
On Fri, 09 Feb 2007 22:02:18 +0100, BKDotCom <bk***********@yahoo.com>
wrote:
On Feb 9, 2:15 pm, "deko" <d...@nospam.comwrote:
>Are there any PHP functions that will help here? How to handle sub domains?
International domains?

Thanks in advance.

well, you found parse_url
you might want to use regular expressions as well

$long_string = 'A HREF="http://something.else.example.com/blah/?
joe=bob"';
if ( preg_match('|([^\s"\']*://[^\s"\']*)|',$long_string,$matches) )
Afaik protocols can only be a-z+, you don't have to capture the entire
match, and the url should have at least one character, so a little
optimised it would be:

'|[a-z]+://[^\s"\']+|i'

{
$url = $matches[1]; // http://something.else.example.com/blah/?
joe=bob
$url = $matches[0];
===========================================

I've been thinking about this... see http://www.liarsscourge.com

I need to decide:

1) what TLDs I will accept
2) what protocols I will accept

so...

1 = common TLDs, including international TLDs
2 = http only

next...

-- assemble array of common/international TLDs
-- construct regex to search for TLDs in this array

developing...

Feb 10 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.