Hello,
I want to create my own index of websites based on my criteria rather than big G's.
For example, I might like to index websites according to what they have in their "author" Meta tag ( and rejecting any site that doesn't have one).
I will use this idea as a working example.
Here is my suggested methodology and I would like some input on problem areas, how to make it better, etc. Methodology:
I will start the script off by giving it a url known to be in the area of interest.
The script will grab the page content and then check to see if it meets my criteria.
If the page has an author tag and it is not blank, write the website url and the author content
to a mysql table row. Then read all the links of that page to an array.
Follow the first link. Do the same as above. Follow the fist link etc.
When getting to a dead end ( a page with no links except already visited urls) then go to the second link on the last array of links - follow that untill a dead end.
Now the problem I see is storing all these arrays of links so that the script can return and process them if it hits a dead end.
What do you think ?
2 5139 Atli 5,058
Recognized Expert Expert
Hey.
You could use your MySQL database to store the links.
Personally, I would set up a database that stored a specific page, all it's sub-pages, and all the outside links found on that page. Each in a different table.
Then you could just crawl through a site until you have reached a dead-end, pick a link from the outside-link table and start again.
That way there would be no need to store a massive amounts of URLs in a PHP array and risk running out of resources.
Thanks for your suggestion.
I was also thinking about the method of accessing the pages.
Somewhere I read the cUrl is faster than using fopen() and fread()
to get the data. Is that true ?
Would it be best to use cUrl ?
I have taken a look at the Snoopy class, but that seems to only use cUrl for
ssl websites ( https ). maybe I did not read the class properly
perhaps it using cUrl for everything.
Here is the class:
Sorry it is a bit long - but well commented.
Would this be the best to use (if it is using cUrl)
Any input much appreciated. - <?php
-
-
/*************************************************
-
-
Snoopy - the PHP net client
-
Author: Monte Ohrt <monte@ispi.net>
-
Copyright (c): 1999-2000 ispi, all rights reserved
-
Version: 1.01
-
-
* This library is free software; you can redistribute it and/or
-
* modify it under the terms of the GNU Lesser General Public
-
* License as published by the Free Software Foundation; either
-
* version 2.1 of the License, or (at your option) any later version.
-
*
-
* This library is distributed in the hope that it will be useful,
-
* but WITHOUT ANY WARRANTY; without even the implied warranty of
-
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
-
* Lesser General Public License for more details.
-
*
-
* You should have received a copy of the GNU Lesser General Public
-
* License along with this library; if not, write to the Free Software
-
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
-
-
You may contact the author of Snoopy by e-mail at:
-
monte@ispi.net
-
-
Or, write to:
-
Monte Ohrt
-
CTO, ispi
-
237 S. 70th suite 220
-
Lincoln, NE 68510
-
-
The latest version of Snoopy can be obtained from:
-
http://snoopy.sourceforge.net/
-
-
*************************************************/
-
-
class Snoopy
-
{
-
/**** Public variables ****/
-
-
/* user definable vars */
-
-
var $host = "www.php.net"; // host name we are connecting to
-
var $port = 80; // port we are connecting to
-
var $proxy_host = ""; // proxy host to use
-
var $proxy_port = ""; // proxy port to use
-
var $proxy_user = ""; // proxy user to use
-
var $proxy_pass = ""; // proxy password to use
-
-
var $agent = "Snoopy v1.2.3"; // agent we masquerade as
-
var $referer = ""; // referer info to pass
-
var $cookies = array(); // array of cookies to pass
-
// $cookies["username"]="joe";
-
var $rawheaders = array(); // array of raw headers to send
-
// $rawheaders["Content-type"]="text/html";
-
-
var $maxredirs = 35; // http redirection depth maximum. 0 = disallow
-
var $lastredirectaddr = ""; // contains address of last redirected address
-
var $offsiteok = true; // allows redirection off-site
-
var $maxframes = 0; // frame content depth maximum. 0 = disallow
-
var $expandlinks = true; // expand links to fully qualified URLs.
-
// this only applies to fetchlinks()
-
// submitlinks(), and submittext()
-
var $passcookies = true; // pass set cookies back through redirects
-
// NOTE: this currently does not respect
-
// dates, domains or paths.
-
-
var $user = ""; // user for http authentication
-
var $pass = ""; // password for http authentication
-
-
// http accept types
-
var $accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*";
-
-
var $results = ""; // where the content is put
-
-
var $error = ""; // error messages sent here
-
var $response_code = ""; // response code returned from server
-
var $headers = array(); // headers returned from server sent here
-
var $maxlength = 500000; // max return data length (body)
-
var $read_timeout = 0; // timeout on read operations, in seconds
-
// supported only since PHP 4 Beta 4
-
// set to 0 to disallow timeouts
-
var $timed_out = false; // if a read operation timed out
-
var $status = 0; // http request status
-
-
var $temp_dir = "/tmp"; // temporary directory that the webserver
-
// has permission to write to.
-
// under Windows, this should be C:\temp
-
-
var $curl_path = "/usr/local/bin/curl";
-
// Snoopy will use cURL for fetching
-
// SSL content if a full system path to
-
// the cURL binary is supplied here.
-
// set to false if you do not have
-
// cURL installed. See http://curl.haxx.se
-
// for details on installing cURL.
-
// Snoopy does *not* use the cURL
-
// library functions built into php,
-
// as these functions are not stable
-
// as of this Snoopy release.
-
-
/**** Private variables ****/
-
-
var $_maxlinelen = 4096; // max line length (headers)
-
-
var $_httpmethod = "GET"; // default http request method
-
var $_httpversion = "HTTP/1.0"; // default http request version
-
var $_submit_method = "POST"; // default submit method
-
var $_submit_type = "application/x-www-form-urlencoded"; // default submit type
-
var $_mime_boundary = ""; // MIME boundary for multipart/form-data submit type
-
var $_redirectaddr = false; // will be set if page fetched is a redirect
-
var $_redirectdepth = 0; // increments on an http redirect
-
var $_frameurls = array(); // frame src urls
-
var $_framedepth = 0; // increments on frame depth
-
-
var $_isproxy = false; // set if using a proxy server
-
var $_fp_timeout = 30; // timeout for socket connection
-
-
/*======================================================================*\
-
Function: fetch
-
Purpose: fetch the contents of a web page
-
(and possibly other protocols in the
-
future like ftp, nntp, gopher, etc.)
-
Input: $URI the location of the page to fetch
-
Output: $this->results the output text from the fetch
-
\*======================================================================*/
-
-
function fetch($URI)
-
{
-
-
//preg_match("|^([^:]+)://([^:/]+)(:[\d]+)*(.*)|",$URI,$URI_PARTS);
-
$URI_PARTS = parse_url($URI);
-
if (!empty($URI_PARTS["user"]))
-
$this->user = $URI_PARTS["user"];
-
if (!empty($URI_PARTS["pass"]))
-
$this->pass = $URI_PARTS["pass"];
-
if (empty($URI_PARTS["query"]))
-
$URI_PARTS["query"] = '';
-
if (empty($URI_PARTS["path"]))
-
$URI_PARTS["path"] = '';
-
-
switch(strtolower($URI_PARTS["scheme"]))
-
{
-
case "http":
-
$this->host = $URI_PARTS["host"];
-
if(!empty($URI_PARTS["port"]))
-
$this->port = $URI_PARTS["port"];
-
if($this->_connect($fp))
-
{
-
if($this->_isproxy)
-
{
-
// using proxy, send entire URI
-
$this->_httprequest($URI,$fp,$URI,$this->_httpmethod);
-
}
-
else
-
{
-
$path = $URI_PARTS["path"].($URI_PARTS["query"] ? "?".$URI_PARTS["query"] : "");
-
// no proxy, send only the path
-
$this->_httprequest($path, $fp, $URI, $this->_httpmethod);
-
}
-
-
$this->_disconnect($fp);
-
-
if($this->_redirectaddr)
-
{
-
/* url was redirected, check if we've hit the max depth */
-
if($this->maxredirs > $this->_redirectdepth)
-
{
-
// only follow redirect if it's on this site, or offsiteok is true
-
if(preg_match("|^http://".preg_quote($this->host)."|i",$this->_redirectaddr) || $this->offsiteok)
-
{
-
/* follow the redirect */
-
$this->_redirectdepth++;
-
$this->lastredirectaddr=$this->_redirectaddr;
-
$this->fetch($this->_redirectaddr);
-
}
-
}
-
}
-
-
if($this->_framedepth < $this->maxframes && count($this->_frameurls) > 0)
-
{
-
$frameurls = $this->_frameurls;
-
$this->_frameurls = array();
-
-
while(list(,$frameurl) = each($frameurls))
-
{
-
if($this->_framedepth < $this->maxframes)
-
{
-
$this->fetch($frameurl);
-
$this->_framedepth++;
-
}
-
else
-
break;
-
}
-
}
-
}
-
else
-
{
-
return false;
-
}
-
return true;
-
break;
-
case "https":
-
if(!$this->curl_path)
-
return false;
-
if(function_exists("is_executable"))
-
if (!is_executable($this->curl_path))
-
return false;
-
$this->host = $URI_PARTS["host"];
-
if(!empty($URI_PARTS["port"]))
-
$this->port = $URI_PARTS["port"];
-
if($this->_isproxy)
-
{
-
// using proxy, send entire URI
-
$this->_httpsrequest($URI,$URI,$this->_httpmethod);
-
}
-
else
-
{
-
$path = $URI_PARTS["path"].($URI_PARTS["query"] ? "?".$URI_PARTS["query"] : "");
-
// no proxy, send only the path
-
$this->_httpsrequest($path, $URI, $this->_httpmethod);
-
}
-
-
if($this->_redirectaddr)
-
{
-
/* url was redirected, check if we've hit the max depth */
-
if($this->maxredirs > $this->_redirectdepth)
-
{
-
// only follow redirect if it's on this site, or offsiteok is true
-
if(preg_match("|^http://".preg_quote($this->host)."|i",$this->_redirectaddr) || $this->offsiteok)
-
{
-
/* follow the redirect */
-
$this->_redirectdepth++;
-
$this->lastredirectaddr=$this->_redirectaddr;
-
$this->fetch($this->_redirectaddr);
-
}
-
}
-
}
-
-
if($this->_framedepth < $this->maxframes && count($this->_frameurls) > 0)
-
{
-
$frameurls = $this->_frameurls;
-
$this->_frameurls = array();
-
-
while(list(,$frameurl) = each($frameurls))
-
{
-
if($this->_framedepth < $this->maxframes)
-
{
-
$this->fetch($frameurl);
-
$this->_framedepth++;
-
}
-
else
-
break;
-
}
-
}
-
return true;
-
break;
-
default:
-
// not a valid protocol
-
$this->error = 'Invalid protocol "'.$URI_PARTS["scheme"].'"\n';
-
return false;
-
break;
-
}
-
return true;
-
}
-
-
/*======================================================================*\
-
Function: submit
-
Purpose: submit an http form
-
Input: $URI the location to post the data
-
$formvars the formvars to use.
-
format: $formvars["var"] = "val";
-
$formfiles an array of files to submit
-
format: $formfiles["var"] = "/dir/filename.ext";
-
Output: $this->results the text output from the post
-
\*======================================================================*/
-
-
function submit($URI, $formvars="", $formfiles="")
-
{
-
unset($postdata);
-
-
$postdata = $this->_prepare_post_body($formvars, $formfiles);
-
-
$URI_PARTS = parse_url($URI);
-
if (!empty($URI_PARTS["user"]))
-
$this->user = $URI_PARTS["user"];
-
if (!empty($URI_PARTS["pass"]))
-
$this->pass = $URI_PARTS["pass"];
-
if (empty($URI_PARTS["query"]))
-
$URI_PARTS["query"] = '';
-
if (empty($URI_PARTS["path"]))
-
$URI_PARTS["path"] = '';
-
-
switch(strtolower($URI_PARTS["scheme"]))
-
{
-
case "http":
-
$this->host = $URI_PARTS["host"];
-
if(!empty($URI_PARTS["port"]))
-
$this->port = $URI_PARTS["port"];
-
if($this->_connect($fp))
-
{
-
if($this->_isproxy)
-
{
-
// using proxy, send entire URI
-
$this->_httprequest($URI,$fp,$URI,$this->_submit_method,$this->_submit_type,$postdata);
-
}
-
else
-
{
-
$path = $URI_PARTS["path"].($URI_PARTS["query"] ? "?".$URI_PARTS["query"] : "");
-
// no proxy, send only the path
-
$this->_httprequest($path, $fp, $URI, $this->_submit_method, $this->_submit_type, $postdata);
-
}
-
-
$this->_disconnect($fp);
-
-
if($this->_redirectaddr)
-
{
-
/* url was redirected, check if we've hit the max depth */
-
if($this->maxredirs > $this->_redirectdepth)
-
{
-
if(!preg_match("|^".$URI_PARTS["scheme"]."://|", $this->_redirectaddr))
-
$this->_redirectaddr = $this->_expandlinks($this->_redirectaddr,$URI_PARTS["scheme"]."://".$URI_PARTS["host"]);
-
-
// only follow redirect if it's on this site, or offsiteok is true
-
if(preg_match("|^http://".preg_quote($this->host)."|i",$this->_redirectaddr) || $this->offsiteok)
-
{
-
/* follow the redirect */
-
$this->_redirectdepth++;
-
$this->lastredirectaddr=$this->_redirectaddr;
-
if( strpos( $this->_redirectaddr, "?" ) > 0 )
-
$this->fetch($this->_redirectaddr); // the redirect has changed the request method from post to get
-
else
-
$this->submit($this->_redirectaddr,$formvars, $formfiles);
-
}
-
}
-
}
-
-
if($this->_framedepth < $this->maxframes && count($this->_frameurls) > 0)
-
{
-
$frameurls = $this->_frameurls;
-
$this->_frameurls = array();
-
-
while(list(,$frameurl) = each($frameurls))
-
{
-
if($this->_framedepth < $this->maxframes)
-
{
-
$this->fetch($frameurl);
-
$this->_framedepth++;
-
}
-
else
-
break;
-
}
-
}
-
-
}
-
else
-
{
-
return false;
-
}
-
return true;
-
break;
-
case "https":
-
if(!$this->curl_path)
-
return false;
-
if(function_exists("is_executable"))
-
if (!is_executable($this->curl_path))
-
return false;
-
$this->host = $URI_PARTS["host"];
-
if(!empty($URI_PARTS["port"]))
-
$this->port = $URI_PARTS["port"];
-
if($this->_isproxy)
-
{
-
// using proxy, send entire URI
-
$this->_httpsrequest($URI, $URI, $this->_submit_method, $this->_submit_type, $postdata);
-
}
-
else
-
{
-
$path = $URI_PARTS["path"].($URI_PARTS["query"] ? "?".$URI_PARTS["query"] : "");
-
// no proxy, send only the path
-
$this->_httpsrequest($path, $URI, $this->_submit_method, $this->_submit_type, $postdata);
-
}
-
-
if($this->_redirectaddr)
-
{
-
/* url was redirected, check if we've hit the max depth */
-
if($this->maxredirs > $this->_redirectdepth)
-
{
-
if(!preg_match("|^".$URI_PARTS["scheme"]."://|", $this->_redirectaddr))
-
$this->_redirectaddr = $this->_expandlinks($this->_redirectaddr,$URI_PARTS["scheme"]."://".$URI_PARTS["host"]);
-
-
// only follow redirect if it's on this site, or offsiteok is true
-
if(preg_match("|^http://".preg_quote($this->host)."|i",$this->_redirectaddr) || $this->offsiteok)
-
{
-
/* follow the redirect */
-
$this->_redirectdepth++;
-
$this->lastredirectaddr=$this->_redirectaddr;
-
if( strpos( $this->_redirectaddr, "?" ) > 0 )
-
$this->fetch($this->_redirectaddr); // the redirect has changed the request method from post to get
-
else
-
$this->submit($this->_redirectaddr,$formvars, $formfiles);
-
}
-
}
-
}
-
-
if($this->_framedepth < $this->maxframes && count($this->_frameurls) > 0)
-
{
-
$frameurls = $this->_frameurls;
-
$this->_frameurls = array();
-
-
while(list(,$frameurl) = each($frameurls))
-
{
-
if($this->_framedepth < $this->maxframes)
-
{
-
$this->fetch($frameurl);
-
$this->_framedepth++;
-
}
-
else
-
break;
-
}
-
}
-
return true;
-
break;
-
-
default:
-
// not a valid protocol
-
$this->error = 'Invalid protocol "'.$URI_PARTS["scheme"].'"\n';
-
return false;
-
break;
-
}
-
return true;
-
}
-
-
/*======================================================================*\
-
Function: fetchlinks
-
Purpose: fetch the links from a web page
-
Input: $URI where you are fetching from
-
Output: $this->results an array of the URLs
-
\*======================================================================*/
-
-
function fetchlinks($URI)
-
{
-
if ($this->fetch($URI))
-
{
-
if($this->lastredirectaddr)
-
$URI = $this->lastredirectaddr;
-
if(is_array($this->results))
-
{
-
for($x=0;$x<count($this->results);$x++)
-
$this->results[$x] = $this->_striplinks($this->results[$x]);
-
}
-
else
-
$this->results = $this->_striplinks($this->results);
-
-
if($this->expandlinks)
-
$this->results = $this->_expandlinks($this->results, $URI);
-
return true;
-
}
-
else
-
return false;
-
}
-
-
/*======================================================================*\
-
Function: fetchform
-
Purpose: fetch the form elements from a web page
-
Input: $URI where you are fetching from
-
Output: $this->results the resulting html form
-
\*======================================================================*/
-
-
function fetchform($URI)
-
{
-
-
if ($this->fetch($URI))
-
{
-
-
if(is_array($this->results))
-
{
-
for($x=0;$x<count($this->results);$x++)
-
$this->results[$x] = $this->_stripform($this->results[$x]);
-
}
-
else
-
$this->results = $this->_stripform($this->results);
-
-
return true;
-
}
-
else
-
return false;
-
}
-
-
-
/*======================================================================*\
-
Function: fetchtext
-
Purpose: fetch the text from a web page, stripping the links
-
Input: $URI where you are fetching from
-
Output: $this->results the text from the web page
-
\*======================================================================*/
-
-
function fetchtext($URI)
-
{
-
if($this->fetch($URI))
-
{
-
if(is_array($this->results))
-
{
-
for($x=0;$x<count($this->results);$x++)
-
$this->results[$x] = $this->_striptext($this->results[$x]);
-
}
-
else
-
$this->results = $this->_striptext($this->results);
-
return true;
-
}
-
else
-
return false;
-
}
-
-
/*======================================================================*\
-
Function: submitlinks
-
Purpose: grab links from a form submission
-
Input: $URI where you are submitting from
-
Output: $this->results an array of the links from the post
-
\*======================================================================*/
-
-
function submitlinks($URI, $formvars="", $formfiles="")
-
{
-
if($this->submit($URI,$formvars, $formfiles))
-
{
-
if($this->lastredirectaddr)
-
$URI = $this->lastredirectaddr;
-
if(is_array($this->results))
-
{
-
for($x=0;$x<count($this->results);$x++)
-
{
-
$this->results[$x] = $this->_striplinks($this->results[$x]);
-
if($this->expandlinks)
-
$this->results[$x] = $this->_expandlinks($this->results[$x],$URI);
-
}
-
}
-
else
-
{
-
$this->results = $this->_striplinks($this->results);
-
if($this->expandlinks)
-
$this->results = $this->_expandlinks($this->results,$URI);
-
}
-
return true;
-
}
-
else
-
return false;
-
}
-
-
/*======================================================================*\
-
Function: submittext
-
Purpose: grab text from a form submission
-
Input: $URI where you are submitting from
-
Output: $this->results the text from the web page
-
\*======================================================================*/
-
-
function submittext($URI, $formvars = "", $formfiles = "")
-
{
-
if($this->submit($URI,$formvars, $formfiles))
-
{
-
if($this->lastredirectaddr)
-
$URI = $this->lastredirectaddr;
-
if(is_array($this->results))
-
{
-
for($x=0;$x<count($this->results);$x++)
-
{
-
$this->results[$x] = $this->_striptext($this->results[$x]);
-
if($this->expandlinks)
-
$this->results[$x] = $this->_expandlinks($this->results[$x],$URI);
-
}
-
}
-
else
-
{
-
$this->results = $this->_striptext($this->results);
-
if($this->expandlinks)
-
$this->results = $this->_expandlinks($this->results,$URI);
-
}
-
return true;
-
}
-
else
-
return false;
-
}
-
-
-
-
/*======================================================================*\
-
Function: set_submit_multipart
-
Purpose: Set the form submission content type to
-
multipart/form-data
-
\*======================================================================*/
-
function set_submit_multipart()
-
{
-
$this->_submit_type = "multipart/form-data";
-
}
-
-
-
/*======================================================================*\
-
Function: set_submit_normal
-
Purpose: Set the form submission content type to
-
application/x-www-form-urlencoded
-
\*======================================================================*/
-
function set_submit_normal()
-
{
-
$this->_submit_type = "application/x-www-form-urlencoded";
-
}
-
-
-
-
-
/*======================================================================*\
-
Private functions
-
\*======================================================================*/
-
-
-
/*======================================================================*\
-
Function: _striplinks
-
Purpose: strip the hyperlinks from an html document
-
Input: $document document to strip.
-
Output: $match an array of the links
-
\*======================================================================*/
-
-
function _striplinks($document)
-
{
-
preg_match_all("'<\s*a\s.*?href\s*=\s* # find <a href=
-
([\"\'])? # find single or double quote
-
(?(1) (.*?)\\1 | ([^\s\>]+)) # if quote found, match up to next matching
-
# quote, otherwise match up to next space
-
'isx",$document,$links);
-
-
-
// catenate the non-empty matches from the conditional subpattern
-
-
while(list($key,$val) = each($links[2]))
-
{
-
if(!empty($val))
-
$match[] = $val;
-
}
-
-
while(list($key,$val) = each($links[3]))
-
{
-
if(!empty($val))
-
$match[] = $val;
-
}
-
-
// return the links
-
return $match;
-
}
-
-
/*======================================================================*\
-
Function: _stripform
-
Purpose: strip the form elements from an html document
-
Input: $document document to strip.
-
Output: $match an array of the links
-
\*======================================================================*/
-
-
function _stripform($document)
-
{
-
preg_match_all("'<\/?(FORM|INPUT|SELECT|TEXTAREA|(OPTION))[^<>]*>(?(2)(.*(?=<\/?(option|select)[^<>]*>[\r\n]*)|(?=[\r\n]*))|(?=[\r\n]*))'Usi",$document,$elements);
-
-
// catenate the matches
-
$match = implode("\r\n",$elements[0]);
-
-
// return the links
-
return $match;
-
}
-
-
-
-
/*======================================================================*\
-
Function: _striptext
-
Purpose: strip the text from an html document
-
Input: $document document to strip.
-
Output: $text the resulting text
-
\*======================================================================*/
-
-
function _striptext($document)
-
{
-
-
// I didn't use preg eval (//e) since that is only available in PHP 4.0.
-
// so, list your entities one by one here. I included some of the
-
// more common ones.
-
-
$search = array("'<script[^>]*?>.*?</script>'si", // strip out javascript
-
"'<[\/\!]*?[^<>]*?>'si", // strip out html tags
-
"'([\r\n])[\s]+'", // strip out white space
-
"'&(quot|#34|#034|#x22);'i", // replace html entities
-
"'&(amp|#38|#038|#x26);'i", // added hexadecimal values
-
"'&(lt|#60|#060|#x3c);'i",
-
"'&(gt|#62|#062|#x3e);'i",
-
"'&(nbsp|#160|#xa0);'i",
-
"'&(iexcl|#161);'i",
-
"'&(cent|#162);'i",
-
"'&(pound|#163);'i",
-
"'&(copy|#169);'i",
-
"'&(reg|#174);'i",
-
"'&(deg|#176);'i",
-
"'&(#39|#039|#x27);'",
-
"'&(euro|#8364);'i", // europe
-
"'&a(uml|UML);'", // german
-
"'&o(uml|UML);'",
-
"'&u(uml|UML);'",
-
"'&A(uml|UML);'",
-
"'&O(uml|UML);'",
-
"'&U(uml|UML);'",
-
"'ß'i",
-
);
-
$replace = array( "",
-
"",
-
"\\1",
-
"\"",
-
"&",
-
"<",
-
">",
-
" ",
-
chr(161),
-
chr(162),
-
chr(163),
-
chr(169),
-
chr(174),
-
chr(176),
-
chr(39),
-
chr(128),
-
"ä",
-
"ö",
-
"ü",
-
"Ä",
-
"Ö",
-
"Ü",
-
"ß",
-
);
-
-
$text = preg_replace($search,$replace,$document);
-
-
return $text;
-
}
-
-
/*======================================================================*\
-
Function: _expandlinks
-
Purpose: expand each link into a fully qualified URL
-
Input: $links the links to qualify
-
$URI the full URI to get the base from
-
Output: $expandedLinks the expanded links
-
\*======================================================================*/
-
-
function _expandlinks($links,$URI)
-
{
-
-
preg_match("/^[^\?]+/",$URI,$match);
-
-
$match = preg_replace("|/[^\/\.]+\.[^\/\.]+$|","",$match[0]);
-
$match = preg_replace("|/$|","",$match);
-
$match_part = parse_url($match);
-
$match_root =
-
$match_part["scheme"]."://".$match_part["host"];
-
-
$search = array( "|^http://".preg_quote($this->host)."|i",
-
"|^(\/)|i",
-
"|^(?!http://)(?!mailto:)|i",
-
"|/\./|",
-
"|/[^\/]+/\.\./|"
-
);
-
-
$replace = array( "",
-
$match_root."/",
-
$match."/",
-
"/",
-
"/"
-
);
-
-
$expandedLinks = preg_replace($search,$replace,$links);
-
-
return $expandedLinks;
-
}
-
-
/*======================================================================*\
-
Function: _httprequest
-
Purpose: go get the http data from the server
-
Input: $url the url to fetch
-
$fp the current open file pointer
-
$URI the full URI
-
$body body contents to send if any (POST)
-
Output:
-
\*======================================================================*/
-
-
function _httprequest($url,$fp,$URI,$http_method,$content_type="",$body="")
-
{
-
$cookie_headers = '';
-
if($this->passcookies && $this->_redirectaddr)
-
$this->setcookies();
-
-
$URI_PARTS = parse_url($URI);
-
if(empty($url))
-
$url = "/";
-
$headers = $http_method." ".$url." ".$this->_httpversion."\r\n";
-
if(!empty($this->agent))
-
$headers .= "User-Agent: ".$this->agent."\r\n";
-
if(!empty($this->host) && !isset($this->rawheaders['Host'])) {
-
$headers .= "Host: ".$this->host;
-
if(!empty($this->port))
-
$headers .= ":".$this->port;
-
$headers .= "\r\n";
-
}
-
if(!empty($this->accept))
-
$headers .= "Accept: ".$this->accept."\r\n";
-
if(!empty($this->referer))
-
$headers .= "Referer: ".$this->referer."\r\n";
-
if(!empty($this->cookies))
-
{
-
if(!is_array($this->cookies))
-
$this->cookies = (array)$this->cookies;
-
-
reset($this->cookies);
-
if ( count($this->cookies) > 0 ) {
-
$cookie_headers .= 'Cookie: ';
-
foreach ( $this->cookies as $cookieKey => $cookieVal ) {
-
$cookie_headers .= $cookieKey."=".urlencode($cookieVal)."; ";
-
}
-
$headers .= substr($cookie_headers,0,-2) . "\r\n";
-
}
-
}
-
if(!empty($this->rawheaders))
-
{
-
if(!is_array($this->rawheaders))
-
$this->rawheaders = (array)$this->rawheaders;
-
while(list($headerKey,$headerVal) = each($this->rawheaders))
-
$headers .= $headerKey.": ".$headerVal."\r\n";
-
}
-
if(!empty($content_type)) {
-
$headers .= "Content-type: $content_type";
-
if ($content_type == "multipart/form-data")
-
$headers .= "; boundary=".$this->_mime_boundary;
-
$headers .= "\r\n";
-
}
-
if(!empty($body))
-
$headers .= "Content-length: ".strlen($body)."\r\n";
-
if(!empty($this->user) || !empty($this->pass))
-
$headers .= "Authorization: Basic ".base64_encode($this->user.":".$this->pass)."\r\n";
-
-
//add proxy auth headers
-
if(!empty($this->proxy_user))
-
$headers .= 'Proxy-Authorization: ' . 'Basic ' . base64_encode($this->proxy_user . ':' . $this->proxy_pass)."\r\n";
-
-
-
$headers .= "\r\n";
-
-
// set the read timeout if needed
-
if ($this->read_timeout > 0)
-
socket_set_timeout($fp, $this->read_timeout);
-
$this->timed_out = false;
-
-
fwrite($fp,$headers.$body,strlen($headers.$body));
-
-
$this->_redirectaddr = false;
-
unset($this->headers);
-
-
while($currentHeader = fgets($fp,$this->_maxlinelen))
-
{
-
if ($this->read_timeout > 0 && $this->_check_timeout($fp))
-
{
-
$this->status=-100;
-
return false;
-
}
-
-
if($currentHeader == "\r\n")
-
break;
-
-
// if a header begins with Location: or URI:, set the redirect
-
if(preg_match("/^(Location:|URI:)/i",$currentHeader))
-
{
-
// get URL portion of the redirect
-
preg_match("/^(Location:|URI:)[ ]+(.*)/i",chop($currentHeader),$matches);
-
// look for :// in the Location header to see if hostname is included
-
if(!preg_match("|\:\/\/|",$matches[2]))
-
{
-
// no host in the path, so prepend
-
$this->_redirectaddr = $URI_PARTS["scheme"]."://".$this->host.":".$this->port;
-
// eliminate double slash
-
if(!preg_match("|^/|",$matches[2]))
-
$this->_redirectaddr .= "/".$matches[2];
-
else
-
$this->_redirectaddr .= $matches[2];
-
}
-
else
-
$this->_redirectaddr = $matches[2];
-
}
-
-
if(preg_match("|^HTTP/|",$currentHeader))
-
{
-
if(preg_match("|^HTTP/[^\s]*\s(.*?)\s|",$currentHeader, $status))
-
{
-
$this->status= $status[1];
-
}
-
$this->response_code = $currentHeader;
-
}
-
-
$this->headers[] = $currentHeader;
-
}
-
-
$results = '';
-
do {
-
$_data = fread($fp, $this->maxlength);
-
if (strlen($_data) == 0) {
-
break;
-
}
-
$results .= $_data;
-
} while(true);
-
-
if ($this->read_timeout > 0 && $this->_check_timeout($fp))
-
{
-
$this->status=-100;
-
return false;
-
}
-
-
// check if there is a a redirect meta tag
-
-
if(preg_match("'<meta[\s]*http-equiv[^>]*?content[\s]*=[\s]*[\"\']?\d+;[\s]*URL[\s]*=[\s]*([^\"\']*?)[\"\']?>'i",$results,$match))
-
-
{
-
$this->_redirectaddr = $this->_expandlinks($match[1],$URI);
-
}
-
-
// have we hit our frame depth and is there frame src to fetch?
-
if(($this->_framedepth < $this->maxframes) && preg_match_all("'<frame\s+.*src[\s]*=[\'\"]?([^\'\"\>]+)'i",$results,$match))
-
{
-
$this->results[] = $results;
-
for($x=0; $x<count($match[1]); $x++)
-
$this->_frameurls[] = $this->_expandlinks($match[1][$x],$URI_PARTS["scheme"]."://".$this->host);
-
}
-
// have we already fetched framed content?
-
elseif(is_array($this->results))
-
$this->results[] = $results;
-
// no framed content
-
else
-
$this->results = $results;
-
-
return true;
-
}
-
-
/*======================================================================*\
-
Function: _httpsrequest
-
Purpose: go get the https data from the server using curl
-
Input: $url the url to fetch
-
$URI the full URI
-
$body body contents to send if any (POST)
-
Output:
-
\*======================================================================*/
-
-
function _httpsrequest($url,$URI,$http_method,$content_type="",$body="")
-
{
-
if($this->passcookies && $this->_redirectaddr)
-
$this->setcookies();
-
-
$headers = array();
-
-
$URI_PARTS = parse_url($URI);
-
if(empty($url))
-
$url = "/";
-
// GET ... header not needed for curl
-
//$headers[] = $http_method." ".$url." ".$this->_httpversion;
-
if(!empty($this->agent))
-
$headers[] = "User-Agent: ".$this->agent;
-
if(!empty($this->host))
-
if(!empty($this->port))
-
$headers[] = "Host: ".$this->host.":".$this->port;
-
else
-
$headers[] = "Host: ".$this->host;
-
if(!empty($this->accept))
-
$headers[] = "Accept: ".$this->accept;
-
if(!empty($this->referer))
-
$headers[] = "Referer: ".$this->referer;
-
if(!empty($this->cookies))
-
{
-
if(!is_array($this->cookies))
-
$this->cookies = (array)$this->cookies;
-
-
reset($this->cookies);
-
if ( count($this->cookies) > 0 ) {
-
$cookie_str = 'Cookie: ';
-
foreach ( $this->cookies as $cookieKey => $cookieVal ) {
-
$cookie_str .= $cookieKey."=".urlencode($cookieVal)."; ";
-
}
-
$headers[] = substr($cookie_str,0,-2);
-
}
-
}
-
if(!empty($this->rawheaders))
-
{
-
if(!is_array($this->rawheaders))
-
$this->rawheaders = (array)$this->rawheaders;
-
while(list($headerKey,$headerVal) = each($this->rawheaders))
-
$headers[] = $headerKey.": ".$headerVal;
-
}
-
if(!empty($content_type)) {
-
if ($content_type == "multipart/form-data")
-
$headers[] = "Content-type: $content_type; boundary=".$this->_mime_boundary;
-
else
-
$headers[] = "Content-type: $content_type";
-
}
-
if(!empty($body))
-
$headers[] = "Content-length: ".strlen($body);
-
if(!empty($this->user) || !empty($this->pass))
-
$headers[] = "Authorization: BASIC ".base64_encode($this->user.":".$this->pass);
-
-
for($curr_header = 0; $curr_header < count($headers); $curr_header++) {
-
$safer_header = strtr( $headers[$curr_header], "\"", " " );
-
$cmdline_params .= " -H \"".$safer_header."\"";
-
}
-
-
if(!empty($body))
-
$cmdline_params .= " -d \"$body\"";
-
-
if($this->read_timeout > 0)
-
$cmdline_params .= " -m ".$this->read_timeout;
-
-
$headerfile = tempnam($temp_dir, "sno");
-
-
$safer_URI = strtr( $URI, "\"", " " ); // strip quotes from the URI to avoid shell access
-
exec($this->curl_path." -D \"$headerfile\"".$cmdline_params." \"".$safer_URI."\"",$results,$return);
-
-
if($return)
-
{
-
$this->error = "Error: cURL could not retrieve the document, error $return.";
-
return false;
-
}
-
-
-
$results = implode("\r\n",$results);
-
-
$result_headers = file("$headerfile");
-
-
$this->_redirectaddr = false;
-
unset($this->headers);
-
-
for($currentHeader = 0; $currentHeader < count($result_headers); $currentHeader++)
-
{
-
-
// if a header begins with Location: or URI:, set the redirect
-
if(preg_match("/^(Location: |URI: )/i",$result_headers[$currentHeader]))
-
{
-
// get URL portion of the redirect
-
preg_match("/^(Location: |URI:)\s+(.*)/",chop($result_headers[$currentHeader]),$matches);
-
// look for :// in the Location header to see if hostname is included
-
if(!preg_match("|\:\/\/|",$matches[2]))
-
{
-
// no host in the path, so prepend
-
$this->_redirectaddr = $URI_PARTS["scheme"]."://".$this->host.":".$this->port;
-
// eliminate double slash
-
if(!preg_match("|^/|",$matches[2]))
-
$this->_redirectaddr .= "/".$matches[2];
-
else
-
$this->_redirectaddr .= $matches[2];
-
}
-
else
-
$this->_redirectaddr = $matches[2];
-
}
-
-
if(preg_match("|^HTTP/|",$result_headers[$currentHeader]))
-
$this->response_code = $result_headers[$currentHeader];
-
-
$this->headers[] = $result_headers[$currentHeader];
-
}
-
-
// check if there is a a redirect meta tag
-
-
if(preg_match("'<meta[\s]*http-equiv[^>]*?content[\s]*=[\s]*[\"\']?\d+;[\s]*URL[\s]*=[\s]*([^\"\']*?)[\"\']?>'i",$results,$match))
-
{
-
$this->_redirectaddr = $this->_expandlinks($match[1],$URI);
-
}
-
-
// have we hit our frame depth and is there frame src to fetch?
-
if(($this->_framedepth < $this->maxframes) && preg_match_all("'<frame\s+.*src[\s]*=[\'\"]?([^\'\"\>]+)'i",$results,$match))
-
{
-
$this->results[] = $results;
-
for($x=0; $x<count($match[1]); $x++)
-
$this->_frameurls[] = $this->_expandlinks($match[1][$x],$URI_PARTS["scheme"]."://".$this->host);
-
}
-
// have we already fetched framed content?
-
elseif(is_array($this->results))
-
$this->results[] = $results;
-
// no framed content
-
else
-
$this->results = $results;
-
-
unlink("$headerfile");
-
-
return true;
-
}
-
-
/*======================================================================*\
-
Function: setcookies()
-
Purpose: set cookies for a redirection
-
\*======================================================================*/
-
-
function setcookies()
-
{
-
for($x=0; $x<count($this->headers); $x++)
-
{
-
if(preg_match('/^set-cookie:[\s]+([^=]+)=([^;]+)/i', $this->headers[$x],$match))
-
$this->cookies[$match[1]] = urldecode($match[2]);
-
}
-
}
-
-
-
/*======================================================================*\
-
Function: _check_timeout
-
Purpose: checks whether timeout has occurred
-
Input: $fp file pointer
-
\*======================================================================*/
-
-
function _check_timeout($fp)
-
{
-
if ($this->read_timeout > 0) {
-
$fp_status = socket_get_status($fp);
-
if ($fp_status["timed_out"]) {
-
$this->timed_out = true;
-
return true;
-
}
-
}
-
return false;
-
}
-
-
/*======================================================================*\
-
Function: _connect
-
Purpose: make a socket connection
-
Input: $fp file pointer
-
\*======================================================================*/
-
-
function _connect(&$fp)
-
{
-
if(!empty($this->proxy_host) && !empty($this->proxy_port))
-
{
-
$this->_isproxy = true;
-
-
$host = $this->proxy_host;
-
$port = $this->proxy_port;
-
}
-
else
-
{
-
$host = $this->host;
-
$port = $this->port;
-
}
-
-
$this->status = 0;
-
-
if($fp = fsockopen(
-
$host,
-
$port,
-
$errno,
-
$errstr,
-
$this->_fp_timeout
-
))
-
{
-
// socket connection succeeded
-
-
return true;
-
}
-
else
-
{
-
// socket connection failed
-
$this->status = $errno;
-
switch($errno)
-
{
-
case -3:
-
$this->error="socket creation failed (-3)";
-
case -4:
-
$this->error="dns lookup failure (-4)";
-
case -5:
-
$this->error="connection refused or timed out (-5)";
-
default:
-
$this->error="connection failed (".$errno.")";
-
}
-
return false;
-
}
-
}
-
/*======================================================================*\
-
Function: _disconnect
-
Purpose: disconnect a socket connection
-
Input: $fp file pointer
-
\*======================================================================*/
-
-
function _disconnect($fp)
-
{
-
return(fclose($fp));
-
}
-
-
-
/*======================================================================*\
-
Function: _prepare_post_body
-
Purpose: Prepare post body according to encoding type
-
Input: $formvars - form variables
-
$formfiles - form upload files
-
Output: post body
-
\*======================================================================*/
-
-
function _prepare_post_body($formvars, $formfiles)
-
{
-
settype($formvars, "array");
-
settype($formfiles, "array");
-
$postdata = '';
-
-
if (count($formvars) == 0 && count($formfiles) == 0)
-
return;
-
-
switch ($this->_submit_type) {
-
case "application/x-www-form-urlencoded":
-
reset($formvars);
-
while(list($key,$val) = each($formvars)) {
-
if (is_array($val) || is_object($val)) {
-
while (list($cur_key, $cur_val) = each($val)) {
-
$postdata .= urlencode($key)."[]=".urlencode($cur_val)."&";
-
}
-
} else
-
$postdata .= urlencode($key)."=".urlencode($val)."&";
-
}
-
break;
-
-
case "multipart/form-data":
-
$this->_mime_boundary = "Snoopy".md5(uniqid(microtime()));
-
-
reset($formvars);
-
while(list($key,$val) = each($formvars)) {
-
if (is_array($val) || is_object($val)) {
-
while (list($cur_key, $cur_val) = each($val)) {
-
$postdata .= "--".$this->_mime_boundary."\r\n";
-
$postdata .= "Content-Disposition: form-data; name=\"$key\[\]\"\r\n\r\n";
-
$postdata .= "$cur_val\r\n";
-
}
-
} else {
-
$postdata .= "--".$this->_mime_boundary."\r\n";
-
$postdata .= "Content-Disposition: form-data; name=\"$key\"\r\n\r\n";
-
$postdata .= "$val\r\n";
-
}
-
}
-
-
reset($formfiles);
-
while (list($field_name, $file_names) = each($formfiles)) {
-
settype($file_names, "array");
-
while (list(, $file_name) = each($file_names)) {
-
if (!is_readable($file_name)) continue;
-
-
$fp = fopen($file_name, "r");
-
$file_content = fread($fp, filesize($file_name));
-
fclose($fp);
-
$base_name = basename($file_name);
-
-
$postdata .= "--".$this->_mime_boundary."\r\n";
-
$postdata .= "Content-Disposition: form-data; name=\"$field_name\"; filename=\"$base_name\"\r\n\r\n";
-
$postdata .= "$file_content\r\n";
-
}
-
}
-
$postdata .= "--".$this->_mime_boundary."--\r\n";
-
break;
-
}
-
-
return $postdata;
-
}
-
}
-
-
?>
-
Sign in to post your reply or Sign up for a free account.
Similar topics |
by: Gomez |
last post by:
Hi,
Is there a way to know if a session on my web server is from an actual user or an automated crawler.
please advise.
G
|
by: Benjamin Lefevre |
last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml,
mobile xhtml) but not only (also html/xml/...), and I ask myself which speed
I can reach.
This crawler is developped in C# using multithreading and HttpWebRequest.
Actually my crawler is able to download and crawl pages at the speed of
around 5 pages per second. It's running on a development machine with 512Mb
Ram and a shared ADSL-connection (2Mbits). Is it...
|
by: Steve Ocsic |
last post by:
Hi,
I've coded a basic crawler where by you enter the URL and it will then
crawl the said URL. What I would like to do now is to take it one
step further and do the following:
1. pick up the url's I would like to crawl from a database and pass
them to the crawler. Once the crawler has crawled the website I would
then like to put a flag against it so that the url is not processed
for a certain period of time.
|
by: Nicolas |
last post by:
I need HELP!!!!!
The crawler (Google or other) don't index my web site unless the web site is
currently visited
If there is nobody visiting those .aspx page therefor activating the aspnet
no crawler is going throught the site
I play with the robots file the meta tag etc.
Also played with the crawler class but no success
Sub Application_BeginRequest(ByVal sender As Object, ByVal e As EventArgs)
|
by: Bill |
last post by:
Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false
positives/negatives?
Thanks!
| |
by: abhinav |
last post by:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be...
|
by: Pradeep Vasudevan |
last post by:
hai
i am a student and need to write a simple web crawler using python and need some guidance of how to start.. i need to crawl web pages using BFS and also DFS... one using stacks and other using queues...
i will try on the obsolete web pages only and so tht i can learn of how to do that.. i have taken a course called search engines and need some help in doing that...
help in any knind would be appreciated..
thank u
|
by: rhitam30111985 |
last post by:
hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code:
#! /usr/bin/python
import urllib
import re
import sys
def crawl(urllist,done):
|
by: bdy120602 |
last post by:
In addition to the question in the subject line, if the answer is yes,
is it possible to locate keywords as part of the functionality of said
crawler (bot, spider)?
Basically, I would like to write a stand-alone form (javascript app.)
to perform a site-specific keyword search.
Can I do the aforementioned in Javascript?
Thanks.
|
by: sonich |
last post by:
I need simple web crawler,
I found Ruya, but it's seems not currently maintained.
Does anybody know good web crawler on python or with python interface?
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |