Shuan wrote:
I am trying to grab sites like craigslist, parse with regular
expression and put some content into database.
$request -fetch( $region_link );
if( !$request -error ){
$pageContent = $request -results;
$regionpattern =
"/<a[^>]*href=\"(\/s\/SL\/sg_maY.*)\".*>.*<img.*alt=\"(.*)\".*id=\"btn.*\">/
siU";
if(preg_match_all( $regionpattern, $pageContent, $categorylinks ))
I was almost tempted to say it was a greedyness issue, before I spotted the /U.
Dodged a bullet there :-).
If I interprete you regex correctly, try this rewrite (I tend to use dots very
sparingly, I'm more a fan of negative character classes, in which proper
greediness is more usefull). I'm not really sure it will gain much on the
resources consumption, but we can try:
'|<a[^>]*?href="(/s/SL/sg_maY[^"]*)"[^>]*>.*?<img[^>]*?alt="([^"]*)"[^>]*?id="bt
n[^"]*"[^>]*>|si
I'd suggest a foreach loop also, instead your for loop:
foreach($categorylinks[1] as $link){
$category_link="http://www.mysite.com".$link;
include( "pagecrawler.php" );//I'm still curious what this does....
}
Or if you do use capture 2:
if(preg_match_all( $regionpattern, $pageContent, $categorylinks,
PREG_SET_ORDER)){
foreach($categorylinks as $link){
$category_link="http://www.mysite.com".$link[1];
include( "pagecrawler.php" );//I'm still curious what this does....
}
}
If you still have issues I'd like to see/know the actual site you're leeching
right now :-).(If you're trying to get a page all at once, be sure to unset()
unused/past variables.) I don't know what your actual pagecrawler.php does, but
if it doesn't use capture 2 you might as well not capture it.
Grtz,
--
Rik Wasmus