473,396 Members | 1,853 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Developing an RSS scraper, but getting lost in all the regexps

11
Hello,

I'm trying to scrape daily titles from http://www.doopes.com/?cat=35444&lan...xc=&inc=&opt=0
But i'm getting lost using preg_match. Can someone help me with this script?

Thanks in advance!

[php]
[Enter code here]<?php

$today = date("Y-m-d");

// Get page
$url = "http://www.doopes.com/?cat=35444&lang=1&num=5&mode=0&from=$today&to=$tod ay&exc=&inc=&opt=0";
//$data = implode("", file($url));

$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);

// Get content items
preg_match_all ("/<tbody>([^`]*?)<\/table>/", $data, $matches);

// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>Latest Scene Releases</title>
<description>Scene releases of <?php echo $today;?> provided by http://flx-tech.net</description>
<link>http://www.flx-tech.net</link>
<language>en-us</language>


<?
// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/<td>([^`]*?)<\/td/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp);
$url = $temp['1'];
$url = trim($url);

// Third, get text
preg_match ("/<p>([^`]*?)<span class=\"byline\">/", $match, $temp);
$text = $temp['1'];
$text = trim($text);

// Fourth, and finally, get author
preg_match ("/<span class=\"byline\">By ([^`]*?)<\/span>/", $match, $temp);
$author = $temp['1'];
$author = trim($author);

// Echo RSS XML
echo "<item>\n";
echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
echo "\t\t\t<link>http://www.phpit.net" . strip_tags($url) . "</link>\n";
echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
echo "\t\t\t<content:encoded><![CDATA[ \n";
echo $text . "\n";
echo " ]]></content:encoded>\n";
echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n";
echo "\t\t</item>\n";
}
?>
</channel>
</rss>
[/php]
Sep 4 '07 #1
1 2267
pbmods
5,821 Expert 4TB
Changed thread title to better describe the problem.

Heya, FLX.

Have a look at Magpie.
Sep 4 '07 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: R. Tarazi | last post by:
Hello together, I'm having extreme difficulties using RegExps for a specific problem and would really appreciate any help and hope somebody will read through my "long" posting... 1. <?php...
5
by: Klaus Alexander Seistrup | last post by:
Hi, Is there a way to "expand" simple regexps? Something along the lines of: #v+ >>> rx = '(a|b)c?(d|f)' >>> expand_regexp(rx)
3
by: Rock | last post by:
Hi, I started using a python based screen scraper called newsscraper I downloaded from sourceforge. http://sourceforge.net/projects/newsscraper/. I have created many python templates that work...
2
by: maxkumar | last post by:
Hi, I am running a ASP.NET 1.1 site on Win Server 2003 with IIS 6.0. The website has been running for about 1.5 years now. In the past, we used to have random cases of session variables getting...
2
by: Yorian | last post by:
I just started to try regexps in php and I didn't have too many problems, however I found a few when trying to build a templte engine. The first one is found is the dollar sign. In my template I...
6
by: =?Utf-8?B?Sm9l?= | last post by:
Hello, I have 10 years of experience working as a Webmaster. For about an year I worked on a 3-tier ASP application which gave me a good experience with SQL Server (Stored procedures) and...
7
by: James Stroud | last post by:
Hello, Does anyone know of an example, however modest, of a screenscraper authored in python? I am using Firefox. Basically, I am answering problems via my browser and being scored for each...
2
by: voroojak | last post by:
Hi Does any one have any idead about screen scraper? Or wher can i find good information? thanks alot
1
by: kronecker | last post by:
A screen scraper is a program that removes text only from a web site. I pinched this one from the web: Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, _ ByVal e As...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.