Developing an RSS scraper, but getting lost in all the regexps

Hello,

I'm trying to scrape daily titles from http://www.doopes.com/?cat=35444&lan...xc=&inc=&opt=0
But i'm getting lost using preg_match. Can someone help me with this script?

Thanks in advance!

[php]
[Enter code here]<?php

$today = date("Y-m-d");

// Get page
$url = "http://www.doopes.com/?cat=35444&lang=1&num=5&mode=0&from=$today&to=$tod ay&exc=&inc=&opt=0";
//$data = implode("", file($url));

$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);

// Get content items
preg_match_all ("/<tbody>([^`]*?)<\/table>/", $data, $matches);

// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>Latest Scene Releases</title>
<description>Scene releases of <?php echo $today;?> provided by http://flx-tech.net</description>
<link>http://www.flx-tech.net</link>
<language>en-us</language>

<?
// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/<td>([^`]*?)<\/td/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp);
$url = $temp['1'];
$url = trim($url);

// Third, get text
preg_match ("/<p>([^`]*?)<span class=\"byline\">/", $match, $temp);
$text = $temp['1'];
$text = trim($text);

// Fourth, and finally, get author
preg_match ("/<span class=\"byline\">By ([^`]*?)<\/span>/", $match, $temp);
$author = $temp['1'];
$author = trim($author);

// Echo RSS XML
echo "<item>\n";
echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
echo "\t\t\t<link>http://www.phpit.net" . strip_tags($url) . "</link>\n";
echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
echo "\t\t\t<content:encoded><![CDATA[ \n";
echo $text . "\n";
echo " ]]></content:encoded>\n";
echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n";
echo "\t\t</item>\n";
}
?>
</channel>
</rss>
[/php]

Sep 4 '07 #1

Subscribe Post Reply

2267

pbmods

5,821

Expert 4TB

Changed thread title to better describe the problem.

Heya, FLX.

Have a look at Magpie.

Sep 4 '07 #2

Similar topics

Can somebody help out with RegExps?

by: R. Tarazi | last post by:

Hello together, I'm having extreme difficulties using RegExps for a specific problem and would really appreciate any help and hope somebody will read through my "long" posting... 1. <?php...

PHP

Expanding regexps

by: Klaus Alexander Seistrup | last post by:

Hi, Is there a way to "expand" simple regexps? Something along the lines of: #v+ >>> rx = '(a|b)c?(d|f)' >>> expand_regexp(rx)

Python

Regex single quotes in scraper script?

by: Rock | last post by:

Hi, I started using a python based screen scraper called newsscraper I downloaded from sourceforge. http://sourceforge.net/projects/newsscraper/. I have created many python templates that work...

Python

Session variables getting lost frequently

by: maxkumar | last post by:

Hi, I am running a ASP.NET 1.1 site on Win Server 2003 with IIS 6.0. The website has been running for about 1.5 years now. In the past, we used to have random cases of session variables getting...

ASP.NET

regexps: dollar sign, lookaheads/behinds and speedquestions

by: Yorian | last post by:

I just started to try regexps in php and I didn't have too many problems, however I found a few when trying to build a templte engine. The first one is found is the dollar sign. In my template I...

PHP

Phobia of developing .net n-tiered applications

by: =?Utf-8?B?Sm9l?= | last post by:

Hello, I have 10 years of experience working as a Webmaster. For about an year I worked on a 3-tier ASP application which gave me a good experience with SQL Server (Stored procedures) and...

.NET Framework

Python Screen Scraper

by: James Stroud | last post by:

Hello, Does anyone know of an example, however modest, of a screenscraper authored in python? I am using Firefox. Basically, I am answering problems via my browser and being scored for each...

Python

screen scraper

by: voroojak | last post by:

Hi Does any one have any idead about screen scraper? Or wher can i find good information? thanks alot

Microsoft Access / VBA

Screen Scraper

by: kronecker | last post by:

A screen scraper is a program that removes text only from a web site. I pinched this one from the web: Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, _ ByVal e As...

Visual Basic .NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice