Connecting Tech Pros Worldwide Help | Site Map

Stripping HTML from RSS feed

Jason
Guest
 
Posts: n/a
#1: Jul 21 '06
First things first, let me say that I couldn't decide whether to post
this to the PHP ng, or to an XML ng. I know from experience that you
guys know what you're talking about, though, and all of the questions
mean "how to do this in PHP," so I hope I picked the right one ;-)

For about a year, I've been importing Yahoo News headlines into my site
via their RSS feed. But I would much rather import Google News
headlines because I can make them specific to my location. The problem
is that their RSS feed includes HTML, and the script I use can't figure
it out.

Here's an example tag from their most recent feed:

<description><br><table border=0 width= valign=top cellpadding=2
cellspacing=7><tr><td valign=top><a
href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://www.raleighchronicle.com/2006072105.html&cid=1108150860&ei=hUzBRO_JMpe2aInH 4PsB">Truck
Driver Wins $100,000 In Lottery; <b>NC</bPowerball
Winners</a><br><font size=-1><font color=#6f6f6f>Raleigh
Chronicle,&nbsp;NC&nbsp;-</font<nobr>2 hours
ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
<b>NC</bLottery Commission, when Dennis Mebane collected his prize
this week from a winning $100,000 instant scratch-off ticket, he
<b>...</b</font><br></table></description>


The XML tag is <description>, but the script I use to parse it
(rss2array, a cookie-cutter script that I downloaded) can't
differentiate between <descriptionand <table>. Usually, I would call
<descriptionby using the variable $rss['items'][$i]['description']
(where $i is the index counter), but with this output I would have to
do something like
$rss['items'][$i]['description']['table'><'tr'><'td'><'b><'font'>...

So I guess I really have 2 questions:

1. Is there a way to strip HTML completely out of the tag? Or better
yet, to make rss2array read the HTML as actual HTML instead of XML
tags?

2. If not, is there a better way to read the XML than through the use
of rss2array? I'm a fairly experienced coder, but don't really use XML
often enough to have a good grasp of the logic.

TIA,

Jason

Noodle
Guest
 
Posts: n/a
#2: Jul 22 '06

re: Stripping HTML from RSS feed


See if this works:

1. Use $rss['items'][$i]['description'] to return the contents of the
description node (including html markup) as a string.
2. Use the 'strip_tags' function to remove unwanted HTML from the
returned string.

See http://www.php.net/manual/en/function.strip-tags.php for more info.



Jason wrote:
Quote:
First things first, let me say that I couldn't decide whether to post
this to the PHP ng, or to an XML ng. I know from experience that you
guys know what you're talking about, though, and all of the questions
mean "how to do this in PHP," so I hope I picked the right one ;-)
>
For about a year, I've been importing Yahoo News headlines into my site
via their RSS feed. But I would much rather import Google News
headlines because I can make them specific to my location. The problem
is that their RSS feed includes HTML, and the script I use can't figure
it out.
>
Here's an example tag from their most recent feed:
>
<description><br><table border=0 width= valign=top cellpadding=2
cellspacing=7><tr><td valign=top><a
href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://www.raleighchronicle.com/2006072105.html&cid=1108150860&ei=hUzBRO_JMpe2aInH 4PsB">Truck
Driver Wins $100,000 In Lottery; <b>NC</bPowerball
Winners</a><br><font size=-1><font color=#6f6f6f>Raleigh
Chronicle,&nbsp;NC&nbsp;-</font<nobr>2 hours
ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
<b>NC</bLottery Commission, when Dennis Mebane collected his prize
this week from a winning $100,000 instant scratch-off ticket, he
<b>...</b</font><br></table></description>
>
>
The XML tag is <description>, but the script I use to parse it
(rss2array, a cookie-cutter script that I downloaded) can't
differentiate between <descriptionand <table>. Usually, I would call
<descriptionby using the variable $rss['items'][$i]['description']
(where $i is the index counter), but with this output I would have to
do something like
$rss['items'][$i]['description']['table'><'tr'><'td'><'b><'font'>...
>
So I guess I really have 2 questions:
>
1. Is there a way to strip HTML completely out of the tag? Or better
yet, to make rss2array read the HTML as actual HTML instead of XML
tags?
>
2. If not, is there a better way to read the XML than through the use
of rss2array? I'm a fairly experienced coder, but don't really use XML
often enough to have a good grasp of the logic.
>
TIA,
>
Jason
Jason
Guest
 
Posts: n/a
#3: Jul 22 '06

re: Stripping HTML from RSS feed


I'm afraid that didn't work. It reads the variable as empty.

As a test, I tried print_r($rss), and while it reads everything else
correctly, it shows ['description'] as an empty variable. The only ones
that it reads correctly are the ones that don't have HTML.

Knowing this, it's got to be a "problem" with rss2array. I put
"problem" in quotes, because technically I think it's the XML file
that's flawed, but there's not much I can do about that. What other way
can I parse an XML database using PHP?

- Jason


Quote:
See if this works:
>
1. Use $rss['items'][$i]['description'] to return the contents of the
description node (including html markup) as a string.
2. Use the 'strip_tags' function to remove unwanted HTML from the
returned string.
>
See http://www.php.net/manual/en/function.strip-tags.php for more info.
>
>
>
Jason wrote:
Quote:
First things first, let me say that I couldn't decide whether to post
this to the PHP ng, or to an XML ng. I know from experience that you
guys know what you're talking about, though, and all of the questions
mean "how to do this in PHP," so I hope I picked the right one ;-)

For about a year, I've been importing Yahoo News headlines into my site
via their RSS feed. But I would much rather import Google News
headlines because I can make them specific to my location. The problem
is that their RSS feed includes HTML, and the script I use can't figure
it out.

Here's an example tag from their most recent feed:

<description><br><table border=0 width= valign=top cellpadding=2
cellspacing=7><tr><td valign=top><a
href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://www.raleighchronicle.com/2006072105.html&cid=1108150860&ei=hUzBRO_JMpe2aInH 4PsB">Truck
Driver Wins $100,000 In Lottery; <b>NC</bPowerball
Winners</a><br><font size=-1><font color=#6f6f6f>Raleigh
Chronicle,&nbsp;NC&nbsp;-</font<nobr>2 hours
ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
<b>NC</bLottery Commission, when Dennis Mebane collected his prize
this week from a winning $100,000 instant scratch-off ticket, he
<b>...</b</font><br></table></description>


The XML tag is <description>, but the script I use to parse it
(rss2array, a cookie-cutter script that I downloaded) can't
differentiate between <descriptionand <table>. Usually, I would call
<descriptionby using the variable $rss['items'][$i]['description']
(where $i is the index counter), but with this output I would have to
do something like
$rss['items'][$i]['description']['table'><'tr'><'td'><'b><'font'>...

So I guess I really have 2 questions:

1. Is there a way to strip HTML completely out of the tag? Or better
yet, to make rss2array read the HTML as actual HTML instead of XML
tags?

2. If not, is there a better way to read the XML than through the use
of rss2array? I'm a fairly experienced coder, but don't really use XML
often enough to have a good grasp of the logic.

TIA,

Jason
Noodle
Guest
 
Posts: n/a
#4: Jul 22 '06

re: Stripping HTML from RSS feed


There are two things I can suggest:

1. Using the DOM XML functions to extract the values (See
http://au3.php.net/domxml), then use the strip_tags() functions.
or
2. Use Regular expressions to extract the html tags from the XML before
you use rss2array

e.g.

// Remove all <ptags
$xml = preg_replace('/<p(.*)?>(.*)?<\/p>/', "$2", $xml);

//Remove all <fonttags
$xml = preg_replace('/<font(.*)?>(.*)?<\/font>/', "$2", $xml);

//etc...


Jason wrote:
Quote:
I'm afraid that didn't work. It reads the variable as empty.
>
As a test, I tried print_r($rss), and while it reads everything else
correctly, it shows ['description'] as an empty variable. The only ones
that it reads correctly are the ones that don't have HTML.
>
Knowing this, it's got to be a "problem" with rss2array. I put
"problem" in quotes, because technically I think it's the XML file
that's flawed, but there's not much I can do about that. What other way
can I parse an XML database using PHP?
>
- Jason
>
>
>
Quote:
See if this works:

1. Use $rss['items'][$i]['description'] to return the contents of the
description node (including html markup) as a string.
2. Use the 'strip_tags' function to remove unwanted HTML from the
returned string.

See http://www.php.net/manual/en/function.strip-tags.php for more info.



Jason wrote:
Quote:
First things first, let me say that I couldn't decide whether to post
this to the PHP ng, or to an XML ng. I know from experience that you
guys know what you're talking about, though, and all of the questions
mean "how to do this in PHP," so I hope I picked the right one ;-)
>
For about a year, I've been importing Yahoo News headlines into my site
via their RSS feed. But I would much rather import Google News
headlines because I can make them specific to my location. The problem
is that their RSS feed includes HTML, and the script I use can't figure
it out.
>
Here's an example tag from their most recent feed:
>
<description><br><table border=0 width= valign=top cellpadding=2
cellspacing=7><tr><td valign=top><a
href="http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://www.raleighchronicle.com/2006072105.html&cid=1108150860&ei=hUzBRO_JMpe2aInH 4PsB">Truck
Driver Wins $100,000 In Lottery; <b>NC</bPowerball
Winners</a><br><font size=-1><font color=#6f6f6f>Raleigh
Chronicle,&nbsp;NC&nbsp;-</font<nobr>2 hours
ago</nobr></font><br><font size=-1>REIDSVILLE -- According to the
<b>NC</bLottery Commission, when Dennis Mebane collected his prize
this week from a winning $100,000 instant scratch-off ticket, he
<b>...</b</font><br></table></description>
>
>
The XML tag is <description>, but the script I use to parse it
(rss2array, a cookie-cutter script that I downloaded) can't
differentiate between <descriptionand <table>. Usually, I would call
<descriptionby using the variable $rss['items'][$i]['description']
(where $i is the index counter), but with this output I would have to
do something like
$rss['items'][$i]['description']['table'><'tr'><'td'><'b><'font'>...
>
So I guess I really have 2 questions:
>
1. Is there a way to strip HTML completely out of the tag? Or better
yet, to make rss2array read the HTML as actual HTML instead of XML
tags?
>
2. If not, is there a better way to read the XML than through the use
of rss2array? I'm a fairly experienced coder, but don't really use XML
often enough to have a good grasp of the logic.
>
TIA,
>
Jason
Closed Thread