Converting HTML elements into XML/RSS

mickjames

Hi,

I'd like to include the whole web page content (as opposed to just the
headlines) into RSS/XML to enable people to read them via rss feed
readers.

Question: how to convert HTML elements such as href, img, b, p, etc
into XML?
I've seen someone use the following in their RSS feed but I don't like
it because <pre> doesn't produce a nice format:

<content:encoded><![CDATA[
<PRE>
blah blah blah..

Here is a sample HTML code. What would be the best way to put it into
XML, more specifically, convert those HTML elements.

----------------
<b>CAESAR</b> Et tu, Brute! Then fall,
<a
href=http://www.epilepsiemuseum.de/raum6/caesar.jpg>Caesar</a>.<br>
Dies
<p>
<b>CINNA</b> Liberty! Freedom! Tyranny is dead!
Run hence, proclaim, cry it about the streets.
<a href=http://www.shakespeare-online.com/>Read more</a>.
-----------------

Thanks for all the help!

Mick James

Jul 20 '05 #1

Subscribe Post Reply

29793

Andy Dingley

On 6 Jan 2005 13:43:19 -0800, mi*******@gmail.com wrote:

I'd like to include the whole web page content (as opposed to just the
headlines) into RSS/XML to enable people to read them via rss feed
readers.

Read this
http://diveintomark.org/archives/200...compatible-rss

Ask again if anything is unclear.

Jul 20 '05 #2

mickjames

Thanks.

So all the HTML needs to be enclosed in <description> and tags need to
be escaped with &lt; and &gt;?

Jul 20 '05 #3

Nick Kew

In article <11**********************@f14g2000cwb.googlegroups .com>,
mi*******@gmail.com writes:

I'd like to include the whole web page content (as opposed to just the
headlines) into RSS/XML to enable people to read them via rss feed
readers.
Uh, that's a lot of content for what users are expecting to be a summary.
Why use a feed if it doesn't save your users anything?
Question: how to convert HTML elements such as href, img, b, p, etc
into XML?
Bearing in mind the above, freely mix it, just using namespaces to
distinguish the elements. Since you're already breaking the purpose
of a feed, working normally with conventional client software presumably
isn't an issue.
Here is a sample HTML code. What would be the best way to put it into

Looks more like tag-soup to me.

--
Nick Kew

Jul 20 '05 #4

mickjames

Thanks for your reply. Yes, I understand that RSS is meant for summary,
not the whole content, but a lot of readers ask for the whole thing.
One imagines, they prefer to read using an rss feed reader instead of
using a web browser.

One question I didn't get the answer to in all my searching is: how to
code HTML tags such as href, img, p, b, etc when converting an HTML
page to .rss page?

Putting everything in CDATA or is there a better way?
A short example would be helpful.

Thanks a lot!

Jul 20 '05 #5

Andy Dingley

On 6 Jan 2005 15:15:54 -0800, mi*******@gmail.com wrote:

So all the HTML needs to be enclosed in <description> and tags need to
be escaped with &lt; and &gt;?

Yes. Ampersands might also cause problems and should already have been
escaped, but it's common in HTML that they aren't.

You should also "fix" any entitity references that are in the HTML,
such as é or   This needs to be done whether there are
tags involved or not - they're one of the most common intermittent
reasons for an RSS feed to become invalid. Such entities are defined
in HTML, but aren't already defined in XML or RSS.

"Fixing" them can be either replacing the initial ampersand with &
or replacing the "named" form of the entity reference with the
corresponding numeric form. The numeric form is probably best to use,
because that will render correctly even if the consumer doesn't
properly expand the encoded entities.

--
Smert' spamionam

Jul 20 '05 #6

Andy Dingley

On Fri, 7 Jan 2005 01:25:36 +0000, ni**@hugin.webthing.com (Nick Kew)
wrote:

Why use a feed if it doesn't save your users anything?
Why do you assume the function of my RSS feed ? I've built many
feeds that are anything but "newsfeeds". I think my record was 20MB
content size in a <description> element, for a very
application-specific intranet task. However it's still perfectly
compliant RSS 1.0

Question: how to convert HTML elements such as href, img, b, p, etc
into XML?

Bearing in mind the above, freely mix it, just using namespaces to
distinguish the elements.

You can't use namespacing, because the content is HTML rather than
XHTML. Apart from the standards-based argument and the fact that
namespacing just doesn't make sense for HTML, it's also impractical to
expect the incoming HTML content to be well-formed as an XML fragment
(or even valid HTML!).

Remember that RSS is a _feed_, not a one-off document (I wish Winer
would recognise this). Like all layered protocols you have to be very
careful that your implementations are not only correct for one
demonstration example, they have to be demonstrably correct for all
possible inputs.

Since you're already breaking the purpose of a feed,

Rubbish. RSS does _NOT_ define any notion of "purpose", or what's
"appropriate" to use it for. Besides which, the notion of content
encoding HTML fragments within the <description> element is very well
established.
--
Smert' spamionam

Jul 20 '05 #7

Nick Kew

In article <11*********************@f14g2000cwb.googlegroups. com>,
mi*******@gmail.com writes:

One imagines, they prefer to read using an rss feed reader instead of
using a web browser.
Hmmm. I think it should be the job of the Client to present it
sensibly. An RSS feed is to the Web as a newsgroup or mail folder
listing (from, subject, date) is to Usenet or Email. IMHO.

(you've presumably seen how Opera presents RSS feeds?)
One question I didn't get the answer to in all my searching is: how to
code HTML tags such as href, img, p, b, etc when converting an HTML
page to .rss page?
The core Site Valet tools offer options to present reports as RDF.
Since these are markup analysis tools, the more verbose options
embed the original markup, so all system messages can be properly
referenced to it. This uses a namespace to describe it, and
looks a little like XSLT with things like:
<ml:element name="a">
<ml:attribute name="href">foo</ml:attribute>
Putting everything in CDATA or is there a better way?
A short example would be helpful.

I don't think the above reply is really relevant to your question:
I was solving a different problem! But you already have Andy's reply.

--
Nick Kew

Jul 20 '05 #8

Colin

Hey,

I'd like to include the whole web page content (as opposed to just the
headlines) into RSS/XML to enable people to read them via rss feed
readers.

Question: how to convert HTML elements such as href, img, b, p, etc
into XML?

Why don't you just use software to create the feed that will convert it for you
so that you don't have to worry about it. There are a couple of options, I know
FeedForAll http://www.feedforall.com has a WYSWIG editor that will do this.

Best,
Colin

Jul 20 '05 #9

mickjames

WYSIWIG is not an option. I need to do it via script on Linux.

Would someone tell me how the following HTML snippet should be encoded
in an RSS file:

<b>This is a test.</a>
<a href=foo.html>Bar</a>.
<img src=baz.jpg>
<p>

I tried using &lt; etc but RSS readers simply display the
equivalent HTML, rather then rendering it.

Jul 20 '05 #10

Henri Sivonen

In article <g1************@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:

Here is a sample HTML code. What would be the best way to put it into

Looks more like tag-soup to me.

"Entity-encoded HTML" *is* tag soup transported over XML character data.
To make things worse, RSS provides no way of communicating whether the
characters reported by the XML processor are presentable text or tag
soup source that needs another level of parsing.

To make matters still worse, the problem has propagated from RSS 0.92
and 2.0 descriptions to titles and even to RSS 0.91 and RSS 1.0
processing, even though there is no spec text supporting "entity-encoded
HTML" in titles in any version of RSS or in descriptions in RSS 0.91 and
RSS 1.0.

For example, Sage misrenders the title "Tag Soup: How Mac IE 5 and
Safari handle <x> <y> </x> </y>" in http://www.hut.fi/u/hsivonen/feed.xml

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/

Jul 20 '05 #11

Andy Dingley

On Sat, 08 Jan 2005 00:22:20 +0200, Henri Sivonen <hs******@iki.fi>
wrote:

To make matters still worse, the problem has propagated from RSS 0.92
and 2.0 descriptions to titles and even to RSS 0.91 and RSS 1.0
processing,

No, RSS 1.0 is clear over this - although the others do have a
problem. The RSS 1.0 spec wasn't written in the sloppy manner of the
others.

Jul 20 '05 #12

Henri Sivonen

In article <15********************************@4ax.com>,
Andy Dingley <di*****@codesmiths.com> wrote:

On Sat, 08 Jan 2005 00:22:20 +0200, Henri Sivonen <hs******@iki.fi>
wrote:
To make matters still worse, the problem has propagated from RSS 0.92
and 2.0 descriptions to titles and even to RSS 0.91 and RSS 1.0
processing,

No, RSS 1.0 is clear over this - although the others do have a
problem. The RSS 1.0 spec wasn't written in the sloppy manner of the
others.

My point was that the problem has propagated to RSS 1.0 *processing*.
That is, there's software that assumes "entity-escaped HTML" in RSS 1.0
*titles*, even though there is no spec text to back it up.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/

Jul 20 '05 #13

mickjames

So can anyone show me how to put this HTML fragment into RSS/XML?

Jul 20 '05 #14

Nick Kew

In article <uo********************************@4ax.com>,
Andy Dingley <di*****@codesmiths.com> writes:

On Fri, 7 Jan 2005 01:25:36 +0000, ni**@hugin.webthing.com (Nick Kew)
wrote:
Why use a feed if it doesn't save your users anything?
Why do you assume the function of my RSS feed ? I've built many

I don't. I made an inference from the wording of the OP.

Bearing in mind the above, freely mix it, just using namespaces to
distinguish the elements.

You can't use namespacing, because the content is HTML rather than
XHTML.

Nonsense. Just map the HTML trivially to XHTML.
Apart from the standards-based argument and the fact that
namespacing just doesn't make sense for HTML, it's also impractical to
expect the incoming HTML content to be well-formed as an XML fragment
(or even valid HTML!).
Not necessary. There's no shortage of software that'll parse HTML
and XHTML to the same representation or event stream.
Remember that RSS is a _feed_, not a one-off document (I wish Winer
would recognise this). Like all layered protocols you have to be very
careful that your implementations are not only correct for one
demonstration example, they have to be demonstrably correct for all
possible inputs.

Yes, and?

Since you're already breaking the purpose of a feed,

Rubbish. RSS does _NOT_ define any notion of "purpose", or what's
"appropriate" to use it for.

Erm, I read the OP as implying a conventional/familiar purpose. What
in the references to "web page content", "rss *feed* readers", or the
ugly-tagsoup-html sample, leads you to suppose otherwise?

--
Nick Kew

Jul 20 '05 #15

thufir.hawat

Nick Kew wrote:

In article <uo********************************@4ax.com>,
Andy Dingley <di*****@codesmiths.com> writes: [..] Not necessary. There's no shortage of software that'll parse HTML
and XHTML to the same representation or event stream.

[..]

what's meant by "event stream," please?
thanks,

Thufir Hawat

Jul 20 '05 #16

Nick Kew

In article <11**********************@f14g2000cwb.googlegroups .com>,
th**********@mail.com writes:

[..]
Not necessary. There's no shortage of software that'll parse HTML
and XHTML to the same representation or event stream.

[..]

what's meant by "event stream," please?

Google for SAX.

--
Nick Kew

Jul 20 '05 #17

Venkata Srinivasulu

hai

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Jul 20 '05 #18

Similar topics

converting linebreaks to br

by: Tomba | last post by:

hi there, I am looking for a way to convert line breaks that are written in a textarea (with an enter) to <br> to create the same line break in html is there anyone who can help me with this?...

PHP

converting base class instance to derived class instance

by: Sridhar R | last post by:

Consider the code below, class Base(object): pass class Derived(object): def __new__(cls, *args, **kwds): # some_factory returns an instance of Base # and I have to derive from this...

Python

converting a string to an object

by: D Elkins | last post by:

Here is my situation: I have several arrays ... let's say ... Bob1_1, Bob1_2, etc. Each array has several elements ... element 1 is the one I am interested in. Example: Bob1_1=new...

Javascript

Tips and advice on converting from HTML 4.01 Trans. to XHTML

by: Armand Karlsen | last post by:

I have a website ( http://www.zen62775.zen.co.uk ) that I made HTML 4.01 Transitional and CSS compliant, and I'm thinking of converting it into XHTML to learn a little about it. Which XHTML variant...

HTML / CSS

I follow the following steps to converting from HTML to XHTML

by: mike | last post by:

regards: I follow the following steps to converting from HTML to XHTML http://webpageworkshop.co.uk/main/xhtml_converting My parser is http://htmlparser.sourceforge.net/ Xhtml version is 1.0...

HTML / CSS

XSLT returning '' is an invalid QName when converting Attributes to Elements

by: Stephan Brunner | last post by:

Hi I have created two flavors of an XSLT stylesheet to transform all attributes of an XML document to elements: They both work as expected with MSXML and XMLSPY but throw an exception ...

.NET Framework

Converting SoapResponse to Class

by: Parvesh | last post by:

hi, I am using a webservice which Returns the Result in an XML string, The XML response i get i svery cumbersome to parse, But if i could convert it to the Corresponding Class using the...

.NET Framework

Converting HTML to ASPX

by: =?Utf-8?B?QWxleCBNYWdoZW4=?= | last post by:

Is there a function in VS or a utility that will take an HTML file and create a code-behind ASPX page? The idea is, I'd like to be able to have someone develop beautiful, fully functional HTML...

ASP.NET

Converting HTML tags within an XML document back to HTML using XSL

by: Izhaki | last post by:

Hi, I'm creating a system where my XML includes HTML tags (<h1></h1>) in addition to other XML elements (<book></book>). I would like to render the HTML tags back to HTML using XSL. Considering...

XML

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++