473,225 Members | 1,274 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,225 software developers and data experts.

Storing HTML in XML

Hi,

Is it possible for me to store HTML tags inside XML nodes? I need some
way to share news headlines. Because the headlines differ in their
presentsation, it would be very difficult to store simply the title and
link. If possible, how would I do this?

Burnsy

Aug 10 '05 #1
12 3327
Tempore 14:44:40, die Wednesday 10 August 2005 AD, hinc in foro {comp.text.xml} scripsit <bi******@yahoo.co.uk>:
Is it possible for me to store HTML tags inside XML nodes? I need some
way to share news headlines. Because the headlines differ in their
presentsation, it would be very difficult to store simply the title and
link. If possible, how would I do this?

If the HTML is well-formed, you can treat it as X(HT)ML and at the nodes to your xml document

--
Joris Gillis (http://users.telenet.be/root-jg/me.html)
Vincit omnia simplicitas
Keep it simple
Aug 10 '05 #2
bi******@yahoo.co.uk wrote:
Is it possible for me to store HTML tags inside XML nodes?
Yes, but it's not pretty.
http://diveintomark.org/archives/200...compatible-rss
I need some way to share news headlines.


Then use RSS 1.0 or Atom 1.0
This is very much a ready-invented wheel.

http://xml.coverpages.org/ni2005-07-15-a.html

Aug 10 '05 #3
Joris Gillis wrote:
If the HTML is well-formed, you can treat it as X(HT)ML
and at the nodes to your xml document


This is problematic (unworkably so, in my enormous experience of doing
it).

- It's probably a fragment, not a whole HTML document.

- If it is a fragment, then it may have multiple root elements, or non
at all. You can manipulate this in XML, but you have to be careful to
use fragment tools on it, not node trees.

- If it's HTML, you just can't guarantee well-formedness. Even quite
well-behaved HTML can omit closing tags, especially if it's an
arbitrary selection from a larger page.

- There's the issue of HTML entities that aren't declared in XML.

- Externally supplied HTML will have garbage in it - one day.

- HTML isn't XML. Applying XML rules to it, such as minimising a
non-empty element with no content (like <script src="foo" ></script> )
can cause no end of trouble downstream.

Aug 10 '05 #4
di*****@codesmiths.com wrote:
bi******@yahoo.co.uk wrote:

Is it possible for me to store HTML tags inside XML nodes?

Yes, but it's not pretty.
http://diveintomark.org/archives/200...compatible-rss

I need some way to share news headlines.

Then use RSS 1.0 or Atom 1.0
This is very much a ready-invented wheel.


Hehe. RSS has clearly gone the way of HTML. Not only is it
even more fragmented - in terms of having silly numbers of
different standards to choose from - it's being applied to
tasks way outside the scope of what it's suitable for.

That of course is the consequence of real-world popularity.

--
Not me guv
Aug 10 '05 #5
Hi Andy,

Tempore 19:32:00, die Wednesday 10 August 2005 AD, hinc in foro {comp.text.xml} scripsit <di*****@codesmiths.com>:
Joris Gillis wrote:
If the HTML is well-formed, you can treat it as X(HT)ML
and at the nodes to your xml document

I stated this wrong. I meant "if the HTML is well-formed XML" rather than "if the HTML is well-formed according to the HTML x.xx recommendation"
This is problematic (unworkably so, in my enormous experience of doing
it).

- It's probably a fragment, not a whole HTML document.

- If it is a fragment, then it may have multiple root elements, or non
at all. You can manipulate this in XML, but you have to be careful to
use fragment tools on it, not node trees.

- If it's HTML, you just can't guarantee well-formedness. Even quite
well-behaved HTML can omit closing tags, especially if it's an
arbitrary selection from a larger page.

- There's the issue of HTML entities that aren't declared in XML.

- Externally supplied HTML will have garbage in it - one day.

- HTML isn't XML. Applying XML rules to it, such as minimising a
non-empty element with no content (like <script src="foo" ></script> )
can cause no end of trouble downstream.


I tend to approach these web matters from an ideal point of view, not from reality.

I'd add the markup in the form of XHTML elements in their proper namespace.
But then again, I'm not a developer, just a hobbyist. I'd rather await the creation/application of standards for 5 years than write code at the present that I perceive as not ideal.

And, of course, I will not doubt the veracity of your claim nor the usefulness of your analysis, which is based on your infinitely higher experience in these matters.

regards,
--
Joris Gillis (http://users.telenet.be/root-jg/me.html)
Vincit omnia simplicitas
Keep it simple
Aug 10 '05 #6
Nick Kew wrote:
di*****@codesmiths.com wrote:
bi******@yahoo.co.uk wrote:

Is it possible for me to store HTML tags inside XML nodes?

Yes, but it's not pretty.
http://diveintomark.org/archives/200...compatible-rss

I need some way to share news headlines.

Then use RSS 1.0 or Atom 1.0
This is very much a ready-invented wheel.


Hehe. RSS has clearly gone the way of HTML. Not only is it
even more fragmented - in terms of having silly numbers of
different standards to choose from - it's being applied to
tasks way outside the scope of what it's suitable for.


Yes. Trash it and use Atom.

///Peter
--
sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
&;top"
Aug 10 '05 #7
On Wed, 10 Aug 2005 19:28:11 +0100, Nick Kew <ni**@asgard.webthing.com>
wrote:
Hehe. RSS has clearly gone the way of HTML.
Oh, it's _much_ worse than that!
You know my opinion of Dave Winer - 'nuff said.
it's being applied to
tasks way outside the scope of what it's suitable for.


Not at all. RSS 1.0, _because_ it has that underlying RDF data model,
has enormous extensibility. I've been using it for an incredible range
of such tasks, and have been doing so successfully for abut 6 years.
With RSS 1.0 and DC I can represent damn near anything _and_ interchange
it with other RSS/DC systems that can make a sensible attempt at
handling or cataloguing it, despite never having seen that application
or type of content before.

RSS 2.0 is of course beneath contempt. Jury's still out on Atom, but
the 0.3->1.0 debacle didn't help its case.
Aug 10 '05 #8
bi******@yahoo.co.uk wrote:
: Hi,

: Is it possible for me to store HTML tags inside XML nodes? I need some
: way to share news headlines. Because the headlines differ in their
: presentsation, it would be very difficult to store simply the title and
: link. If possible, how would I do this?

Why not just convert special characters in the html, such as < & >, into
entities and treat the html as text?

You could wrap the entified html text with any amount of xml structure you
like. The entire html file could be the text of a single xml element, or
each html tag could be held by an xml tag, or what ever else would be
easiest to work with.

<the-entire-html-file>
&gt;html&lt; &gt;head ... etc ...
</the-entire-html-file>

<a-tag original="&gt;html&lt;" />
<a-tag original="&gt;head&lt;" />
<a-tag original="&gt;title&lt;" />This is the original text
<an-end-tag original="&gt;/title&lt;" />
<an-end-tag original="&gt;/head&lt;" />
<a-tag original="&gt;body&lt;" />welcome to my web site
<an-end-tag original="&gt;/body&lt;" />
<an-end-tag original="&gt;/html&lt;" />

or what ever

$0.10

--

This space not for rent.
Aug 10 '05 #9
On 10 Aug 2005 16:08:21 -0800, yf***@vtn1.victoria.tc.ca (Malcolm
Dew-Jones) wrote:
Why not just convert special characters in the html, such as < & >, into
entities and treat the html as text?


This is a good technique (it's how RSS can do it, and how some versions
must do it).

One caveat is that you must _always_ do this. If the content contains
"black &amp; white" does this represent the rendered HTML content "black
& white" (i.e. it has been encoded), or is it really "black &amp;
white", such as might appear in a HTML tutorial ? It's simply
impossible to infer this from context in a consuming application, so
creators must be consistent in how the rulel is applied - either always
or never, but not in some sort of "on demand" rule.

Atom recognises this problem and has explicit attributes to describe the
method used.
Aug 10 '05 #10
Malcolm Dew-Jones wrote:
bi******@yahoo.co.uk wrote:
: Hi,

: Is it possible for me to store HTML tags inside XML nodes? I need some
: way to share news headlines. Because the headlines differ in their
: presentsation, it would be very difficult to store simply the title and
: link. If possible, how would I do this?

Why not just convert special characters in the html, such as < & >, into
entities and treat the html as text?

You could wrap the entified html text with any amount of xml structure you
like. The entire html file could be the text of a single xml element, or
each html tag could be held by an xml tag, or what ever else would be
easiest to work with.

<the-entire-html-file>
&gt;html&lt; &gt;head ... etc ...
</the-entire-html-file>

<a-tag original="&gt;html&lt;" />
<a-tag original="&gt;head&lt;" />
<a-tag original="&gt;title&lt;" />This is the original text
<an-end-tag original="&gt;/title&lt;" />
<an-end-tag original="&gt;/head&lt;" />
<a-tag original="&gt;body&lt;" />welcome to my web site
<an-end-tag original="&gt;/body&lt;" />
<an-end-tag original="&gt;/html&lt;" />

or what ever
That would be
<html:element name="html" id="elt0">
<html:element name="head" id="elt1">
<html:element name="title" id="elt2">
<html:text id="text0">This is the original text</html:text>
</html:element>
.... etc
And for those entities:
<html:entity type="alpha" name="amp" id="ent0"/>

Works very well, and of course is easy either to
manipulate or to reconstruct the original from.
All it needs is an HTML parser to construct -
well-formedness of the original HTML is not a requirement.
$0.10


Inflation? :-)

--
Nick Kew
Aug 11 '05 #11
In <42******@news.victoria.tc.ca>, on 08/10/2005
at 04:08 PM, yf***@vtn1.victoria.tc.ca (Malcolm Dew-Jones) said:
Why not just convert special characters in the html, such as < & >,
into entities and treat the html as text?


That wouldn't have the same semantics. If the OP wants to eventually
render the text properly, then he must eventually serve, <b> as <b>,
not as &lt;b*gt;.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Aug 11 '05 #12
Shmuel (Seymour J.) Metz <sp******@library.lspace.org.invalid> wrote:
In <42******@news.victoria.tc.ca>, on 08/10/2005
at 04:08 PM, yf***@vtn1.victoria.tc.ca (Malcolm Dew-Jones) said:
Why not just convert special characters in the html, such as < & >,
into entities and treat the html as text?


That wouldn't have the same semantics. If the OP wants to eventually
render the text properly, then he must eventually serve, <b> as <b>,
not as &lt;b*gt;.


It should be fine, as long as it's applied correctly.

Consider: You're treating the inbound HTML as plain text. Plain text
must be correctly escaped. Thus, do (in Perl)

$html =~ s/\&/\&amp;/g;
$html =~ s/</\&lt;/g;
$html =~ s/>/\&gt;/g;

This will correctly handle all conversioned necessary for these
characters ("&amp;" becomes "&amp;amp;", etc.). On extracting from the
XML container, do

$html =~ s/\&gt;/>/g;
$html =~ s/\&lt;/</g;
$html =~ s/\&amp;/\&/g;

(to be honest, I don't remember if you have to escape the & in the
first part, but it harms nothing)

This will correctly and adequately handle the escaping. Now, if you put
broken HTML in, it'll still be broken coming out... but you'll get back
what you put in, at least. Assuming nothing goofy like whitespace
removal happens, of course.
Keith
--
Keith Davies "Trying to sway him from his current kook-
ke**********@kjdavies.org rant with facts is like trying to create
ke**********@gmail.com a vacuum in a room by pushing the air
http://www.kjdavies.org/ out with your hands." -- Matt Frisch
Aug 12 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Mark | last post by:
I have a website with an increasing amount of articles and news reports and so I am thinking of moving away from storing each article as a seperate page to having a single page and storing articles...
8
by: Steven | last post by:
Hi there, I am wanting to store price data as n.nn format, so if the user enters "1" the data that gets stored is "1.00" Is there a way to do this. Cheers Steven
6
by: Jonathan | last post by:
I want to save textarea contents to a mysql database with the paragraph breaks intact without having to type paragraph or break tags in HTML. How can I do that. So far, although it occurs naturally...
2
by: Mark Hannon | last post by:
I am trying to wrap my brain around storing form elements inside variables & arrays before I move on to a more complicated project. I created this simple example to experiment and as far as I can...
3
by: Peter Hardy | last post by:
Hi guys, Sorry for the cross-post but I got no response in the asp.net newsgroup. I am trying to develop a mini e-learning application where the user provides content for each page....
2
by: Robert Hanson | last post by:
I am new to the asp.net application building and I have read the information regarding the storing of information using session vs cookies vs viewstate. I am asking for suggestions/guidance as to...
0
by: Merek | last post by:
Hi all, We need to allow the user to store, view and edit blocks of rich text via an ASP.NET application. After adopting one of the many rich text editors out there that outputs HTML we are...
4
by: Frank Rizzo | last post by:
In classic ASP, it was considered a bad idea to store VB6-created objects in the Application variable for various threading issues. What's the current wisdom on storing objects in the Application...
5
by: Nikolay Petrov | last post by:
When using System.Security.Cryptography to Encrypt/Decrypt information, I need to store two values - the Initialization Vector and the Encryption Key. The are both needed in Encryption/Decryption...
0
by: Larry Neylon | last post by:
Hi there, I'm currently trying to implement a website that will store and retrieve Polish, so I need to be able to handle Polish characters using classic ASP with MySql5. Does anybody have an...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
0
by: veera ravala | last post by:
ServiceNow is a powerful cloud-based platform that offers a wide range of services to help organizations manage their workflows, operations, and IT services more efficiently. At its core, ServiceNow...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
0
by: mar23 | last post by:
Here's the situation. I have a form called frmDiceInventory with subform called subfrmDice. The subform's control source is linked to a query called qryDiceInventory. I've been trying to pick up the...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
by: jimatqsi | last post by:
The boss wants the word "CONFIDENTIAL" overlaying certain reports. He wants it large, slanted across the page, on every page, very light gray, outlined letters, not block letters. I thought Word Art...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.