Connecting Tech Pros Worldwide Help | Site Map

HTML to XML

2peachy
Guest
 
Posts: n/a
#1: Jul 20 '05

hello... I am brand new to this...
I did a search with no results...

how do you convert an html page into an xml page

2peach
-----------------------------------------------------------------------
Posted via http://www.forum4designers.co
-----------------------------------------------------------------------
View this thread: http://www.forum4designers.com/message32301.htm

Johannes Koch
Guest
 
Posts: n/a
#2: Jul 20 '05

re: HTML to XML


2peachy wrote:[color=blue]
> hello... I am brand new to this...
> I did a search with no results...
>
> how do you convert an html page into an xml page ?[/color]

For valid HTML documents you can use sx from OpenSP. Or use tidy to
output XHTML.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)

Andy Dingley
Guest
 
Posts: n/a
#3: Jul 20 '05

re: HTML to XML


On Tue, 13 Jan 2004 20:29:34 -0600, 2peachy
<2peachy.1005cn@mail.forum4designers.com> wrote:
[color=blue]
>how do you convert an html page into an xml page ?[/color]

How long is a piece of string ?


How many pages are you dealing with ? Is this a one-off "I want to
convert my site" or a regular "I want to scrape stock prices from
another site and make them into an XML feed" ?

What's "HTML" ? Is this well-coded valid HTML 3.2 / 4.0, XHTML or
some tag-soup written by a M$oft tool ? What happens if it's not
valid ? Can your code crash, abandon the page, scream for human help,
or must it make a best-attempt ?

Can you avoid this altogether ? Can you obtain the content by some
friendlier means, such as RSS, direct access to the database, or some
other source ?

Why do you want to do it ? There are no "XML pages", there are only
XML documents. If you want to end up with "a web page" at the end of
it, then raw XML isn't enough of a finishing point, you need to take
it further.

What is "XML" ? What DTD or Schema are you aiming at ?


For one-offs, use Dave Raggett's Tidy (easily obtained via HTMLKit).
Even if you're not looking for an XHTML output, Tidy can be an
excellent pre-processor for sorting out ugly Tag Soup.

For screen-scrapes, use your favourite scripting language (Perl is
always a good start, but you could use Python or even JavaScript) and
use someone else's HTML parser.

RSS 1.0 is a good XML Schema to target at for generic screen scraping,
even if you don;t think your content is "relevant" to a newseed (but
RSS 0.92 isn't)

--
Do whales have krillfiles ?
Closed Thread


Similar .NET Framework bytes