By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,960 Members | 987 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,960 IT Pros & Developers. It's quick & easy.

XHTML to XML conversion

P: n/a
I'm trying do some "screen scraping", and am using
<http://www.oreilly.com/catalog/xmlhks/> for inspiration.

First I'd like to convert XHTML to XML, or extract XML from XHTML, I'm
not sure how to phrase that.

"Use Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It
for Data"
<http://hacks.oreilly.com/pub/h/2125>

Is what I'd like to do down the line, but for now I'm working on
something simpler.
First,

"Convert an HTML Document to XHTML with HTML Tidy"
<http://hacks.oreilly.com/pub/h/2054>

Instead of Tidy, I went with TagSoup
<http://mercury.ccil.org/~cowan/XML/tagsoup/>.
Then I'd like go from XHTML to XML in order to:

"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.com/pub/h/2069>

How do I get the XML from the XHTML, please?

here's what I have:[thufir@arrakis tagSoup]$
[thufir@arrakis tagSoup]$ date
Sun Aug 14 23:34:13 IST 2005
[thufir@arrakis tagSoup]$ pwd
/home/thufir/Desktop/tagSoup
[thufir@arrakis tagSoup]$ ll
total 60
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
[thufir@arrakis tagSoup]$ java -jar tagsoup.jar --files google.html
src: google.html dst: google.xhtml
[thufir@arrakis tagSoup]$ ll
total 76
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 10568 Aug 14 23:34 google.xhtml
-rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
[thufir@arrakis tagSoup]$ cat google.xhtml -n
1 <?xml version="1.0" standalone="yes"?>
2
3 <html version="-//W3C//DTD HTML 4.01 Transitional//EN"
xmlns="http://www.w3.org/1999/xhtml"><head><title>Google
Directory</title><style>&lt;!--
4 body,td,a,p,.h{font-family: arial,sans-serif;}
..h{color:#008000}
..q{text-decoration:none; color:#0000cc;}
5 //--&gt;</style><script>
6 &lt;!--
7 function sf(){document.f.q.focus();}
8 // --&gt;
9 </script></head><body bgcolor="#ffffff" text="#000000"
link="#3300cc" vlink="#660066" alink="#ff0000" onload="sf();">
10 <center>
11 <table cellpadding="0" cellspacing="0" border="0"><tr><td
align="right" colspan="1" rowspan="1" valign="bottom"><img
src="http://www.google.com/images/hp0.gif" width="158" height="78"
alt="Google Directory"></img></td><td colspan="1" rowspan="1"
valign="bottom"><img src="http://www.google.com/images/hp1.gif"
width="50" height="78" alt=""></img></td><td colspan="1" rowspan="1"
valign="bottom"><img src="http://www.google.com/images/hp2.gif"
width="68" height="78" alt=""></img></td></tr><tr><td align="right"
colspan="1" rowspan="1" valign="top" class="h"><b>Directory</b></td><td
colspan="1" rowspan="1" valign="top"><img
src="http://www.google.com/images/hp3.gif" width="50" height="32"
alt=""></img></td><td colspan="1" rowspan="1" valign="top"
class="h"></td></tr></table><br clear="none"></br><table border="0"
cellspacing="0" cellpadding="0"><tr><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="0" bgcolor="#efefef" width="95"><a shape="rect"
class="q" id="0a" href="http://www.google.com/webhp?hl=en"><font
size="-1">Web</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="1" bgcolor="#efefef" width="95"><a shape="rect"
class="q" id="1a" href="http://www.google.com/imghp?hl=en"><font
size="-1">Images</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="2" bgcolor="#efefef" width="95"><a shape="rect"
class="q" id="2a" href="http://www.google.com/grphp?hl=en"><font
size="-1">Groups</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="3" bgcolor="#008000" width="95"><font color="#ffffff"
size="-1"><b>Directory</b></font></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="4" bgcolor="#efefef" width="95"><a shape="rect"
class="q" id="4a" href="http://www.google.com/nwshp?hl=en"><font
size="-1">News</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td colspan="1" rowspan="1"
width="15"> </td></tr><tr><td colspan="12" rowspan="1"
bgcolor="#008000"><img width="1" height="1"
alt=""></img></td></tr></table><br clear="none"></br><form
enctype="application/x-www-form-urlencoded" method="get"
action="http://www.google.com/search" name="f"><table cellpadding="0"
cellspacing="0"><tr align="middle" valign="center"><td colspan="1"
rowspan="1" width="150"> </td><td colspan="1" rowspan="1"><input
maxlength="256" type="text" name="q" size="40"
value=""></input><script>document.f.q.focus();</script><input
type="submit" name="btnG" value="Google Search"></input><input
type="hidden" name="hl" value="en"></input><input type="hidden"
name="cat" value="gwd/Top"></input></td><td align="left" colspan="1"
rowspan="1" width="150"><font size="-2"> • <a
shape="rect" href="http://www.google.com/dirhelp.html">Directory
Help</a></font></td></tr></table></form><p><font color="#008000"><b>The
web organized by topic into categories.</b></font></p><p></p><table
align="center" width="1%" border="0" cellspacing="7"
cellpadding="0"><tr><td colspan="4" rowspan="1" bgcolor="#008000"><img
width="1" height="1" alt=""></img></td></tr><tr><td colspan="1"
rowspan="1"> </td><td colspan="1" nowrap="nowrap" rowspan="1">
12 <b><a shape="rect" href="/Top/Arts/">Arts</a></b><br
clear="none"></br>
13 <font size="-1"><a shape="rect"
href="/Top/Arts/Movies/">Movies</a>, <a shape="rect"
href="/Top/Arts/Music/">Music</a>, <a shape="rect"
href="/Top/Arts/Television/">Television</a>, ...</font><p>
14 <b><a shape="rect" href="/Top/Business/">Business</a></b><br
clear="none"></br>
15 <font size="-1"><a shape="rect"
href="/Top/Business/Major_Companies/">Companies</a>, <a shape="rect"
href="/Top/Business/Financial_Services/">Finance</a>, <a shape="rect"
href="/Top/Business/Employment/">Jobs</a>, ...</font></p><p>
16 <b><a shape="rect" href="/Top/Computers/">Computers</a></b><br
clear="none"></br>
17 <font size="-1"><a shape="rect"
href="/Top/Computers/Internet/">Internet</a>, <a shape="rect"
href="/Top/Computers/Hardware/">Hardware</a>, <a shape="rect"
href="/Top/Computers/Software/">Software</a>, ...</font></p><p>
18 <b><a shape="rect" href="/Top/Games/">Games</a></b><br
clear="none"></br>
19 <font size="-1"><a shape="rect"
href="/Top/Games/Board_Games/">Board</a>, <a shape="rect"
href="/Top/Games/Roleplaying/">Roleplaying</a>, <a shape="rect"
href="/Top/Games/Video_Games/">Video</a>, ...</font></p><p>
20 <b><a shape="rect" href="/Top/Health/">Health</a></b><br
clear="none"></br>
21 <font size="-1"><a shape="rect"
href="/Top/Health/Alternative/">Alternative</a>, <a shape="rect"
href="/Top/Health/Fitness/">Fitness</a>, <a shape="rect"
href="/Top/Health/Medicine/">Medicine</a>, ...</font></p><p>
22 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
23 <b><a shape="rect" href="/Top/Home/">Home</a></b><br
clear="none"></br>
24 <font size="-1"><a shape="rect"
href="/Top/Home/Consumer_Information/">Consumers</a>, <a shape="rect"
href="/Top/Home/Homeowners/">Homeowners</a>, <a shape="rect"
href="/Top/Home/Family/">Family</a>, ...</font><p>
25 <b><a shape="rect" href="/Top/Kids_and_Teens/">Kids and
Teens</a></b><br clear="none"></br>
26 <font size="-1"><a shape="rect"
href="/Top/Kids_and_Teens/Computers/">Computers</a>, <a shape="rect"
href="/Top/Kids_and_Teens/Entertainment/">Entertainment</a>, <a
shape="rect" href="/Top/Kids_and_Teens/School_Time/">School</a>,
....</font></p><p>
27 <b><a shape="rect" href="/Top/News/">News</a></b><br
clear="none"></br>
28 <font size="-1"><a shape="rect"
href="/Top/News/Media/">Media</a>, <a shape="rect"
href="/Top/News/Newspapers/">Newspapers</a>, <a shape="rect"
href="/Top/News/Current_Events/">Current Events</a>, ...</font></p><p>
29 <b><a shape="rect"
href="/Top/Recreation/">Recreation</a></b><br
clear="none"></br> 30 <font size="-1"><a shape="rect"
href="/Top/Recreation/Food/">Food</a>, <a shape="rect"
href="/Top/Recreation/Outdoors/">Outdoors</a>, <a shape="rect"
href="/Top/Recreation/Travel/">Travel</a>, ...</font></p><p>
31 <b><a shape="rect" href="/Top/Reference/">Reference</a></b><br
clear="none"></br>
32 <font size="-1"><a shape="rect"
href="/Top/Reference/Education/">Education</a>, <a shape="rect"
href="/Top/Reference/Libraries/">Libraries</a>, <a shape="rect"
href="/Top/Reference/Maps/">Maps</a>, ...</font></p><p>
33 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
34 <b><a shape="rect" href="/Top/Regional/">Regional</a></b><br
clear="none"></br>
35 <font size="-1"><a shape="rect"
href="/Top/Regional/Asia/">Asia</a>, <a shape="rect"
href="/Top/Regional/Europe/">Europe</a>, <a shape="rect"
href="/Top/Regional/North_America/">North America</a>, ...</font><p>
36 <b><a shape="rect" href="/Top/Science/">Science</a></b><br
clear="none"></br>
37 <font size="-1"><a shape="rect"
href="/Top/Science/Biology/">Biology</a>, <a shape="rect"
href="/Top/Science/Social_Sciences/Psychology/">Psychology</a>, <a
shape="rect" href="/Top/Science/Physics/">Physics</a>,
....</font></p><p>
38 <b><a shape="rect" href="/Top/Shopping/">Shopping</a></b><br
clear="none"></br>
39 <font size="-1"><a shape="rect"
href="/Top/Shopping/Vehicles/Autos/">Autos</a>, <a shape="rect"
href="/Top/Shopping/Clothing/">Clothing</a>, <a shape="rect"
href="/Top/Shopping/Gifts/">Gifts</a>, ...</font></p><p>
40 <b><a shape="rect" href="/Top/Society/">Society</a></b><br
clear="none"></br>
41 <font size="-1"><a shape="rect"
href="/Top/Society/Issues/">Issues</a>, <a shape="rect"
href="/Top/Society/People/">People</a>, <a shape="rect"
href="/Top/Society/Religion_and_Spirituality/">Religion</a>,
....</font></p><p>
42 <b><a shape="rect" href="/Top/Sports/">Sports</a></b><br
clear="none"></br>
43 <font size="-1"><a shape="rect"
href="/Top/Sports/Basketball/">Basketball</a>, <a shape="rect"
href="/Top/Sports/Football/">Football</a>, <a shape="rect"
href="/Top/Sports/Soccer/">Soccer</a>, ...</font></p><p>
44 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
colspan="3" rowspan="1"><b><a shape="rect"
href="/Top/World/">World</a></b><br clear="none"></br>
45 <font size="-1"><a shape="rect"
href="/Top/World/Deutsch/">Deutsch</a>, <a shape="rect"
href="/Top/World/Espa%C3%B1ol/">Espa�ol</a>, <a shape="rect"
href="/Top/World/Fran%C3%A7ais/">Fran�ais</a>, <a shape="rect"
href="/Top/World/Italiano/">Italiano</a>, <a shape="rect"
href="/Top/World/Japanese/">Japanese</a>, <a shape="rect"
href="/Top/World/Korean/">Korean</a>, <a shape="rect"
href="/Top/World/Nederlands/">Nederlands</a>, <a shape="rect"
href="/Top/World/Polska/">Polska</a>, <a shape="rect"
href="/Top/World/Svenska/">Svenska</a>, ...</font><p>
46 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
colspan="1" nowrap="nowrap" rowspan="1"><font
size="-1"> </font></td></tr><tr><td colspan="4" rowspan="1"
bgcolor="#008000"><img width="1" height="1"
alt=""></img></td></tr></table><br clear="none"></br><font size="-1"><a
shape="rect"
href="http://www.google.com/ads/">Advertise with Us</a> - <a
shape="rect"
href="http://www.google.com/about.html">Jobs, Press, Cool Stuff...</a></font><p><font
face="arial,sans-serif" size="-1"> ©2004 Google</font></p><br
clear="none"></br><table align="center" border="0" bgcolor="#336600"
cellpadding="3" cellspacing="0"><tr><td colspan="1" rowspan="1"> <table
width="100%" cellpadding="2" cellspacing="0" border="0"><tr
align="center"><td colspan="1" rowspan="1"><font face="sans-serif,
Arial, Helvetica" size="2" color="#ffffff">Help build the largest
human-edited directory on the web.</font></td></tr><tr align="center"
bgcolor="#cccccc"><td colspan="1" rowspan="1"><font face="sans-serif,
Arial, Helvetica" size="2">
47 <a shape="rect" href="http://dmoz.org/add.html">
48 Submit a Site</a> - <a shape="rect"
href="http://dmoz.org/about.html"><b>Open Directory Project</b></a> -
49 <a shape="rect" href="http://dmoz.org/cgi-bin/apply.cgi">Become
an Editor</a> </font>
50 </td></tr></table>
51 </td></tr></table>
52 </center></body></html>
53
[thufir@arrakis tagSoup]$ date
Sun Aug 14 23:34:57 IST 2005
[thufir@arrakis tagSoup]$
Thanks,

Thufir

Aug 15 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a
ha**********@gmail.com wrote:
I'm trying do some "screen scraping",
As a general rule, this sucks. It's a vile process and very brittle
(they change their site without telling you, your code dies). It's
impossible to say how hard or easy it is - it's massively dependent on
the target page you're scraping. Even within one site, one page may be
easy and another a nightmare.

It's also increasingly unneccessary and even illegal to do it. Chances
are that if they _want_ you to have the content there will be an RSS
version of it, and if they don't then they'll get pissy with suits over
it.

So these days you can quite probably go the easy route, and if you
can't then there's problems ahead anyway.
First I'd like to convert XHTML to XML,


If your input is XHTML, then life is a lot easier than if it's HTML.
XHTML _is_ XML, which means that it should be amenable to processing
with XML tools - these are generally much easier to work with than HTML
parsing tools.

OTOH, XHTML is rare on the web. It's still rare to see it, Appendix C
means that it has to be served up as HTML rather than XML (and may no
longer work correctly as XML). Additionally much of it is still just
broken, as the web always has been. Be wary of any page with externally
served ads on it!
You will probably get your project developed most effectively by first
hacking around with a few well-behaved RSS or Atom feeds (BBC and
Google are good sources). Learn to work through half the problem before
you have to dive into the nasty half of straining through random tag
soup.

Your previously described architecture looked like an awful lot of
layers - I've never needed to use that many stages of processing.

Aug 15 '05 #2

P: n/a
di*****@codesmiths.com wrote:
ha**********@gmail.com wrote:

It's also increasingly unneccessary and even illegal to do it.
Why should it be illegal to save a (public) html-file and modify it? You can
save it with the "save as" function of your browser as well!

If you save it for your own use I do not think it is illegal.

Chances are that if they _want_ you to have the content there will be an RSS version of it, and if they don't then they'll get pissy with suits
over it.


They might get pissy, but tell me: why do they publish that information on
the web?!

regards

Andreas
Aug 15 '05 #3

P: n/a
Andreas Baier wrote:
....
Chances are that if they _want_ you to have the content there will be an

RSS
version of it, and if they don't then they'll get pissy with suits
over it.


They might get pissy, but tell me: why do they publish that information on
the web?!

....

Ok, let's take this to alt.ethics.web ;)

If there's a beef, it should really be with o'reilly for publishing the
hack <http://hacks.oreilly.com/pub/h/2125>. Of course, they're
probably protected by the "free speech" rights part of the
constitution, but IANAL. (Bill of rights? which part?)

Anyhow, it's for personal use. I'm not republishing the data, which'd
be slimy. I don't know that it's illegal, but that'd be slimy.
Whether it's illegal or slimy, I'm sure there's a book on it. I'm sure
there are books on spam, for example.

-Thufir

Aug 15 '05 #4

P: n/a
di*****@codesmiths.com wrote:
....
If your input is XHTML, then life is a lot easier than if it's HTML.
XHTML _is_ XML, which means that it should be amenable to processing
with XML tools - these are generally much easier to work with than HTML
parsing tools.

....

TagSoup <http://mercury.ccil.org/~cowan/XML/tagsoup/> nicely creates
the XHTML file for this trial run. I'm more trying to understand than
do anything practical at this stage.

If XHTML is XML, can the hack
"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.com/pub/h/2069>

be run on the XHTML file to create the XSLT Identity Stylesheet? If
not, is there some other "hack" to do something like that, but with
XHTML?

Once you have XHTML you have XML because, as you said, XHTML is XML.
However, there's all that extra stuff in there which makes it XHTML.
An XSL Stylesheet can turn the XHTML file into plain XML?

At the moment I just want to get some sort of XSLT Stylesheet to work
with as a baseline to try to understand this. Can Relaxer create an
Identity Stylesheet for an XHTML file? Once I have an Identity
Stylesheet, that'd be something to work with.

I'm also working on this from the direction of Cocoon as per "Use
Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It for
Data" <http://hacks.oreilly.com/pub/h/2125>. Right now I'm just trying
to figure out how to convert XHTML to XML. I know that an XSLT can be
used, but can an Identity Stylesheet for an XHTML file be generated?
Thanks,

Thufir

Aug 15 '05 #5

P: n/a
ha**********@gmail.com wrote:
If XHTML is XML, can the hack
"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.com/pub/h/2069>
be run on the XHTML file to create the XSLT Identity Stylesheet? If
not, is there some other "hack" to do something like that, but with
XHTML?
An identity stylesheet won't do anything for you. When run on an XSL
processor, it just takes an XML document and spits out the same document.
Once you have XHTML you have XML because, as you said, XHTML is XML.
However, there's all that extra stuff in there which makes it XHTML.
An XSL Stylesheet can turn the XHTML file into plain XML?


Really, XHTML is plain XML. It's one XML-based language. Others are RSS,
XSL, XML Schema, eclipse .project files, ant build scripts, and
thousands of others.

I guess you mean, can you use an XSL stylesheet to extract the data in
which you are interested, from the XHTML? Yes, you can, but as someone
pointed out, it will be pretty brittle -- someone introduces or removes
a <span> around "your" data, and you will probably have to edit your
transform stylesheet. You will get tired.

BTW, tools to exits to generate extractors, that can select data on web
pages and then be set loose to suck data out of a large set of
almost-identical, or identically-generated pages. But these are expensive.

Soren
Aug 15 '05 #6

P: n/a
Soren Kuula wrote:
....
An identity stylesheet won't do anything for you. When run on an XSL
processor, it just takes an XML document and spits out the same document.
That's ok, it'd be a starting point.

So, Relaxer should be able to generate an identity stylesheet?
Then I could modify the stylesheet to actually extract the data?
.... Really, XHTML is plain XML. It's one XML-based language. Others are RSS,
XSL, XML Schema, eclipse .project files, ant build scripts, and
thousands of others.

I guess you mean, can you use an XSL stylesheet to extract the data in
which you are interested, from the XHTML? Yes, you can, but as someone
pointed out, it will be pretty brittle -- someone introduces or removes
a <span> around "your" data, and you will probably have to edit your
transform stylesheet. You will get tired.

....

This is just for a one off, to see how it works. I recognize the
brittleness of it conceptually, although I don't exactly know what a
span is, besides being a linear algebra term.

If it breaks, that's ok.
-Thufir

Aug 15 '05 #7

P: n/a
On 15 Aug 2005 15:36:44 -0700, "ha**********@gmail.com"
<ha**********@gmail.com> wrote:
TagSoup <http://mercury.ccil.org/~cowan/XML/tagsoup/> nicely creates
the XHTML file for this trial run. I'm more trying to understand than
do anything practical at this stage.
You have roughly three problems to solve.

- Turning HTML tag soup into a sensible document

- Turning "information" into "data"

- Turning your minimal raw data into something useful.
TagSoup appears to solve the first one for you - a series of SAX events
may be enough to work from, without even needing to save it as a
"document".

The second is the hard one, and the one that's most dependent on the
target site. A well-coded semantically-detailed site is easy,
pixel-based "visual design" can be almost impossible. You need to
identify relationships in the page such that "the row below the row
containg the string "Weather" will have the expected temperature in the
third column" - then you implement something (perhaps in complicated
XPath and simple XSLT) that can implement this rule and extract the
useful datum.

Processing the raw data out into a useful output is a perfect XSLT task.
This is relatively easy.
If XHTML is XML, can the hack
"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.com/pub/h/2069>
I have no idea what that is - it just looks like a title to me.

I can't even think what an "identity stylesheet" would be either - at
least not in any useful context.

An XSL Stylesheet can turn the XHTML file into plain XML?


There's no such thing as "plain XML". All XML documents have a schema -
although there's an abstract concept of "plain XML", you can't have any
real concrete document without some level of schema. You might not write
a formal schema, you might not even think through exactly what's in it,
but as soon as you start giving your elements names, then you've started
defining some degree of schema. Now if you have to have a schema, and if
it has to represent web content, then you might as well be using XHTML
for it !

The intermediate format (between steps 2 & 3) has less reason to be
XHTML. It's more likely to be application specific and _just_ holding
the core data. The schema here could be custom-rolled, or it might be
some sort of pre-existing weather schema, eBXML catalogue information,
or even good old Dublin Core.
As a development route, I'd suggest trying to turn some feed (Amazon top
10 books ? BBC top stories ?) into RSS 1.0 or Atom, then turning that
into a usable page. This should be an easy enough example to work on -
then you can try a more awkward site. I think that hand-coding a simple
example will give you better experience than diving in with Cocoon and
having random magic happen in front of you that you don't really
understand what it's doing (and Cocoon isn't my obvious thought for a
first tool to use).
--
Cats have nine lives, which is why they rarely post to Usenet.
Aug 16 '05 #8

P: n/a
Andy Dingley wrote:
....
I have no idea what that is - it just looks like a title to me.
Heh, I was *hoping* someone who owned the book would respond ;)
I can't even think what an "identity stylesheet" would be either - at
least not in any useful context.
I don't think it's useful in and of itself. In the example from that
book, IIRC they take an xml document, time.xml I believe, create an
XSLT, run the two through ?xerces? and the result is, essentially,
time.xml; not it's not useful.

The point is to automatically create a stylesheet which does nothing,
then edit the stylesheet, versus creating the stylesheet from scratch.
It's in the same section as using GUI XSL editors.
An XSL Stylesheet can turn the XHTML file into plain XML?
There's no such thing as "plain XML". All XML documents have a schema -
although there's an abstract concept of "plain XML", you can't have any
real concrete document without some level of schema. You might not write
a formal schema, you might not even think through exactly what's in it,
but as soon as you start giving your elements names, then you've started
defining some degree of schema. Now if you have to have a schema, and if
it has to represent web content, then you might as well be using XHTML
for it !


Ah, thank you. I learned something, the "schema" "thing" makes more
sense now. I don't quite get where XML leaves off and XHTML starts,
although I do know a bit. XML doesn't have reserved words(?) while
XHTML does, like <p> for paragraph. XML being more general. I should
read a bit more about XHTML.

.... As a development route, I'd suggest trying to turn some feed (Amazon top
10 books ? BBC top stories ?) into RSS 1.0 or Atom, then turning that
into a usable page. This should be an easy enough example to work on -
then you can try a more awkward site. I think that hand-coding a simple
example will give you better experience than diving in with Cocoon and
having random magic happen in front of you that you don't really
understand what it's doing (and Cocoon isn't my obvious thought for a
first tool to use).

....

Ok, that sounds good. I think I bit off a bit more than I can chew at
the moment. I'll work on from a feed then.
Thanks,

Thufir

Aug 16 '05 #9

P: n/a
ha**********@gmail.com wrote:
Heh, I was *hoping* someone who owned the book would respond ;)


I haven't bought an O'Reilly in years.

I can't even think what an "identity stylesheet" would be either - at
least not in any useful context.


Ah - I think I see what this "identity" stylesheet is about.

An identity transfrom turns "A" into "A". There's an obvious way to
write one in XSLT that uses wildcards to copy everything, as the
identity transform. However (given a schema or even an example of
input) it would be possible to generate a "longhand" identity
stylesheet that did each element explicitly. This could them be
modified to process each element differently, as you required it.

However this is just a time-saving measure for writing it, not some
fundamental technique. You can code your own pretty easily.

I've seen any number of "stylesheet generator" tools over the years,
from Schematron onwards. Supposedly you can transform anything to
anything, with auto-generated XSLT, based on input schemas and clever
code. However this whole area is a technique that has singularly
_failed_ to deliver useful products (unusual for XML tools). I'm
enormously skeptical about them.

Aug 16 '05 #10

P: n/a
di*****@codesmiths.com wrote:
....
An identity transfrom turns "A" into "A". There's an obvious way to
write one in XSLT that uses wildcards to copy everything, as the
identity transform. However (given a schema or even an example of
input) it would be possible to generate a "longhand" identity
stylesheet that did each element explicitly. This could them be
modified to process each element differently, as you required it.

However this is just a time-saving measure for writing it, not some
fundamental technique. You can code your own pretty easily.

....
Take matrix A. Then there's the identity matrix I.

AI=?=IA

I forget. heh.
-Thufir

Aug 16 '05 #11

P: n/a


help on creading my pimp page

*** Sent via Developersdex http://www.developersdex.com ***
Aug 31 '05 #12

P: n/a
edgar arizmendi wrote:
help on creading my pimp page

*** Sent via Developersdex http://www.developersdex.com ***

why?

Sep 5 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.