473,760 Members | 9,717 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

XHTML to XML conversion

I'm trying do some "screen scraping", and am using
<http://www.oreilly.com/catalog/xmlhks/> for inspiration.

First I'd like to convert XHTML to XML, or extract XML from XHTML, I'm
not sure how to phrase that.

"Use Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It
for Data"
<http://hacks.oreilly.c om/pub/h/2125>

Is what I'd like to do down the line, but for now I'm working on
something simpler.
First,

"Convert an HTML Document to XHTML with HTML Tidy"
<http://hacks.oreilly.c om/pub/h/2054>

Instead of Tidy, I went with TagSoup
<http://mercury.ccil.or g/~cowan/XML/tagsoup/>.
Then I'd like go from XHTML to XML in order to:

"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.c om/pub/h/2069>

How do I get the XML from the XHTML, please?

here's what I have:[thufir@arrakis tagSoup]$
[thufir@arrakis tagSoup]$ date
Sun Aug 14 23:34:13 IST 2005
[thufir@arrakis tagSoup]$ pwd
/home/thufir/Desktop/tagSoup
[thufir@arrakis tagSoup]$ ll
total 60
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
[thufir@arrakis tagSoup]$ java -jar tagsoup.jar --files google.html
src: google.html dst: google.xhtml
[thufir@arrakis tagSoup]$ ll
total 76
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 10568 Aug 14 23:34 google.xhtml
-rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
[thufir@arrakis tagSoup]$ cat google.xhtml -n
1 <?xml version="1.0" standalone="yes "?>
2
3 <html version="-//W3C//DTD HTML 4.01 Transitional//EN"
xmlns="http://www.w3.org/1999/xhtml"><head><t itle>Google
Directory</title><style>&l t;!--
4 body,td,a,p,.h{ font-family: arial,sans-serif;}
..h{color:#0080 00}
..q{text-decoration:none ; color:#0000cc;}
5 //--&gt;</style><script>
6 &lt;!--
7 function sf(){document.f .q.focus();}
8 // --&gt;
9 </script></head><body bgcolor="#fffff f" text="#000000"
link="#3300cc" vlink="#660066" alink="#ff0000" onload="sf();">
10 <center>
11 <table cellpadding="0" cellspacing="0" border="0"><tr> <td
align="right" colspan="1" rowspan="1" valign="bottom" ><img
src="http://www.google.com/images/hp0.gif" width="158" height="78"
alt="Google Directory"></img></td><td colspan="1" rowspan="1"
valign="bottom" ><img src="http://www.google.com/images/hp1.gif"
width="50" height="78" alt=""></img></td><td colspan="1" rowspan="1"
valign="bottom" ><img src="http://www.google.com/images/hp2.gif"
width="68" height="78" alt=""></img></td></tr><tr><td align="right"
colspan="1" rowspan="1" valign="top" class="h"><b>Di rectory</b></td><td
colspan="1" rowspan="1" valign="top"><i mg
src="http://www.google.com/images/hp3.gif" width="50" height="32"
alt=""></img></td><td colspan="1" rowspan="1" valign="top"
class="h"></td></tr></table><br clear="none"></br><table border="0"
cellspacing="0" cellpadding="0" ><tr><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="0" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="0a" href="http://www.google.com/webhp?hl=en"><f ont
size="-1">Web</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="1" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="1a" href="http://www.google.com/imghp?hl=en"><f ont
size="-1">Images</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="2" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="2a" href="http://www.google.com/grphp?hl=en"><f ont
size="-1">Groups</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="3" bgcolor="#00800 0" width="95"><fon t color="#ffffff"
size="-1"><b>Directory </b></font></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="4" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="4a" href="http://www.google.com/nwshp?hl=en"><f ont
size="-1">News</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td colspan="1" rowspan="1"
width="15"> </td></tr><tr><td colspan="12" rowspan="1"
bgcolor="#00800 0"><img width="1" height="1"
alt=""></img></td></tr></table><br clear="none"></br><form
enctype="applic ation/x-www-form-urlencoded" method="get"
action="http://www.google.com/search" name="f"><table cellpadding="0"
cellspacing="0" ><tr align="middle" valign="center" ><td colspan="1"
rowspan="1" width="150"> </td><td colspan="1" rowspan="1"><in put
maxlength="256" type="text" name="q" size="40"
value=""></input><script>d ocument.f.q.foc us();</script><input
type="submit" name="btnG" value="Google Search"></input><input
type="hidden" name="hl" value="en"></input><input type="hidden"
name="cat" value="gwd/Top"></input></td><td align="left" colspan="1"
rowspan="1" width="150"><fo nt size="-2"> • <a
shape="rect" href="http://www.google.com/dirhelp.html">D irectory
Help</a></font></td></tr></table></form><p><font color="#008000" ><b>The
web organized by topic into categories.</b></font></p><p></p><table
align="center" width="1%" border="0" cellspacing="7"
cellpadding="0" ><tr><td colspan="4" rowspan="1" bgcolor="#00800 0"><img
width="1" height="1" alt=""></img></td></tr><tr><td colspan="1"
rowspan="1"> </td><td colspan="1" nowrap="nowrap" rowspan="1">
12 <b><a shape="rect" href="/Top/Arts/">Arts</a></b><br
clear="none"></br>
13 <font size="-1"><a shape="rect"
href="/Top/Arts/Movies/">Movies</a>, <a shape="rect"
href="/Top/Arts/Music/">Music</a>, <a shape="rect"
href="/Top/Arts/Television/">Televisio n</a>, ...</font><p>
14 <b><a shape="rect" href="/Top/Business/">Business</a></b><br
clear="none"></br>
15 <font size="-1"><a shape="rect"
href="/Top/Business/Major_Companies/">Companies </a>, <a shape="rect"
href="/Top/Business/Financial_Servi ces/">Finance</a>, <a shape="rect"
href="/Top/Business/Employment/">Jobs</a>, ...</font></p><p>
16 <b><a shape="rect" href="/Top/Computers/">Computers </a></b><br
clear="none"></br>
17 <font size="-1"><a shape="rect"
href="/Top/Computers/Internet/">Internet</a>, <a shape="rect"
href="/Top/Computers/Hardware/">Hardware</a>, <a shape="rect"
href="/Top/Computers/Software/">Software</a>, ...</font></p><p>
18 <b><a shape="rect" href="/Top/Games/">Games</a></b><br
clear="none"></br>
19 <font size="-1"><a shape="rect"
href="/Top/Games/Board_Games/">Board</a>, <a shape="rect"
href="/Top/Games/Roleplaying/">Roleplayi ng</a>, <a shape="rect"
href="/Top/Games/Video_Games/">Video</a>, ...</font></p><p>
20 <b><a shape="rect" href="/Top/Health/">Health</a></b><br
clear="none"></br>
21 <font size="-1"><a shape="rect"
href="/Top/Health/Alternative/">Alternati ve</a>, <a shape="rect"
href="/Top/Health/Fitness/">Fitness</a>, <a shape="rect"
href="/Top/Health/Medicine/">Medicine</a>, ...</font></p><p>
22 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
23 <b><a shape="rect" href="/Top/Home/">Home</a></b><br
clear="none"></br>
24 <font size="-1"><a shape="rect"
href="/Top/Home/Consumer_Inform ation/">Consumers </a>, <a shape="rect"
href="/Top/Home/Homeowners/">Homeowner s</a>, <a shape="rect"
href="/Top/Home/Family/">Family</a>, ...</font><p>
25 <b><a shape="rect" href="/Top/Kids_and_Teens/">Kids and
Teens</a></b><br clear="none"></br>
26 <font size="-1"><a shape="rect"
href="/Top/Kids_and_Teens/Computers/">Computers </a>, <a shape="rect"
href="/Top/Kids_and_Teens/Entertainment/">Entertainment </a>, <a
shape="rect" href="/Top/Kids_and_Teens/School_Time/">School</a>,
....</font></p><p>
27 <b><a shape="rect" href="/Top/News/">News</a></b><br
clear="none"></br>
28 <font size="-1"><a shape="rect"
href="/Top/News/Media/">Media</a>, <a shape="rect"
href="/Top/News/Newspapers/">Newspaper s</a>, <a shape="rect"
href="/Top/News/Current_Events/">Current Events</a>, ...</font></p><p>
29 <b><a shape="rect"
href="/Top/Recreation/">Recreatio n</a></b><br
clear="none"></br> 30 <font size="-1"><a shape="rect"
href="/Top/Recreation/Food/">Food</a>, <a shape="rect"
href="/Top/Recreation/Outdoors/">Outdoors</a>, <a shape="rect"
href="/Top/Recreation/Travel/">Travel</a>, ...</font></p><p>
31 <b><a shape="rect" href="/Top/Reference/">Reference </a></b><br
clear="none"></br>
32 <font size="-1"><a shape="rect"
href="/Top/Reference/Education/">Education </a>, <a shape="rect"
href="/Top/Reference/Libraries/">Libraries </a>, <a shape="rect"
href="/Top/Reference/Maps/">Maps</a>, ...</font></p><p>
33 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
34 <b><a shape="rect" href="/Top/Regional/">Regional</a></b><br
clear="none"></br>
35 <font size="-1"><a shape="rect"
href="/Top/Regional/Asia/">Asia</a>, <a shape="rect"
href="/Top/Regional/Europe/">Europe</a>, <a shape="rect"
href="/Top/Regional/North_America/">North America</a>, ...</font><p>
36 <b><a shape="rect" href="/Top/Science/">Science</a></b><br
clear="none"></br>
37 <font size="-1"><a shape="rect"
href="/Top/Science/Biology/">Biology</a>, <a shape="rect"
href="/Top/Science/Social_Sciences/Psychology/">Psycholog y</a>, <a
shape="rect" href="/Top/Science/Physics/">Physics</a>,
....</font></p><p>
38 <b><a shape="rect" href="/Top/Shopping/">Shopping</a></b><br
clear="none"></br>
39 <font size="-1"><a shape="rect"
href="/Top/Shopping/Vehicles/Autos/">Autos</a>, <a shape="rect"
href="/Top/Shopping/Clothing/">Clothing</a>, <a shape="rect"
href="/Top/Shopping/Gifts/">Gifts</a>, ...</font></p><p>
40 <b><a shape="rect" href="/Top/Society/">Society</a></b><br
clear="none"></br>
41 <font size="-1"><a shape="rect"
href="/Top/Society/Issues/">Issues</a>, <a shape="rect"
href="/Top/Society/People/">People</a>, <a shape="rect"
href="/Top/Society/Religion_and_Sp irituality/">Religion</a>,
....</font></p><p>
42 <b><a shape="rect" href="/Top/Sports/">Sports</a></b><br
clear="none"></br>
43 <font size="-1"><a shape="rect"
href="/Top/Sports/Basketball/">Basketbal l</a>, <a shape="rect"
href="/Top/Sports/Football/">Football</a>, <a shape="rect"
href="/Top/Sports/Soccer/">Soccer</a>, ...</font></p><p>
44 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
colspan="3" rowspan="1"><b> <a shape="rect"
href="/Top/World/">World</a></b><br clear="none"></br>
45 <font size="-1"><a shape="rect"
href="/Top/World/Deutsch/">Deutsch</a>, <a shape="rect"
href="/Top/World/Espa%C3%B1ol/">Espa�ol</a>, <a shape="rect"
href="/Top/World/Fran%C3%A7ais/">Fran�ais</a>, <a shape="rect"
href="/Top/World/Italiano/">Italiano</a>, <a shape="rect"
href="/Top/World/Japanese/">Japanese</a>, <a shape="rect"
href="/Top/World/Korean/">Korean</a>, <a shape="rect"
href="/Top/World/Nederlands/">Nederland s</a>, <a shape="rect"
href="/Top/World/Polska/">Polska</a>, <a shape="rect"
href="/Top/World/Svenska/">Svenska</a>, ...</font><p>
46 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
colspan="1" nowrap="nowrap" rowspan="1"><fo nt
size="-1"> </font></td></tr><tr><td colspan="4" rowspan="1"
bgcolor="#00800 0"><img width="1" height="1"
alt=""></img></td></tr></table><br clear="none"></br><font size="-1"><a
shape="rect"
href="http://www.google.com/ads/">Advertise wit h Us</a> - <a
shape="rect"
href="http://www.google.com/about.html">Job s, Press, Cool  Stuff...</a></font><p><font
face="arial,san s-serif" size="-1"> ©2004 Google</font></p><br
clear="none"></br><table align="center" border="0" bgcolor="#33660 0"
cellpadding="3" cellspacing="0" ><tr><td colspan="1" rowspan="1"> <table
width="100%" cellpadding="2" cellspacing="0" border="0"><tr
align="center"> <td colspan="1" rowspan="1"><fo nt face="sans-serif,
Arial, Helvetica" size="2" color="#ffffff" >Help build the largest
human-edited directory on the web.</font></td></tr><tr align="center"
bgcolor="#ccccc c"><td colspan="1" rowspan="1"><fo nt face="sans-serif,
Arial, Helvetica" size="2">
47 <a shape="rect" href="http://dmoz.org/add.html">
48 Submit a Site</a> - <a shape="rect"
href="http://dmoz.org/about.html"><b> Open Directory Project</b></a> -
49 <a shape="rect" href="http://dmoz.org/cgi-bin/apply.cgi">Beco me
an Editor</a> </font>
50 </td></tr></table>
51 </td></tr></table>
52 </center></body></html>
53
[thufir@arrakis tagSoup]$ date
Sun Aug 14 23:34:57 IST 2005
[thufir@arrakis tagSoup]$
Thanks,

Thufir

Aug 15 '05 #1
12 7734
ha**********@gm ail.com wrote:
I'm trying do some "screen scraping",
As a general rule, this sucks. It's a vile process and very brittle
(they change their site without telling you, your code dies). It's
impossible to say how hard or easy it is - it's massively dependent on
the target page you're scraping. Even within one site, one page may be
easy and another a nightmare.

It's also increasingly unneccessary and even illegal to do it. Chances
are that if they _want_ you to have the content there will be an RSS
version of it, and if they don't then they'll get pissy with suits over
it.

So these days you can quite probably go the easy route, and if you
can't then there's problems ahead anyway.
First I'd like to convert XHTML to XML,


If your input is XHTML, then life is a lot easier than if it's HTML.
XHTML _is_ XML, which means that it should be amenable to processing
with XML tools - these are generally much easier to work with than HTML
parsing tools.

OTOH, XHTML is rare on the web. It's still rare to see it, Appendix C
means that it has to be served up as HTML rather than XML (and may no
longer work correctly as XML). Additionally much of it is still just
broken, as the web always has been. Be wary of any page with externally
served ads on it!
You will probably get your project developed most effectively by first
hacking around with a few well-behaved RSS or Atom feeds (BBC and
Google are good sources). Learn to work through half the problem before
you have to dive into the nasty half of straining through random tag
soup.

Your previously described architecture looked like an awful lot of
layers - I've never needed to use that many stages of processing.

Aug 15 '05 #2
di*****@codesmi ths.com wrote:
ha**********@gm ail.com wrote:

It's also increasingly unneccessary and even illegal to do it.
Why should it be illegal to save a (public) html-file and modify it? You can
save it with the "save as" function of your browser as well!

If you save it for your own use I do not think it is illegal.

Chances are that if they _want_ you to have the content there will be an RSS version of it, and if they don't then they'll get pissy with suits
over it.


They might get pissy, but tell me: why do they publish that information on
the web?!

regards

Andreas
Aug 15 '05 #3
Andreas Baier wrote:
....
Chances are that if they _want_ you to have the content there will be an

RSS
version of it, and if they don't then they'll get pissy with suits
over it.


They might get pissy, but tell me: why do they publish that information on
the web?!

....

Ok, let's take this to alt.ethics.web ;)

If there's a beef, it should really be with o'reilly for publishing the
hack <http://hacks.oreilly.c om/pub/h/2125>. Of course, they're
probably protected by the "free speech" rights part of the
constitution, but IANAL. (Bill of rights? which part?)

Anyhow, it's for personal use. I'm not republishing the data, which'd
be slimy. I don't know that it's illegal, but that'd be slimy.
Whether it's illegal or slimy, I'm sure there's a book on it. I'm sure
there are books on spam, for example.

-Thufir

Aug 15 '05 #4
di*****@codesmi ths.com wrote:
....
If your input is XHTML, then life is a lot easier than if it's HTML.
XHTML _is_ XML, which means that it should be amenable to processing
with XML tools - these are generally much easier to work with than HTML
parsing tools.

....

TagSoup <http://mercury.ccil.or g/~cowan/XML/tagsoup/> nicely creates
the XHTML file for this trial run. I'm more trying to understand than
do anything practical at this stage.

If XHTML is XML, can the hack
"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.c om/pub/h/2069>

be run on the XHTML file to create the XSLT Identity Stylesheet? If
not, is there some other "hack" to do something like that, but with
XHTML?

Once you have XHTML you have XML because, as you said, XHTML is XML.
However, there's all that extra stuff in there which makes it XHTML.
An XSL Stylesheet can turn the XHTML file into plain XML?

At the moment I just want to get some sort of XSLT Stylesheet to work
with as a baseline to try to understand this. Can Relaxer create an
Identity Stylesheet for an XHTML file? Once I have an Identity
Stylesheet, that'd be something to work with.

I'm also working on this from the direction of Cocoon as per "Use
Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It for
Data" <http://hacks.oreilly.c om/pub/h/2125>. Right now I'm just trying
to figure out how to convert XHTML to XML. I know that an XSLT can be
used, but can an Identity Stylesheet for an XHTML file be generated?
Thanks,

Thufir

Aug 15 '05 #5
ha**********@gm ail.com wrote:
If XHTML is XML, can the hack
"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.c om/pub/h/2069>
be run on the XHTML file to create the XSLT Identity Stylesheet? If
not, is there some other "hack" to do something like that, but with
XHTML?
An identity stylesheet won't do anything for you. When run on an XSL
processor, it just takes an XML document and spits out the same document.
Once you have XHTML you have XML because, as you said, XHTML is XML.
However, there's all that extra stuff in there which makes it XHTML.
An XSL Stylesheet can turn the XHTML file into plain XML?


Really, XHTML is plain XML. It's one XML-based language. Others are RSS,
XSL, XML Schema, eclipse .project files, ant build scripts, and
thousands of others.

I guess you mean, can you use an XSL stylesheet to extract the data in
which you are interested, from the XHTML? Yes, you can, but as someone
pointed out, it will be pretty brittle -- someone introduces or removes
a <span> around "your" data, and you will probably have to edit your
transform stylesheet. You will get tired.

BTW, tools to exits to generate extractors, that can select data on web
pages and then be set loose to suck data out of a large set of
almost-identical, or identically-generated pages. But these are expensive.

Soren
Aug 15 '05 #6
Soren Kuula wrote:
....
An identity stylesheet won't do anything for you. When run on an XSL
processor, it just takes an XML document and spits out the same document.
That's ok, it'd be a starting point.

So, Relaxer should be able to generate an identity stylesheet?
Then I could modify the stylesheet to actually extract the data?
.... Really, XHTML is plain XML. It's one XML-based language. Others are RSS,
XSL, XML Schema, eclipse .project files, ant build scripts, and
thousands of others.

I guess you mean, can you use an XSL stylesheet to extract the data in
which you are interested, from the XHTML? Yes, you can, but as someone
pointed out, it will be pretty brittle -- someone introduces or removes
a <span> around "your" data, and you will probably have to edit your
transform stylesheet. You will get tired.

....

This is just for a one off, to see how it works. I recognize the
brittleness of it conceptually, although I don't exactly know what a
span is, besides being a linear algebra term.

If it breaks, that's ok.
-Thufir

Aug 15 '05 #7
On 15 Aug 2005 15:36:44 -0700, "ha**********@g mail.com"
<ha**********@g mail.com> wrote:
TagSoup <http://mercury.ccil.or g/~cowan/XML/tagsoup/> nicely creates
the XHTML file for this trial run. I'm more trying to understand than
do anything practical at this stage.
You have roughly three problems to solve.

- Turning HTML tag soup into a sensible document

- Turning "informatio n" into "data"

- Turning your minimal raw data into something useful.
TagSoup appears to solve the first one for you - a series of SAX events
may be enough to work from, without even needing to save it as a
"document".

The second is the hard one, and the one that's most dependent on the
target site. A well-coded semantically-detailed site is easy,
pixel-based "visual design" can be almost impossible. You need to
identify relationships in the page such that "the row below the row
containg the string "Weather" will have the expected temperature in the
third column" - then you implement something (perhaps in complicated
XPath and simple XSLT) that can implement this rule and extract the
useful datum.

Processing the raw data out into a useful output is a perfect XSLT task.
This is relatively easy.
If XHTML is XML, can the hack
"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.c om/pub/h/2069>
I have no idea what that is - it just looks like a title to me.

I can't even think what an "identity stylesheet" would be either - at
least not in any useful context.

An XSL Stylesheet can turn the XHTML file into plain XML?


There's no such thing as "plain XML". All XML documents have a schema -
although there's an abstract concept of "plain XML", you can't have any
real concrete document without some level of schema. You might not write
a formal schema, you might not even think through exactly what's in it,
but as soon as you start giving your elements names, then you've started
defining some degree of schema. Now if you have to have a schema, and if
it has to represent web content, then you might as well be using XHTML
for it !

The intermediate format (between steps 2 & 3) has less reason to be
XHTML. It's more likely to be application specific and _just_ holding
the core data. The schema here could be custom-rolled, or it might be
some sort of pre-existing weather schema, eBXML catalogue information,
or even good old Dublin Core.
As a development route, I'd suggest trying to turn some feed (Amazon top
10 books ? BBC top stories ?) into RSS 1.0 or Atom, then turning that
into a usable page. This should be an easy enough example to work on -
then you can try a more awkward site. I think that hand-coding a simple
example will give you better experience than diving in with Cocoon and
having random magic happen in front of you that you don't really
understand what it's doing (and Cocoon isn't my obvious thought for a
first tool to use).
--
Cats have nine lives, which is why they rarely post to Usenet.
Aug 16 '05 #8
Andy Dingley wrote:
....
I have no idea what that is - it just looks like a title to me.
Heh, I was *hoping* someone who owned the book would respond ;)
I can't even think what an "identity stylesheet" would be either - at
least not in any useful context.
I don't think it's useful in and of itself. In the example from that
book, IIRC they take an xml document, time.xml I believe, create an
XSLT, run the two through ?xerces? and the result is, essentially,
time.xml; not it's not useful.

The point is to automatically create a stylesheet which does nothing,
then edit the stylesheet, versus creating the stylesheet from scratch.
It's in the same section as using GUI XSL editors.
An XSL Stylesheet can turn the XHTML file into plain XML?
There's no such thing as "plain XML". All XML documents have a schema -
although there's an abstract concept of "plain XML", you can't have any
real concrete document without some level of schema. You might not write
a formal schema, you might not even think through exactly what's in it,
but as soon as you start giving your elements names, then you've started
defining some degree of schema. Now if you have to have a schema, and if
it has to represent web content, then you might as well be using XHTML
for it !


Ah, thank you. I learned something, the "schema" "thing" makes more
sense now. I don't quite get where XML leaves off and XHTML starts,
although I do know a bit. XML doesn't have reserved words(?) while
XHTML does, like <p> for paragraph. XML being more general. I should
read a bit more about XHTML.

.... As a development route, I'd suggest trying to turn some feed (Amazon top
10 books ? BBC top stories ?) into RSS 1.0 or Atom, then turning that
into a usable page. This should be an easy enough example to work on -
then you can try a more awkward site. I think that hand-coding a simple
example will give you better experience than diving in with Cocoon and
having random magic happen in front of you that you don't really
understand what it's doing (and Cocoon isn't my obvious thought for a
first tool to use).

....

Ok, that sounds good. I think I bit off a bit more than I can chew at
the moment. I'll work on from a feed then.
Thanks,

Thufir

Aug 16 '05 #9
ha**********@gm ail.com wrote:
Heh, I was *hoping* someone who owned the book would respond ;)


I haven't bought an O'Reilly in years.

I can't even think what an "identity stylesheet" would be either - at
least not in any useful context.


Ah - I think I see what this "identity" stylesheet is about.

An identity transfrom turns "A" into "A". There's an obvious way to
write one in XSLT that uses wildcards to copy everything, as the
identity transform. However (given a schema or even an example of
input) it would be possible to generate a "longhand" identity
stylesheet that did each element explicitly. This could them be
modified to process each element differently, as you required it.

However this is just a time-saving measure for writing it, not some
fundamental technique. You can code your own pretty easily.

I've seen any number of "stylesheet generator" tools over the years,
from Schematron onwards. Supposedly you can transform anything to
anything, with auto-generated XSLT, based on input schemas and clever
code. However this whole area is a technique that has singularly
_failed_ to deliver useful products (unusual for XML tools). I'm
enormously skeptical about them.

Aug 16 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

32
3236
by: Werner Partner | last post by:
I put this question already, but erhaps it "came under the wheels" because it was hidden in another thread. Nevertheless it's important for me to understand the problem and solve it. Old html 4.01 Standard: http://www.sonoptikon.de/kairos/kontakt.php The crucial lines are: ------------------- <table cellpadding=4 cellspacing=1 width="100%">
87
5649
by: CMAR | last post by:
For xhtml validatin, which is the right metatag to use for English language or can one forget about this tag? <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> Thanks, CMA
3
2294
by: Lachlan Hunt | last post by:
Hi, I'm in the process of setting up content negotiation on the server for my website, and I have set it up so that UAs will either recieve application/xhtml+xml or text/html, depending on their support. Does anybody know where I can get a freely available XSLT file, or other conversion program (or a dreamweaver extension) that can convert from XHTML 1.1 (for application/xhtml+xml) to (X)HTML 1.0/4.01 Strict (for text/html), so I can...
119
4628
by: rhat | last post by:
I heard that beta 2 now makes ASP.NET xhtml compliant. Can anyone shed some light on what this will change and it will break stuff as converting HTML to XHTML pages DO break things. see, http://www.alistapart.com/articles/betterliving/ I read on http://msdn.microsoft.com/netframework/default.aspx?pull=/library/en-us/dnnetdep/html/netfxcompat.asp It said they changed stuff like this
47
10308
by: Chuck | last post by:
Is there any logical reason why one should convert if css is already being used? What possible, immediate, benefit would there be? I am at a loss to see what, pragmatic, difference it would make.
13
1857
by: Peter Williams | last post by:
Hello, If my html is valid XHTML accroding to http://validator.w3.org/, does thuis mean it is also valid (4.0.1) Html? Thanks in Advance
5
2344
by: one | last post by:
Cutting out the <br>s.. Anyone who has a semantic/browser problem with this conversion? Thanks. <style> p.line {margin: 0em;} </style> <!-- From --> <p>text text<br />text text</p>
1
1444
by: shalini jain | last post by:
Hi, I am being faced with a strange problem... I wrote a code for displaying pages in HTML and hence was using HTML parser. Now i am using the same code but now parsing using XHTML that is i want code to be converted to XHTML . Now theproblem is that---------- All the functionality is working fine after conversion except the alignment problem which has arisen in XHTML.... All the text area which was shown as LEFT aligned in HTML has now...
1
1933
by: =?Utf-8?B?QUJO?= | last post by:
Hi, I am getting a HTML string from database. I need to convert this string to XHTML string, and assign it as a text to a XML node. My application is a .NET windows service, which will get scheduled every night. What approach should I follow for conversion of text from HTML to XHTML? Is there any kind of API for this conversion which can be called from the code?
0
9521
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9333
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9945
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9900
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9765
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7324
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5214
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3863
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2733
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.