473,769 Members | 2,377 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

XHTML to XML conversion

I'm trying do some "screen scraping", and am using
<http://www.oreilly.com/catalog/xmlhks/> for inspiration.

First I'd like to convert XHTML to XML, or extract XML from XHTML, I'm
not sure how to phrase that.

"Use Cocoon to Create a Well-Formed View of a Web Page, Then Scrape It
for Data"
<http://hacks.oreilly.c om/pub/h/2125>

Is what I'd like to do down the line, but for now I'm working on
something simpler.
First,

"Convert an HTML Document to XHTML with HTML Tidy"
<http://hacks.oreilly.c om/pub/h/2054>

Instead of Tidy, I went with TagSoup
<http://mercury.ccil.or g/~cowan/XML/tagsoup/>.
Then I'd like go from XHTML to XML in order to:

"Generate an XSLT Identity Stylesheet with Relaxer"
<http://hacks.oreilly.c om/pub/h/2069>

How do I get the XML from the XHTML, please?

here's what I have:[thufir@arrakis tagSoup]$
[thufir@arrakis tagSoup]$ date
Sun Aug 14 23:34:13 IST 2005
[thufir@arrakis tagSoup]$ pwd
/home/thufir/Desktop/tagSoup
[thufir@arrakis tagSoup]$ ll
total 60
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
[thufir@arrakis tagSoup]$ java -jar tagsoup.jar --files google.html
src: google.html dst: google.xhtml
[thufir@arrakis tagSoup]$ ll
total 76
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 10568 Aug 14 23:34 google.xhtml
-rw-rw-r-- 1 thufir thufir 42207 Aug 14 23:32 tagsoup.jar
[thufir@arrakis tagSoup]$ cat google.xhtml -n
1 <?xml version="1.0" standalone="yes "?>
2
3 <html version="-//W3C//DTD HTML 4.01 Transitional//EN"
xmlns="http://www.w3.org/1999/xhtml"><head><t itle>Google
Directory</title><style>&l t;!--
4 body,td,a,p,.h{ font-family: arial,sans-serif;}
..h{color:#0080 00}
..q{text-decoration:none ; color:#0000cc;}
5 //--&gt;</style><script>
6 &lt;!--
7 function sf(){document.f .q.focus();}
8 // --&gt;
9 </script></head><body bgcolor="#fffff f" text="#000000"
link="#3300cc" vlink="#660066" alink="#ff0000" onload="sf();">
10 <center>
11 <table cellpadding="0" cellspacing="0" border="0"><tr> <td
align="right" colspan="1" rowspan="1" valign="bottom" ><img
src="http://www.google.com/images/hp0.gif" width="158" height="78"
alt="Google Directory"></img></td><td colspan="1" rowspan="1"
valign="bottom" ><img src="http://www.google.com/images/hp1.gif"
width="50" height="78" alt=""></img></td><td colspan="1" rowspan="1"
valign="bottom" ><img src="http://www.google.com/images/hp2.gif"
width="68" height="78" alt=""></img></td></tr><tr><td align="right"
colspan="1" rowspan="1" valign="top" class="h"><b>Di rectory</b></td><td
colspan="1" rowspan="1" valign="top"><i mg
src="http://www.google.com/images/hp3.gif" width="50" height="32"
alt=""></img></td><td colspan="1" rowspan="1" valign="top"
class="h"></td></tr></table><br clear="none"></br><table border="0"
cellspacing="0" cellpadding="0" ><tr><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="0" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="0a" href="http://www.google.com/webhp?hl=en"><f ont
size="-1">Web</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="1" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="1a" href="http://www.google.com/imghp?hl=en"><f ont
size="-1">Images</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="2" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="2a" href="http://www.google.com/grphp?hl=en"><f ont
size="-1">Groups</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="3" bgcolor="#00800 0" width="95"><fon t color="#ffffff"
size="-1"><b>Directory </b></font></td><td colspan="1" rowspan="1"
width="15"> </td><td align="center" colspan="1" nowrap="nowrap"
rowspan="1" id="4" bgcolor="#efefe f" width="95"><a shape="rect"
class="q" id="4a" href="http://www.google.com/nwshp?hl=en"><f ont
size="-1">News</font></a></td><td colspan="1" rowspan="1"
width="15"> </td><td colspan="1" rowspan="1"
width="15"> </td></tr><tr><td colspan="12" rowspan="1"
bgcolor="#00800 0"><img width="1" height="1"
alt=""></img></td></tr></table><br clear="none"></br><form
enctype="applic ation/x-www-form-urlencoded" method="get"
action="http://www.google.com/search" name="f"><table cellpadding="0"
cellspacing="0" ><tr align="middle" valign="center" ><td colspan="1"
rowspan="1" width="150"> </td><td colspan="1" rowspan="1"><in put
maxlength="256" type="text" name="q" size="40"
value=""></input><script>d ocument.f.q.foc us();</script><input
type="submit" name="btnG" value="Google Search"></input><input
type="hidden" name="hl" value="en"></input><input type="hidden"
name="cat" value="gwd/Top"></input></td><td align="left" colspan="1"
rowspan="1" width="150"><fo nt size="-2"> • <a
shape="rect" href="http://www.google.com/dirhelp.html">D irectory
Help</a></font></td></tr></table></form><p><font color="#008000" ><b>The
web organized by topic into categories.</b></font></p><p></p><table
align="center" width="1%" border="0" cellspacing="7"
cellpadding="0" ><tr><td colspan="4" rowspan="1" bgcolor="#00800 0"><img
width="1" height="1" alt=""></img></td></tr><tr><td colspan="1"
rowspan="1"> </td><td colspan="1" nowrap="nowrap" rowspan="1">
12 <b><a shape="rect" href="/Top/Arts/">Arts</a></b><br
clear="none"></br>
13 <font size="-1"><a shape="rect"
href="/Top/Arts/Movies/">Movies</a>, <a shape="rect"
href="/Top/Arts/Music/">Music</a>, <a shape="rect"
href="/Top/Arts/Television/">Televisio n</a>, ...</font><p>
14 <b><a shape="rect" href="/Top/Business/">Business</a></b><br
clear="none"></br>
15 <font size="-1"><a shape="rect"
href="/Top/Business/Major_Companies/">Companies </a>, <a shape="rect"
href="/Top/Business/Financial_Servi ces/">Finance</a>, <a shape="rect"
href="/Top/Business/Employment/">Jobs</a>, ...</font></p><p>
16 <b><a shape="rect" href="/Top/Computers/">Computers </a></b><br
clear="none"></br>
17 <font size="-1"><a shape="rect"
href="/Top/Computers/Internet/">Internet</a>, <a shape="rect"
href="/Top/Computers/Hardware/">Hardware</a>, <a shape="rect"
href="/Top/Computers/Software/">Software</a>, ...</font></p><p>
18 <b><a shape="rect" href="/Top/Games/">Games</a></b><br
clear="none"></br>
19 <font size="-1"><a shape="rect"
href="/Top/Games/Board_Games/">Board</a>, <a shape="rect"
href="/Top/Games/Roleplaying/">Roleplayi ng</a>, <a shape="rect"
href="/Top/Games/Video_Games/">Video</a>, ...</font></p><p>
20 <b><a shape="rect" href="/Top/Health/">Health</a></b><br
clear="none"></br>
21 <font size="-1"><a shape="rect"
href="/Top/Health/Alternative/">Alternati ve</a>, <a shape="rect"
href="/Top/Health/Fitness/">Fitness</a>, <a shape="rect"
href="/Top/Health/Medicine/">Medicine</a>, ...</font></p><p>
22 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
23 <b><a shape="rect" href="/Top/Home/">Home</a></b><br
clear="none"></br>
24 <font size="-1"><a shape="rect"
href="/Top/Home/Consumer_Inform ation/">Consumers </a>, <a shape="rect"
href="/Top/Home/Homeowners/">Homeowner s</a>, <a shape="rect"
href="/Top/Home/Family/">Family</a>, ...</font><p>
25 <b><a shape="rect" href="/Top/Kids_and_Teens/">Kids and
Teens</a></b><br clear="none"></br>
26 <font size="-1"><a shape="rect"
href="/Top/Kids_and_Teens/Computers/">Computers </a>, <a shape="rect"
href="/Top/Kids_and_Teens/Entertainment/">Entertainment </a>, <a
shape="rect" href="/Top/Kids_and_Teens/School_Time/">School</a>,
....</font></p><p>
27 <b><a shape="rect" href="/Top/News/">News</a></b><br
clear="none"></br>
28 <font size="-1"><a shape="rect"
href="/Top/News/Media/">Media</a>, <a shape="rect"
href="/Top/News/Newspapers/">Newspaper s</a>, <a shape="rect"
href="/Top/News/Current_Events/">Current Events</a>, ...</font></p><p>
29 <b><a shape="rect"
href="/Top/Recreation/">Recreatio n</a></b><br
clear="none"></br> 30 <font size="-1"><a shape="rect"
href="/Top/Recreation/Food/">Food</a>, <a shape="rect"
href="/Top/Recreation/Outdoors/">Outdoors</a>, <a shape="rect"
href="/Top/Recreation/Travel/">Travel</a>, ...</font></p><p>
31 <b><a shape="rect" href="/Top/Reference/">Reference </a></b><br
clear="none"></br>
32 <font size="-1"><a shape="rect"
href="/Top/Reference/Education/">Education </a>, <a shape="rect"
href="/Top/Reference/Libraries/">Libraries </a>, <a shape="rect"
href="/Top/Reference/Maps/">Maps</a>, ...</font></p><p>
33 </p></td><td colspan="1" nowrap="nowrap" rowspan="1">
34 <b><a shape="rect" href="/Top/Regional/">Regional</a></b><br
clear="none"></br>
35 <font size="-1"><a shape="rect"
href="/Top/Regional/Asia/">Asia</a>, <a shape="rect"
href="/Top/Regional/Europe/">Europe</a>, <a shape="rect"
href="/Top/Regional/North_America/">North America</a>, ...</font><p>
36 <b><a shape="rect" href="/Top/Science/">Science</a></b><br
clear="none"></br>
37 <font size="-1"><a shape="rect"
href="/Top/Science/Biology/">Biology</a>, <a shape="rect"
href="/Top/Science/Social_Sciences/Psychology/">Psycholog y</a>, <a
shape="rect" href="/Top/Science/Physics/">Physics</a>,
....</font></p><p>
38 <b><a shape="rect" href="/Top/Shopping/">Shopping</a></b><br
clear="none"></br>
39 <font size="-1"><a shape="rect"
href="/Top/Shopping/Vehicles/Autos/">Autos</a>, <a shape="rect"
href="/Top/Shopping/Clothing/">Clothing</a>, <a shape="rect"
href="/Top/Shopping/Gifts/">Gifts</a>, ...</font></p><p>
40 <b><a shape="rect" href="/Top/Society/">Society</a></b><br
clear="none"></br>
41 <font size="-1"><a shape="rect"
href="/Top/Society/Issues/">Issues</a>, <a shape="rect"
href="/Top/Society/People/">People</a>, <a shape="rect"
href="/Top/Society/Religion_and_Sp irituality/">Religion</a>,
....</font></p><p>
42 <b><a shape="rect" href="/Top/Sports/">Sports</a></b><br
clear="none"></br>
43 <font size="-1"><a shape="rect"
href="/Top/Sports/Basketball/">Basketbal l</a>, <a shape="rect"
href="/Top/Sports/Football/">Football</a>, <a shape="rect"
href="/Top/Sports/Soccer/">Soccer</a>, ...</font></p><p>
44 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
colspan="3" rowspan="1"><b> <a shape="rect"
href="/Top/World/">World</a></b><br clear="none"></br>
45 <font size="-1"><a shape="rect"
href="/Top/World/Deutsch/">Deutsch</a>, <a shape="rect"
href="/Top/World/Espa%C3%B1ol/">Espa�ol</a>, <a shape="rect"
href="/Top/World/Fran%C3%A7ais/">Fran�ais</a>, <a shape="rect"
href="/Top/World/Italiano/">Italiano</a>, <a shape="rect"
href="/Top/World/Japanese/">Japanese</a>, <a shape="rect"
href="/Top/World/Korean/">Korean</a>, <a shape="rect"
href="/Top/World/Nederlands/">Nederland s</a>, <a shape="rect"
href="/Top/World/Polska/">Polska</a>, <a shape="rect"
href="/Top/World/Svenska/">Svenska</a>, ...</font><p>
46 </p></td></tr><tr><td colspan="1" rowspan="1"> </td><td
colspan="1" nowrap="nowrap" rowspan="1"><fo nt
size="-1"> </font></td></tr><tr><td colspan="4" rowspan="1"
bgcolor="#00800 0"><img width="1" height="1"
alt=""></img></td></tr></table><br clear="none"></br><font size="-1"><a
shape="rect"
href="http://www.google.com/ads/">Advertise wit h Us</a> - <a
shape="rect"
href="http://www.google.com/about.html">Job s, Press, Cool  Stuff...</a></font><p><font
face="arial,san s-serif" size="-1"> ©2004 Google</font></p><br
clear="none"></br><table align="center" border="0" bgcolor="#33660 0"
cellpadding="3" cellspacing="0" ><tr><td colspan="1" rowspan="1"> <table
width="100%" cellpadding="2" cellspacing="0" border="0"><tr
align="center"> <td colspan="1" rowspan="1"><fo nt face="sans-serif,
Arial, Helvetica" size="2" color="#ffffff" >Help build the largest
human-edited directory on the web.</font></td></tr><tr align="center"
bgcolor="#ccccc c"><td colspan="1" rowspan="1"><fo nt face="sans-serif,
Arial, Helvetica" size="2">
47 <a shape="rect" href="http://dmoz.org/add.html">
48 Submit a Site</a> - <a shape="rect"
href="http://dmoz.org/about.html"><b> Open Directory Project</b></a> -
49 <a shape="rect" href="http://dmoz.org/cgi-bin/apply.cgi">Beco me
an Editor</a> </font>
50 </td></tr></table>
51 </td></tr></table>
52 </center></body></html>
53
[thufir@arrakis tagSoup]$ date
Sun Aug 14 23:34:57 IST 2005
[thufir@arrakis tagSoup]$
Thanks,

Thufir

Aug 15 '05
12 7736
di*****@codesmi ths.com wrote:
....
An identity transfrom turns "A" into "A". There's an obvious way to
write one in XSLT that uses wildcards to copy everything, as the
identity transform. However (given a schema or even an example of
input) it would be possible to generate a "longhand" identity
stylesheet that did each element explicitly. This could them be
modified to process each element differently, as you required it.

However this is just a time-saving measure for writing it, not some
fundamental technique. You can code your own pretty easily.

....
Take matrix A. Then there's the identity matrix I.

AI=?=IA

I forget. heh.
-Thufir

Aug 16 '05 #11


help on creading my pimp page

*** Sent via Developersdex http://www.developersdex.com ***
Aug 31 '05 #12
edgar arizmendi wrote:
help on creading my pimp page

*** Sent via Developersdex http://www.developersdex.com ***

why?

Sep 5 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

32
3242
by: Werner Partner | last post by:
I put this question already, but erhaps it "came under the wheels" because it was hidden in another thread. Nevertheless it's important for me to understand the problem and solve it. Old html 4.01 Standard: http://www.sonoptikon.de/kairos/kontakt.php The crucial lines are: ------------------- <table cellpadding=4 cellspacing=1 width="100%">
87
5660
by: CMAR | last post by:
For xhtml validatin, which is the right metatag to use for English language or can one forget about this tag? <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> Thanks, CMA
3
2295
by: Lachlan Hunt | last post by:
Hi, I'm in the process of setting up content negotiation on the server for my website, and I have set it up so that UAs will either recieve application/xhtml+xml or text/html, depending on their support. Does anybody know where I can get a freely available XSLT file, or other conversion program (or a dreamweaver extension) that can convert from XHTML 1.1 (for application/xhtml+xml) to (X)HTML 1.0/4.01 Strict (for text/html), so I can...
119
4635
by: rhat | last post by:
I heard that beta 2 now makes ASP.NET xhtml compliant. Can anyone shed some light on what this will change and it will break stuff as converting HTML to XHTML pages DO break things. see, http://www.alistapart.com/articles/betterliving/ I read on http://msdn.microsoft.com/netframework/default.aspx?pull=/library/en-us/dnnetdep/html/netfxcompat.asp It said they changed stuff like this
47
10314
by: Chuck | last post by:
Is there any logical reason why one should convert if css is already being used? What possible, immediate, benefit would there be? I am at a loss to see what, pragmatic, difference it would make.
13
1858
by: Peter Williams | last post by:
Hello, If my html is valid XHTML accroding to http://validator.w3.org/, does thuis mean it is also valid (4.0.1) Html? Thanks in Advance
5
2344
by: one | last post by:
Cutting out the <br>s.. Anyone who has a semantic/browser problem with this conversion? Thanks. <style> p.line {margin: 0em;} </style> <!-- From --> <p>text text<br />text text</p>
1
1444
by: shalini jain | last post by:
Hi, I am being faced with a strange problem... I wrote a code for displaying pages in HTML and hence was using HTML parser. Now i am using the same code but now parsing using XHTML that is i want code to be converted to XHTML . Now theproblem is that---------- All the functionality is working fine after conversion except the alignment problem which has arisen in XHTML.... All the text area which was shown as LEFT aligned in HTML has now...
1
1935
by: =?Utf-8?B?QUJO?= | last post by:
Hi, I am getting a HTML string from database. I need to convert this string to XHTML string, and assign it as a text to a XML node. My application is a .NET windows service, which will get scheduled every night. What approach should I follow for conversion of text from HTML to XHTML? Is there any kind of API for this conversion which can be called from the code?
0
9589
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10222
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8876
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5310
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3967
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3570
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.