469,946 Members | 2,174 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,946 developers. It's quick & easy.

comparing HTML tables

Hi,

I have table pairs that I need to compare, and produce another table
that shows differences. I can't just open them in separate browser
and look for differences, because I have many such table pairs, and so
this process has to be automated.

Tables can differ in a number of ways - columns and rows can be added
or missing, values of cells can change. To see an example of the
differences, I'm attaching below two tables.

Ideally, I could like some library that would do the comparison,
something that would use the Xerces library.

Any clues would be great.
Thanks,
Irek

******************************

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>cost</title>
</head>
<body>
<table border="yes" summary="cost">
<tr>
<th class="text" colspan="2">Category</th>
<th class="text">Cost</th>
<th class="text">Total</th>
</tr>
<tr>
<th class="text" rowspan="4">Link Cost</th>
<th class="text">Cable</th>
<td class="money">0.00</td>
<td class="money" rowspan="4">44,615.00</td>
</tr>
<tr>
<th class="text">Fiber</th>
<td class="money">18,575.00</td>
</tr>
<tr>
<th class="text">Channel</th>
<td class="money">26,040.00</td>
</tr>
<tr>
<th class="text">Equipment</th>
<td class="money">0.00</td>
</tr>
<tr>
<th class="text" rowspan="2">Node Cost</th>
<th class="text">Electrical</th>
<td class="money">130,820.00</td>
<td class="money" rowspan="2">163,180.00</td>
</tr>
<tr>
<th class="text">Optical</th>
<td class="money">32,360.00</td>
</tr>
<tr class="total">
<th class="text" colspan="3">Total Network Cost</th>
<td class="money">207,795.00</td>
</tr>
</table>
</body>
</html>

******************************

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>cost</title>
</head>
<body>
<table border="yes" summary="cost">
<tr>
<th class="text" colspan="2">Category</th>
<th class="text">Cost</th>
</tr>
<tr>
<th class="text" rowspan="3">Link Cost</th>
<th class="text">Fiber</th>
<td class="money">18,575.00</td>
</tr>
<tr>
<th class="text">Channel</th>
<td class="money">26,040.00</td>
</tr>
<tr>
<th class="text">Equipment</th>
<td class="money">200.00</td>
</tr>
<tr>
<th class="text" rowspan="2">Node Cost</th>
<th class="text">Electrical</th>
<td class="money">130,820.00</td>
</tr>
<tr>
<th class="text">Optical</th>
<td class="money">32,360.00</td>
</tr>
</table>
</body>
</html>

May 12 '07 #1
8 2090
On 12 May, 14:47, irek.szczesn...@gmail.com wrote:
I have table pairs that I need to compare,
This is a good reason to use XHTML rather than HTML. Convert it with
Tidy if necessary.

Then use XSLT to do your comparison.

May 14 '07 #2

Andy Dingley <di*****@codesmiths.comwrote in
<11**********************@l77g2000hsb.googlegroups .com>:
On 12 May, 14:47, irek.szczesn...@gmail.com wrote:
>I have table pairs that I need to compare,

This is a good reason to use XHTML rather than HTML.
Convert it with Tidy if necessary.

Then use XSLT to do your comparison.
I don't believe XHTML conversion is strictly necessary if
XSLT is to be used for comparison. At least some of the
common XSLT processors are perfectly capable of
transforming DOM trees built using HTML parser
(xsltproc --html comes to mind). On the other hand, coding
an xmldiff is hardly what I would call a trivial task, so
it would probably be better to convert to XHTML, then use
one of the existing xmldiff implementations instead of
XSLT.

--
Pavel Lepin
May 14 '07 #3
On 14 May, 10:15, Pavel Lepin <p.le...@ctncorp.comwrote:
it would probably be better to convert to XHTML, then use
one of the existing xmldiff implementations instead of
XSLT.
xmldiff is rarely the same in an application sense as "compare
tables". For one thing you need to restrict the scope to _just_ the
table, secondly there can be a wide range of variations that are
merely "whitespace" to an application but they're significant to
canonical XML.

May 14 '07 #4

Andy Dingley <di*****@codesmiths.comwrote in
<11**********************@h2g2000hsg.googlegroups. com>:
On 14 May, 10:15, Pavel Lepin <p.le...@ctncorp.comwrote:
>it would probably be better to convert to XHTML, then use
one of the existing xmldiff implementations instead of
XSLT.

xmldiff is rarely the same in an application sense as
"compare tables". For one thing you need to restrict the
scope to _just_ the table,
If that is the case, then pulling the stuff-to-be-compared
out of the source documents is, indeed, a job for XSLT. I'm
just not really sure about the whole idea of implementing
xmldiff in XSLT. It would be challenging enough with XSLT2,
and outright maddening with XSLT1.
secondly there can be a wide range of variations that are
merely "whitespace" to an application but they're
significant to canonical XML.
Shouldn't that be vice versa? In any case, I believe XML,
when properly used, should not rely on serialisation
details; goes against the grain. And that'd be the
behaviour I would expect out of any worthy xmldiff
implementation.

--
Pavel Lepin
May 14 '07 #5
On 14 May, 13:37, Pavel Lepin <p.le...@ctncorp.comwrote:
Andy Dingley <ding...@codesmiths.comwrote in
<1179140351.311695.307...@h2g2000hsg.googlegroups. com>:
On 14 May, 10:15, Pavel Lepin <p.le...@ctncorp.comwrote:
it would probably be better to convert to XHTML, then use
one of the existing xmldiff implementations instead of
XSLT.
xmldiff is rarely the same in an application sense as
"compare tables". For one thing you need to restrict the
scope to _just_ the table,

If that is the case, then pulling the stuff-to-be-compared
out of the source documents is, indeed, a job for XSLT. I'm
just not really sure about the whole idea of implementing
xmldiff in XSLT.
Who said that? If I _wanted_ xmldiff, I'd use it.

In fact, if I could use an application-independent diif (rather than
needing a smarter application-aware one), then I'd probably just use a
line-based plaintext diff. For two web pages that are generated by the
same code, then their serialisations are usually going to be pretty
similar, so line-based comparison is "adequate". It also handles SGML
as easily as XML.
secondly there can be a wide range of variations that are
merely "whitespace" to an application but they're
significant to canonical XML.

Shouldn't that be vice versa? In any case, I believe XML,
when properly used, should not rely on serialisation
details;
Canonical XML is what you get when you take the underlying XML
Infoset, and ignore any variations that can be caused merely by
permissible variations in serialisation. It's a basic requirement for
any sort of xmldiff that it will be able to compare canonical forms.

However in most application cases, this isn't particularly useful.
It's very common (as an example) to need to compare unsorted table
rows where one column is an ordinal identifier (identifying the pairs
of rows to be compared) other columns are "data" to be compared and
some columns are timestamps or trivial annotations that should be
ignored. This is quite easy for a competent XSLT coder to do a second
time (it's admittedly hard to do _anything_ in XSLT for the first
time). It's impractical for a low-level xmldiff to understand what's
significant and what isn't though.

May 14 '07 #6

Andy Dingley <di*****@codesmiths.comwrote in
<11**********************@e65g2000hsc.googlegroups .com>:
On 14 May, 13:37, Pavel Lepin <p.le...@ctncorp.comwrote:
>Andy Dingley <ding...@codesmiths.comwrote in
<1179140351.311695.307...@h2g2000hsg.googlegroups .com>:
xmldiff is rarely the same in an application sense as
"compare tables". For one thing you need to restrict
the scope to _just_ the table,

If that is the case, then pulling the
stuff-to-be-compared out of the source documents is,
indeed, a job for XSLT. I'm just not really sure about
the whole idea of implementing xmldiff in XSLT.

Who said that? If I _wanted_ xmldiff, I'd use it.
The problem sounded general enough to me that a fairly
complete implementation of 'treediff' would be in order.
In fact, if I could use an application-independent diif
(rather than needing a smarter application-aware one),
then I'd probably just use a line-based plaintext diff.
For two web pages that are generated by the same code,
then their serialisations are usually going to be pretty
similar, so line-based comparison is "adequate".
Granted, but IME line-based comparison is a bit too prone to
breakage when used on tree-like markup.
It also handles SGML as easily as XML.
All of that is certainly true. But I believe we're lacking
information on the problem the OP is facing to make a
really good guess which of the options discussed (XSLT
application-specific comparison, xmldiff, plain-text diff)
would probably be best for him. So I think the best course
of action is to *list* those options for the OP's
consideration.
secondly there can be a wide range of variations that
are merely "whitespace" to an application but they're
significant to canonical XML.

Shouldn't that be vice versa? In any case, I believe XML,
when properly used, should not rely on serialisation
details;

Canonical XML is what you get when you take the underlying
XML Infoset, and ignore any variations that can be caused
merely by permissible variations in serialisation. It's a
basic requirement for any sort of xmldiff that it will be
able to compare canonical forms.
That's precisely what I was talking about.
However in most application cases, this isn't particularly
useful. It's very common (as an example) to need to
compare unsorted table rows where one column is an ordinal
identifier (identifying the pairs of rows to be compared)
other columns are "data" to be compared and some columns
are timestamps or trivial annotations that should be
ignored.
I was under the impression the OP didn't care much for the
underlying meaning of his documents, and was mostly
interested in diffing the structure of his documents
intelligently. In case I'm mistaken, canned solutions such
as diffs of any sort wouldn't do him much good indeed.
Perhaps all of that confusion can be attributed to the
ambiguity of the 'table' term in HTML context ('table' as
in 'markup used to denote a table' vs. 'table' as
in 'tabular data').
This is quite easy for a competent XSLT coder to
do a second time (it's admittedly hard to do _anything_ in
XSLT for the first time).
Aww, XSLT is not that bad. Actually, it's quite fun even for
newbies (given that newbies in question are willing to
stretch their minds a bit).
It's impractical for a low-level xmldiff to understand
what's significant and what isn't though.
Definitely, it's just that I was recently tinkering with
xmldiffing, and OP's description of his problem sounded
suspiciously close to my final formulation of *my* problem.
(So perhaps that's just a case of tunnel vision on my
part.)

--
Pavel Lepin
May 14 '07 #7
Many thanks to responses to my post.

On May 14, 3:22 pm, Andy Dingley <ding...@codesmiths.comwrote:
However in most application cases, this isn't particularly useful.
It's very common (as an example) to need to compare unsorted table
rows where one column is an ordinal identifier (identifying the pairs
of rows to be compared) other columns are "data" to be compared and
some columns are timestamps or trivial annotations that should be
ignored. This is quite easy for a competent XSLT coder to do a second
time (it's admittedly hard to do _anything_ in XSLT for the first
time).
Could you please point me to some example of XSLT that compares two
tables? By a table I mean the HTML table, something that is encoded
by the "table", "td" and other tags. The tables that I have use the
colspan and rowspan attributes.
Thanks,
Irek

May 14 '07 #8
All of that is certainly true. But I believe we're lacking
information on the problem the OP is facing to make a
really good guess which of the options discussed (XSLT
application-specific comparison, xmldiff, plain-text diff)
would probably be best for him. So I think the best course
of action is to *list* those options for the OP's
consideration.
My apologies if I wasn't clear enough. I know that plain-text diff is
not good for me. xmldiff perhaps could be useful, but as far as I
could tell from my short Internet research, there is no accepted
diffing protocol for XML. Moreover, the tools like Xydiff, diffxml,
xmldiff seem unmaintained and forgotten.

And since I'm dealing with the data that's encoded with the HTML table
tags, then I'm looking for a way to compare the tables and produce a
table that would show the differences. My tables use the colspan and
rowspan attributes and so they don't need to have the strict tabular
format, i.e. the number of their data cells equals to the product of
the number of rows and the number of columns. I examined tools like
HTMLMatch, HTMLDiff, Compare HTML, but the tables with differences
they produced were just wrong.
Best,
Irek

May 14 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Ian N | last post: by
4 posts views Thread by osmethod | last post: by
19 posts views Thread by Dennis | last post: by
5 posts views Thread by Frank | last post: by
5 posts views Thread by Franck | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.