Peter Flynn wrote:
Pass the page through HTML Tidy so that it becomes well-formed XHTML.
Or use an HTML-to-XML-API parser, such as NekoHTML (part of the Xerces
family). Though Tidy has the advantage that, like a parser, it will
attempt to guess what completely bogus/broken/atrocious HTML was
intended to mean; I think NekoHTML is intended mostly for HTML that is
at least vaguely reasonable.
As Peter said, and as I hinted: If you're doing this as a personal tool,
and are willing to continue to maintain it every time the folks running
the search engine break it, you can probably make this work well enough.
If you're doing it as a business tool, whoever runs that search engine
is going to work very hard to shut you down unless you've contracted
with them -- and if you've got a contract, you can probably pay them for
an XML interface for the search that doesn't include the advertising,
avoiding the whole problem.
Remember, search results are their product. They're putting a lot of
money into the software, machines, and network resources, and they're
providing the service to noncommercial users for no fee. They really are
entitled to make a a fair profit on the commercial users and/or those
who aren't willing to look at advertising.
--
Joe Kesselman / Beware the fury of a patient man. -- John Dryden