Hi,
can anybody help me with the cleaning of really messy HTML from
the news site into really clean XHTML, which I would like to
then analyze with some qualitative analysis (probably exporting
to plain ASCII in meantime, but not necessarily). I can do some
little cleaning by hand, but when there some hundreds of
webpages, I hoped that I could create some XSL stylesheet for
conversion.
I have downloaded this page
(http://news.bostonherald.com/localRe...6&format=text;
the copy is available on
http://www.ceplovi.cz/matej/tmp/downloaded.html). Then I run it
through tidy (http://www.ceplovi.cz/matej/tmp/tidyfied.html). I
would love to get some really minimal HTML2.0-like XHTML
(something like http://www.ceplovi.cz/matej/tmp/clean.xhtml).
Is there any tool for doing things like that? I hoped to create
some XSL stylesheet myself, but I am quite newbie in XSL-arena,
and there are some things, which I did not manage to do:
1) How to say to XSL processor "skip everything between <body>
and the blocklevel tag, which contains the same text as <title>,
but without some constant text or regex (e.g.,
"\s*BostonHerald.com.*:" or at least "BostonHerald.com -
Local/Regional News:")"? Of course all remaining closing tags
should be omitted as well.
2) How to remove all tables without removing their content?
Does anybody know about any such solution or at least example of
such thing?
Thanks for any help,
Matej Cepl
--
Matej Cepl,
GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC
138 Highland Ave. #10, Somerville, Ma 02143, (617) 623-1488