By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,727 Members | 757 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,727 IT Pros & Developers. It's quick & easy.

Cleaning the mess of newssite HTML

P: n/a
Hi,

can anybody help me with the cleaning of really messy HTML from
the news site into really clean XHTML, which I would like to
then analyze with some qualitative analysis (probably exporting
to plain ASCII in meantime, but not necessarily). I can do some
little cleaning by hand, but when there some hundreds of
webpages, I hoped that I could create some XSL stylesheet for
conversion.

I have downloaded this page
(http://news.bostonherald.com/localRe...6&format=text;
the copy is available on
http://www.ceplovi.cz/matej/tmp/downloaded.html). Then I run it
through tidy (http://www.ceplovi.cz/matej/tmp/tidyfied.html). I
would love to get some really minimal HTML2.0-like XHTML
(something like http://www.ceplovi.cz/matej/tmp/clean.xhtml).

Is there any tool for doing things like that? I hoped to create
some XSL stylesheet myself, but I am quite newbie in XSL-arena,
and there are some things, which I did not manage to do:

1) How to say to XSL processor "skip everything between <body>
and the blocklevel tag, which contains the same text as <title>,
but without some constant text or regex (e.g.,
"\s*BostonHerald.com.*:" or at least "BostonHerald.com -
Local/Regional News:")"? Of course all remaining closing tags
should be omitted as well.
2) How to remove all tables without removing their content?

Does anybody know about any such solution or at least example of
such thing?

Thanks for any help,

Matej Cepl

--
Matej Cepl,
GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC
138 Highland Ave. #10, Somerville, Ma 02143, (617) 623-1488
Jul 20 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Matej Cepl wrote:
can anybody help me with the cleaning of really messy HTML from
the news site into really clean XHTML, which I would like to
then analyze with some qualitative analysis (probably exporting
to plain ASCII in meantime, but not necessarily). I can do some
little cleaning by hand, but when there some hundreds of
webpages, I hoped that I could create some XSL stylesheet for
conversion.


Sorry, only later I found the right keywords to search on groups.google.com
and found out that HTML Screen reading is already developed enterprise.

Matej

--
Matej Cepl,
GPG Finger: 89EF 4BC6 288A BF43 1BAB 25C3 E09F EF25 D964 84AC
138 Highland Ave. #10, Somerville, Ma 02143, (617) 623-1488
Jul 20 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.