By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,723 Members | 1,270 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,723 IT Pros & Developers. It's quick & easy.

Seek HTML cleanup utilities

P: n/a
I have a number of changes I like to make to HTML files that are not
currently supported by HTML Tidy. Most of them arise from OCR
recognition errors, and many from the ways my OCR program, Finereader,
saves to HTML. I have begun to write stream editing scripts in python,
but wonder whether someone else may have already done so. It would
save me a lot of time to use or modify already-written utilities. I
would appreciate direction to any that are available. Please respond
by email.

Some of the kinds of cleanup I want to be able to do include:

1. Removal of empty tag pairs.

2. Trimming/moving whitespace around tags:
a. Removal whitespace following a <p> and preceding
a </p>.
b. Moving whitespace following lead tag to precede
it, preceding end tag to follow it.

3. Moving certain punctuation -- comma, period,
semi-colon, etc. -- outside of certain end tags, such
as </i>, </b>, etc.

4. Removal of certain attributes:
a. In <font> tag, face="Times New Roman" (or
whatever) so that it will be viewed with default font face.
b. In <font> tag, size="2" (or whatever) so that it
will ve viewed with default font size.

5. Changing of certain attributes:
a. In <font> tag, absolute size="4" to relative
size="+1" (or whatever).

6. Changing of certain tags:
a. <em> to <i>.
b. <strong> to <b>.

7. Removal of certain tags, such as <p>, from around
all the contents of table cells.

8. For all tables, removal of empty topmost and
bottommost rows, leftmost and rightmost columns.

I could go on, but this provides a sample.

Please visit my website at http://www.constitution.org to see what
kinds of HTML documents I am producing.
Jul 18 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.