472,127 Members | 1,601 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,127 software developers and data experts.

Parsing HTML - modify URLs

I am trying to parse an HTML page an only modify URLs within tags -
e.g. inside IMG, A, SCRIPT, FRAME tags etc...

I have built one that works fine using the HTMLParser.HTMLParser and
it works fine.... on good HTML. Having done a google it looks like
parsing dodgy HTML and having HTMLParser choke is a common theme.

I would have difficulties using regular expressions as I want to
modify local reference URLS as well as absolute ones.

It would be nice to just override the error handling of HTMLParser -
but short of digging in the source code it's not a documented
technique :-)

Anyone got any suggestions - this is to go on a server as a CGI - and
I don't have shell access or anything like that, so I'd like to avoid
installing mxTidy. Anyone know an HTML parsing library that will allow
me to rewrite out most of the page unmodified and just modify the
contents of some of the tags.



Jul 18 '05 #1
0 1506

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Walter Dörwald | last post: by
10 posts views Thread by George | last post: by
2 posts views Thread by SMERSH009X | last post: by
7 posts views Thread by John Nagle | last post: by
1 post views Thread by Shriphani | last post: by
29 posts views Thread by lenbell | last post: by
1 post views Thread by Philip Semanchuk | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.