By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,784 Members | 2,933 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,784 IT Pros & Developers. It's quick & easy.

Parsing HTML - modify URLs

P: n/a
I am trying to parse an HTML page an only modify URLs within tags -
e.g. inside IMG, A, SCRIPT, FRAME tags etc...

I have built one that works fine using the HTMLParser.HTMLParser and
it works fine.... on good HTML. Having done a google it looks like
parsing dodgy HTML and having HTMLParser choke is a common theme.

I would have difficulties using regular expressions as I want to
modify local reference URLS as well as absolute ones.

It would be nice to just override the error handling of HTMLParser -
but short of digging in the source code it's not a documented
technique :-)

Anyone got any suggestions - this is to go on a server as a CGI - and
I don't have shell access or anything like that, so I'd like to avoid
installing mxTidy. Anyone know an HTML parsing library that will allow
me to rewrite out most of the page unmodified and just modify the
contents of some of the tags.


Jul 18 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.