469,916 Members | 2,371 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,916 developers. It's quick & easy.

Parsing HTML - modify URLs

I am trying to parse an HTML page an only modify URLs within tags -
e.g. inside IMG, A, SCRIPT, FRAME tags etc...

I have built one that works fine using the HTMLParser.HTMLParser and
it works fine.... on good HTML. Having done a google it looks like
parsing dodgy HTML and having HTMLParser choke is a common theme.

I would have difficulties using regular expressions as I want to
modify local reference URLS as well as absolute ones.

It would be nice to just override the error handling of HTMLParser -
but short of digging in the source code it's not a documented
technique :-)

Anyone got any suggestions - this is to go on a server as a CGI - and
I don't have shell access or anything like that, so I'd like to avoid
installing mxTidy. Anyone know an HTML parsing library that will allow
me to rewrite out most of the page unmodified and just modify the
contents of some of the tags.



Jul 18 '05 #1
0 1444

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Walter Dörwald | last post: by
10 posts views Thread by George | last post: by
2 posts views Thread by SMERSH009X | last post: by
7 posts views Thread by John Nagle | last post: by
1 post views Thread by Shriphani | last post: by
29 posts views Thread by lenbell | last post: by
1 post views Thread by Philip Semanchuk | last post: by
reply views Thread by Salome Sato | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.