I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:
// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");
//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");
// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");
Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");
Thanks.
--
Best Regards
Andrew Dixon