Hi Chris.
Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.
--
Best Regards
[color=blue][color=green][color=darkred]
>>> Andrew Dixon[/color][/color][/color]
www.depictions.net - Sell your photographs online and set your own price.
"Chris" <chris2k01@hotmail.com> wrote in message
news:b4HIb.105301$ss5.27559@clgrps13...[color=blue]
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Andrew Dixon - Depictions.net wrote:
>[color=green]
> > Hi Everyone.
> >
> > I have been working on some code that strips the HTML code out of an
> > HTML page leaving just the text on the page. At the moment this is
> > what I have:
> >
> > // Strip all tags
> > replacePattern = "<(.|\n)+?>";
> > pageHTML = pageHTML.replaceAll(replacePattern,"");
> >
> > //Remove any HTML specific characters (e.g. " or &)
> > replacePattern = "&(.|\n)+?;";
> > pageHTML = pageHTML.replaceAll(replacePattern,"");
> >
> > // Remove whitespace
> > replacePattern = "\\s{2,}";
> > pageHTML = pageHTML.replaceAll(replacePattern," ");
> >
> > Is there a way I can combine all four patterns into one expression
> > so I can make the code more efficient? I've not really worked with
> > RegEx so any advice would be most welcome. Can I do something like:
> >
> > replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
> > pageHTML = pageHTML.replaceAll(replacePattern,"");
> >
> > Thanks.[/color]
>
> Hi,
> I'm not that familiar with regular expressions myself, but according
> to the documentation, you should be able to use something like this:
>
> (<(.|\n)+?>)|(&(.|\n)+?;)
>
> to match either of the first two items in your list (tags and
> entities). The whitespace thing, I would recommend keeping a separate
> operation, since you really want to replace each block of whitespace
> with " ", but you want to replace tags and entities with "". Also,
> isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
> another point to ponder: using your whitespace replacement system
> will put the entire output on one line. I'd think about changing your
> expression if you want to keep linebreaks where they are instead of
> turning them into spaces.
>
> - --
> Chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.2 (GNU/Linux)
>
> iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
> gILykD4bpg2T8Io/eZJ+M1Q=
> =DX+T
> -----END PGP SIGNATURE-----[/color]