By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,798 Members | 1,342 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,798 IT Pros & Developers. It's quick & easy.

Combining Regular Expressions

P: n/a
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon

Jul 17 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----
Jul 17 '05 #2

P: n/a
"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote in message news:<V_*********************@news-text.cableinet.net>...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.ParserCallback#handleText() method.
Jul 17 '05 #3

P: n/a
You might want to look at http://htmlparser.sourceforge.net

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote
in message news:V_*********************@news-text.cableinet.net...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

file://Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon


Jul 17 '05 #4

P: n/a
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.

--

Best Regards
Andrew Dixon

www.depictions.net - Sell your photographs online and set your own price.
"Chris" <ch*******@hotmail.com> wrote in message
news:b4HIb.105301$ss5.27559@clgrps13... -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----

Jul 17 '05 #5

P: n/a
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where
there is 2 spaces or greater so that it doesn't all end up as one
great long string with any spaces which is what \\s+ would do.


Hi,
First, I have to assume you mean "without any spaces" rather than
"with any spaces".

If one of these is not true, then I'm afraid I don't quite understand
the question.

If they are, however, then I think we misunderstand each other. My
idea was to replace "\\s+" with " ". This would take any string of
one space or more and replace it with one space - which makes sense.
Replacing one space with one space is still one space. I think you
thought I meant to replace it with "", which would indeed remove all
spaces. However, *your* example is also slightly flawed, assuming
your replacement string is "", in that where there are 2 or more
spaces, it will take all of them and remove them entirely, rather
than replacing them with one space. If, on the other hand, your
replacement string is " ", then both examples work equally well, but
my regexp is shorter than yours :) Largely academic at this point
though.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/82s7wxczzJRavJYRAiseAJ9ah2iajbVZRGYQ6szYmkNNAisAHg CgmUm0
Fwp7qzP8SFWauv/EH3kxH6U=
=suPx
-----END PGP SIGNATURE-----
Jul 17 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.