472,325 Members | 1,596 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,325 software developers and data experts.

Combining Regular Expressions

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon

Jul 17 '05 #1
5 11624
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----
Jul 17 '05 #2
"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote in message news:<V_*********************@news-text.cableinet.net>...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.ParserCallback#handleText() method.
Jul 17 '05 #3
You might want to look at http://htmlparser.sourceforge.net

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote
in message news:V_*********************@news-text.cableinet.net...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

file://Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon


Jul 17 '05 #4
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.

--

Best Regards
Andrew Dixon

www.depictions.net - Sell your photographs online and set your own price.
"Chris" <ch*******@hotmail.com> wrote in message
news:b4HIb.105301$ss5.27559@clgrps13... -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----

Jul 17 '05 #5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where
there is 2 spaces or greater so that it doesn't all end up as one
great long string with any spaces which is what \\s+ would do.


Hi,
First, I have to assume you mean "without any spaces" rather than
"with any spaces".

If one of these is not true, then I'm afraid I don't quite understand
the question.

If they are, however, then I think we misunderstand each other. My
idea was to replace "\\s+" with " ". This would take any string of
one space or more and replace it with one space - which makes sense.
Replacing one space with one space is still one space. I think you
thought I meant to replace it with "", which would indeed remove all
spaces. However, *your* example is also slightly flawed, assuming
your replacement string is "", in that where there are 2 or more
spaces, it will take all of them and remove them entirely, rather
than replacing them with one space. If, on the other hand, your
replacement string is " ", then both examples work equally well, but
my regexp is shorter than yours :) Largely academic at this point
though.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/82s7wxczzJRavJYRAiseAJ9ah2iajbVZRGYQ6szYmkNNAisAHg CgmUm0
Fwp7qzP8SFWauv/EH3kxH6U=
=suPx
-----END PGP SIGNATURE-----
Jul 17 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API...
1
by: Christopher Subich | last post by:
I don't think the python regular expression module correctly handles combining marks; it gives inconsistent results between equivalent forms of...
2
by: Sehboo | last post by:
Hi, I have several regular expressions that I need to run against documents. Is it possible to combine several expressions in one expression in...
4
by: Együd Csaba | last post by:
Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance....
0
by: rufus | last post by:
I have some text to parse. I dont want to match link text or text inside paragraphs of class=tab. All other text should be matched. Here is the...
3
by: a | last post by:
I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach'...
1
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to...
13
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major...
2
by: Bart Kastermans | last post by:
I have a file in which I am searching for the letter "i" (actually a bit more general than that, arbitrary regular expressions could occur) as long...
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
better678
by: better678 | last post by:
Question: Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct? Answer: Java is an object-oriented...
0
by: teenabhardwaj | last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
0
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: CD Tom | last post by:
This happens in runtime 2013 and 2016. When a report is run and then closed a toolbar shows up and the only way to get it to go away is to right...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
0
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.