473,416 Members | 1,535 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

Combining Regular Expressions

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon

Jul 17 '05 #1
5 11713
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----
Jul 17 '05 #2
"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote in message news:<V_*********************@news-text.cableinet.net>...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.ParserCallback#handleText() method.
Jul 17 '05 #3
You might want to look at http://htmlparser.sourceforge.net

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote
in message news:V_*********************@news-text.cableinet.net...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

file://Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon


Jul 17 '05 #4
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.

--

Best Regards
Andrew Dixon

www.depictions.net - Sell your photographs online and set your own price.
"Chris" <ch*******@hotmail.com> wrote in message
news:b4HIb.105301$ss5.27559@clgrps13... -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----

Jul 17 '05 #5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where
there is 2 spaces or greater so that it doesn't all end up as one
great long string with any spaces which is what \\s+ would do.


Hi,
First, I have to assume you mean "without any spaces" rather than
"with any spaces".

If one of these is not true, then I'm afraid I don't quite understand
the question.

If they are, however, then I think we misunderstand each other. My
idea was to replace "\\s+" with " ". This would take any string of
one space or more and replace it with one space - which makes sense.
Replacing one space with one space is still one space. I think you
thought I meant to replace it with "", which would indeed remove all
spaces. However, *your* example is also slightly flawed, assuming
your replacement string is "", in that where there are 2 or more
spaces, it will take all of them and remove them entirely, rather
than replacing them with one space. If, on the other hand, your
replacement string is " ", then both examples work equally well, but
my regexp is shorter than yours :) Largely academic at this point
though.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/82s7wxczzJRavJYRAiseAJ9ah2iajbVZRGYQ6szYmkNNAisAHg CgmUm0
Fwp7qzP8SFWauv/EH3kxH6U=
=suPx
-----END PGP SIGNATURE-----
Jul 17 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
1
by: Christopher Subich | last post by:
I don't think the python regular expression module correctly handles combining marks; it gives inconsistent results between equivalent forms of some regular expressions: >>> sys.version '2.4.1...
2
by: Sehboo | last post by:
Hi, I have several regular expressions that I need to run against documents. Is it possible to combine several expressions in one expression in Regex object. So that it is faster, or will I...
4
by: Együd Csaba | last post by:
Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or...
0
by: rufus | last post by:
I have some text to parse. I dont want to match link text or text inside paragraphs of class=tab. All other text should be matched. Here is the text: ********** This text will match<a...
3
by: a | last post by:
I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach' (http://www.weitz.de/regex-coach/index.html) Are the Regular...
1
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...
13
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...
2
by: Bart Kastermans | last post by:
I have a file in which I am searching for the letter "i" (actually a bit more general than that, arbitrary regular expressions could occur) as long as it does not occur inside an expression that...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.