473,804 Members | 4,311 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Combining Regular Expressions

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?> ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?; ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replac eAll(replacePat tern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

Thanks.
--

Best Regards
Andrew Dixon

Jul 17 '05 #1
5 11733
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?> ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?; ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replac eAll(replacePat tern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(& (.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJY RAlHQAJ9kY1USFt v36iInWnR0v6hqT L0PzwCZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----
Jul 17 '05 #2
"Andrew Dixon - Depictions.net" <an**********@N OREPLY.depictio ns.net> wrote in message news:<V_******* **************@ news-text.cableinet. net>...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?> ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?; ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replac eAll(replacePat tern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

Thanks.


Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.P arserCallback#h andleText() method.
Jul 17 '05 #3
You might want to look at http://htmlparser.sourceforge.net

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Andrew Dixon - Depictions.net" <an**********@N OREPLY.depictio ns.net> wrote
in message news:V_******** *************@n ews-text.cableinet. net...
Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?> ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

file://Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?; ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replac eAll(replacePat tern," ");

Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

Thanks.
--

Best Regards
Andrew Dixon


Jul 17 '05 #4
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.

--

Best Regards
Andrew Dixon

www.depictions.net - Sell your photographs online and set your own price.
"Chris" <ch*******@hotm ail.com> wrote in message
news:b4HIb.1053 01$ss5.27559@cl grps13... -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?> ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

//Remove any HTML specific characters (e.g. &quot; or &amp;)
replacePattern = "&(.|\n)+?; ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replac eAll(replacePat tern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replac eAll(replacePat tern,"");

Thanks.


Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(& (.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJY RAlHQAJ9kY1USFt v36iInWnR0v6hqT L0PzwCZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----

Jul 17 '05 #5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where
there is 2 spaces or greater so that it doesn't all end up as one
great long string with any spaces which is what \\s+ would do.


Hi,
First, I have to assume you mean "without any spaces" rather than
"with any spaces".

If one of these is not true, then I'm afraid I don't quite understand
the question.

If they are, however, then I think we misunderstand each other. My
idea was to replace "\\s+" with " ". This would take any string of
one space or more and replace it with one space - which makes sense.
Replacing one space with one space is still one space. I think you
thought I meant to replace it with "", which would indeed remove all
spaces. However, *your* example is also slightly flawed, assuming
your replacement string is "", in that where there are 2 or more
spaces, it will take all of them and remove them entirely, rather
than replacing them with one space. If, on the other hand, your
replacement string is " ", then both examples work equally well, but
my regexp is shorter than yours :) Largely academic at this point
though.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/82s7wxczzJRavJY RAiseAJ9ah2iajb VZRGYQ6szYmkNNA isAHgCgmUm0
Fwp7qzP8SFWauv/EH3kxH6U=
=suPx
-----END PGP SIGNATURE-----
Jul 17 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4187
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make regular expressions easier to create and use (and in my experience as a regular expression user, it makes them MUCH easier to create and use.) I'm still working on formal documentation, and in any case, such documentation isn't necessarily the...
1
1781
by: Christopher Subich | last post by:
I don't think the python regular expression module correctly handles combining marks; it gives inconsistent results between equivalent forms of some regular expressions: >>> sys.version '2.4.1 (#65, Mar 30 2005, 09:13:57) ' >>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0) u'n' >>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0) u'\xf1'
2
5105
by: Sehboo | last post by:
Hi, I have several regular expressions that I need to run against documents. Is it possible to combine several expressions in one expression in Regex object. So that it is faster, or will I have to use all the expressions seperately? Here are my regular expressions that check for valid email address and link Dim Expression As String =
4
5187
by: Együd Csaba | last post by:
Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or adate LIKE "2004.01.10 __:15") .... into something like this: .... where adate LIKE "2004.01.10 __:(30/15)" ...
0
949
by: rufus | last post by:
I have some text to parse. I dont want to match link text or text inside paragraphs of class=tab. All other text should be matched. Here is the text: ********** This text will match<a href="">This text wont match</a>This text will match<p class=tab>This text wont match</p><p class=other>This text will match</p>This text will match<a href="">This text wont match</a>This text will match. **********
3
3028
by: a | last post by:
I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach' (http://www.weitz.de/regex-coach/index.html) Are the Regular Expressions used in Perl identical to the Regular Expressions in PHP?
1
4388
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find the first regular expression that matches the string. I've gor the regular expressions ordered so that the highest priority is first (if two or more regular expressions match the string I want the first one returned) The code that does this has...
13
7497
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can be accomplished with regular expressions this way, such as validating a mathematical expression or parsing a language with nested parens, quoting or expressions. Another feature I'm missing is once-only subpatterns and possessive quantifiers...
2
1517
by: Bart Kastermans | last post by:
I have a file in which I am searching for the letter "i" (actually a bit more general than that, arbitrary regular expressions could occur) as long as it does not occur inside an expression that matches \\.+?\b (something started by a backslash and including the word that follows). More concrete example, I have the string "\sin(i)" and I want to match the argument, but not the i in \sin. Can this be achieved by combining the regular...
0
10593
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10340
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10329
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9163
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7626
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6858
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5527
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5663
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4304
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.