Combining Regular Expressions

Andrew Dixon - Depictions.net

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards

Andrew Dixon

Jul 17 '05 #1

Subscribe Post Reply

11713

Chris

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:

Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.

Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----

Jul 17 '05 #2

hiwa

"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote in message news:<V_*********************@news-text.cableinet.net>...

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.

Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.ParserCallback#handleText() method.

Jul 17 '05 #3

Tony Morris

You might want to look at http://htmlparser.sourceforge.net

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Andrew Dixon - Depictions.net" <an**********@NOREPLY.depictions.net> wrote
in message news:V_*********************@news-text.cableinet.net...

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

file://Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.
--

Best Regards
Andrew Dixon

Jul 17 '05 #4

Andrew Dixon - Depictions.net

Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.

--

Best Regards

Andrew Dixon

www.depictions.net - Sell your photographs online and set your own price.
"Chris" <ch*******@hotmail.com> wrote in message
news:b4HIb.105301$ss5.27559@clgrps13... -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:
Hi Everyone.

I have been working on some code that strips the HTML code out of an
HTML page leaving just the text on the page. At the moment this is
what I have:

// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");

//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");

// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression
so I can make the code more efficient? I've not really worked with
RegEx so any advice would be most welcome. Can I do something like:

replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.

Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:

(<(.|\n)+?>)|(&(.|\n)+?;)

to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/8xNbwxczzJRavJYRAlHQAJ9kY1USFtv36iInWnR0v6hqTL0Pzw CZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----

Jul 17 '05 #5

Chris

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Dixon - Depictions.net wrote:

Hi Chris.

Thanks. The reason I used \\s{2,} is so that it only replaces where
there is 2 spaces or greater so that it doesn't all end up as one
great long string with any spaces which is what \\s+ would do.

Hi,
First, I have to assume you mean "without any spaces" rather than
"with any spaces".

If one of these is not true, then I'm afraid I don't quite understand
the question.

If they are, however, then I think we misunderstand each other. My
idea was to replace "\\s+" with " ". This would take any string of
one space or more and replace it with one space - which makes sense.
Replacing one space with one space is still one space. I think you
thought I meant to replace it with "", which would indeed remove all
spaces. However, *your* example is also slightly flawed, assuming
your replacement string is "", in that where there are 2 or more
spaces, it will take all of them and remove them entirely, rather
than replacing them with one space. If, on the other hand, your
replacement string is " ", then both examples work equally well, but
my regexp is shorter than yours :) Largely academic at this point
though.

- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE/82s7wxczzJRavJYRAiseAJ9ah2iajbVZRGYQ6szYmkNNAisAHg CgmUm0
Fwp7qzP8SFWauv/EH3kxH6U=
=suPx
-----END PGP SIGNATURE-----

Jul 17 '05 #6

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

Unicode regular expressions -- buggy?

by: Christopher Subich | last post by:

I don't think the python regular expression module correctly handles combining marks; it gives inconsistent results between equivalent forms of some regular expressions: >>> sys.version '2.4.1...

Python

Regular Expressions

by: Sehboo | last post by:

Hi, I have several regular expressions that I need to run against documents. Is it possible to combine several expressions in one expression in Regex object. So that it is faster, or will I...

Visual Basic .NET

Using regular expressions in LIKE

by: Együd Csaba | last post by:

Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or...

PostgreSQL Database

combining regular expressions

by: rufus | last post by:

I have some text to parse. I dont want to match link text or text inside paragraphs of class=tab. All other text should be matched. Here is the text: ********** This text will match<a...

.NET Framework

Regular Expressions and The Regex Coach

by: a | last post by:

I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach' (http://www.weitz.de/regex-coach/index.html) Are the Regular...

PHP

Dynamic list of regular expressions, find the one that matches.

by: Allan Ebdrup | last post by:

I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...

C# / C Sharp

Python regular expressions just ain't PCRE

by: Wiseman | last post by:

I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...

Python

Negative regular expressions (searching for "i" not inside command)

by: Bart Kastermans | last post by:

I have a file in which I am searching for the letter "i" (actually a bit more general than that, arbitrary regular expressions could occur) as long as it does not occur inside an expression that...

Python

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Combining Regular Expressions

Similar topics