Hi Everyone.
I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:
// Strip all tags
replacePattern = "<(.|\n)+?> ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");
//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?; ";
pageHTML = pageHTML.replac eAll(replacePat tern,"");
// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replac eAll(replacePat tern," ");
Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replac eAll(replacePat tern,"");
Thanks.
--
Best Regards Andrew Dixon
5 11733
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Dixon - Depictions.net wrote: Hi Everyone.
I have been working on some code that strips the HTML code out of an HTML page leaving just the text on the page. At the moment this is what I have:
// Strip all tags replacePattern = "<(.|\n)+?> "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
//Remove any HTML specific characters (e.g. " or &) replacePattern = "&(.|\n)+?; "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
// Remove whitespace replacePattern = "\\s{2,}"; pageHTML = pageHTML.replac eAll(replacePat tern," ");
Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]"; pageHTML = pageHTML.replac eAll(replacePat tern,"");
Thanks.
Hi,
I'm not that familiar with regular expressions myself, but according
to the documentation, you should be able to use something like this:
(<(.|\n)+?>)|(& (.|\n)+?;)
to match either of the first two items in your list (tags and
entities). The whitespace thing, I would recommend keeping a separate
operation, since you really want to replace each block of whitespace
with " ", but you want to replace tags and entities with "". Also,
isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally,
another point to ponder: using your whitespace replacement system
will put the entire output on one line. I'd think about changing your
expression if you want to keep linebreaks where they are instead of
turning them into spaces.
- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/8xNbwxczzJRavJY RAlHQAJ9kY1USFt v36iInWnR0v6hqT L0PzwCZAYZ2
gILykD4bpg2T8Io/eZJ+M1Q=
=DX+T
-----END PGP SIGNATURE-----
"Andrew Dixon - Depictions.net" <an**********@N OREPLY.depictio ns.net> wrote in message news:<V_******* **************@ news-text.cableinet. net>... Hi Everyone.
I have been working on some code that strips the HTML code out of an HTML page leaving just the text on the page. At the moment this is what I have:
// Strip all tags replacePattern = "<(.|\n)+?> "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
//Remove any HTML specific characters (e.g. " or &) replacePattern = "&(.|\n)+?; "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
// Remove whitespace replacePattern = "\\s{2,}"; pageHTML = pageHTML.replac eAll(replacePat tern," ");
Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]"; pageHTML = pageHTML.replac eAll(replacePat tern,"");
Thanks.
Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.P arserCallback#h andleText() method.
You might want to look at http://htmlparser.sourceforge.net
--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software
"Andrew Dixon - Depictions.net" <an**********@N OREPLY.depictio ns.net> wrote
in message news:V_******** *************@n ews-text.cableinet. net... Hi Everyone.
I have been working on some code that strips the HTML code out of an HTML page leaving just the text on the page. At the moment this is what I have:
// Strip all tags replacePattern = "<(.|\n)+?> "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
file://Remove any HTML specific characters (e.g. " or &) replacePattern = "&(.|\n)+?; "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
// Remove whitespace replacePattern = "\\s{2,}"; pageHTML = pageHTML.replac eAll(replacePat tern," ");
Is there a way I can combine all four patterns into one expression so I
can make the code more efficient? I've not really worked with RegEx so any advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]"; pageHTML = pageHTML.replac eAll(replacePat tern,"");
Thanks. --
Best Regards
Andrew Dixon
Hi Chris.
Thanks. The reason I used \\s{2,} is so that it only replaces where there is
2 spaces or greater so that it doesn't all end up as one great long string
with any spaces which is what \\s+ would do.
--
Best Regards Andrew Dixon www.depictions.net - Sell your photographs online and set your own price.
"Chris" <ch*******@hotm ail.com> wrote in message
news:b4HIb.1053 01$ss5.27559@cl grps13... -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andrew Dixon - Depictions.net wrote:
Hi Everyone.
I have been working on some code that strips the HTML code out of an HTML page leaving just the text on the page. At the moment this is what I have:
// Strip all tags replacePattern = "<(.|\n)+?> "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
//Remove any HTML specific characters (e.g. " or &) replacePattern = "&(.|\n)+?; "; pageHTML = pageHTML.replac eAll(replacePat tern,"");
// Remove whitespace replacePattern = "\\s{2,}"; pageHTML = pageHTML.replac eAll(replacePat tern," ");
Is there a way I can combine all four patterns into one expression so I can make the code more efficient? I've not really worked with RegEx so any advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]"; pageHTML = pageHTML.replac eAll(replacePat tern,"");
Thanks.
Hi, I'm not that familiar with regular expressions myself, but according to the documentation, you should be able to use something like this:
(<(.|\n)+?>)|(& (.|\n)+?;)
to match either of the first two items in your list (tags and entities). The whitespace thing, I would recommend keeping a separate operation, since you really want to replace each block of whitespace with " ", but you want to replace tags and entities with "". Also, isn't it easier to use "\\s+" than "\\s{2,}" for whitespace? Finally, another point to ponder: using your whitespace replacement system will put the entire output on one line. I'd think about changing your expression if you want to keep linebreaks where they are instead of turning them into spaces.
- -- Chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/8xNbwxczzJRavJY RAlHQAJ9kY1USFt v36iInWnR0v6hqT L0PzwCZAYZ2 gILykD4bpg2T8Io/eZJ+M1Q= =DX+T -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andrew Dixon - Depictions.net wrote: Hi Chris.
Thanks. The reason I used \\s{2,} is so that it only replaces where there is 2 spaces or greater so that it doesn't all end up as one great long string with any spaces which is what \\s+ would do.
Hi,
First, I have to assume you mean "without any spaces" rather than
"with any spaces".
If one of these is not true, then I'm afraid I don't quite understand
the question.
If they are, however, then I think we misunderstand each other. My
idea was to replace "\\s+" with " ". This would take any string of
one space or more and replace it with one space - which makes sense.
Replacing one space with one space is still one space. I think you
thought I meant to replace it with "", which would indeed remove all
spaces. However, *your* example is also slightly flawed, assuming
your replacement string is "", in that where there are 2 or more
spaces, it will take all of them and remove them entirely, rather
than replacing them with one space. If, on the other hand, your
replacement string is " ", then both examples work equally well, but
my regexp is shorter than yours :) Largely academic at this point
though.
- --
Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/82s7wxczzJRavJY RAiseAJ9ah2iajb VZRGYQ6szYmkNNA isAHgCgmUm0
Fwp7qzP8SFWauv/EH3kxH6U=
=suPx
-----END PGP SIGNATURE----- This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Kenneth McDonald |
last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate
feedback, suggestions, and criticism as I work towards finalizing the
API and feature sets. rex is a module intended to make regular expressions
easier to create and use (and in my experience as a regular expression
user, it makes them MUCH easier to create and use.)
I'm still working on formal documentation, and in any case, such
documentation isn't necessarily the...
|
by: Christopher Subich |
last post by:
I don't think the python regular expression module correctly handles
combining marks; it gives inconsistent results between equivalent forms
of some regular expressions:
>>> sys.version
'2.4.1 (#65, Mar 30 2005, 09:13:57) '
>>>re.match('\w',unicodedata.normalize('NFD',u'\xf1'),re.UNICODE).group(0)
u'n'
>>>re.match('\w',unicodedata.normalize('NFC',u'\xf1'),re.UNICODE).group(0)
u'\xf1'
|
by: Sehboo |
last post by:
Hi,
I have several regular expressions that I need to run against
documents. Is it possible to combine several expressions in one
expression in Regex object. So that it is faster, or will I have to
use all the expressions seperately?
Here are my regular expressions that check for valid email address and
link
Dim Expression As String =
|
by: Együd Csaba |
last post by:
Hi All,
I'd like to "compress" the following two filter expressions into one -
assuming that it makes sense regarding query execution performance.
.... where (adate LIKE "2004.01.10 __:30" or adate LIKE "2004.01.10 __:15")
....
into something like this:
.... where adate LIKE "2004.01.10 __:(30/15)" ...
|
by: rufus |
last post by:
I have some text to parse. I dont want to match link text or text inside
paragraphs of class=tab. All other text should be matched. Here is the text:
**********
This text will match<a href="">This text wont match</a>This text will
match<p class=tab>This text wont match</p><p class=other>This text will
match</p>This text will match<a href="">This text wont match</a>This text
will match.
**********
| |
by: a |
last post by:
I'm a newbie needing to use some Regular Expressions in PHP.
Can I safely use the results of my tests using 'The Regex Coach'
(http://www.weitz.de/regex-coach/index.html)
Are the Regular Expressions used in Perl identical to the Regular
Expressions in PHP?
|
by: Allan Ebdrup |
last post by:
I have a dynamic list of regular expressions, the expressions don't change
very often but they can change. And I have a single string that I want to
match the regular expressions against and find the first regular expression
that matches the string.
I've gor the regular expressions ordered so that the highest priority is
first (if two or more regular expressions match the string I want the first
one returned)
The code that does this has...
|
by: Wiseman |
last post by:
I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.
Another feature I'm missing is once-only subpatterns and possessive
quantifiers...
|
by: Bart Kastermans |
last post by:
I have a file in which I am searching for the letter "i" (actually
a bit more general than that, arbitrary regular expressions could
occur) as long as it does not occur inside an expression that matches
\\.+?\b (something started by a backslash and including the word that
follows).
More concrete example, I have the string "\sin(i)" and I want to match
the argument, but not the i in \sin.
Can this be achieved by combining the regular...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |