By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,079 Members | 1,324 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,079 IT Pros & Developers. It's quick & easy.

using a regular expression to match up to but not including html start/end tags

P: n/a
I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Oct 11 '08 #1
Share this Question
Share on Google+
14 Replies


P: n/a
On Oct 11, 6:42*am, "Andy B" <a_bo...@sbcglobal.netwrote:
I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag..
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.
Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:
http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>55555 any text</aClosingTagin textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel
Oct 11 '08 #2

P: n/a
On Fri, 10 Oct 2008 23:42:10 -0400, "Andy B" <a_*****@sbcglobal.net>
wrote:
>I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>
If you are just interested in a match, try this:

<(\w+)>\d{5} .*</\1>

Note the space above (copy as is).
>I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.
The above seems to imply you wish to capture the text that has matched
the expression. If this is the case, try this:

<(\w+)>(\d{5} .*)</\1>

Group two will contain the text you are after.
Oct 11 '08 #3

P: n/a
On Oct 11, 1:22*pm, kimiraikkonen <kimiraikkone...@gmail.comwrote:
On Oct 11, 6:42*am, "Andy B" <a_bo...@sbcglobal.netwrote:
I need to create a regular expression that will match a 5 digit number,a
space and then anything up to but not including the next closing html tag.
Here is an example:
<startTag>55555 any text</aClosingTag>
I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>55555 any text</aClosingTagin textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel
Andy,
I revised the code a bit, and paste that code to get the text between
HTML tags:

In the sample, strToSearch is the one that's in your post:

'-----------------------------------------
Dim strToSearch As String
' Your HTML line includin its tag
strToSearch = "<startTag>55555 any text</aClosingTag>"

' Initialize Regex type with proper pattern
Dim objRegExp As New Regex("<(.|\n)+?>")

' Define output variable
Dim strOutput As String

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strToSearch, "")

'Replace all < and with &lt; and &gt;
strOutput = Replace(strOutput, "<", "&lt;")
strOutput = Replace(strOutput, ">", "&gt;")

'Show result in MsgBox
'Returns "5555 any text"
MsgBox(strOutput.ToString)

objRegExp = Nothing
'----------------------------------------------

Hope it's better,

Onur Güzel
Oct 11 '08 #4

P: n/a
Two recommendations: 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.

Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.

I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.

Good luck, Bob

"Andy B" <a_*****@sbcglobal.netwrote in message
news:OK**************@TK2MSFTNGP06.phx.gbl...
>I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Oct 11 '08 #5

P: n/a
"I am concerned about the fact that the html tags are "random". Depending
on what else is in the file you may have problems avoiding stuff you do not
want."

Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a <p></p>
element. If there are directions for the item, there will be a <br /after
the title. If there are no directions for the item, the title ends at </p>
[the end of the p element in question]. Here are a few examples:

<p>11111 This item has no directions</p>

<p>22222 This item has directions<br />1. Stand up. 2. Turn around. 3. Sit
down</p>

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /html tag. If it does exist, split the title before
the <br /and the directions after the <br /into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?

Oct 11 '08 #6

P: n/a
On Sat, 11 Oct 2008 13:22:31 -0400, "eBob.com"
<fa******@totallybogus.comwrote:
>Two recommendations: 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.
I believe he will want greediness repetition. I'm fairly certain that
resorting to laziness will return undesired results. Consider the
following example, and try it with both greedy and lazy repetition:

<p>11111 ABC <p>123</pDEF</p>
>Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.
Agreed. However, I do not use Expresso. I do have it installed, but
it unfortunately does not work correctly with many regular expressions
I use.
>I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.
I cannot imagine problems he may run into (unless he was to use lazy
repetition as you suggested). Of course, I have not created an
exhaustive set of test scenarios...
>Good luck, Bob
Oct 11 '08 #7

P: n/a
On Sat, 11 Oct 2008 15:09:13 -0400, "Andy B" <a_*****@sbcglobal.net>
wrote:
>Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a <p></p>
element. If there are directions for the item, there will be a <br /after
the title. If there are no directions for the item, the title ends at </p>
[the end of the p element in question]. Here are a few examples:

<p>11111 This item has no directions</p>

<p>22222 This item has directions<br />1. Stand up. 2. Turn around. 3. Sit
down</p>

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /html tag. If it does exist, split the title before
the <br /and the directions after the <br /into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?
My suggestion for this is to take care of the html breaks later in
your code after you've captured the text you want. You're trying to
do too much in regular expressions, and it will become obnoxiously
complex.
Oct 11 '08 #8

P: n/a
"My suggestion for this is to take care of the html breaks later in your
code after you've captured the text you want. You're trying to do too much
in regular expressions, and it will become obnoxiously complex."

After a little bit of homework, I came up with this so far:

<p>(?<Number>\d{5,})(?<Title>.*)<br />(?<Steps>.*)</p>

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: <p>11122 Title without steps</p>.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?

Oct 12 '08 #9

P: n/a
On Sat, 11 Oct 2008 20:38:03 -0400, "Andy B" <a_*****@sbcglobal.net>
wrote:
>After a little bit of homework, I came up with this so far:

<p>(?<Number>\d{5,})(?<Title>.*)<br />(?<Steps>.*)</p>

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: <p>11122 Title without steps</p>.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?
As I had already suggested, use a single group that captures
everything and handle splitting the break up later in your code. If,
for whatever reasons, you don't want to do this, an alternative is to
create two regular expressions-- one that matches text containing a
break and one that does not.

A single regular expression that does what you want may be possible,
but I don't want to create it. Go with either of the two approaches I
posted above.
Oct 12 '08 #10

P: n/a

<ja***@onepost.netwrote in message
news:kr********************************@4ax.com...
< snip >
Agreed. However, I do not use Expresso. I do have it installed, but
it unfortunately does not work correctly with many regular expressions
I use.
< snip >
Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso to
get a result different from a program using the same regex and options. The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ñ (n with a tilde over it), get changed.

Bob
Oct 12 '08 #11

P: n/a
On Sun, 12 Oct 2008 10:52:06 -0400, "eBob.com"
<fa******@totallybogus.comwrote:
>Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso to
get a result different from a program using the same regex and options. The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ? with a tilde over it), get changed.

Bob
..NET's regular expression support works fine. What I was referring to
is a bug (or undesired feature) specific to the version of Expresso I
currently have installed (version 3.0.2766.13570).

Take the following regular expression:

^\$\d+(?:\.\d{1,2}|)$

This is valid and works fine under .NET. It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this in Expresso with a couple of test scenarios and watch
what happens.

Here's some test cases that all match:

$10
$99.99
$5.50
$0

When I click 'Run Match', I get zero matches. This is a bug. The
expression matches all lines of text. To confirm this, click
'Validate'. You will see that all lines match in this case.
Oct 13 '08 #12

P: n/a

<ja***@onepost.netwrote in message
news:kr********************************@4ax.com...
<snip>
I believe he will want greediness repetition. I'm fairly certain that
resorting to laziness will return undesired results. Consider the
following example, and try it with both greedy and lazy repetition:

<p>11111 ABC <p>123</pDEF</p>
I have to admit that I am not sure I fully understand the difference between
".*" and ".*?". I can almost recite what the doc says, but that's not the
same as fully understanding. I haven't played with the example you gave yet
but I hope to today.

BUT ... in general I have found that ".*?" works better for me than ".*". I
had an interesting experience just yesterday. I developed a regex (using
Expresso) with approximately a half dozen uses of ".*". Actually, given my
experience, I was going to use ".*?", but remembering your post I decided to
use ".*". The resulting expression worked, but was taking over 1.7 seconds
to find a relatively short string in a relatively small file! Since this
expression would be used against over a thousand files I could not tolerate
such poor performance. So, not having any better ideas, I just changed all
of the uses of ".*"to ".*?". The expression still worked and took so little
CPU that it was not measurable.

I am not disagreeing with you, I am just reporting my experience.

Bob
Oct 13 '08 #13

P: n/a

<ja***@onepost.netwrote in message
news:77********************************@4ax.com...
On Sun, 12 Oct 2008 10:52:06 -0400, "eBob.com"
<fa******@totallybogus.comwrote:
>>Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso
to
get a result different from a program using the same regex and options.
The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ñ (n with a tilde over it), get changed.

Bob

.NET's regular expression support works fine. What I was referring to
is a bug (or undesired feature) specific to the version of Expresso I
currently have installed (version 3.0.2766.13570).

Take the following regular expression:

^\$\d+(?:\.\d{1,2}|)$

This is valid and works fine under .NET. It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this in Expresso with a couple of test scenarios and watch
what happens.

Here's some test cases that all match:

$10
$99.99
$5.50
$0

When I click 'Run Match', I get zero matches. This is a bug. The
expression matches all lines of text. To confirm this, click
'Validate'. You will see that all lines match in this case.
I have the same level of Expresso (I think that we have the latest) and I
have the same experience with your expression and sample text. As I am sure
you know, but for the benefit of others who might be listening in, there's
no problem if you remove the $ at the end of your expression. Which I
understand may not be the expression which you need. I would agree that it
is an Expresso bug. But even so I can't imagine developing a non-trivial
regular expression without it. Have you reported this bug to Ultrapico? I
notice at the moment that the web site is down. I hope that doesn't mean
anything!

Thanks for making me aware of this.

Bob
Oct 13 '08 #14

P: n/a
>Bob
>
.NET's regular expression support works fine. *What I was referring to
is a bug (or undesired feature) specific to the version ofExpressoI
currently have installed (version 3.0.2766.13570).
Take the following regular expression:
^\$\d+(?:\.\d{1,2}|)$
This is valid and works fine under .NET. *It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this inExpressowith a couple of test scenarios and watch
what happens.
Here's some test cases that all match:
$10
$99.99
$5.50
$0
When I click 'Run Match', I get zero matches. *This is a bug. *The
expression matches all lines of text. *To confirm this, click
'Validate'. *You will see that all lines match in this case.

I have the same level ofExpresso(I think that we have the latest) and I
have the same experience with your expression and sample text. *As I amsure
you know, but for the benefit of others who might be listening in, there's
no problem if you remove the $ at the end of your expression. *Which I
understand may not be the expression which you need. *I would agree that it
is anExpressobug. *But even so I can't imagine developing a non-trivial
regular expression without it. *Have you reported this bug to Ultrapico? *I
notice at the moment that the web site is down. *I hope that doesn't mean
anything!

Thanks for making me aware of this.

Bob- Hide quoted text -

- Show quoted text -
I admit that this is confusing, but it is not a bug in Expresso.
Regular expressions are very literal and you have to remember that a
Windows text file has line termination characters that have to be
matched properly. Specifically, each line ends with "\r\n" (carriage
return, line feed). The regular expression in your example properly
matches each of the examples if it is all by itself without any line
termination. (Try using any of the examples as the only text in the
"Sample Text" box, without a new line). If you use a number of
examples on separate lines, it will not work, just as it would not
work in code, unless you also match the carriage return character at
the end of each line. Try this regex, for example:

^\$\d+(?:\.\d{1,2}|)\r?$

This searches for your string, matches zero or one carriage returns,
then looks for the end of the string. (Be sure to turn OFF the
"Multiline" option, which has a confusing name). It will match every
line in your example.

The "Validate Line by Line" tool was designed specifically to avoid
this confusion. All it does it to take each line individually, without
any line termination characters and apply the regex to that line,
showing whether it matches the whole line, part of it, or none of it.
If you are expecting your text to have no embedded line termination,
it is the ideal tool to use. If you want to know what will happen if
the text has carriage returns, you should use the "Run Match" tool.

This is definitely confusing, but the goal of Expresso's design is to
show you exactly what would happen if you used the regex in your code.
Oct 14 '08 #15

This discussion thread is closed

Replies have been disabled for this discussion.