using a regular expression to match up to but not including html start/end tags

Andy B

I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Oct 11 '08 #1

Subscribe Post Reply

4946

kimiraikkonen

On Oct 11, 6:42*am, "Andy B" <a_bo...@sbcglobal.netwrote:

I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag..
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:
http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>55555 any text</aClosingTagin textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel

Oct 11 '08 #2

jamil

On Fri, 10 Oct 2008 23:42:10 -0400, "Andy B" <a_*****@sbcglobal.net>
wrote:

>I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

If you are just interested in a match, try this:

<(\w+)>\d{5} .*</\1>

Note the space above (copy as is).

>I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

The above seems to imply you wish to capture the text that has matched
the expression. If this is the case, try this:

<(\w+)>(\d{5} .*)</\1>

Group two will contain the text you are after.

Oct 11 '08 #3

kimiraikkonen

On Oct 11, 1:22*pm, kimiraikkonen <kimiraikkone...@gmail.comwrote:

On Oct 11, 6:42*am, "Andy B" <a_bo...@sbcglobal.netwrote:

I need to create a regular expression that will match a 5 digit number,a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>55555 any text</aClosingTagin textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel

Andy,
I revised the code a bit, and paste that code to get the text between
HTML tags:

In the sample, strToSearch is the one that's in your post:

'-----------------------------------------
Dim strToSearch As String
' Your HTML line includin its tag
strToSearch = "<startTag>55555 any text</aClosingTag>"

' Initialize Regex type with proper pattern
Dim objRegExp As New Regex("<(.|\n)+?>")

' Define output variable
Dim strOutput As String

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strToSearch, "")

'Replace all < and with < and >
strOutput = Replace(strOutput, "<", "<")
strOutput = Replace(strOutput, ">", ">")

'Show result in MsgBox
'Returns "5555 any text"
MsgBox(strOutput.ToString)

objRegExp = Nothing
'----------------------------------------------

Hope it's better,

Onur Güzel

Oct 11 '08 #4

eBob.com

Two recommendations: 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.

Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.

I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.

Good luck, Bob

"Andy B" <a_*****@sbcglobal.netwrote in message
news:OK**************@TK2MSFTNGP06.phx.gbl...

>I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Oct 11 '08 #5

Andy B

"I am concerned about the fact that the html tags are "random". Depending
on what else is in the file you may have problems avoiding stuff you do not
want."

Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a 
element. If there are directions for the item, there will be a 
[the end of the p element in question]. Here are a few examples:

11111 This item has no directions

22222 This item has directions 1. Stand up. 2. Turn around. 3. Sit
down

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /html tag. If it does exist, split the title before
the <br /and the directions after the <br /into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?

Oct 11 '08 #6

jamil

On Sat, 11 Oct 2008 13:22:31 -0400, "eBob.com"
<fa******@totallybogus.comwrote:

>Two recommendations: 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.

I believe he will want greediness repetition. I'm fairly certain that
resorting to laziness will return undesired results. Consider the
following example, and try it with both greedy and lazy repetition:

11111 ABC 123</pDEF

>Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.

Agreed. However, I do not use Expresso. I do have it installed, but
it unfortunately does not work correctly with many regular expressions
I use.

>I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.

I cannot imagine problems he may run into (unless he was to use lazy
repetition as you suggested). Of course, I have not created an
exhaustive set of test scenarios...

>Good luck, Bob

Oct 11 '08 #7

jamil

On Sat, 11 Oct 2008 15:09:13 -0400, "Andy B" <a_*****@sbcglobal.net>
wrote:

>Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a 
element. If there are directions for the item, there will be a 
[the end of the p element in question]. Here are a few examples:

11111 This item has no directions

22222 This item has directions 1. Stand up. 2. Turn around. 3. Sit
down

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /html tag. If it does exist, split the title before
the <br /and the directions after the <br /into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?

My suggestion for this is to take care of the html breaks later in
your code after you've captured the text you want. You're trying to
do too much in regular expressions, and it will become obnoxiously
complex.

Oct 11 '08 #8

Andy B

"My suggestion for this is to take care of the html breaks later in your
code after you've captured the text you want. You're trying to do too much
in regular expressions, and it will become obnoxiously complex."

After a little bit of homework, I came up with this so far:

(?<Number>\d{5,})(?<Title>.*) (?<Steps>.*)

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: 11122 Title without steps.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?

Oct 12 '08 #9

jamil

On Sat, 11 Oct 2008 20:38:03 -0400, "Andy B" <a_*****@sbcglobal.net>
wrote:

>After a little bit of homework, I came up with this so far:

(?<Number>\d{5,})(?<Title>.*) (?<Steps>.*)

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: 11122 Title without steps.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?

As I had already suggested, use a single group that captures
everything and handle splitting the break up later in your code. If,
for whatever reasons, you don't want to do this, an alternative is to
create two regular expressions-- one that matches text containing a
break and one that does not.

A single regular expression that does what you want may be possible,
but I don't want to create it. Go with either of the two approaches I
posted above.

Oct 12 '08 #10

eBob.com

<ja***@onepost.netwrote in message
news:kr********************************@4ax.com...

< snip >
Agreed. However, I do not use Expresso. I do have it installed, but
it unfortunately does not work correctly with many regular expressions
I use.
< snip >

Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso to
get a result different from a program using the same regex and options. The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ñ (n with a tilde over it), get changed.

Bob

Oct 12 '08 #11

jamil

On Sun, 12 Oct 2008 10:52:06 -0400, "eBob.com"
<fa******@totallybogus.comwrote:

>Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso to
get a result different from a program using the same regex and options. The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ? with a tilde over it), get changed.

Bob

..NET's regular expression support works fine. What I was referring to
is a bug (or undesired feature) specific to the version of Expresso I
currently have installed (version 3.0.2766.13570).

Take the following regular expression:

^\$\d+(?:\.\d{1,2}|)$

This is valid and works fine under .NET. It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this in Expresso with a couple of test scenarios and watch
what happens.

Here's some test cases that all match:

$10
$99.99
$5.50
$0

When I click 'Run Match', I get zero matches. This is a bug. The
expression matches all lines of text. To confirm this, click
'Validate'. You will see that all lines match in this case.

Oct 13 '08 #12

eBob.com

<ja***@onepost.netwrote in message
news:kr********************************@4ax.com...

<snip>
I believe he will want greediness repetition. I'm fairly certain that
resorting to laziness will return undesired results. Consider the
following example, and try it with both greedy and lazy repetition:

11111 ABC 123</pDEF

I have to admit that I am not sure I fully understand the difference between
".*" and ".*?". I can almost recite what the doc says, but that's not the
same as fully understanding. I haven't played with the example you gave yet
but I hope to today.

BUT ... in general I have found that ".*?" works better for me than ".*". I
had an interesting experience just yesterday. I developed a regex (using
Expresso) with approximately a half dozen uses of ".*". Actually, given my
experience, I was going to use ".*?", but remembering your post I decided to
use ".*". The resulting expression worked, but was taking over 1.7 seconds
to find a relatively short string in a relatively small file! Since this
expression would be used against over a thousand files I could not tolerate
such poor performance. So, not having any better ideas, I just changed all
of the uses of ".*"to ".*?". The expression still worked and took so little
CPU that it was not measurable.

I am not disagreeing with you, I am just reporting my experience.

Bob

Oct 13 '08 #13

eBob.com

<ja***@onepost.netwrote in message
news:77********************************@4ax.com...

On Sun, 12 Oct 2008 10:52:06 -0400, "eBob.com"
<fa******@totallybogus.comwrote:

>>Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso
to
get a result different from a program using the same regex and options.
The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ñ (n with a tilde over it), get changed.

Bob

.NET's regular expression support works fine. What I was referring to
is a bug (or undesired feature) specific to the version of Expresso I
currently have installed (version 3.0.2766.13570).

Take the following regular expression:

^\$\d+(?:\.\d{1,2}|)$

This is valid and works fine under .NET. It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this in Expresso with a couple of test scenarios and watch
what happens.

Here's some test cases that all match:

$10
$99.99
$5.50
$0

When I click 'Run Match', I get zero matches. This is a bug. The
expression matches all lines of text. To confirm this, click
'Validate'. You will see that all lines match in this case.

I have the same level of Expresso (I think that we have the latest) and I
have the same experience with your expression and sample text. As I am sure
you know, but for the benefit of others who might be listening in, there's
no problem if you remove the $ at the end of your expression. Which I
understand may not be the expression which you need. I would agree that it
is an Expresso bug. But even so I can't imagine developing a non-trivial
regular expression without it. Have you reported this bug to Ultrapico? I
notice at the moment that the web site is down. I hope that doesn't mean
anything!

Thanks for making me aware of this.

Bob

Oct 13 '08 #14

kottekoe

>Bob

>
.NET's regular expression support works fine. *What I was referring to
is a bug (or undesired feature) specific to the version ofExpressoI
currently have installed (version 3.0.2766.13570).

Take the following regular expression:

^\$\d+(?:\.\d{1,2}|)$

This is valid and works fine under .NET. *It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this inExpressowith a couple of test scenarios and watch
what happens.

Here's some test cases that all match:

$10
$99.99
$5.50
$0

When I click 'Run Match', I get zero matches. *This is a bug. *The
expression matches all lines of text. *To confirm this, click
'Validate'. *You will see that all lines match in this case.

I have the same level ofExpresso(I think that we have the latest) and I
have the same experience with your expression and sample text. *As I amsure
you know, but for the benefit of others who might be listening in, there's
no problem if you remove the $ at the end of your expression. *Which I
understand may not be the expression which you need. *I would agree that it
is anExpressobug. *But even so I can't imagine developing a non-trivial
regular expression without it. *Have you reported this bug to Ultrapico? *I
notice at the moment that the web site is down. *I hope that doesn't mean
anything!

Thanks for making me aware of this.

Bob- Hide quoted text -

- Show quoted text -

I admit that this is confusing, but it is not a bug in Expresso.
Regular expressions are very literal and you have to remember that a
Windows text file has line termination characters that have to be
matched properly. Specifically, each line ends with "\r\n" (carriage
return, line feed). The regular expression in your example properly
matches each of the examples if it is all by itself without any line
termination. (Try using any of the examples as the only text in the
"Sample Text" box, without a new line). If you use a number of
examples on separate lines, it will not work, just as it would not
work in code, unless you also match the carriage return character at
the end of each line. Try this regex, for example:

^\$\d+(?:\.\d{1,2}|)\r?$

This searches for your string, matches zero or one carriage returns,
then looks for the end of the string. (Be sure to turn OFF the
"Multiline" option, which has a confusing name). It will match every
line in your example.

The "Validate Line by Line" tool was designed specifically to avoid
this confusion. All it does it to take each line individually, without
any line termination characters and apply the regex to that line,
showing whether it matches the whole line, part of it, or none of it.
If you are expecting your text to have no embedded line termination,
it is the ideal tool to use. If you want to know what will happen if
the text has carriage returns, you should use the "Run Match" tool.

This is definitely confusing, but the goal of Expresso's design is to
show you exactly what would happen if you used the regex in your code.

Oct 14 '08 #15

by: rajarao | last post by:

hi I want to remove the content embedded in <script> and </script> tags submitted via text box. My java script should remove the content embedded between <script> and </script> tag. my current...

Javascript

HTML And Regular Explression

by: Ori | last post by:

Hi , I'm working with C#.NET and I'm looking for the following. I have a web page content and I want to pull all the text which appear in the page without all the HTML tags. I know that there...

C# / C Sharp

Regular Expression Pattern Help

by: Martin Andert | last post by:

Hello, i want to parse some html with regex and have the following problem: --- html to parse start --- some text some text with linebreaks and tabs and tags in...

C# / C Sharp

Using regulare expressions to parse text (HTML)

by: Earl Teigrob | last post by:

I am tring to scan a html string for all content within cretain tags. In the simplifed example below, I would like to scan the source text for all occurrences of text that start with "A" and end in...

C# / C Sharp

Need help with a Regular Expression

by: Luhar | last post by:

After much scouring of information on Regular Expressions from books and the web, I've come up with the this handy little Regex to parse links from HTML: ...

Visual Basic .NET

Regular Expression help

by: Zach | last post by:

Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...

C# / C Sharp

Regular Expression question

by: stevebread | last post by:

Hi, I am having some difficulty trying to create a regular expression. Consider: <tag1 name="john"/ <br/<tag2 value="adj__tall__"/> <tag1 name="joe"/> <tag1 name="jack"/> <tag2...

Python

Regular Expression Syntax

by: AAaron123 | last post by:

I found this on the Internet and tried a few of them and they worked in VS2008. Actually it was in a different form but I converted to make a smaller file. The data is the same as the original. ...

Visual Basic .NET

C# DirectorySearcher.Filter using a regular expression

by: Zetten | last post by:

I have an AD search module which works as I want it to; searching for a matching forename and/or surname in the appropriate OU. I would like to extend it to be more flexible, so that instead of...

.NET Framework

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

using a regular expression to match up to but not including html start/end tags

Similar topics