473,898 Members | 2,624 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

using a regular expression to match up to but not including html start/end tags

I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>555 55 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Oct 11 '08 #1
14 5003
On Oct 11, 6:42*am, "Andy B" <a_bo...@sbcglo bal.netwrote:
I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag..
Here is an example:

<startTag>555 55 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.
Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:
http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>555 55 any text</aClosingTagin textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel
Oct 11 '08 #2
On Fri, 10 Oct 2008 23:42:10 -0400, "Andy B" <a_*****@sbcglo bal.net>
wrote:
>I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>5555 5 any text</aClosingTag>
If you are just interested in a match, try this:

<(\w+)>\d{5} .*</\1>

Note the space above (copy as is).
>I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.
The above seems to imply you wish to capture the text that has matched
the expression. If this is the case, try this:

<(\w+)>(\d{5} .*)</\1>

Group two will contain the text you are after.
Oct 11 '08 #3
On Oct 11, 1:22*pm, kimiraikkonen <kimiraikkone.. .@gmail.comwrot e:
On Oct 11, 6:42*am, "Andy B" <a_bo...@sbcglo bal.netwrote:
I need to create a regular expression that will match a 5 digit number,a
space and then anything up to but not including the next closing html tag.
Here is an example:
<startTag>555 55 any text</aClosingTag>
I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>555 55 any text</aClosingTagin textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel
Andy,
I revised the code a bit, and paste that code to get the text between
HTML tags:

In the sample, strToSearch is the one that's in your post:

'-----------------------------------------
Dim strToSearch As String
' Your HTML line includin its tag
strToSearch = "<startTag>5555 5 any text</aClosingTag>"

' Initialize Regex type with proper pattern
Dim objRegExp As New Regex("<(.|\n)+ ?>")

' Define output variable
Dim strOutput As String

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Repla ce(strToSearch, "")

'Replace all < and with &lt; and &gt;
strOutput = Replace(strOutp ut, "<", "&lt;")
strOutput = Replace(strOutp ut, ">", "&gt;")

'Show result in MsgBox
'Returns "5555 any text"
MsgBox(strOutpu t.ToString)

objRegExp = Nothing
'----------------------------------------------

Hope it's better,

Onur Güzel
Oct 11 '08 #4
Two recommendations : 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.

Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.

I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.

Good luck, Bob

"Andy B" <a_*****@sbcglo bal.netwrote in message
news:OK******** ******@TK2MSFTN GP06.phx.gbl...
>I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>555 55 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Oct 11 '08 #5
"I am concerned about the fact that the html tags are "random". Depending
on what else is in the file you may have problems avoiding stuff you do not
want."

Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a <p></p>
element. If there are directions for the item, there will be a <br /after
the title. If there are no directions for the item, the title ends at </p>
[the end of the p element in question]. Here are a few examples:

<p>11111 This item has no directions</p>

<p>22222 This item has directions<br />1. Stand up. 2. Turn around. 3. Sit
down</p>

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /html tag. If it does exist, split the title before
the <br /and the directions after the <br /into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?

Oct 11 '08 #6
On Sat, 11 Oct 2008 13:22:31 -0400, "eBob.com"
<fa******@total lybogus.comwrot e:
>Two recommendations : 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.
I believe he will want greediness repetition. I'm fairly certain that
resorting to laziness will return undesired results. Consider the
following example, and try it with both greedy and lazy repetition:

<p>11111 ABC <p>123</pDEF</p>
>Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.
Agreed. However, I do not use Expresso. I do have it installed, but
it unfortunately does not work correctly with many regular expressions
I use.
>I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.
I cannot imagine problems he may run into (unless he was to use lazy
repetition as you suggested). Of course, I have not created an
exhaustive set of test scenarios...
>Good luck, Bob
Oct 11 '08 #7
On Sat, 11 Oct 2008 15:09:13 -0400, "Andy B" <a_*****@sbcglo bal.net>
wrote:
>Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a <p></p>
element. If there are directions for the item, there will be a <br /after
the title. If there are no directions for the item, the title ends at </p>
[the end of the p element in question]. Here are a few examples:

<p>11111 This item has no directions</p>

<p>22222 This item has directions<br />1. Stand up. 2. Turn around. 3. Sit
down</p>

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /html tag. If it does exist, split the title before
the <br /and the directions after the <br /into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?
My suggestion for this is to take care of the html breaks later in
your code after you've captured the text you want. You're trying to
do too much in regular expressions, and it will become obnoxiously
complex.
Oct 11 '08 #8
"My suggestion for this is to take care of the html breaks later in your
code after you've captured the text you want. You're trying to do too much
in regular expressions, and it will become obnoxiously complex."

After a little bit of homework, I came up with this so far:

<p>(?<Number>\d {5,})(?<Title>. *)<br />(?<Steps>.*) </p>

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: <p>11122 Title without steps</p>.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?

Oct 12 '08 #9
On Sat, 11 Oct 2008 20:38:03 -0400, "Andy B" <a_*****@sbcglo bal.net>
wrote:
>After a little bit of homework, I came up with this so far:

<p>(?<Number>\ d{5,})(?<Title> .*)<br />(?<Steps>.*) </p>

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: <p>11122 Title without steps</p>.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?
As I had already suggested, use a single group that captures
everything and handle splitting the break up later in your code. If,
for whatever reasons, you don't want to do this, an alternative is to
create two regular expressions-- one that matches text containing a
break and one that does not.

A single regular expression that does what you want may be possible,
but I don't want to create it. Go with either of the two approaches I
posted above.
Oct 12 '08 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
3089
by: rajarao | last post by:
hi I want to remove the content embedded in <script> and </script> tags submitted via text box. My java script should remove the content embedded between <script> and </script> tag. my current code is function RemoveHTMLScript(strText) { var regEx = /<script\w*<\/script>/g
3
1561
by: Ori | last post by:
Hi , I'm working with C#.NET and I'm looking for the following. I have a web page content and I want to pull all the text which appear in the page without all the HTML tags. I know that there is a way to do it with regular expression. Does someone know how to do it ? Please help….
1
1386
by: Martin Andert | last post by:
Hello, i want to parse some html with regex and have the following problem: --- html to parse start --- some text <span class="x"> some text with linebreaks and tabs and <b>tags <i>in it</i> goes here
4
1855
by: Earl Teigrob | last post by:
I am tring to scan a html string for all content within cretain tags. In the simplifed example below, I would like to scan the source text for all occurrences of text that start with "A" and end in "c" without any overlapping. In the following example, the regular expression finds only one result that inludes the first "A" to the last "c". This is not what I want. I want every non overlapping occurrance of "A" and "c". The result set...
2
1473
by: Luhar | last post by:
After much scouring of information on Regular Expressions from books and the web, I've come up with the this handy little Regex to parse links from HTML: <a\s+href(?:\s+)?=(?:\s+)?+(.?+)+(?:\s+)?>(.*?)</a> It works quite well at extracting the url and title of a link from an anchor tag, with one major problem--if the anchor tag includes other attributes after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it a...
3
2569
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular expression. ^(.+?) uses (?!a spoon)\.$
12
337
by: stevebread | last post by:
Hi, I am having some difficulty trying to create a regular expression. Consider: <tag1 name="john"/ <br/<tag2 value="adj__tall__"/> <tag1 name="joe"/> <tag1 name="jack"/> <tag2 value="adj__short__"/> Whenever a tag1 is followed by a tag 2, I want to retrieve the values
1
1599
by: AAaron123 | last post by:
I found this on the Internet and tried a few of them and they worked in VS2008. Actually it was in a different form but I converted to make a smaller file. The data is the same as the original. I'm confused about how regular expressions work in different systems. I suspect that each system may have some things that do not work in other systems. So my question is: Do the things in the table below work on VS2008? And what is Posix and...
6
15960
by: Zetten | last post by:
I have an AD search module which works as I want it to; searching for a matching forename and/or surname in the appropriate OU. I would like to extend it to be more flexible, so that instead of just searching for a matching string in the surname/forename fields it can match partial strings. I already have it applying a star to the end of the filter, which accomplishes part of this, but I would like it to match partial strings at the start as...
0
9992
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9839
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
10943
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10480
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9658
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
7187
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5876
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
4295
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3303
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.