473,241 Members | 1,454 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,241 software developers and data experts.

Need help in Regex.

I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
... some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM

Jul 8 '06 #1
9 2045
JM,

Why use Regex? Why not use MSHTML through COM interop and just parse
the HTML? Then, you can access the object model and find the item that way.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<jm******@gmail.comwrote in message
news:11**********************@35g2000cwc.googlegro ups.com...
>I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM

Jul 8 '06 #2
I am using this for a small .NET program and don't want to use
Unmanaged code/COM. Thats why looking for solution based on Regex.

Thanks
JM

Nicholas Paldino [.NET/C# MVP] wrote:
JM,

Why use Regex? Why not use MSHTML through COM interop and just parse
the HTML? Then, you can access the object model and find the item that way.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<jm******@gmail.comwrote in message
news:11**********************@35g2000cwc.googlegro ups.com...
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM
Jul 8 '06 #3
jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

It starts searching when it encounters an <a tag, then continues on
looking for anything that's not and picks our the href attribute. It
captures the href value in a group and searched on for the end of the
opening tag. Once there it searches for city1, failing if it finds </a
before encountering that specific text.

And the code to extract the value would then be something like (haven't
compiled, so might contain a few small errors):

Regex rx = new
Regex(@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups[1].Value;
}
Jesse
Jul 8 '06 #4
Jesse Houwing wrote:
jm******@gmail.com wrote:
>I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>
Jesse
Jul 8 '06 #5
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>
Jesse
Jul 8 '06 #6
jm******@gmail.com wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
>
Thanks & Regards
JM

Jesse Houwing wrote:
>Jesse Houwing wrote:
>>jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>
>>Jesse
Jul 8 '06 #7
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

btw, what are the good resources on net from where I can start learning
about Regex ?

And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

Thanks again for all your help
JM

Jesse Houwing wrote:
jm******@gmail.com wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
Jul 8 '06 #8
jm******@gmail.com wrote:
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.
You're welcome
btw, what are the good resources on net from where I can start learning
about Regex ?
Download The Regulator and try experimenting (just disable intellisense
from the options window, it doesn't work).
http://regex.osherove.com/

Try some of the exercises here:
http://www.cs.princeton.edu/introcs/72regular/

There are a number of articles on MSDN that might also be of help:
http://msdn.microsoft.com/library/de...63906a7353.asp
http://msdn2.microsoft.com/en-us/library/az24scfc.aspx

And a general explanation on regular expressions (not specific to .Net):
http://www.regularexpressions.info/

Especially for .Net buy the ebook from Dan Applemen
http://www.amazon.com/gp/product/B00...lance&n=551440

And for general insight on the workings of Regular Expressions the
following book is one of the best resources available:
http://www.amazon.com/gp/product/059...lance&n=283155

And finally:
Keep trying! Keep Practising and don't be afraid to ask questions.
And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.
HTML isn't that strict and people write some funny stuff in their pages
once in a while. You can't predict these things, and you're better
equipped with a parser that knows most of these exceptions.

There the .NET Html Agility Pack that parses HTML and makes it search
able. The MSHTML object can also be a great help.
http://smourier.blogspot.com/
Thanks again for all your help
Again, welcome!

Jesse
Jesse Houwing wrote:
>jm******@gmail.com wrote:
>>Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
>>Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
>I have got the following html:
>>
>"something in html ... etc.. city1... etc... <a class="font1"
>href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
>.. some html. city1.. can repeat lot of times here....
>>
>Requirement:
>-------------------
>I want to get the value of "href" i.e "city1.html" by searching "city1"
>between the <a</atag. Please note that "city1" can repeat lot of
>times in the html, but I have to search for "city1" which lies between
><a></atag. Please note that there can be other tags between <a></a>
>tags also like <b></btag in above html.
>>
>I want to do this in C# and using Regex. I am new to Regex and I would
>really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:
>
@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
Jul 8 '06 #9
I will definitely go thru the resources you have mentioned here. You
have solved my problem which I was trying for last one week. Thanks for
your time and professional help.

Regards
JM

Jesse Houwing wrote:
jm******@gmail.com wrote:
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

You're welcome
btw, what are the good resources on net from where I can start learning
about Regex ?

Download The Regulator and try experimenting (just disable intellisense
from the options window, it doesn't work).
http://regex.osherove.com/

Try some of the exercises here:
http://www.cs.princeton.edu/introcs/72regular/

There are a number of articles on MSDN that might also be of help:
http://msdn.microsoft.com/library/de...63906a7353.asp
http://msdn2.microsoft.com/en-us/library/az24scfc.aspx

And a general explanation on regular expressions (not specific to .Net):
http://www.regularexpressions.info/

Especially for .Net buy the ebook from Dan Applemen
http://www.amazon.com/gp/product/B00...lance&n=551440

And for general insight on the workings of Regular Expressions the
following book is one of the best resources available:
http://www.amazon.com/gp/product/059...lance&n=283155

And finally:
Keep trying! Keep Practising and don't be afraid to ask questions.
And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

HTML isn't that strict and people write some funny stuff in their pages
once in a while. You can't predict these things, and you're better
equipped with a parser that knows most of these exceptions.

There the .NET Html Agility Pack that parses HTML and makes it search
able. The MSHTML object can also be a great help.
http://smourier.blogspot.com/
Thanks again for all your help

Again, welcome!

Jesse
Jesse Houwing wrote:
jm******@gmail.com wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
I have got the following html:
>
"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....
>
Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.
>
I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
Jul 8 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Joe | last post by:
Hi, I have been using a regular expression that I don’t uite understand to filter the valid email address. My regular expression is as follows: <asp:RegularExpressionValidator...
2
by: Michael R. Pierotti | last post by:
Dim reg As New Regex("^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$") Dim m As Match = reg.Match(txtIPAddress.Text) If m.Success Then 'No need to do anything here Else MessageBox.Show("You need to enter a...
1
by: hillcountry74 | last post by:
Hi, I'm stuck with this regular expression from past 2 days. Desperately need help. I need a regular expression that will allow all characters except these *:~<>' This is my code in...
6
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all...
7
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward...
3
by: aspineux | last post by:
My goal is to write a parser for these imaginary string from the SMTP protocol, regarding RFC 821 and 1869. I'm a little flexible with the BNF from these RFC :-) Any comment ? tests= def...
15
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
3
by: =?Utf-8?B?UmF5IE1pdGNoZWxs?= | last post by:
I'm trying to learn regex but since I've spent way too much time on the following "simple" case, there's obviously something I'm missing. I need to find all occurrences of a specific...
4
by: Danny Ni | last post by:
Hi, The following code snippet is causing CPU to max out on my local machine and production servers. It looks fine on Expresso though. Regex rgxVideo = new...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.