468,317 Members | 1,521 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,317 developers. It's quick & easy.

Need help in Regex.

I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
... some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM

Jul 8 '06 #1
9 1796
JM,

Why use Regex? Why not use MSHTML through COM interop and just parse
the HTML? Then, you can access the object model and find the item that way.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<jm******@gmail.comwrote in message
news:11**********************@35g2000cwc.googlegro ups.com...
>I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM

Jul 8 '06 #2
I am using this for a small .NET program and don't want to use
Unmanaged code/COM. Thats why looking for solution based on Regex.

Thanks
JM

Nicholas Paldino [.NET/C# MVP] wrote:
JM,

Why use Regex? Why not use MSHTML through COM interop and just parse
the HTML? Then, you can access the object model and find the item that way.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<jm******@gmail.comwrote in message
news:11**********************@35g2000cwc.googlegro ups.com...
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM
Jul 8 '06 #3
jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

It starts searching when it encounters an <a tag, then continues on
looking for anything that's not and picks our the href attribute. It
captures the href value in a group and searched on for the end of the
opening tag. Once there it searches for city1, failing if it finds </a
before encountering that specific text.

And the code to extract the value would then be something like (haven't
compiled, so might contain a few small errors):

Regex rx = new
Regex(@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups[1].Value;
}
Jesse
Jul 8 '06 #4
Jesse Houwing wrote:
jm******@gmail.com wrote:
>I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>
Jesse
Jul 8 '06 #5
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>
Jesse
Jul 8 '06 #6
jm******@gmail.com wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
>
Thanks & Regards
JM

Jesse Houwing wrote:
>Jesse Houwing wrote:
>>jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>
>>Jesse
Jul 8 '06 #7
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

btw, what are the good resources on net from where I can start learning
about Regex ?

And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

Thanks again for all your help
JM

Jesse Houwing wrote:
jm******@gmail.com wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
Jul 8 '06 #8
jm******@gmail.com wrote:
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.
You're welcome
btw, what are the good resources on net from where I can start learning
about Regex ?
Download The Regulator and try experimenting (just disable intellisense
from the options window, it doesn't work).
http://regex.osherove.com/

Try some of the exercises here:
http://www.cs.princeton.edu/introcs/72regular/

There are a number of articles on MSDN that might also be of help:
http://msdn.microsoft.com/library/de...63906a7353.asp
http://msdn2.microsoft.com/en-us/library/az24scfc.aspx

And a general explanation on regular expressions (not specific to .Net):
http://www.regularexpressions.info/

Especially for .Net buy the ebook from Dan Applemen
http://www.amazon.com/gp/product/B00...lance&n=551440

And for general insight on the workings of Regular Expressions the
following book is one of the best resources available:
http://www.amazon.com/gp/product/059...lance&n=283155

And finally:
Keep trying! Keep Practising and don't be afraid to ask questions.
And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.
HTML isn't that strict and people write some funny stuff in their pages
once in a while. You can't predict these things, and you're better
equipped with a parser that knows most of these exceptions.

There the .NET Html Agility Pack that parses HTML and makes it search
able. The MSHTML object can also be a great help.
http://smourier.blogspot.com/
Thanks again for all your help
Again, welcome!

Jesse
Jesse Houwing wrote:
>jm******@gmail.com wrote:
>>Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
>>Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
>I have got the following html:
>>
>"something in html ... etc.. city1... etc... <a class="font1"
>href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
>.. some html. city1.. can repeat lot of times here....
>>
>Requirement:
>-------------------
>I want to get the value of "href" i.e "city1.html" by searching "city1"
>between the <a</atag. Please note that "city1" can repeat lot of
>times in the html, but I have to search for "city1" which lies between
><a></atag. Please note that there can be other tags between <a></a>
>tags also like <b></btag in above html.
>>
>I want to do this in C# and using Regex. I am new to Regex and I would
>really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:
>
@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
Jul 8 '06 #9
I will definitely go thru the resources you have mentioned here. You
have solved my problem which I was trying for last one week. Thanks for
your time and professional help.

Regards
JM

Jesse Houwing wrote:
jm******@gmail.com wrote:
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

You're welcome
btw, what are the good resources on net from where I can start learning
about Regex ?

Download The Regulator and try experimenting (just disable intellisense
from the options window, it doesn't work).
http://regex.osherove.com/

Try some of the exercises here:
http://www.cs.princeton.edu/introcs/72regular/

There are a number of articles on MSDN that might also be of help:
http://msdn.microsoft.com/library/de...63906a7353.asp
http://msdn2.microsoft.com/en-us/library/az24scfc.aspx

And a general explanation on regular expressions (not specific to .Net):
http://www.regularexpressions.info/

Especially for .Net buy the ebook from Dan Applemen
http://www.amazon.com/gp/product/B00...lance&n=551440

And for general insight on the workings of Regular Expressions the
following book is one of the best resources available:
http://www.amazon.com/gp/product/059...lance&n=283155

And finally:
Keep trying! Keep Practising and don't be afraid to ask questions.
And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

HTML isn't that strict and people write some funny stuff in their pages
once in a while. You can't predict these things, and you're better
equipped with a parser that knows most of these exceptions.

There the .NET Html Agility Pack that parses HTML and makes it search
able. The MSHTML object can also be a great help.
http://smourier.blogspot.com/
Thanks again for all your help

Again, welcome!

Jesse
Jesse Houwing wrote:
jm******@gmail.com wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
jm******@gmail.com wrote:
I have got the following html:
>
"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."click for <b>info</bon city1 </a>
.. some html. city1.. can repeat lot of times here....
>
Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a</atag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></atag. Please note that there can be other tags between <a></a>
tags also like <b></btag in above html.
>
I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
Jul 8 '06 #10

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Michael R. Pierotti | last post: by
1 post views Thread by hillcountry74 | last post: by
6 posts views Thread by Extremest | last post: by
7 posts views Thread by Extremest | last post: by
3 posts views Thread by aspineux | last post: by
15 posts views Thread by morleyc | last post: by
3 posts views Thread by =?Utf-8?B?UmF5IE1pdGNoZWxs?= | last post: by
4 posts views Thread by Danny Ni | last post: by
reply views Thread by NPC403 | last post: by
1 post views Thread by howard w | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.