Need help with a Regular Expression

Luhar

After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.

Nov 20 '05 #1

Subscribe Post Reply

1450

Ken Tucker [MVP]

Hi,
Maybe this will help.
Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))

Dim strHtml As String

Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()

Ken

----------------------------

"Luhar" <lu***@luharsoftworks.com> wrote in message
news:uH*************@TK2MSFTNGP10.phx.gbl...
After much scouring of information on Regular Expressions from books and the
web, I've come up with the this handy little Regex to parse links from HTML:

<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor
tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it
a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.

Nov 20 '05 #2

Luhar

Thanks Ken. I'll try that.

While I was waiting for somebody to reply, I refined my original regex a bit
and it seems to find the url and text of href tags, no matter how the tag is
formatted. Here it is:
<\s*a\s*.*[^href]?href\s*={1}?[\s""']*([^\s'"">]*)[^>]*?>{1}(.*?\n*.*?)</a>+
?
Looks ugly, but it seems to do the job. First capture group is the entire
tag from the opening <a... to the closing ...</a>. The second capture group
is just the url, including the protocol, host, destination, fragments, and
queries if they exist. The third capture group is the text between the <a>
and </> tags, including any embedded html tags.

If anybody can think of a situation where this regex doesn't match an anchor
tag in html code, please let me know.

Thanks again,

Luhar

"Ken Tucker [MVP]" <vb***@bellsouth.net> wrote in message
news:Oj**************@TK2MSFTNGP09.phx.gbl...

Hi,
Maybe this will help.
Dim wc As New System.Net.WebClient

Dim sr As New System.IO.StreamReader(wc.OpenRead("http://news.google.com/"))
Dim strHtml As String

Dim regLink As New
System.Text.RegularExpressions.Regex("\""(?<url>[^\""]*)\""")

Dim regTitle As New System.Text.RegularExpressions.Regex(">(.*?)\<")

Dim regHref As New System.Text.RegularExpressions.Regex("\<a
href=""(.*?)""\>(.*?)\<\/a\>")

Dim m As System.Text.RegularExpressions.Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches(strHtml)

Dim mLink As System.Text.RegularExpressions.Match

For Each mLink In regLink.Matches(m.ToString())

Trace.WriteLine(String.Format("Link {0}", mLink.ToString))

Next

For Each mLink In regTitle.Matches(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replace(">", "")

strTitle = strTitle.Replace("<", "")

Trace.WriteLine(String.Format("Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()

Ken

----------------------------

"Luhar" <lu***@luharsoftworks.com> wrote in message
news:uH*************@TK2MSFTNGP10.phx.gbl...
After much scouring of information on Regular Expressions from books and the web, I've come up with the this handy little Regex to parse links from HTML:
<a\s+href(?:\s+)?=(?:\s+)?[""']+(.?[^'""]+)['""]+(?:\s+)?>(.*?)</a>

It works quite well at extracting the url and title of a link from an anchor tag, with one major problem--if the anchor tag includes other attributes
after the HREF= attribute, such as TITLE= or TARGET=, it doesn't consider it a match. Here are some examples:

This one matches:
<a href="/">Home</a>
Group 1: "/"
Group 2: "Home"

This one doesn't:
<a href="/" target="_blank">Home</a>

I can't figure out how to match just the href attribute and just the link
text. Any help would be appreciated.

Thanks.

Nov 20 '05 #3

Similar topics

Need regular expression for this

by: Danny | last post by:

I am trying to do a regular expression to search for a url so anything that has http:\\www.hellothere.com but may not have the http:\\ and may not have the www and may not have http:\\www and...

Microsoft Access / VBA

Problem with a Regular Expression in C. Need Help!

by: Mike Andrews | last post by:

Guys, I've got a regular expression that will just not work. I can't get it work properly and I would like to see if someone out there can tell me if I'm doing this wrong, or if there is a...

C / C++

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Need help understanding regular expression

by: Joe | last post by:

Hi, I have been using a regular expression that I donâ€™t uite understand to filter the valid email address. My regular expression is as follows: <asp:RegularExpressionValidator...

ASP.NET

Simple Regular Expression need

by: Q. John Chen | last post by:

I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How...

ASP.NET

Need one Regular Expression

by: Lucky | last post by:

hi guys, i'm practising regular expression. i've got one string and i want it to split in groups. i was trying to make one regular expression but i didn't successed. please help me guys. i'm...

Visual Basic .NET

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

Need help in forming a regular expression using regex_replace

by: deepak_kamath_n | last post by:

Hello, I am relatively new to the world of regex and require some help in forming a regular expression to achieve the following: I have an input stream similar to: Slot: slot1 Description:...

C / C++

need some regular expression help

by: Chris | last post by:

I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = Can anybody help me out? Thanks for any help!

Python

How to build long Regular Expression

by: altavim | last post by:

Usually when you make regular expression to extract text you are starting from simple expression. When you got to know target text, you are extending your expression. Subsequently very hard to ready...

.NET Framework

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing