By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,922 Members | 1,493 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,922 IT Pros & Developers. It's quick & easy.

high performance hyperlink extraction

P: n/a
The following script is a high-performance link (<a
href="...">...</a>) extractor. I'm posting to this list in hopes that
anyone interested will offer constructive
criticism/suggestions/comments/etc. Mainly I'm curious what comments
folks have on my regular expressions. Hopefully someone finds this
kind of thing as interesting like I do! :)

My design goals were as follows:
* extract links from text (most likey valid HTML)
* work faster than BeautifulSoup, sgmllib, or other markup parsing
libraries
* return accurate results

The basic idea is to:
1. find anchor ('a') tags within some HTML text that contain 'href'
attributes (I assume these are hyperlinks)
2. extract all attributes from each 'a' tag found as name, value pairs
import re
import urllib

whiteout = re.compile(r'\s+')

# grabs hyperlinks from text
href_re = re.compile(r'''
<a(?P<attrs>[^>]* # start of tag
href=(?P<delim>["']) # delimiter
(?P<link>[^"']*) # link
(?P=delim) # delimiter
[^>]*)> # rest of start tag
(?P<content>.*?) # link content
</a> # end tag
''', re.VERBOSE | re.IGNORECASE)

# grabs attribute name, value pairs
attrs_re = re.compile(r'''
(?P<name>\w+)= # attribute name
(?P<delim>["']) # delimiter
(?P<value>[^"']*) # attribute value
(?P=delim) # delimiter
''', re.VERBOSE)

def getLinks(html_data):
newdata = whiteout.sub(' ', html_data)
matches = href_re.finditer(newdata)
ancs = []
for match in matches:
d = match.groupdict()
a = {}
a['href'] = d.get('link', None)
a['content'] = d.get('content', None)
attr_matches = attrs_re.finditer(d.get('attrs', None))
for match in attr_matches:
da = match.groupdict()
name = da.get('name', None)
a[name] = da.get('value', None)
ancs.append(a)
return ancs

if __name__ == '__main__':
opener = urllib.FancyURLopener()
url = 'http://adammonsen.com/tut/libgladeTest.html'
html_data = opener.open(url).read()
for a in getLinks(html_data): print a
--
Adam Monsen
http://adammonsen.com/

Sep 13 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a

pretty nice, however, u wont capture the more and more common
javascripted redirections, like

<b onclick='location.href="http://www.nowhere.com"'>click me</b>

nor

<form action="http://www.yahoo.com">
<input type=submit value="clickme">
</form>

nor

<form action="http://www.yahoo.com" name=x>
<input type=button value="clickme" onclick=document.x.submit()>
</form>

..

im guessing it also wont handle correctly thing like:

<a href='javascript:alert("...")'>click</a>
but you probably already knew all this stuff, didnt you?
well, anyway, my 2 cents are, that instead of parsing the html looking
for
urls, like http://XXXX.XXXXXX.XXX/XXX?xXXx=xXx#x

or something like that.
//f3l

Sep 13 '05 #2

P: n/a
Adam Monsen wrote:
The following script is a high-performance link (<a
href="...">...</a>) extractor. [...]
* extract links from text (most likey valid HTML) [...] import re
import urllib

whiteout = re.compile(r'\s+')

# grabs hyperlinks from text
href_re = re.compile(r'''
<a(?P<attrs>[^>]* # start of tag
href=(?P<delim>["']) # delimiter
(?P<link>[^"']*) # link
(?P=delim) # delimiter
[^>]*)> # rest of start tag
(?P<content>.*?) # link content
</a> # end tag
''', re.VERBOSE | re.IGNORECASE)
A few notes:

The single or double quote delimiters are optional in some cases
(and frequently omitted even when required by the current
standard).

Where blank-spaces may appear is HTML entities is not so clear.
To follow the standard, one would have to acquire the SGML
standard, which costs money. Popular browsers allow end tags
such as "</a >" which the RE above will reject.
I'm not good at reading RE's, but it looks like the first line
will greedily match the entire start tag, and then back-track to
find the href attribute. There appear to many other good
opportunities for a cleverly-constructed input to force big-time
backtracking; for example, a '>' will end the start-tag, but in
within the delimiters, it's just another character. Can anyone
show a worst-case run-time?

Writing a Python RE to match all and only legal anchor tags may
not be possible. Writing a regular expression to do so is
definitely not possible.

[...] def getLinks(html_data):
newdata = whiteout.sub(' ', html_data)
matches = href_re.finditer(newdata)
ancs = []
for match in matches:
d = match.groupdict()
a = {}
a['href'] = d.get('link', None)
The statement above doesn't seem necessary. The 'href' just gets
over-written below, as just another attribute.
a['content'] = d.get('content', None)
attr_matches = attrs_re.finditer(d.get('attrs', None))
for match in attr_matches:
da = match.groupdict()
name = da.get('name', None)
a[name] = da.get('value', None)
ancs.append(a)
return ancs

--
--Bryan
Sep 14 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.