Connecting Tech Pros Worldwide Help | Site Map

high performance hyperlink extraction

Adam Monsen
Guest
 
Posts: n/a
#1: Sep 13 '05
The following script is a high-performance link (<a
href="...">...</a>) extractor. I'm posting to this list in hopes that
anyone interested will offer constructive
criticism/suggestions/comments/etc. Mainly I'm curious what comments
folks have on my regular expressions. Hopefully someone finds this
kind of thing as interesting like I do! :)

My design goals were as follows:
* extract links from text (most likey valid HTML)
* work faster than BeautifulSoup, sgmllib, or other markup parsing
libraries
* return accurate results

The basic idea is to:
1. find anchor ('a') tags within some HTML text that contain 'href'
attributes (I assume these are hyperlinks)
2. extract all attributes from each 'a' tag found as name, value pairs


import re
import urllib

whiteout = re.compile(r'\s+')

# grabs hyperlinks from text
href_re = re.compile(r'''
<a(?P<attrs>[^>]* # start of tag
href=(?P<delim>["']) # delimiter
(?P<link>[^"']*) # link
(?P=delim) # delimiter
[^>]*)> # rest of start tag
(?P<content>.*?) # link content
</a> # end tag
''', re.VERBOSE | re.IGNORECASE)

# grabs attribute name, value pairs
attrs_re = re.compile(r'''
(?P<name>\w+)= # attribute name
(?P<delim>["']) # delimiter
(?P<value>[^"']*) # attribute value
(?P=delim) # delimiter
''', re.VERBOSE)

def getLinks(html_data):
newdata = whiteout.sub(' ', html_data)
matches = href_re.finditer(newdata)
ancs = []
for match in matches:
d = match.groupdict()
a = {}
a['href'] = d.get('link', None)
a['content'] = d.get('content', None)
attr_matches = attrs_re.finditer(d.get('attrs', None))
for match in attr_matches:
da = match.groupdict()
name = da.get('name', None)
a[name] = da.get('value', None)
ancs.append(a)
return ancs

if __name__ == '__main__':
opener = urllib.FancyURLopener()
url = 'http://adammonsen.com/tut/libgladeTest.html'
html_data = opener.open(url).read()
for a in getLinks(html_data): print a


--
Adam Monsen
http://adammonsen.com/

felipevaldez
Guest
 
Posts: n/a
#2: Sep 13 '05

re: high performance hyperlink extraction





pretty nice, however, u wont capture the more and more common
javascripted redirections, like



<b onclick='location.href="http://www.nowhere.com"'>click me</b>

nor

<form action="http://www.yahoo.com">
<input type=submit value="clickme">
</form>

nor

<form action="http://www.yahoo.com" name=x>
<input type=button value="clickme" onclick=document.x.submit()>
</form>

..

im guessing it also wont handle correctly thing like:

<a href='javascript:alert("...")'>click</a>


but you probably already knew all this stuff, didnt you?


well, anyway, my 2 cents are, that instead of parsing the html looking
for
urls, like http://XXXX.XXXXXX.XXX/XXX?xXXx=xXx#x

or something like that.


//f3l

Bryan Olson
Guest
 
Posts: n/a
#3: Sep 14 '05

re: high performance hyperlink extraction


Adam Monsen wrote:[color=blue]
> The following script is a high-performance link (<a
> href="...">...</a>) extractor. [...]
> * extract links from text (most likey valid HTML)[/color]
[...][color=blue]
> import re
> import urllib
>
> whiteout = re.compile(r'\s+')
>
> # grabs hyperlinks from text
> href_re = re.compile(r'''
> <a(?P<attrs>[^>]* # start of tag
> href=(?P<delim>["']) # delimiter
> (?P<link>[^"']*) # link
> (?P=delim) # delimiter
> [^>]*)> # rest of start tag
> (?P<content>.*?) # link content
> </a> # end tag
> ''', re.VERBOSE | re.IGNORECASE)[/color]

A few notes:

The single or double quote delimiters are optional in some cases
(and frequently omitted even when required by the current
standard).

Where blank-spaces may appear is HTML entities is not so clear.
To follow the standard, one would have to acquire the SGML
standard, which costs money. Popular browsers allow end tags
such as "</a >" which the RE above will reject.


I'm not good at reading RE's, but it looks like the first line
will greedily match the entire start tag, and then back-track to
find the href attribute. There appear to many other good
opportunities for a cleverly-constructed input to force big-time
backtracking; for example, a '>' will end the start-tag, but in
within the delimiters, it's just another character. Can anyone
show a worst-case run-time?

Writing a Python RE to match all and only legal anchor tags may
not be possible. Writing a regular expression to do so is
definitely not possible.

[...][color=blue]
> def getLinks(html_data):
> newdata = whiteout.sub(' ', html_data)
> matches = href_re.finditer(newdata)
> ancs = []
> for match in matches:
> d = match.groupdict()
> a = {}
> a['href'] = d.get('link', None)[/color]

The statement above doesn't seem necessary. The 'href' just gets
over-written below, as just another attribute.
[color=blue]
> a['content'] = d.get('content', None)
> attr_matches = attrs_re.finditer(d.get('attrs', None))
> for match in attr_matches:
> da = match.groupdict()
> name = da.get('name', None)
> a[name] = da.get('value', None)
> ancs.append(a)
> return ancs[/color]


--
--Bryan
Closed Thread