Hi, I started using a python based screen scraper called newsscraper I
downloaded from sourceforge.
http://sourceforge.net/projects/newsscraper/. I have created many python
templates that work just fine from their examples however I ran into a road
block with sites that use single quotes instead of double quotes for
specifying url in their web pages.
For example: <a href='http://www.foo/'>
instead of the usual
<a href="http://www.foo/">
Being a real newbie with this I think I found the area of code that parses
the href. It is in a file called parsefns.py
the full excerpt is listed below but here is the regex line that I believe
is not dealing with single quote.
m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
I have tried many different variations but no luck and no luck getting hold
of the author. Any ideas? Thx.
---------------------
def get_href(text, base_url=None):
"""get_href(text[, base_url]) -> href or None
Extract the URL out of an HREF tag. If base_url is provided,
will attempt to resolve relative links.
"""
m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
if not m:
return None
link = m.group(1)
if base_url and not link.lower().startswith("http"):
import urlparse
link = urlparse.urljoin(base_url, link)
return link
===============