"Fuzzyman" <fu******@gmail.com> writes:
Ajar wrote: I want to write a program which will automatically login to my ISPs
website, retrieve data and do some processing. Can this be done? Can
you point me to any example python programs which do similar things?
Regards,
Ajar
Very easily. Have a look at my article on the ``urllib2`` module.
http://www.voidspace.org.uk/python/articles.shtml#http
You may need to use ClientCookie/cookielib to handle cookies and may
have to cope with BASIC authentication. There are also articles about
both of these as well.
If you want to handle filling in forms programattically then the module
ClientForm is useful (allegedly).
The last piece of the puzzle is BeautifulSoup. That's what you use to
extract data from the web page.
For instance a lot of web pages listing data have something like this
on it:
<table>
....
<tr><th>Item:</th><td>Value</td></tr>
....
</table>
You can extract value from such with BeautifulSoup by doing something like:
soup.fetchText('Item:')[0].findParent(['td', 'th']).nextSibling.string
Where this checks works for the item being in either a td or th tag.
Of course, I recommend doing things a little bit more verbosely. In my
case, I'm writing code that's expected to work on a large number of
web pages with different formats, so I put in a lot of error checking,
along with informative errors.
links = table.fetchText(name)
if not links:
raise BadTableMatch, "%s not found in table" % name
td = links[0].findParent(['td', 'th'])
if not td:
raise BadmatchTable, "td/th not a parent of %s" % name
next = td.nextSibling
if not next:
raise BadTableMatch, "td for %s has no sibling" % name
out = get_contents(next)
if not out:
raise BadTableMatch, "no value string found for %s" % name
return out
BeautifulSoup would raise exceptions if the conditions I check for are
true and I didn't check them - but the error messages wouldn't be as
informative.
Oh yeah - get_contents isn't from BeautifulSoup. I ran into cases
where the <td> tag held other tags, and wanted the flat text
extracted. Couldn't find a BeautifulSoup method to do that, so I wrote:
def get_contents(ele):
"""Utility function to return all the text in a tag."""
if ele.string:
return ele.string # We only have one string. Done
return ''.join(get_contents(x) for x in ele)
<mike
--
Mike Meyer <mw*@mired.org>
http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.