By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,907 Members | 2,039 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,907 IT Pros & Developers. It's quick & easy.

Instrumented web proxy

P: n/a
I would like to write a web (http) proxy which I can instrument to
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a particular
regexp. For those URLs I would have code that parsed the content and
logged some of it.

Think of it as web scraping under manual control.

I found this list of Python web proxies

http://www.xhaus.com/alan/python/proxies.html

Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
many lines of code)

http://www.okisoft.co.jp/esc/python/proxy/

It does what it's supposed to, but I'm a bit at a loss as where to
intercept the traffic. I suspect it should be quite straightforward, but
I'm finding the code a bit opaque.

Any suggestions?

Andrew
Mar 27 '08 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Hello Andrew,
Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
many lines of code)

http://www.okisoft.co.jp/esc/python/proxy/

It does what it's supposed to, but I'm a bit at a loss as where to
intercept the traffic. I suspect it should be quite straightforward, but
I'm finding the code a bit opaque.

Any suggestions?
From a quick look at the code, you need to either hook to do_GET where
you have the URL (see the urlunparse line).
If you want the actual content of the page, you'll need to hook to
_read_write (data = i.recv(8192)).

HTH,
--
Miki <mi*********@gmail.com>
http://pythonwise.blogspot.com

Mar 27 '08 #2

P: n/a
Andrew McLean <an*********@andros.org.ukwrites:
I would like to write a web (http) proxy which I can instrument to
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a
particular regexp. For those URLs I would have code that parsed the
content and logged some of it.

Think of it as web scraping under manual control.
I've used Proxy 3 for this, a very cool program with powerful
capabilities for on the fly html rewriting.

http://theory.stanford.edu/~amitp/proxy.html
Mar 27 '08 #3

P: n/a
Paul Rubin wrote:
Andrew McLean <an*********@andros.org.ukwrites:
>I would like to write a web (http) proxy which I can instrument to
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a
particular regexp. For those URLs I would have code that parsed the
content and logged some of it.

Think of it as web scraping under manual control.

I've used Proxy 3 for this, a very cool program with powerful
capabilities for on the fly html rewriting.

http://theory.stanford.edu/~amitp/proxy.html
This looks very useful. Unfortunately I can't seem to get it to run
under Windows (specifically Vista) using Python 1.5.2, 2.2.3 or 2.5.2.
I'll try Linux if I get a chance.

Mar 28 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.