By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,525 Members | 1,634 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,525 IT Pros & Developers. It's quick & easy.

pagecrawling websites with Python

P: n/a
Hi all,

We've got an application we wrote in Python called pagecrawler that
generates a list of URL's based on sql queries. It then runs through
this list of URL's 'browsing' one of our staging servers for all those
URL's. We do this to build the site dynamically, but each page
generated by the URL is saved as a static HTML file. Anyway, the
pagecrawler program uses Python threads to try and build the pages as
fast as it can. The list of URL's is stored in a queue and the thread
objects get URL's from the queue and run them till the queue is empty.
This works okay but it still seems to take a long time to build the
site this way, even though the actual pages only take milliseconds to
run (the pages are generated with PHP on separate server). Does anyone
have any insight if this is a reasonable approach to build web pages,
or if we should look at another design?

Thanks in advance,
Doug

Jul 18 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On 1 Apr 2005 11:58:11 -0800, writeson <wr******@charter.net> wrote:
We've got an application we wrote in Python called pagecrawler that <snip /> Does anyone have any insight if this is a reasonable approach to build web pages,
or if we should look at another design?


I don't have an answer to your particular question, but maybe you can
have a look at how the HarvestMan works:

http://freshmeat.net/projects/harvestman
Regards,
--
Swaroop C H
Blog: http://www.swaroopch.info
Book: http://www.byteofpython.info
Jul 18 '05 #2

P: n/a
Swaroop,

Thanks for the reply, I'll take a look at HarvestMan and see if we can
use it directly, or get some ideas from the source code. :)

Doug

Jul 18 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.