469,955 Members | 2,457 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,955 developers. It's quick & easy.

please critique my thread code

I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.

Please critique the code and let me know how to improve it.

An example use of the program:

promptpython download.py 1 240000

The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.

Questions:

--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?

--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?

--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?

--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.

Thank you.

Winston

#!/usr/bin/env python

# Winston C. Yang
# Created 2008-06-14

from __future__ import with_statement

import mechanize
import os
import Queue
import re
import sys
import threading
import time

lock = threading.RLock()

# Make the dot match even a newline.
error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)

def now():
return time.strftime("%Y-%m-%d %H:%M:%S")

def worker():

while True:

try:
id = queue.get()
except Queue.Empty:
continue

request = mechanize.Request("http://sourceforge.net/project/"\
"memberlist.php?group_id=%d" %
id)
response = mechanize.urlopen(request)
text = response.read()

valid_id = not error_pattern.match(text)

if valid_id:
f = open("%d.csv" % id, "w+")
f.write(text)
f.close()

with lock:
print "\t".join((str(id), now(), "+" if valid_id else
"-"))

def fatal_error():
print "usage: python application start_id end_id"
print
print "Get the usernames associated with each SourceForge project
with"
print "ID between start_id and end_id, inclusive."
print
print "start_id and end_id must be positive integers and satisfy"
print "start_id <= end_id."
sys.exit(1)

if __name__ == "__main__":

if len(sys.argv) == 3:

try:
start_id = int(sys.argv[1])

if start_id <= 0:
raise Exception

end_id = int(sys.argv[2])

if end_id < start_id:
raise Exception
except:
fatal_error()
else:
fatal_error()

# Print the start time.
start_time = now()
print start_time

# Create a directory whose name contains the start time.
dir = start_time.replace(" ", "_").replace(":", "_")
os.mkdir(dir)
os.chdir(dir)

queue = Queue.Queue(0)

for i in xrange(32):
t = threading.Thread(target=worker, name="worker %d" % (i +
1))
t.setDaemon(True)
t.start()

for id in xrange(start_id, end_id + 1):
queue.put(id)

# When the queue has size zero, exit in three seconds.
while True:
if queue.qsize() == 0:
time.sleep(3)
break

print now()
Jun 27 '08 #1
2 945
On Jun 15, 2:29 pm, wins...@cs.wisc.edu wrote:
I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.

Please critique the code and let me know how to improve it.

An example use of the program:

promptpython download.py 1 240000

The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.

Questions:

--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?

--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?

--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?

--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.

Thank you.

Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:

error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
....
valid_id = not error_pattern.match(text)

with:

error_pattern = "\n<!--pageid login -->\n"
....
valid_id = error_pattern not in text
Jun 27 '08 #2
MRAB wrote:
On Jun 15, 2:29 pm, wins...@cs.wisc.edu wrote:
>I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.

Please critique the code and let me know how to improve it.

An example use of the program:

promptpython download.py 1 240000

The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.

Questions:

--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?

--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?

--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?

--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.

Thank you.

Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:
<SNIP>

Erm, shurely the bottleneck will be bandwidth not processor/memory?* If
it isn't then - yes, you run the risk of actually DOSing their servers!

Your mac will run thousands of threads comfortably but your router may
not handle the thousands of TCP/IP connections you throw at it very
well, especially if it is a domestic model, and sure as hell sourceforge
aren't going to want more than a handfull of concurrent connections from
you.

Typical sourceforge page ~ 30K
Project pages to read = 240000

= ~6.8 Gigabytes

Maybe send their sysadmin a box of chocolates if you want to grab all
that in any less than a week and not get your IP blocked! :)
Roger Heathcote

* Of course, stylistically, MRAB is perfectly right about not wasting
CPU on regexes where string methods will do, unless you are planning on
making your searches more elaborate in the future.

Jun 27 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

26 posts views Thread by Michael Strorm | last post: by
19 posts views Thread by TC | last post: by
8 posts views Thread by G Patel | last post: by
188 posts views Thread by christopher diggins | last post: by
39 posts views Thread by Eric | last post: by
19 posts views Thread by mohammaditraders | last post: by
2 posts views Thread by matt | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.