I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
promptpython download.py 1 240000
The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.
Questions:
--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.
Thank you.
Winston
#!/usr/bin/env python
# Winston C. Yang
# Created 2008-06-14
from __future__ import with_statement
import mechanize
import os
import Queue
import re
import sys
import threading
import time
lock = threading.RLock()
# Make the dot match even a newline.
error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
def now():
return time.strftime("%Y-%m-%d %H:%M:%S")
def worker():
while True:
try:
id = queue.get()
except Queue.Empty:
continue
request = mechanize.Request("http://sourceforge.net/project/"\
"memberlist.php?group_id=%d" %
id)
response = mechanize.urlopen(request)
text = response.read()
valid_id = not error_pattern.match(text)
if valid_id:
f = open("%d.csv" % id, "w+")
f.write(text)
f.close()
with lock:
print "\t".join((str(id), now(), "+" if valid_id else
"-"))
def fatal_error():
print "usage: python application start_id end_id"
print
print "Get the usernames associated with each SourceForge project
with"
print "ID between start_id and end_id, inclusive."
print
print "start_id and end_id must be positive integers and satisfy"
print "start_id <= end_id."
sys.exit(1)
if __name__ == "__main__":
if len(sys.argv) == 3:
try:
start_id = int(sys.argv[1])
if start_id <= 0:
raise Exception
end_id = int(sys.argv[2])
if end_id < start_id:
raise Exception
except:
fatal_error()
else:
fatal_error()
# Print the start time.
start_time = now()
print start_time
# Create a directory whose name contains the start time.
dir = start_time.replace(" ", "_").replace(":", "_")
os.mkdir(dir)
os.chdir(dir)
queue = Queue.Queue(0)
for i in xrange(32):
t = threading.Thread(target=worker, name="worker %d" % (i +
1))
t.setDaemon(True)
t.start()
for id in xrange(start_id, end_id + 1):
queue.put(id)
# When the queue has size zero, exit in three seconds.
while True:
if queue.qsize() == 0:
time.sleep(3)
break
print now() 2 1022
On Jun 15, 2:29 pm, wins...@cs.wisc.edu wrote:
I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
promptpython download.py 1 240000
The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.
Questions:
--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.
Thank you.
Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:
error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
....
valid_id = not error_pattern.match(text)
with:
error_pattern = "\n<!--pageid login -->\n"
....
valid_id = error_pattern not in text
MRAB wrote:
On Jun 15, 2:29 pm, wins...@cs.wisc.edu wrote:
>I wrote a Python program (103 lines, below) to download developer data from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
promptpython download.py 1 240000
The above command downloads data for the projects with IDs between 1 and 240000, inclusive. As it runs, it prints status messages, with a plus sign meaning that the project ID exists. Else, it prints a minus sign.
Questions:
--- Are my setup and use of threads, the queue, and "while True" loop correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use mechanize. I installed it and followed an example on the web site. There wasn't a good description of it on the web site, or I didn't find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS 10.5.3.
Thank you.
Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:
<SNIP>
Erm, shurely the bottleneck will be bandwidth not processor/memory?* If
it isn't then - yes, you run the risk of actually DOSing their servers!
Your mac will run thousands of threads comfortably but your router may
not handle the thousands of TCP/IP connections you throw at it very
well, especially if it is a domestic model, and sure as hell sourceforge
aren't going to want more than a handfull of concurrent connections from
you.
Typical sourceforge page ~ 30K
Project pages to read = 240000
= ~6.8 Gigabytes
Maybe send their sysadmin a box of chocolates if you want to grab all
that in any less than a week and not get your IP blocked! :)
Roger Heathcote
* Of course, stylistically, MRAB is perfectly right about not wasting
CPU on regexes where string methods will do, unless you are planning on
making your searches more elaborate in the future. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Michael Strorm |
last post by:
Hi!
I posted a message a while back asking for project suggestions, and decided
to go with the idea of creating an adventure game (although it was...
|
by: gorda |
last post by:
Hello,
I am playing around with operator overloading and inheritence,
specifically overloading the + operator in the base class and its
derived...
|
by: TC |
last post by:
Are there any good sites or forums for a web critique? I went to
alt.html.critique and it's pretty dead.
|
by: G Patel |
last post by:
I wrote the following program to remove C89 type comments from stdin
and send it to stdout (as per exercise in K&R2) and it works but I was
hoping...
|
by: christopher diggins |
last post by:
I have posted a C# critique at
http://www.heron-language.com/c-sharp-critique.html. To summarize I bring up
the following issues :
- unsafe code...
|
by: Eric |
last post by:
There is a VB.NET critique on the following page:
http://www.vb7-critique.741.com/
for those who are interested. Feel free to take a look and share...
|
by: Vijay |
last post by:
Hi,
I am faced with the following common managed/unmanaged C++ interop
problem:
- I have a .NET form that acts as a front end GUI to a...
|
by: mohammaditraders |
last post by:
a program which consists of a class named Student, the class should
consists of three data members Name, Ob_marks, Total_marks and two
member...
|
by: matt |
last post by:
this is my first program in this language ever (besides 'hello
world'), can i get a code critique, please? it's purpose is to read
through an input...
|
by: concettolabs |
last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
|
by: teenabhardwaj |
last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
|
by: Naresh1 |
last post by:
What is WebLogic Admin Training?
WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
|
by: jalbright99669 |
last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
|
by: Matthew3360 |
last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function.
Here is my code.
...
|
by: Matthew3360 |
last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
|
by: AndyPSV |
last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
| |