473,769 Members | 4,909 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

please critique my thread code

I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.

Please critique the code and let me know how to improve it.

An example use of the program:

promptpython download.py 1 240000

The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.

Questions:

--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?

--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?

--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?

--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.

Thank you.

Winston

#!/usr/bin/env python

# Winston C. Yang
# Created 2008-06-14

from __future__ import with_statement

import mechanize
import os
import Queue
import re
import sys
import threading
import time

lock = threading.RLock ()

# Make the dot match even a newline.
error_pattern = re.compile(".*\ n<!--pageid login -->\n.*", re.DOTALL)

def now():
return time.strftime(" %Y-%m-%d %H:%M:%S")

def worker():

while True:

try:
id = queue.get()
except Queue.Empty:
continue

request = mechanize.Reque st("http://sourceforge.net/project/"\
"memberlist.php ?group_id=%d" %
id)
response = mechanize.urlop en(request)
text = response.read()

valid_id = not error_pattern.m atch(text)

if valid_id:
f = open("%d.csv" % id, "w+")
f.write(text)
f.close()

with lock:
print "\t".join((str( id), now(), "+" if valid_id else
"-"))

def fatal_error():
print "usage: python application start_id end_id"
print
print "Get the usernames associated with each SourceForge project
with"
print "ID between start_id and end_id, inclusive."
print
print "start_id and end_id must be positive integers and satisfy"
print "start_id <= end_id."
sys.exit(1)

if __name__ == "__main__":

if len(sys.argv) == 3:

try:
start_id = int(sys.argv[1])

if start_id <= 0:
raise Exception

end_id = int(sys.argv[2])

if end_id < start_id:
raise Exception
except:
fatal_error()
else:
fatal_error()

# Print the start time.
start_time = now()
print start_time

# Create a directory whose name contains the start time.
dir = start_time.repl ace(" ", "_").replace(": ", "_")
os.mkdir(dir)
os.chdir(dir)

queue = Queue.Queue(0)

for i in xrange(32):
t = threading.Threa d(target=worker , name="worker %d" % (i +
1))
t.setDaemon(Tru e)
t.start()

for id in xrange(start_id , end_id + 1):
queue.put(id)

# When the queue has size zero, exit in three seconds.
while True:
if queue.qsize() == 0:
time.sleep(3)
break

print now()
Jun 27 '08 #1
2 1117
On Jun 15, 2:29 pm, wins...@cs.wisc .edu wrote:
I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.

Please critique the code and let me know how to improve it.

An example use of the program:

promptpython download.py 1 240000

The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.

Questions:

--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?

--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?

--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?

--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.

Thank you.

Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:

error_pattern = re.compile(".*\ n<!--pageid login -->\n.*", re.DOTALL)
....
valid_id = not error_pattern.m atch(text)

with:

error_pattern = "\n<!--pageid login -->\n"
....
valid_id = error_pattern not in text
Jun 27 '08 #2
MRAB wrote:
On Jun 15, 2:29 pm, wins...@cs.wisc .edu wrote:
>I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.

Please critique the code and let me know how to improve it.

An example use of the program:

promptpython download.py 1 240000

The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.

Questions:

--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?

--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?

--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?

--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.

Thank you.

Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:
<SNIP>

Erm, shurely the bottleneck will be bandwidth not processor/memory?* If
it isn't then - yes, you run the risk of actually DOSing their servers!

Your mac will run thousands of threads comfortably but your router may
not handle the thousands of TCP/IP connections you throw at it very
well, especially if it is a domestic model, and sure as hell sourceforge
aren't going to want more than a handfull of concurrent connections from
you.

Typical sourceforge page ~ 30K
Project pages to read = 240000

= ~6.8 Gigabytes

Maybe send their sysadmin a box of chocolates if you want to grab all
that in any less than a week and not get your IP blocked! :)
Roger Heathcote

* Of course, stylistically, MRAB is perfectly right about not wasting
CPU on regexes where string methods will do, unless you are planning on
making your searches more elaborate in the future.

Jun 27 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

26
2940
by: Michael Strorm | last post by:
Hi! I posted a message a while back asking for project suggestions, and decided to go with the idea of creating an adventure game (although it was never intended to be a 'proper' game, rather an excuse to write- and learn- some C++). To cut a long story short, I wrote a fair chunk of it, but realised that it's... not very good. Okay, it's my first "proper" C++ program, so that's no big deal, but I don't want to waste more time working...
16
3094
by: gorda | last post by:
Hello, I am playing around with operator overloading and inheritence, specifically overloading the + operator in the base class and its derived class. The structure is simple: the base class has two int memebers "dataA", "dataB". The derived class has an additional int member "dataC". I am simply trying to overload the + operator so that 'adding' two objects will sum up the corresponding int members.
19
2554
by: TC | last post by:
Are there any good sites or forums for a web critique? I went to alt.html.critique and it's pretty dead.
8
1688
by: G Patel | last post by:
I wrote the following program to remove C89 type comments from stdin and send it to stdout (as per exercise in K&R2) and it works but I was hoping more experienced programmer would critique the layout/style/etc. Any comments will be helpful. Thank you. /* **************************************************** C89/90 COMMENT REMOVER
188
7251
by: christopher diggins | last post by:
I have posted a C# critique at http://www.heron-language.com/c-sharp-critique.html. To summarize I bring up the following issues : - unsafe code - attributes - garbage collection - non-deterministic destructors - Objects can't exist on the stack - Type / Reference Types
39
1941
by: Eric | last post by:
There is a VB.NET critique on the following page: http://www.vb7-critique.741.com/ for those who are interested. Feel free to take a look and share your thoughts. Cheers, Eric. Ps: for those on comp.programming, this may be off topic, but I've posted there because the critique was part of a discussion in that group.
6
1627
by: Vijay | last post by:
Hi, I am faced with the following common managed/unmanaged C++ interop problem: - I have a .NET form that acts as a front end GUI to a processing engine in the background. The processing engine runs as a thread that is managed by the front-end form class. - The processing engine must have a callback mechanism to update the form about progress, and to send status messages that will be displayed
19
1966
by: mohammaditraders | last post by:
a program which consists of a class named Student, the class should consists of three data members Name, Ob_marks, Total_marks and two member functions Cal_percentage() which calculate the percentage of the student by the formula (Ob_marks * 100 ) / Total_marks and Display() which show all information of the student. The class should also contain the default constructor which initializes all the data member of the class. In main program...
2
1318
by: matt | last post by:
this is my first program in this language ever (besides 'hello world'), can i get a code critique, please? it's purpose is to read through an input file character by character and tally the occurrence of each input character. it seems to compile and run, so i'm looking for the opinions of old-timers here plz. /* * File: occurrenceTally.cpp * Author: matthew *
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10210
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10043
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9990
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8869
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7406
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6672
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5298
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3956
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.