I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
promptpython download.py 1 240000
The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.
Questions:
--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.
Thank you.
Winston
#!/usr/bin/env python
# Winston C. Yang
# Created 2008-06-14
from __future__ import with_statement
import mechanize
import os
import Queue
import re
import sys
import threading
import time
lock = threading.RLock ()
# Make the dot match even a newline.
error_pattern = re.compile(".*\ n<!--pageid login -->\n.*", re.DOTALL)
def now():
return time.strftime(" %Y-%m-%d %H:%M:%S")
def worker():
while True:
try:
id = queue.get()
except Queue.Empty:
continue
request = mechanize.Reque st("http://sourceforge.net/project/"\
"memberlist.php ?group_id=%d" %
id)
response = mechanize.urlop en(request)
text = response.read()
valid_id = not error_pattern.m atch(text)
if valid_id:
f = open("%d.csv" % id, "w+")
f.write(text)
f.close()
with lock:
print "\t".join((str( id), now(), "+" if valid_id else
"-"))
def fatal_error():
print "usage: python application start_id end_id"
print
print "Get the usernames associated with each SourceForge project
with"
print "ID between start_id and end_id, inclusive."
print
print "start_id and end_id must be positive integers and satisfy"
print "start_id <= end_id."
sys.exit(1)
if __name__ == "__main__":
if len(sys.argv) == 3:
try:
start_id = int(sys.argv[1])
if start_id <= 0:
raise Exception
end_id = int(sys.argv[2])
if end_id < start_id:
raise Exception
except:
fatal_error()
else:
fatal_error()
# Print the start time.
start_time = now()
print start_time
# Create a directory whose name contains the start time.
dir = start_time.repl ace(" ", "_").replace(": ", "_")
os.mkdir(dir)
os.chdir(dir)
queue = Queue.Queue(0)
for i in xrange(32):
t = threading.Threa d(target=worker , name="worker %d" % (i +
1))
t.setDaemon(Tru e)
t.start()
for id in xrange(start_id , end_id + 1):
queue.put(id)
# When the queue has size zero, exit in three seconds.
while True:
if queue.qsize() == 0:
time.sleep(3)
break
print now() 2 1117
On Jun 15, 2:29 pm, wins...@cs.wisc .edu wrote:
I wrote a Python program (103 lines, below) to download developer data
from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
promptpython download.py 1 240000
The above command downloads data for the projects with IDs between 1
and 240000, inclusive. As it runs, it prints status messages, with a
plus sign meaning that the project ID exists. Else, it prints a minus
sign.
Questions:
--- Are my setup and use of threads, the queue, and "while True" loop
correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge
servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use
mechanize. I installed it and followed an example on the web site.
There wasn't a good description of it on the web site, or I didn't
find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro
2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
10.5.3.
Thank you.
Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:
error_pattern = re.compile(".*\ n<!--pageid login -->\n.*", re.DOTALL)
....
valid_id = not error_pattern.m atch(text)
with:
error_pattern = "\n<!--pageid login -->\n"
....
valid_id = error_pattern not in text
MRAB wrote:
On Jun 15, 2:29 pm, wins...@cs.wisc .edu wrote:
>I wrote a Python program (103 lines, below) to download developer data from SourceForge for research about social networks.
Please critique the code and let me know how to improve it.
An example use of the program:
promptpython download.py 1 240000
The above command downloads data for the projects with IDs between 1 and 240000, inclusive. As it runs, it prints status messages, with a plus sign meaning that the project ID exists. Else, it prints a minus sign.
Questions:
--- Are my setup and use of threads, the queue, and "while True" loop correct or conventional?
--- Should the program sleep sometimes, to be nice to the SourceForge servers, and so they don't think this is a denial-of-service attack?
--- Someone told me that popen is not thread-safe, and to use mechanize. I installed it and followed an example on the web site. There wasn't a good description of it on the web site, or I didn't find it. Could someone explain what mechanize does?
--- How do I choose the number of threads? I am using a MacBook Pro 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS 10.5.3.
Thank you.
Winston
[snip]
String methods are quicker than regular expressions, so don't use
regular expressions if string methods are perfectly adequate. For
example, you can replace:
<SNIP>
Erm, shurely the bottleneck will be bandwidth not processor/memory?* If
it isn't then - yes, you run the risk of actually DOSing their servers!
Your mac will run thousands of threads comfortably but your router may
not handle the thousands of TCP/IP connections you throw at it very
well, especially if it is a domestic model, and sure as hell sourceforge
aren't going to want more than a handfull of concurrent connections from
you.
Typical sourceforge page ~ 30K
Project pages to read = 240000
= ~6.8 Gigabytes
Maybe send their sysadmin a box of chocolates if you want to grab all
that in any less than a week and not get your IP blocked! :)
Roger Heathcote
* Of course, stylistically, MRAB is perfectly right about not wasting
CPU on regexes where string methods will do, unless you are planning on
making your searches more elaborate in the future. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Michael Strorm |
last post by:
Hi!
I posted a message a while back asking for project suggestions, and decided
to go with the idea of creating an adventure game (although it was never
intended to be a 'proper' game, rather an excuse to write- and learn- some
C++).
To cut a long story short, I wrote a fair chunk of it, but realised that
it's... not very good. Okay, it's my first "proper" C++ program, so that's
no big deal, but I don't want to waste more time working...
|
by: gorda |
last post by:
Hello,
I am playing around with operator overloading and inheritence,
specifically overloading the + operator in the base class and its
derived class.
The structure is simple: the base class has two int memebers "dataA",
"dataB". The derived class has an additional int member "dataC". I am
simply trying to overload the + operator so that 'adding' two objects
will sum up the corresponding int members.
|
by: TC |
last post by:
Are there any good sites or forums for a web critique? I went to
alt.html.critique and it's pretty dead.
|
by: G Patel |
last post by:
I wrote the following program to remove C89 type comments from stdin
and send it to stdout (as per exercise in K&R2) and it works but I was
hoping more experienced programmer would critique the layout/style/etc.
Any comments will be helpful. Thank you.
/*
****************************************************
C89/90 COMMENT REMOVER
|
by: christopher diggins |
last post by:
I have posted a C# critique at
http://www.heron-language.com/c-sharp-critique.html. To summarize I bring up
the following issues :
- unsafe code
- attributes
- garbage collection
- non-deterministic destructors
- Objects can't exist on the stack
- Type / Reference Types
| |
by: Eric |
last post by:
There is a VB.NET critique on the following page:
http://www.vb7-critique.741.com/
for those who are interested. Feel free to take a look and share your
thoughts.
Cheers, Eric.
Ps: for those on comp.programming, this may be off topic, but I've
posted there because the critique was part of a discussion in that
group.
|
by: Vijay |
last post by:
Hi,
I am faced with the following common managed/unmanaged C++ interop
problem:
- I have a .NET form that acts as a front end GUI to a processing
engine in the background. The processing engine runs as a thread that
is managed by the front-end form class.
- The processing engine must have a callback mechanism to update the
form about progress, and to send status messages that will be displayed
|
by: mohammaditraders |
last post by:
a program which consists of a class named Student, the class should
consists of three data members Name, Ob_marks, Total_marks and two
member functions Cal_percentage() which calculate the percentage of
the student by the formula (Ob_marks * 100 ) / Total_marks and
Display() which show all information of the student. The class should
also contain the default constructor which initializes all the data
member of the class.
In main program...
|
by: matt |
last post by:
this is my first program in this language ever (besides 'hello
world'), can i get a code critique, please? it's purpose is to read
through an input file character by character and tally the occurrence
of each input character. it seems to compile and run, so i'm looking
for the opinions of old-timers here plz.
/*
* File: occurrenceTally.cpp
* Author: matthew
*
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |