472,353 Members | 1,705 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,353 software developers and data experts.

Oh what a twisted thread we weave....

Hi

First off I'm not using anything from Twisted. I just liked the subject
line :)

The folks of this list have been most helpful before and I'm hoping
that you'll take pity on a the dazed and confused. I've read stuff on
this group and various website and book until my head is spinning...

Here is a brief summary of what I'm trying to do and an example below.
I have the code below in a single threaded version and use it to test a
list of roughly 6000 urls ensure that they "work". If they fail I track
the kind of failures and then generate a report. Currently it take
about 7 - 9 hours to run through the entire list. I basically create a
list from a file containing a list of URLS and then iterate over the
list and check each page as I go through the list. I get all sort of
flack because it takes so long so I thought I could speed it up by
using a Queue and X number of threads. Seems easier said then done.

However in my test below I can't even get it to catch a single error in
my if statement in the Run() function. I'm stumped as to why. Any help
would be Greatly appreciated. and if so inclined pointers on how to
limit the number of threads of a give number of threads.

Thank you in advance! I really do appreciate it

here is what I have so far... Yes there are somethings that are unused
from previous test. Oh and to give proper credit this is based on some
code from http://starship.python.net/crew/aahz...00/SCRIPT2.HTM

import threading, Queue
from time import sleep, time
import urllib2
import formatter
import string
#toscan = Queue.Queue
#scanned = Queue.Queue
#workQueue = Queue.Queue()
MAX_THREADS = 10

timeout = 90 # sets timeout for urllib2.urlopen()
failedlinks = [] # list for failed urls
zeromatch = [] # list for 0 result searches
t = 0 # used to store starting time for getting a page.
pagetime = 0 # time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10 # fastest page time
cumulative = 0 # total time to load all pages (used to calc. avg)
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

class Retriever(threading.Thread):
def __init__(self, URL):
self.done = 0
self.URL = URL
self.urlObj = ''
self.ST_zeroMatch = ST_zeroMatch
print '__init__:self.URL', self.URL
threading.Thread.__init__(self)

def run(self):
print 'In run()'
print "Retrieving:", self.URL
#self.page = urllib.urlopen(self.URL)
#self.body = self.page.read()
#self.page.close()
self.t = time()
self.urlObj = urllib2.urlopen(self.URL)
self.pagetime = time() - t
self.webpg = self.urlObj.read()
print 'Retriever.run: before if'
print 'matching', self.ST_zeroMatch
print ST_zeroMatch
# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:
# I don't think I want to use self.zeromatch, do I?
print '** Found zeromatch'
zeromatch.append(url)
#self.parse()
print 'Retriever.run: past if'
print 'exiting run()'
self.done = 1

# the last 2 Shop.Com Urls should trigger the zeromatch condition
sites = ['http://www.foo.com/',
'http://www.shop.com',
'http://www.shop.com/op/aprod-~zzsome+thing',
'http://www.shop.com/op/aprod-~xyzzy'
#'http://www.yahoo.com/ThisPageDoesntExist'
]

threadList = []
URLs = []
workQueue = Queue.Queue()

for item in sites:
workQueue.put(item)

print workQueue
print
print 'b4 test in sites'

for test in sites:
retriever = Retriever(test)
retriever.start()
threadList.append(retriever)

print 'threadList:'
print threadList
print 'past for test in sites:'

while threading.activeCount()>1:
print'Zzz...'
sleep(1)

print 'entering retriever for loop'
for retriever in threadList:
#URLs.extend(retriever.run())
retriever.run()

print 'zeromatch:', zeromatch
# even though there are two URLs that that should be here nothing ever
gets appeneded to the list.

Oct 29 '05 #1
2 3381
On Sat, 28 Oct 2005, GregM wrote:
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:


This code - i do not think it means what you think it means. Specifically,
it doesn't mean "is either of ST_zeroMatch or ST_zeroMatch2 in
self.webpg"; what it means is "apply the 'or' opereator to ST_zeroMatch
and ST_zeroMatch2, then check if the result is in self.webpg". The result
of applying the or operator to two nonempty strings is the left-hand
string; your code is thus equivalent to

if ST_zeroMatch in self.webpg:

Which will work in cases where your page says 'You found 0 products', but
not in cases where it says 'There are no products matching your
selection'.

What you want is:

if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):

Or something like that.

You say that you have a single-threaded version of this that works;
presumably, you have a working version of this logic in there. Did you
write the threaded version from scratch? Often a bad move!

tom

--
It's the 21st century, man - we rue _minutes_. -- Benjamin Rosenbaum
Oct 29 '05 #2
Tom,

Thanks for the reply and sorry for the delay in getting back to you.
Thanks for pointing out my logic problem. I had added the 2nd part of
the if statement at the last minute...

Yes I have a single threaded version its several hundred lines and uses
COM to write the results out to and Excel spreadsheet.. I was trying to
better understand threading and queues before I started hacking on my
current code... maybe that was a mistake... hey I'm still learning and
I learn a lot just by reading stuff posted to this group. I hope at
some point I can help others in the same way.

Here are the relevent parts of the code (no COM stuff)

here is a summary:
# see if url exists
# if exists then
# hit page
# get text of page
# see if text of page contains search terms
# if it does then
# update appropiate counters and lists
# else update static line and do the next one
# when done with Links list
# - calculate totals and times
# - write info to xls file
# end.

# utils are functions and classes that I wrote
# from utils import PrintStatic, HttpExists2
#
# My version of 'easyExcel' with extentions and improvements.
# import excelled
import urllib2
import time
import socket
import os
#import msvcrt # for printstatic
from datetime import datetime
import pythoncom
from sys import exc_info, stdout, argv, exit

# search terms to use for matching.
#primarySearchTerm = 'Narrow your'
ST_lookingFor = 'Looking for Something'
ST_errorConnecting = 'there has been an error connecting'
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

#initialize Globals
timeout = 90 # sets timeout for urllib2.urlopen()
failedlinks = [] # list for failed urls
zeromatch = [] # list for 0 result searches
pseudo404 = [] # list for shop.com 404 pages
t = 0 # used to store starting time for getting a page.
count = 0 # number of tests so far
pagetime = 0 # time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10 # fastest page time
cumulative = 0 # total time to load all pages (used to calc. avg)

#version number of the program
version = 'B2.9'

def ShopCom404(testUrl):
""" checks url for shop.com 404 url
shop.com 404 url -- returns status 200
http://www.shop.com/amos/cc/main/404/ccsyn/260
"""
if '404' in testUrl:
return True
else:
return False

##### main program #####

try:
links = open(testfile).readlines()
except:
exc, err, tb = exc_info()
print 'There is a problem with the file you specified. Check the file
and re-run the program.\n'
#print str(exc)
print str(err)
print
exit()

# timeout in seconds
socket.setdefaulttimeout(timeout)
totalNumberTests = len(links)
print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n'
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
start = datetime.today()
for url in links:
count = count + 1
#HttpExists2 - checks to see if URL exists and detects redirection.
# handles 404's and exceptions better. Returns tuple depending on
results:
# if found: true and final url. if not found: false and attempted url
pgChk = HttpExists2(url)
if pgChk[0] == False:
#failed url Exists
failedlinks.append(pgChk[1])
elif ShopCom404(pgChk[1]):
#Our version of a 404
pseudo404.append(url)
if pgChk[0] and not ShopCom404(url):
#if valid page not a 404 then get the page and check it.
try:
t = time.time()
urlObj = urllib2.urlopen(url)
pagetime = time.time() - t
webpg = urlObj.read()
if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):
zeromatch.append(url)
elif ST_errorConnecting in webpg:
# for some reason we got the error page
# so add it to the failed urls
failmsg = 'Error Connecting Page with: ' + url
failedlinks.append(failmsg)
except:
print 'exception with: ' + url
#figure page times
cumulative += pagetime
if pagetime > slowestpage:
slowestpage = pagetime, url.strip()
elif pagetime < fastestpage:
fastestpage = pagetime, url.strip()
msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \
'. Currnet runtime: ' + str(datetime.today() - start)
# status message that updates the same line.
#PrintStatic(msg)

### Now write out results
end = datetime.today()
finished = datetime.today()
finishedTimeStr = time.asctime()
avg = cumulative/totalNumberTests
failed = len(failedlinks)
nomatches = len(zeromatch)

#setup COM connection to Excel and write the spreadsheet.

If I understand what I've read about threading I need to convert much
of the above into a function and then call threading.thread start or
run to fire off each thread. but where and how and how to limit to X
number of threads is the part I get lost on. The example I've seen
using queues and threads never show using a list (squence) for the
source data and I'm not sure where I'd use the Queue stuff or for that
mattter if I'm just complicating the issue.

Once again thanks for the help.
Greg.

Oct 31 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Jean-Baptiste Richet | last post by:
Hello, weave could be highly usefull to me, but I can't have it work correctly on my PC, and I can't understand the error message. Could you help...
11
by: mir nazim | last post by:
hi, i m planning to start writing intranet applications and want ur real cool advices for choosing the correct platform. the choice is between the...
3
by: Michael Foord | last post by:
Has anyone had success using weave with gcc for windows (standard windows distribution python)? I have my distutils setup to use gcc (via...
2
by: Daniel Bickett | last post by:
Hello, I am writing an application using two event-driven libraries: wxPython, and twisted. The first problem I encountered in the program is...
0
by: zhuang | last post by:
Hi, I'm having some problems getting weave to work correctly. For now, I'm just testing how to access Numeric arrays under weave. My test code...
0
by: Julien Fiore | last post by:
Hi, I have problems trying to install the scipy.weave package. I run Python 2.4 on windows XP and my C compiler is MinGW. Below is the output of...
0
by: jgarber | last post by:
Hello, I just upgraded MySQLdb to the 1.2.0 version provided by Redhat Enterprise Linux ES4. At that point I began to get segfaults when...
0
by: monkeyboy | last post by:
Hello, I'm a new scipy user, and I'm trying to speed up some array code with weave. I'm running xp with gcc from cgywin, and scipy.weave.test()...
2
by: Soren | last post by:
Hi, I have a strange and very annoying problem when using weave in scipy.. when I run the code below.. the first time it needs to compile.. it...
1
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand....
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.