473,387 Members | 1,492 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Oh what a twisted thread we weave....

Hi

First off I'm not using anything from Twisted. I just liked the subject
line :)

The folks of this list have been most helpful before and I'm hoping
that you'll take pity on a the dazed and confused. I've read stuff on
this group and various website and book until my head is spinning...

Here is a brief summary of what I'm trying to do and an example below.
I have the code below in a single threaded version and use it to test a
list of roughly 6000 urls ensure that they "work". If they fail I track
the kind of failures and then generate a report. Currently it take
about 7 - 9 hours to run through the entire list. I basically create a
list from a file containing a list of URLS and then iterate over the
list and check each page as I go through the list. I get all sort of
flack because it takes so long so I thought I could speed it up by
using a Queue and X number of threads. Seems easier said then done.

However in my test below I can't even get it to catch a single error in
my if statement in the Run() function. I'm stumped as to why. Any help
would be Greatly appreciated. and if so inclined pointers on how to
limit the number of threads of a give number of threads.

Thank you in advance! I really do appreciate it

here is what I have so far... Yes there are somethings that are unused
from previous test. Oh and to give proper credit this is based on some
code from http://starship.python.net/crew/aahz...00/SCRIPT2.HTM

import threading, Queue
from time import sleep, time
import urllib2
import formatter
import string
#toscan = Queue.Queue
#scanned = Queue.Queue
#workQueue = Queue.Queue()
MAX_THREADS = 10

timeout = 90 # sets timeout for urllib2.urlopen()
failedlinks = [] # list for failed urls
zeromatch = [] # list for 0 result searches
t = 0 # used to store starting time for getting a page.
pagetime = 0 # time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10 # fastest page time
cumulative = 0 # total time to load all pages (used to calc. avg)
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

class Retriever(threading.Thread):
def __init__(self, URL):
self.done = 0
self.URL = URL
self.urlObj = ''
self.ST_zeroMatch = ST_zeroMatch
print '__init__:self.URL', self.URL
threading.Thread.__init__(self)

def run(self):
print 'In run()'
print "Retrieving:", self.URL
#self.page = urllib.urlopen(self.URL)
#self.body = self.page.read()
#self.page.close()
self.t = time()
self.urlObj = urllib2.urlopen(self.URL)
self.pagetime = time() - t
self.webpg = self.urlObj.read()
print 'Retriever.run: before if'
print 'matching', self.ST_zeroMatch
print ST_zeroMatch
# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:
# I don't think I want to use self.zeromatch, do I?
print '** Found zeromatch'
zeromatch.append(url)
#self.parse()
print 'Retriever.run: past if'
print 'exiting run()'
self.done = 1

# the last 2 Shop.Com Urls should trigger the zeromatch condition
sites = ['http://www.foo.com/',
'http://www.shop.com',
'http://www.shop.com/op/aprod-~zzsome+thing',
'http://www.shop.com/op/aprod-~xyzzy'
#'http://www.yahoo.com/ThisPageDoesntExist'
]

threadList = []
URLs = []
workQueue = Queue.Queue()

for item in sites:
workQueue.put(item)

print workQueue
print
print 'b4 test in sites'

for test in sites:
retriever = Retriever(test)
retriever.start()
threadList.append(retriever)

print 'threadList:'
print threadList
print 'past for test in sites:'

while threading.activeCount()>1:
print'Zzz...'
sleep(1)

print 'entering retriever for loop'
for retriever in threadList:
#URLs.extend(retriever.run())
retriever.run()

print 'zeromatch:', zeromatch
# even though there are two URLs that that should be here nothing ever
gets appeneded to the list.

Oct 29 '05 #1
2 3428
On Sat, 28 Oct 2005, GregM wrote:
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

# why does this always drop through even though the If should be true.
if (ST_zeroMatch or ST_zeroMatch2) in self.webpg:


This code - i do not think it means what you think it means. Specifically,
it doesn't mean "is either of ST_zeroMatch or ST_zeroMatch2 in
self.webpg"; what it means is "apply the 'or' opereator to ST_zeroMatch
and ST_zeroMatch2, then check if the result is in self.webpg". The result
of applying the or operator to two nonempty strings is the left-hand
string; your code is thus equivalent to

if ST_zeroMatch in self.webpg:

Which will work in cases where your page says 'You found 0 products', but
not in cases where it says 'There are no products matching your
selection'.

What you want is:

if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):

Or something like that.

You say that you have a single-threaded version of this that works;
presumably, you have a working version of this logic in there. Did you
write the threaded version from scratch? Often a bad move!

tom

--
It's the 21st century, man - we rue _minutes_. -- Benjamin Rosenbaum
Oct 29 '05 #2
Tom,

Thanks for the reply and sorry for the delay in getting back to you.
Thanks for pointing out my logic problem. I had added the 2nd part of
the if statement at the last minute...

Yes I have a single threaded version its several hundred lines and uses
COM to write the results out to and Excel spreadsheet.. I was trying to
better understand threading and queues before I started hacking on my
current code... maybe that was a mistake... hey I'm still learning and
I learn a lot just by reading stuff posted to this group. I hope at
some point I can help others in the same way.

Here are the relevent parts of the code (no COM stuff)

here is a summary:
# see if url exists
# if exists then
# hit page
# get text of page
# see if text of page contains search terms
# if it does then
# update appropiate counters and lists
# else update static line and do the next one
# when done with Links list
# - calculate totals and times
# - write info to xls file
# end.

# utils are functions and classes that I wrote
# from utils import PrintStatic, HttpExists2
#
# My version of 'easyExcel' with extentions and improvements.
# import excelled
import urllib2
import time
import socket
import os
#import msvcrt # for printstatic
from datetime import datetime
import pythoncom
from sys import exc_info, stdout, argv, exit

# search terms to use for matching.
#primarySearchTerm = 'Narrow your'
ST_lookingFor = 'Looking for Something'
ST_errorConnecting = 'there has been an error connecting'
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

#initialize Globals
timeout = 90 # sets timeout for urllib2.urlopen()
failedlinks = [] # list for failed urls
zeromatch = [] # list for 0 result searches
pseudo404 = [] # list for shop.com 404 pages
t = 0 # used to store starting time for getting a page.
count = 0 # number of tests so far
pagetime = 0 # time it took to load page
slowestpage = 0 # slowest page time
fastestpage = 10 # fastest page time
cumulative = 0 # total time to load all pages (used to calc. avg)

#version number of the program
version = 'B2.9'

def ShopCom404(testUrl):
""" checks url for shop.com 404 url
shop.com 404 url -- returns status 200
http://www.shop.com/amos/cc/main/404/ccsyn/260
"""
if '404' in testUrl:
return True
else:
return False

##### main program #####

try:
links = open(testfile).readlines()
except:
exc, err, tb = exc_info()
print 'There is a problem with the file you specified. Check the file
and re-run the program.\n'
#print str(exc)
print str(err)
print
exit()

# timeout in seconds
socket.setdefaulttimeout(timeout)
totalNumberTests = len(links)
print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n'
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
start = datetime.today()
for url in links:
count = count + 1
#HttpExists2 - checks to see if URL exists and detects redirection.
# handles 404's and exceptions better. Returns tuple depending on
results:
# if found: true and final url. if not found: false and attempted url
pgChk = HttpExists2(url)
if pgChk[0] == False:
#failed url Exists
failedlinks.append(pgChk[1])
elif ShopCom404(pgChk[1]):
#Our version of a 404
pseudo404.append(url)
if pgChk[0] and not ShopCom404(url):
#if valid page not a 404 then get the page and check it.
try:
t = time.time()
urlObj = urllib2.urlopen(url)
pagetime = time.time() - t
webpg = urlObj.read()
if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in self.webpg):
zeromatch.append(url)
elif ST_errorConnecting in webpg:
# for some reason we got the error page
# so add it to the failed urls
failmsg = 'Error Connecting Page with: ' + url
failedlinks.append(failmsg)
except:
print 'exception with: ' + url
#figure page times
cumulative += pagetime
if pagetime > slowestpage:
slowestpage = pagetime, url.strip()
elif pagetime < fastestpage:
fastestpage = pagetime, url.strip()
msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \
'. Currnet runtime: ' + str(datetime.today() - start)
# status message that updates the same line.
#PrintStatic(msg)

### Now write out results
end = datetime.today()
finished = datetime.today()
finishedTimeStr = time.asctime()
avg = cumulative/totalNumberTests
failed = len(failedlinks)
nomatches = len(zeromatch)

#setup COM connection to Excel and write the spreadsheet.

If I understand what I've read about threading I need to convert much
of the above into a function and then call threading.thread start or
run to fire off each thread. but where and how and how to limit to X
number of threads is the part I get lost on. The example I've seen
using queues and threads never show using a list (squence) for the
source data and I'm not sure where I'd use the Queue stuff or for that
mattter if I'm just complicating the issue.

Once again thanks for the help.
Greg.

Oct 31 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Jean-Baptiste Richet | last post by:
Hello, weave could be highly usefull to me, but I can't have it work correctly on my PC, and I can't understand the error message. Could you help me ? I installed the latest zip package by...
11
by: mir nazim | last post by:
hi, i m planning to start writing intranet applications and want ur real cool advices for choosing the correct platform. the choice is between the three: 1. Twisted 2. Medusa 3. Zope (i do...
3
by: Michael Foord | last post by:
Has anyone had success using weave with gcc for windows (standard windows distribution python)? I have my distutils setup to use gcc (via mingw32) - which it does fine. I've compiled and...
2
by: Daniel Bickett | last post by:
Hello, I am writing an application using two event-driven libraries: wxPython, and twisted. The first problem I encountered in the program is the confliction between the two all-consuming...
0
by: zhuang | last post by:
Hi, I'm having some problems getting weave to work correctly. For now, I'm just testing how to access Numeric arrays under weave. My test code is: from Numeric import * import weave a =...
0
by: Julien Fiore | last post by:
Hi, I have problems trying to install the scipy.weave package. I run Python 2.4 on windows XP and my C compiler is MinGW. Below is the output of scipy.weave.test(). I read that the tests should...
0
by: jgarber | last post by:
Hello, I just upgraded MySQLdb to the 1.2.0 version provided by Redhat Enterprise Linux ES4. At that point I began to get segfaults when importing twisted after MySQLdb, but not before. --...
0
by: monkeyboy | last post by:
Hello, I'm a new scipy user, and I'm trying to speed up some array code with weave. I'm running xp with gcc from cgywin, and scipy.weave.test() returns an OK status. I'm trying to speed up...
2
by: Soren | last post by:
Hi, I have a strange and very annoying problem when using weave in scipy.. when I run the code below.. the first time it needs to compile.. it says <compilingand then python.exe crashes! and no...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.