On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritters.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen('http://google.com')
>>foo.read(512)
'<html><head...
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZE)
if not chunk: break
m = re.search('<html>', chunk )
if m != None :
break
print time.clock()
print
print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen('http://google.com')
m = re.search('<html>', f.read() )
if m != None :
break
print time.clock()
It prints that :
f.read(CHUNK)
0.1
0.31
f.read()
0.31
0.32
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.