473,324 Members | 2,124 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

begin to parse a web page not entirely downloaded

Hi,

Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement. I suppose I would be able to
do what I want by using the sockets module, but I'm sure there's a
simpler way to do it.

Feb 8 '07 #1
7 1392
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen('http://google.com')
foo.read(512)
'<html><head...

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Feb 8 '07 #2
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritters.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen('http://google.com')
>>foo.read(512)
'<html><head...

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZE)
if not chunk: break
m = re.search('<html>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen('http://google.com')
m = re.search('<html>', f.read() )
if m != None :
break

print time.clock()
It prints that :
f.read(CHUNK)
0.1
0.31

f.read()
0.31
0.32
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Feb 8 '07 #3
k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.
Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):

$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen('http://ubuntu.cs.utah.edu/releases/6.06/ubuntu-6.06.1-desktop-i386.iso').read(512)"
1 loops, best of 1: 596 msec per loop
Feb 8 '07 #4
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritters.bizwrote:
>k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
> >>foo = urllib.urlopen('http://google.com')
foo.read(512)
'<html><head...

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZE)
if not chunk: break
m = re.search('<html>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen('http://google.com')
m = re.search('<html>', f.read() )
if m != None :
break
A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

Björn
Feb 8 '07 #5
On Feb 8, 8:06 pm, Björn Steinbrink <B.Steinbr...@gmx.dewrote:
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritters.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen('http://google.com')
>>foo.read(512)
'<html><head...
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZE)
if not chunk: break
m = re.search('<html>', chunk )
if m != None :
break
print time.clock()
print
print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen('http://google.com')
m = re.search('<html>', f.read() )
if m != None :
break

A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

Björn
That's right my test was false. I've replaced http://google.com with
http://aol.com
And the 'break' in the second loop with 'continue' ( because when the
string is found I don't want the rest of the page to be parsed.

I obtain this :
f.read(CHUNK)
0.1
0.17

f.read()
0.17
0.23
f.read() is still faster than f.read(CHUNK)
Feb 8 '07 #6
On Feb 8, 8:02 pm, Leif K-Brooks <eurl...@ecritters.bizwrote:
k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):

$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen('http://ubuntu.cs.utah.edu/releases/6.06/ubuntu-6.06.1-desktop-i386.is...)"
1 loops, best of 1: 596 msec per loop
OK, you convince me. The fact that I haven't got better results in my
test with read(512) must be because what takes most of the time is the
response time of the server, not the data transfer on the network.

Feb 8 '07 #7
On Feb 8, 6:20 pm, "k0mp" <Michel....@gmail.comwrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritters.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl(url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen(), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen('http://google.com')
>>foo.read(512)
'<html><head...
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZE)
if not chunk: break
m = re.search('<html>', chunk )
if m != None :
break
[snip]
I'd just like to point out that the above code assumes that the
'<html>' is entirely within one chunk; it could in fact be split
across chunks.

Feb 8 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

82
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
13
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them...
9
by: Bo Yang | last post by:
Hi, guys. I am now developing an application in which I need to fetch some html page, and then parsing it to get some intended content in it. Because HTML is not a standard XML format, so I am...
4
by: mmiller | last post by:
I have a pretty limited knowledge of PHP. My scenario is: I want one form to have two (2) submit buttons. I want one button to submit an email to a specific address then redirect to a page, and...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.