473,782 Members | 2,492 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

begin to parse a web page not entirely downloaded

Hi,

Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement. I suppose I would be able to
do what I want by using the sockets module, but I'm sure there's a
simpler way to do it.

Feb 8 '07 #1
7 1405
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
foo.read(51 2)
'<html><head.. .

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Feb 8 '07 #2
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break

print time.clock()
It prints that :
f.read(CHUNK)
0.1
0.31

f.read()
0.31
0.32
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Feb 8 '07 #3
k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.
Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):

$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen ('http://ubuntu.cs.utah. edu/releases/6.06/ubuntu-6.06.1-desktop-i386.iso').read (512)"
1 loops, best of 1: 596 msec per loop
Feb 8 '07 #4
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
>k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
> >>foo = urllib.urlopen( 'http://google.com')
foo.read(51 2)
'<html><head.. .

foo.read(512 ) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break
A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

Björn
Feb 8 '07 #5
On Feb 8, 8:06 pm, Björn Steinbrink <B.Steinbr...@g mx.dewrote:
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break
print time.clock()
print
print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break

A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

Björn
That's right my test was false. I've replaced http://google.com with
http://aol.com
And the 'break' in the second loop with 'continue' ( because when the
string is found I don't want the rest of the page to be parsed.

I obtain this :
f.read(CHUNK)
0.1
0.17

f.read()
0.17
0.23
f.read() is still faster than f.read(CHUNK)
Feb 8 '07 #6
On Feb 8, 8:02 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):

$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen ('http://ubuntu.cs.utah. edu/releases/6.06/ubuntu-6.06.1-desktop-i386.is...)"
1 loops, best of 1: 596 msec per loop
OK, you convince me. The fact that I haven't got better results in my
test with read(512) must be because what takes most of the time is the
response time of the server, not the data transfer on the network.

Feb 8 '07 #7
On Feb 8, 6:20 pm, "k0mp" <Michel....@gma il.comwrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break
[snip]
I'd just like to point out that the above code assumes that the
'<html>' is entirely within one chunk; it could in fact be split
across chunks.

Feb 8 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

82
6356
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility of getting wrong it seems I am on very shaky ground . For example, pretty much every book and web course on html that I have read tells me I must include <html>, <head> and <body> tag pairs. I have always done that, and never questioned it. ...
13
4244
by: DH | last post by:
Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) I'm new to python and could use a little push in the right direction, any ideas on how to implement this? Thanks!
9
15698
by: Bo Yang | last post by:
Hi, guys. I am now developing an application in which I need to fetch some html page, and then parsing it to get some intended content in it. Because HTML is not a standard XML format, so I am curious about how should it be parsed? Any help and suggestion will be appreciated very much, thanks in advance!
4
1870
by: mmiller | last post by:
I have a pretty limited knowledge of PHP. My scenario is: I want one form to have two (2) submit buttons. I want one button to submit an email to a specific address then redirect to a page, and I want the second one to submit to the same email, then redirect to a different page. I've downloaded some code, so I can't take credit for it, but I'm trying to use my basic knowledge to create and IF and ELSE IF. Right now, I'm just trying to get...
0
9639
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10311
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10146
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10080
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8967
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7492
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5378
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
3639
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2874
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.