Hi,
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement. I suppose I would be able to
do what I want by using the sockets module, but I'm sure there's a
simpler way to do it. 7 1405
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com') foo.read(51 2)
'<html><head.. .
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break
print time.clock()
print
print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break
print time.clock()
It prints that :
f.read(CHUNK)
0.1
0.31
f.read()
0.31
0.32
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.
k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.
Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):
$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen ('http://ubuntu.cs.utah. edu/releases/6.06/ubuntu-6.06.1-desktop-i386.iso').read (512)"
1 loops, best of 1: 596 msec per loop
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
>k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
> >>foo = urllib.urlopen( 'http://google.com') foo.read(51 2)
'<html><head.. .
foo.read(512 ) will return as soon as 512 bytes have been received. You can keep caling it until it returns an empty string, indicating that there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break
print time.clock()
print
print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break
A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().
Björn
On Feb 8, 8:06 pm, Björn Steinbrink <B.Steinbr...@g mx.dewrote:
On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break
print time.clock()
print
print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break
A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().
Björn
That's right my test was false. I've replaced http://google.com with http://aol.com
And the 'break' in the second loop with 'continue' ( because when the
string is found I don't want the rest of the page to be parsed.
I obtain this :
f.read(CHUNK)
0.1
0.17
f.read()
0.17
0.23
f.read() is still faster than f.read(CHUNK)
On Feb 8, 8:02 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.
Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):
$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen ('http://ubuntu.cs.utah. edu/releases/6.06/ubuntu-6.06.1-desktop-i386.is...)"
1 loops, best of 1: 596 msec per loop
OK, you convince me. The fact that I haven't got better results in my
test with read(512) must be because what takes most of the time is the
response time of the server, not the data transfer on the network.
On Feb 8, 6:20 pm, "k0mp" <Michel....@gma il.comwrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.
Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:
>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .
foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.
Thanks for your answer :)
I'm not sure that read() works as you say.
Here is a test I've done :
import urllib2
import re
import time
CHUNKSIZE = 1024
print 'f.read(CHUNK)'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break
[snip]
I'd just like to point out that the above code assumes that the
'<html>' is entirely within one chunk; it could in fact be split
across chunks. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Eric Lindsay |
last post by:
I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common. Not only that, even
in what seemed elementary and without any possibility of getting wrong
it seems I am on very shaky ground .
For example, pretty much every book and web course on html that I have
read tells me I must include <html>, <head> and <body> tag pairs.
I have always done that, and never questioned it. ...
|
by: DH |
last post by:
Hi,
I'm trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I'm new to python
and could use a little push in the right direction, any ideas on how to
implement this?
Thanks!
|
by: Bo Yang |
last post by:
Hi, guys. I am now developing an application
in which I need to fetch some html page, and then
parsing it to get some intended content in it.
Because HTML is not a standard XML format, so I
am curious about how should it be parsed?
Any help and suggestion will be appreciated very
much, thanks in advance!
|
by: mmiller |
last post by:
I have a pretty limited knowledge of PHP.
My scenario is:
I want one form to have two (2) submit buttons. I want one button to submit an email to a specific address then redirect to a page, and I want the second one to submit to the same email, then redirect to a different page. I've downloaded some code, so I can't take credit for it, but I'm trying to use my basic knowledge to create and IF and ELSE IF. Right now, I'm just trying to get...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |