begin to parse a web page not entirely downloaded

k0mp

Hi,

Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement. I suppose I would be able to
do what I want by using the sockets module, but I'm sure there's a
simpler way to do it.

Feb 8 '07 #1

Subscribe Reply

1405

Leif K-Brooks

k0mp wrote:

Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:

>>foo = urllib.urlopen( 'http://google.com')
foo.read(51 2)

'<html><head.. .

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Feb 8 '07 #2

k0mp

On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:

k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:

>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)

'<html><head.. .

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break

print time.clock()
It prints that :
f.read(CHUNK)
0.1
0.31

f.read()
0.31
0.32
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Feb 8 '07 #3

Leif K-Brooks

k0mp wrote:

It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):

$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen ('http://ubuntu.cs.utah. edu/releases/6.06/ubuntu-6.06.1-desktop-i386.iso').read (512)"
1 loops, best of 1: 596 msec per loop

Feb 8 '07 #4

=?iso-8859-1?b?Qmr2cm4=?= Steinbrink

On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:

On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
>k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:

> >>foo = urllib.urlopen( 'http://google.com')
foo.read(51 2)
'<html><head.. .

foo.read(512 ) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break

A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

BjÃ¶rn

Feb 8 '07 #5

k0mp

On Feb 8, 8:06 pm, Björn Steinbrink <B.Steinbr...@g mx.dewrote:

On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote:
On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:
k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:

>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .

foo.read(512) will return as soon as 512 bytes have been received. You

can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break

print time.clock()

print

print 'f.read()'
print time.clock()
for i in range(30) :
f = urllib2.urlopen ('http://google.com')
m = re.search('<htm l>', f.read() )
if m != None :
break

A fair comparison would use "pass" here. Or a while loop as in the
other case. The way it is, it compares 30 times read(CHUNKSIZE)
against one time read().

Björn

That's right my test was false. I've replaced http://google.com with
http://aol.com
And the 'break' in the second loop with 'continue' ( because when the
string is found I don't want the rest of the page to be parsed.

I obtain this :
f.read(CHUNK)
0.1
0.17

f.read()
0.17
0.23
f.read() is still faster than f.read(CHUNK)

Feb 8 '07 #6

k0mp

On Feb 8, 8:02 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:

k0mp wrote:
It seems to take more time when I use read(size) than just read.
I think in both case urllib.openurl retrieve the whole page.

Google's home page is very small, so it's not really a great test of
that. Here's a test downloading the first 512 bytes of an Ubuntu ISO
(beware of wrap):

$ python -m timeit -n1 -r1 "import urllib"
"urllib.urlopen ('http://ubuntu.cs.utah. edu/releases/6.06/ubuntu-6.06.1-desktop-i386.is...)"
1 loops, best of 1: 596 msec per loop

OK, you convince me. The fact that I haven't got better results in my
test with read(512) must be because what takes most of the time is the
response time of the server, not the data transfer on the network.

Feb 8 '07 #7

MRAB

On Feb 8, 6:20 pm, "k0mp" <Michel....@gma il.comwrote:

On Feb 8, 6:54 pm, Leif K-Brooks <eurl...@ecritt ers.bizwrote:

k0mp wrote:
Is there a way to retrieve a web page and before it is entirely
downloaded, begin to test if a specific string is present and if yes
stop the download ?
I believe that urllib.openurl( url) will retrieve the whole page before
the program goes to the next statement.

Use urllib.urlopen( ), but call .read() with a smallish argument, e.g.:

>>foo = urllib.urlopen( 'http://google.com')
>>foo.read(51 2)
'<html><head.. .

foo.read(512) will return as soon as 512 bytes have been received. You
can keep caling it until it returns an empty string, indicating that
there's no more data to be read.

Thanks for your answer :)

I'm not sure that read() works as you say.
Here is a test I've done :

import urllib2
import re
import time

CHUNKSIZE = 1024

print 'f.read(CHUNK)'
print time.clock()

for i in range(30) :
f = urllib2.urlopen ('http://google.com')
while True: # read the page using a loop
chunk = f.read(CHUNKSIZ E)
if not chunk: break
m = re.search('<htm l>', chunk )
if m != None :
break

[snip]
I'd just like to point out that the above code assumes that the
'<html>' is entirely within one chunk; it could in fact be split
across chunks.

Feb 8 '07 #8

Similar topics

6356

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility of getting wrong it seems I am on very shaky ground . For example, pretty much every book and web course on html that I have read tells me I must include <html>, <head> and <body> tag pairs. I have always done that, and never questioned it. ...

HTML / CSS

4244

Taking data from a text file to parse html page

by: DH | last post by:

Hi, I'm trying to strip the html and other useless junk from a html page.. Id like to create something like an automated text editor, where it takes the keywords from a txt file and removes them from the html page (replace the words in the html page with blank space) I'm new to python and could use a little push in the right direction, any ideas on how to implement this? Thanks!

Python

15698

Use C++ to parse HTML

by: Bo Yang | last post by:

Hi, guys. I am now developing an application in which I need to fetch some html page, and then parsing it to get some intended content in it. Because HTML is not a standard XML format, so I am curious about how should it be parsed? Any help and suggestion will be appreciated very much, thanks in advance!

C / C++

1870

Parse Error

by: mmiller | last post by:

I have a pretty limited knowledge of PHP. My scenario is: I want one form to have two (2) submit buttons. I want one button to submit an email to a specific address then redirect to a page, and I want the second one to submit to the same email, then redirect to a different page. I've downloaded some code, so I can't take credit for it, but I'm trying to use my basic knowledge to create and IF and ELSE IF. Right now, I'm just trying to get...

PHP

9639

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10311

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10146

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10080

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8967

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7492

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5378

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3639

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2874

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General