469,613 Members | 1,280 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,613 developers. It's quick & easy.

Is it possible to download only the <head> of a web page?

Rex
I am writing a script that executes a bunch of queries through a form
on a website and reads the results. I am only interested in the
<titlesection in the <headof each web page. Currently, each page
the server returns is about 100kb and contains a bunch of HTML and
Javascript, all of which I don't need; I don't want to waste bandwidth
or consume too much of the server's resources. I just need the <title>
string.

Is there any way to download less than the entire web page?
Sep 4 '08 #1
2 1146
Rex wrote:
I am writing a script that executes a bunch of queries through a form
on a website and reads the results. I am only interested in the
<titlesection in the <headof each web page. Currently, each page
the server returns is about 100kb and contains a bunch of HTML and
Javascript, all of which I don't need; I don't want to waste bandwidth
or consume too much of the server's resources. I just need the <title>
string.
you need to issue a GET request to get the HTML head section, which
almost always means that the server will build the entire page before
sending it to you (so it can set content-length etc).

you can save on network traffic by parsing the data as it arrives, and
stopping when you've gotten the TITLE element:

http://effbot.org/librarybook/sgmllib.htm

</F>

Sep 4 '08 #2
En Thu, 04 Sep 2008 18:53:33 -0300, Fredrik Lundh <fr*****@pythonware.com>
escribi�:
Rex wrote:
>I am writing a script that executes a bunch of queries through a form
on a website and reads the results. I am only interested in the
<titlesection in the <headof each web page. Currently, each page
the server returns is about 100kb and contains a bunch of HTML and
Javascript, all of which I don't need; I don't want to waste bandwidth
or consume too much of the server's resources. I just need the <title>
string.

you need to issue a GET request to get the HTML head section, which
almost always means that the server will build the entire page before
sending it to you (so it can set content-length etc).

you can save on network traffic by parsing the data as it arrives, and
stopping when you've gotten the TITLE element:

http://effbot.org/librarybook/sgmllib.htm
Another alternative would be to estimate the size it takes to reach to the
<titletag, and issue a GET with a Range header. The server will -very
likely- have to build the entire page, but won't attempt to send more
bytes than requested. (In case the requested size is not enough, one can
issue another GET asking for more data)

http://www.w3.org/Protocols/rfc2616/....html#sec14.35

--
Gabriel Genellina

Sep 5 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

15 posts views Thread by Frances | last post: by
10 posts views Thread by Brian W | last post: by
3 posts views Thread by Sam Samnah | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.