467,211 Members | 1,224 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,211 developers. It's quick & easy.

Inconsistent result from urllib.urlopen

Here's the problem: using Netscape 7.1, I type use the view page
source command (url is http://en.wikipedia.org/wiki/Cain) and save the
raw HTML file and it's 67 kb, and has the addresses of all the images
in it. I want the exact same thing from my Python script, but I'm not
getting it. Instead, I get a file only 21 kb that has no image
addresses. Here's the code I use:

import urllib
f = urllib.urlopen('http://en.wikipedia.org/wiki/Cain')
data = f.read(9999999)
f.close()
f1 = open('junk.txt', 'w')
f1.write(data)
f1.close()

Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.

Apr 12 '07 #1
  • viewed: 1478
Share:
4 Replies
Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.
Maybe it has to do something with your user agent string. The server
side can decide to return a different content when your user agent is
not 'mozilla', 'internet explorer' or 'opera' etc.

Do you want to know how to change your user agent string? Google for
it.... :-)

Laszlo
Apr 12 '07 #2
En Thu, 12 Apr 2007 15:25:03 -0300, <ju**********@hotmail.comescribió:
Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.
The server (that is, Wikipedia) may choose to send a different response
based on the User-Agent header you provide.

--
Gabriel Genellina
Apr 12 '07 #3
ju**********@hotmail.com wrote:
import urllib
f = urllib.urlopen('http://en.wikipedia.org/wiki/Cain')
data = f.read(9999999)
f.close()
f1 = open('junk.txt', 'w')
f1.write(data)
f1.close()
Did you see the file "junk.txt"? It's an error page from Wikipedia, not
the actual content page...

Regards,

--
.. Facundo
..
Blog: http://www.taniquetil.com.ar/plog/
PyAr: http://www.python.org/ar/
Apr 12 '07 #4

Laszlo Nagy wrote:
Any ideas why I don't get the same result from the python script as I
do from a web browser? This problem seems to be a recent
development. The scripts I wrote like this worked fine for a while
and then stopped working within the past couple of weeks.
Maybe it has to do something with your user agent string. The server
side can decide to return a different content when your user agent is
not 'mozilla', 'internet explorer' or 'opera' etc.

Do you want to know how to change your user agent string? Google for
it.... :-)

Laszlo
Thanks. That is the fix I needed. I added

urllib.URLopener.version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)'

as the second line of code and now it is actually getting content, not
just an error message. It's not the exact same format as you get from
saving the page from the web browser, but all the links and image
addresses are in place.

Apr 13 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Russell E. Owen | last post: by
11 posts views Thread by Pater Maximus | last post: by
reply views Thread by Chris | last post: by
4 posts views Thread by william@opensource4you.com | last post: by
6 posts views Thread by JabaPyth | last post: by
4 posts views Thread by kgrafals@gmail.com | last post: by
2 posts views Thread by Iain Dalton | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.