472,143 Members | 1,337 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,143 software developers and data experts.

urllib behaves strangely

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 12 '06 #1
8 3353
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.


The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b******t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Jun 12 '06 #2
Benjamin Niemann wrote:
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.
The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.


Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',


You may try this anyway...
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Jun 12 '06 #3

Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/


I think the problem might be with the Wikimedia Commons website itself,
rather than urllib. Wikipedia has a policy against unapproved bots:
http://en.wikipedia.org/wiki/Wikipedia:Bots

It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot. I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried. Also, the html the program returns seems to be a
standard "ACCESS DENIED" page.

I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.

John Hicken

Jun 12 '06 #4
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
....

That (or code very like it) worked when I tried it.
Jun 12 '06 #5
Duncan Booth <du**********@invalid.invalid> writes:
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib: [...] "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url ) [...] However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl. [...]
On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)
John
Jun 12 '06 #6
John J. Lee wrote:
It looks like wikipedia checks the User-Agent header and refuses to
send pages to browsers it doesn't like. Try: [...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)


They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically.

http://en.wikipedia.org/wiki/Wikipedia:Bots says: This policy in a nutshell:
Programs that update pages automatically in a useful and harmless way
may be welcome if their owners seek approval first and go to great
lengths to stop them running amok or being a drain on resources.


On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
even provide a link to some frameworks for writing bots e.g.

http://sourceforge.net/projects/pywikipediabot/

Jun 13 '06 #7
> On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
i think so, too.

even provide a link to some frameworks for writing bots e.g.
http://sourceforge.net/projects/pywikipediabot/

ah, that looks nice ..

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 13 '06 #8
> headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)

ah, thanks a lot, that works !

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 13 '06 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Richard Shea | last post: by
7 posts views Thread by Stuart McGraw | last post: by
reply views Thread by Pieter Edelman | last post: by
1 post views Thread by Timothy Wu | last post: by
4 posts views Thread by william | last post: by
6 posts views Thread by justsee | last post: by
11 posts views Thread by Stefan Palme | last post: by
5 posts views Thread by supercooper | last post: by
5 posts views Thread by chrispoliquin | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.