473,415 Members | 1,562 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,415 software developers and data experts.

urllib behaves strangely

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 12 '06 #1
8 3472
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.


The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b******t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Jun 12 '06 #2
Benjamin Niemann wrote:
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.
The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.


Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',


You may try this anyway...
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Jun 12 '06 #3

Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/


I think the problem might be with the Wikimedia Commons website itself,
rather than urllib. Wikipedia has a policy against unapproved bots:
http://en.wikipedia.org/wiki/Wikipedia:Bots

It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot. I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried. Also, the html the program returns seems to be a
standard "ACCESS DENIED" page.

I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.

John Hicken

Jun 12 '06 #4
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
....

That (or code very like it) worked when I tried it.
Jun 12 '06 #5
Duncan Booth <du**********@invalid.invalid> writes:
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib: [...] "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url ) [...] However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl. [...]
On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)
John
Jun 12 '06 #6
John J. Lee wrote:
It looks like wikipedia checks the User-Agent header and refuses to
send pages to browsers it doesn't like. Try: [...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)


They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically.

http://en.wikipedia.org/wiki/Wikipedia:Bots says: This policy in a nutshell:
Programs that update pages automatically in a useful and harmless way
may be welcome if their owners seek approval first and go to great
lengths to stop them running amok or being a drain on resources.


On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
even provide a link to some frameworks for writing bots e.g.

http://sourceforge.net/projects/pywikipediabot/

Jun 13 '06 #7
> On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
i think so, too.

even provide a link to some frameworks for writing bots e.g.
http://sourceforge.net/projects/pywikipediabot/

ah, that looks nice ..

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 13 '06 #8
> headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)

ah, thanks a lot, that works !

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 13 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...
7
by: Stuart McGraw | last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or...
0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
1
by: Timothy Wu | last post by:
Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...
4
by: william | last post by:
I've got a strange problem on windows (not very familiar with that OS). I can ping a host, but cannot get it via urllib (see here under). I can even telnet the host on port 80. Thus network...
6
by: justsee | last post by:
Hi, I'm using Python 2.3 on Windows for the first time, and am doing something wrong in using urllib to retrieve images from urls embedded in a csv file. If I explicitly specify a url and image...
11
by: Stefan Palme | last post by:
Hi all, is there a way to modify the time a call of urllib.open(...) waits for an answer from the other side? Have a tool which automatically checks a list of websites for certain content....
5
by: supercooper | last post by:
I am downloading images using the script below. Sometimes it will go for 10 mins, sometimes 2 hours before timing out with the following error: Traceback (most recent call last): File...
5
by: chrispoliquin | last post by:
Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.