urllib behaves strangely

Gabriel Zachmann

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

Jun 12 '06 #1

Subscribe Post Reply

3472

Benjamin Niemann

Gabriel Zachmann wrote:

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b******t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/

Jun 12 '06 #2

Benjamin Niemann

Benjamin Niemann wrote:

Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.
The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.

Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',

You may try this anyway...
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/

Jun 12 '06 #3

John Hicken

Gabriel Zachmann wrote:

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

I think the problem might be with the Wikimedia Commons website itself,
rather than urllib. Wikipedia has a policy against unapproved bots:
http://en.wikipedia.org/wiki/Wikipedia:Bots

It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot. I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried. Also, the html the program returns seems to be a
standard "ACCESS DENIED" page.

I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.

John Hicken

Jun 12 '06 #4

Duncan Booth

Gabriel Zachmann wrote:

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
....

That (or code very like it) worked when I tried it.

Jun 12 '06 #5

John J. Lee

Duncan Booth <du**********@invalid.invalid> writes:

Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib: [...] "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url ) [...] However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl. [...]
On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)
John

Jun 12 '06 #6

Duncan Booth

John J. Lee wrote:

It looks like wikipedia checks the User-Agent header and refuses to
send pages to browsers it doesn't like. Try: [...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)

They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically.

http://en.wikipedia.org/wiki/Wikipedia:Bots says: This policy in a nutshell:
Programs that update pages automatically in a useful and harmless way
may be welcome if their owners seek approval first and go to great
lengths to stop them running amok or being a drain on resources.

On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
even provide a link to some frameworks for writing bots e.g.

http://sourceforge.net/projects/pywikipediabot/

Jun 13 '06 #7

Gabriel Zachmann

> On the other hand something which is simply retrieving one or two fixed

pages doesn't fit that definition of a bot so is probably alright. They
i think so, too.

even provide a link to some frameworks for writing bots e.g.
http://sourceforge.net/projects/pywikipediabot/

ah, that looks nice ..

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

Jun 13 '06 #8

Gabriel Zachmann

> headers = {}

headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)

ah, thanks a lot, that works !

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/

Jun 13 '06 #9

Similar topics

Simple Question : files and URLLIB

by: Richard Shea | last post by:

Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...

Python

bad data from urllib when run from MS .bat file

by: Stuart McGraw | last post by:

I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or...

Python

POST data with 401 authentication using urllib(2)

by: Pieter Edelman | last post by:

Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...

Python

urllib problem (maybe bugs?)

by: Timothy Wu | last post by:

Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...

Python

urllib on windows machines

by: william | last post by:

I've got a strange problem on windows (not very familiar with that OS). I can ping a host, but cannot get it via urllib (see here under). I can even telnet the host on port 80. Thus network...

Python

Downloading files using urllib in a for loop?

by: justsee | last post by:

Hi, I'm using Python 2.3 on Windows for the first time, and am doing something wrong in using urllib to retrieve images from urls embedded in a csv file. If I explicitly specify a url and image...

Python

timeout in urllib.open()

by: Stefan Palme | last post by:

Hi all, is there a way to modify the time a call of urllib.open(...) waits for an answer from the other side? Have a tool which automatically checks a list of websites for certain content....

Python

urllib timeout issues

by: supercooper | last post by:

I am downloading images using the script below. Sometimes it will go for 10 mins, sometimes 2 hours before timing out with the following error: Traceback (most recent call last): File...

Python

urllib (54, 'Connection reset by peer') error

by: chrispoliquin | last post by:

Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA