473,548 Members | 2,578 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

urllib behaves strangely

Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimed ia.org/wiki/Commons:Feature d_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 12 '06 #1
8 3483
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimed ia.org/wiki/Commons:Feature d_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.


The ':' in '..Commons:Feat ..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimed ia.org/wiki/Commons%3aFeatu red_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b******t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Jun 12 '06 #2
Benjamin Niemann wrote:
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimed ia.org/wiki/Commons:Feature d_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine
in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.
The ':' in '..Commons:Feat ..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.


Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...
Try the URI
'http://commons.wikimed ia.org/wiki/Commons%3aFeatu red_pictures/chronological',


You may try this anyway...
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Jun 12 '06 #3

Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimed ia.org/wiki/Commons:Feature d_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...

Any ideas?
I would appreciate very much any hints or suggestions.

Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/


I think the problem might be with the Wikimedia Commons website itself,
rather than urllib. Wikipedia has a policy against unapproved bots:
http://en.wikipedia.org/wiki/Wikipedia:Bots

It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot. I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried. Also, the html the program returns seems to be a
standard "ACCESS DENIED" page.

I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.

John Hicken

Jun 12 '06 #4
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:

import urllib
url =
"http://commons.wikimed ia.org/wiki/Commons:Feature d_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl.

On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request (url, headers)
file = urllib2.urlopen (request)
....

That (or code very like it) worked when I tried it.
Jun 12 '06 #5
Duncan Booth <du**********@i nvalid.invalid> writes:
Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib: [...] "http://commons.wikimed ia.org/wiki/Commons:Feature d_pictures/chronologi
cal"
print url
print
file = urllib.urlopen( url ) [...] However, when i ecexute it, i get an html error ("access denied").

On the one hand, the funny thing though is that i can view the page
fine in my browser, and i can download it fine using curl. [...]
On the other hand, it must have something to do with the URL because
urllib works fine with any other URL i have tried ...

It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:

[...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)
John
Jun 12 '06 #6
John J. Lee wrote:
It looks like wikipedia checks the User-Agent header and refuses to
send pages to browsers it doesn't like. Try: [...]

If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)


They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically.

http://en.wikipedia.org/wiki/Wikipedia:Bots says: This policy in a nutshell:
Programs that update pages automatically in a useful and harmless way
may be welcome if their owners seek approval first and go to great
lengths to stop them running amok or being a drain on resources.


On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
even provide a link to some frameworks for writing bots e.g.

http://sourceforge.net/projects/pywikipediabot/

Jun 13 '06 #7
> On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
i think so, too.

even provide a link to some frameworks for writing bots e.g.
http://sourceforge.net/projects/pywikipediabot/

ah, that looks nice ..

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 13 '06 #8
> headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'

request = urllib2.Request (url, headers)
file = urllib2.urlopen (request)

ah, thanks a lot, that works !

Best regards,
Gabriel.

--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
Jun 13 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
3974
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... finA= urllib.urlopen('http://www.python.org/') foutA=open('C:\\testout.html','w') tidy(finA,foutA,None) I get ...
7
2320
by: Stuart McGraw | last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^ Sourceforce and trying to submit a Python bug report but it still won't let me. I give up. Maybe someone who cares will see this post, or maybe it will save time for someone else who runs into this problem... ================================================ Environment: - Microsoft...
0
3570
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's my cry for help. The server is an elog logbook server (http://midas.psi.ch/elog/). It is protected with a password and an empty username. I can...
1
2059
by: Timothy Wu | last post by:
Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the parameters on purpose because that's the only way I can get it to work.. For "version" I am suppose to leave it unchecked, having value of empty string....
4
3580
by: william | last post by:
I've got a strange problem on windows (not very familiar with that OS). I can ping a host, but cannot get it via urllib (see here under). I can even telnet the host on port 80. Thus network seems good, but not for python ;-(. Does any windows specialist can guide me (a poor linux user) to get Network functionalitiies with python on...
6
5807
by: justsee | last post by:
Hi, I'm using Python 2.3 on Windows for the first time, and am doing something wrong in using urllib to retrieve images from urls embedded in a csv file. If I explicitly specify a url and image name it works fine(commented example in the code), but if I pass in variables in this for loop it throws errors: --- The script: import csv,...
11
2549
by: Stefan Palme | last post by:
Hi all, is there a way to modify the time a call of urllib.open(...) waits for an answer from the other side? Have a tool which automatically checks a list of websites for certain content. The tool "hangs" when one of the contacted websites behaves badly and "never" answers...
5
4659
by: supercooper | last post by:
I am downloading images using the script below. Sometimes it will go for 10 mins, sometimes 2 hours before timing out with the following error: Traceback (most recent call last): File "ftp_20070326_Downloads_cooperc_FetchLibreMapProjectDRGs.py", line 108, i n ? urllib.urlretrieve(fullurl, localfile) File "C:\Python24\lib\urllib.py", line...
5
13029
by: chrispoliquin | last post by:
Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib module. The problem is that after 110 pages or so the script sort of hangs and then I get the following traceback: Traceback (most recent call...
0
7512
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7438
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7951
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7466
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7803
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
1
5362
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3475
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1926
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1051
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.