Here is a very simple Python script utilizing urllib:
import urllib
url =
"http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological"
print url
print
file = urllib.urlopen( url )
mime = file.info()
print mime
print file.read()
print file.geturl()
However, when i ecexute it, i get an html error ("access denied").
On the one hand, the funny thing though is that i can view the page fine in my
browser, and i can download it fine using curl.
On the other hand, it must have something to do with the URL because urllib
works fine with any other URL i have tried ...
Any ideas?
I would appreciate very much any hints or suggestions.
Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/ 8 3472
Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib:
import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl()
However, when i ecexute it, i get an html error ("access denied").
On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl.
On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ...
Any ideas? I would appreciate very much any hints or suggestions.
The ':' in '..Commons:Feat..' is not a legal character in this part of the
URI and has to be %-quoted as '%3a'.
Try the URI
'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',
perhaps urllib is stricter than your browsers (which are known to accept
every b******t you feed into them, sometimes with very confusing results)
and gets confused when it tries to parse the malformed URI.
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Benjamin Niemann wrote: Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:
import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl()
However, when i ecexute it, i get an html error ("access denied").
On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl.
On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ...
Any ideas? I would appreciate very much any hints or suggestions. The ':' in '..Commons:Feat..' is not a legal character in this part of the URI and has to be %-quoted as '%3a'.
Oops, I was wrong... ':' *is* allowed in path segments. I should eat
something, my vision starts to get blurry...
Try the URI 'http://commons.wikimedia.org/wiki/Commons%3aFeatured_pictures/chronological',
You may try this anyway...
--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://pink.odahoda.de/
Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib:
import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl()
However, when i ecexute it, i get an html error ("access denied").
On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl.
On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ...
Any ideas? I would appreciate very much any hints or suggestions.
Best regards, Gabriel.
-- /-----------------------------------------------------------------------\ | If you know exactly what you will do -- | | why would you want to do it? | | (Picasso) | \-----------------------------------------------------------------------/
I think the problem might be with the Wikimedia Commons website itself,
rather than urllib. Wikipedia has a policy against unapproved bots: http://en.wikipedia.org/wiki/Wikipedia:Bots
It might be that Wikimedia Commons blocks bots that aren't approved,
and might consider your program a bot. I've had similar error message
from www.wikipedia.org and had no problems with a couple of other
websites I've tried. Also, the html the program returns seems to be a
standard "ACCESS DENIED" page.
I might be worth asking at the Wikimedia Commons website, at least to
eliminate this possibility.
John Hicken
Gabriel Zachmann wrote: Here is a very simple Python script utilizing urllib:
import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi cal" print url print file = urllib.urlopen( url ) mime = file.info() print mime print file.read() print file.geturl()
However, when i ecexute it, i get an html error ("access denied").
On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl.
On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ...
It looks like wikipedia checks the User-Agent header and refuses to send
pages to browsers it doesn't like. Try:
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'
request = urllib2.Request(url, headers)
file = urllib2.urlopen(request)
....
That (or code very like it) worked when I tried it.
Duncan Booth <du**********@invalid.invalid> writes: Gabriel Zachmann wrote:
Here is a very simple Python script utilizing urllib:
[...] "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronologi cal" print url print file = urllib.urlopen( url )
[...] However, when i ecexute it, i get an html error ("access denied").
On the one hand, the funny thing though is that i can view the page fine in my browser, and i can download it fine using curl.
[...] On the other hand, it must have something to do with the URL because urllib works fine with any other URL i have tried ... It looks like wikipedia checks the User-Agent header and refuses to send pages to browsers it doesn't like. Try:
[...]
If wikipedia is trying to discourage this kind of scraping, it's
probably not polite to do it. (I don't know what wikipedia's policies
are, though)
John
John J. Lee wrote: It looks like wikipedia checks the User-Agent header and refuses to send pages to browsers it doesn't like. Try: [...]
If wikipedia is trying to discourage this kind of scraping, it's probably not polite to do it. (I don't know what wikipedia's policies are, though)
They have a general policy against unapproved bots, which is
understandable since badly behaved bots could mess up or delete pages.
If you read the policy it is aimed at bots which modify wikipedia
articles automatically. http://en.wikipedia.org/wiki/Wikipedia:Bots says: This policy in a nutshell: Programs that update pages automatically in a useful and harmless way may be welcome if their owners seek approval first and go to great lengths to stop them running amok or being a drain on resources.
On the other hand something which is simply retrieving one or two fixed
pages doesn't fit that definition of a bot so is probably alright. They
even provide a link to some frameworks for writing bots e.g. http://sourceforge.net/projects/pywikipediabot/
> On the other hand something which is simply retrieving one or two fixed pages doesn't fit that definition of a bot so is probably alright. They
i think so, too.
even provide a link to some frameworks for writing bots e.g. http://sourceforge.net/projects/pywikipediabot/
ah, that looks nice ..
Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/
> headers = {} headers['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'
request = urllib2.Request(url, headers) file = urllib2.urlopen(request)
ah, thanks a lot, that works !
Best regards,
Gabriel.
--
/-----------------------------------------------------------------------\
| If you know exactly what you will do -- |
| why would you want to do it? |
| (Picasso) |
\-----------------------------------------------------------------------/ This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Richard Shea |
last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy'
function (part of the mx.tidy package). There's one thing I'm having
real difficulties understanding. When I did this ...
...
|
by: Stuart McGraw |
last post by:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or...
|
by: Pieter Edelman |
last post by:
Hi all,
I'm trying to submit some data using a POST request to a HTTP server with
BASIC authentication with python, but I can't get it to work. Since it's
driving me completely nuts, so here's...
|
by: Timothy Wu |
last post by:
Hi,
I'm trying to fill the form on page
http://www.cbs.dtu.dk/services/TMHMM/ using urllib.
There are two peculiarities. First of all, I am filling in incorrect
key/value pairs in the...
|
by: william |
last post by:
I've got a strange problem on windows (not very familiar with that OS).
I can ping a host, but cannot get it via urllib (see here under).
I can even telnet the host on port 80.
Thus network...
|
by: justsee |
last post by:
Hi,
I'm using Python 2.3 on Windows for the first time, and am doing
something wrong in using urllib to retrieve images from urls embedded
in a csv file. If I explicitly specify a url and image...
|
by: Stefan Palme |
last post by:
Hi all,
is there a way to modify the time a call of
urllib.open(...)
waits for an answer from the other side? Have a tool
which automatically checks a list of websites for
certain content....
|
by: supercooper |
last post by:
I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:
Traceback (most recent call last):
File...
|
by: chrispoliquin |
last post by:
Hi,
I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |