By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,908 Members | 1,860 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,908 IT Pros & Developers. It's quick & easy.

get wikipedia source failed (urrlib2)

P: n/a
Hi,
I'm trying to get wikipedia page source with urllib2:
usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
Albert_Einstein")
data = usock.read();
usock.close();
return data
I got exception because HTTP 403 error. why? with my browser i can't
access it without any problem?

Thanks,
Shahar.

Aug 7 '07 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On 7 , 11:54, shaha...@gmail.com wrote:
Hi,
I'm trying to get wikipedia page source with urllib2:
usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
Albert_Einstein")
data = usock.read();
usock.close();
return data
I got exception because HTTP 403 error. why? with my browser i can't
access it without any problem?

Thanks,
Shahar.
This source works fine for other site. the problem is in wikipedia. is
someone now any solution for this problem?

Aug 7 '07 #2

P: n/a
<sh******@gmail.comwrote:
This source works fine for other site. the problem is in wikipedia. is
someone now any solution for this problem?
Wikipedia, AFAIK, bans requests without a User Agent.
http://www.voidspace.org.uk/python/a....shtml#headers

--
Lawrence, oluyede.org - neropercaso.it
"It is difficult to get a man to understand
something when his salary depends on not
understanding it" - Upton Sinclair
Aug 7 '07 #3

P: n/a
In article* <‬11**********************@o61g2000hsh.googlegro ups.com*>,‬
* ‬sh******@gmail.com wrote*:‬

*‬Hi*,‬
*‬I'm trying to get wikipedia page source with urllib2*:‬
* ‬usock* = ‬urllib2*.‬urlopen*("‬http*://‬en.wikipedia.org/wiki*/‬
*‬Albert_Einstein*")‬
* ‬data* = ‬usock.read*();‬
* ‬usock.close*();‬
* ‬return data
*‬I got exception because HTTP 403* ‬error*. ‬why*? ‬with my browser i can't
*‬access it without any problem*?‬
*‬
*‬Thanks*,‬
*‬Shahar*.‬

It appears that Wikipedia may inspect the contents of the User-Agent* ‬
HTTP header*, ‬and that it does not particularly like the string it* ‬
receives from Python's urllib*. ‬I was able to make it work with urllib* ‬
via the following code*:‬

import urllib

class CustomURLopener* (‬urllib.FancyURLopener*):‬
* ‬version* = '‬Mozilla/5.0*'‬

urllib*.‬_urlopener* = ‬CustomURLopener*()‬

u* = ‬urllib.urlopen*('‬http*://‬en.wikipedia.org/wiki/Albert_Einstein*')‬
data* = ‬u.read*()‬

I'm assuming a similar trick could be used with urllib2*, ‬though I didn't* ‬
actually try it*. ‬Another thing to watch out for*, ‬is that some sites* ‬
will redirect a public URL X to an internal URL Y*, ‬and will check that* ‬
access to Y is only permitted if the Referer field indicates coming from* ‬
somewhere internal to the site*. ‬I have seen both of these techniques* ‬
used to foil screen-scraping*.‬

Cheers*,‬
*-‬M

*-- ‬
Michael J*. ‬Fromberger* | ‬Lecturer*, ‬Dept*. ‬of Computer Science
http*://‬www.dartmouth.edu*/‬~sting*/ | ‬Dartmouth College*, ‬Hanover*, ‬NH*, ‬USA
Aug 7 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.