472,133 Members | 1,454 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,133 software developers and data experts.

get wikipedia source failed (urrlib2)

Hi,
I'm trying to get wikipedia page source with urllib2:
usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
Albert_Einstein")
data = usock.read();
usock.close();
return data
I got exception because HTTP 403 error. why? with my browser i can't
access it without any problem?

Thanks,
Shahar.

Aug 7 '07 #1
3 1415
On 7 , 11:54, shaha...@gmail.com wrote:
Hi,
I'm trying to get wikipedia page source with urllib2:
usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
Albert_Einstein")
data = usock.read();
usock.close();
return data
I got exception because HTTP 403 error. why? with my browser i can't
access it without any problem?

Thanks,
Shahar.
This source works fine for other site. the problem is in wikipedia. is
someone now any solution for this problem?

Aug 7 '07 #2
<sh******@gmail.comwrote:
This source works fine for other site. the problem is in wikipedia. is
someone now any solution for this problem?
Wikipedia, AFAIK, bans requests without a User Agent.
http://www.voidspace.org.uk/python/a....shtml#headers

--
Lawrence, oluyede.org - neropercaso.it
"It is difficult to get a man to understand
something when his salary depends on not
understanding it" - Upton Sinclair
Aug 7 '07 #3
In articleâ€* <‬11**********************@o61g2000hsh.googlegro ups.comâ€*>,‬
â€* ‬sh******@gmail.com wroteâ€*:‬

â€*‬Hiâ€*,‬
â€*‬I'm trying to get wikipedia page source with urllib2â€*:‬
â€* ‬usockâ€* = ‬urllib2â€*.‬urlopenâ€*("‬httpâ€*://‬en.wikipedia.org/wikiâ€*/‬
â€*‬Albert_Einsteinâ€*")‬
â€* ‬dataâ€* = ‬usock.readâ€*();‬
â€* ‬usock.closeâ€*();‬
â€* ‬return data
â€*‬I got exception because HTTP 403â€* ‬errorâ€*. ‬whyâ€*? ‬with my browser i can't
â€*‬access it without any problemâ€*?‬
â€*‬
â€*‬Thanksâ€*,‬
â€*‬Shaharâ€*.‬

It appears that Wikipedia may inspect the contents of the User-Agentâ€* ‬
HTTP headerâ€*, ‬and that it does not particularly like the string itâ€* ‬
receives from Python's urllibâ€*. ‬I was able to make it work with urllibâ€* ‬
via the following codeâ€*:‬

import urllib

class CustomURLopenerâ€* (‬urllib.FancyURLopenerâ€*):‬
â€* ‬versionâ€* = '‬Mozilla/5.0â€*'‬

urllibâ€*.‬_urlopenerâ€* = ‬CustomURLopenerâ€*()‬

uâ€* = ‬urllib.urlopenâ€*('‬httpâ€*://‬en.wikipedia.org/wiki/Albert_Einsteinâ€*')‬
dataâ€* = ‬u.readâ€*()‬

I'm assuming a similar trick could be used with urllib2â€*, ‬though I didn'tâ€* ‬
actually try itâ€*. ‬Another thing to watch out forâ€*, ‬is that some sitesâ€* ‬
will redirect a public URL X to an internal URL Yâ€*, ‬and will check thatâ€* ‬
access to Y is only permitted if the Referer field indicates coming fromâ€* ‬
somewhere internal to the siteâ€*. ‬I have seen both of these techniquesâ€* ‬
used to foil screen-scrapingâ€*.‬

Cheersâ€*,‬
â€*-‬M

â€*-- ‬
Michael Jâ€*. ‬Frombergerâ€* | ‬Lecturerâ€*, ‬Deptâ€*. ‬of Computer Science
httpâ€*://‬www.dartmouth.eduâ€*/‬~stingâ€*/ | ‬Dartmouth Collegeâ€*, ‬Hanoverâ€*, ‬NHâ€*, ‬USA
Aug 7 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

24 posts views Thread by Roman | last post: by
24 posts views Thread by Luis M. González | last post: by
48 posts views Thread by spibou | last post: by
3 posts views Thread by lukefrancomusic | last post: by
27 posts views Thread by John J. Lee | last post: by
16 posts views Thread by lovecreatesbea... | last post: by
2 posts views Thread by Taras_96 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.