In articleâ€* <‬11**********************@o61g2000hsh.googlegro ups.comâ€*>,‬
â€* ‬sh******@gmail.com wroteâ€*:‬
â€*‬Hiâ€*,‬
â€*‬I'm trying to get wikipedia page source with urllib2â€*:‬
â€* ‬usockâ€* = ‬urllib2â€*.‬urlopenâ€*("‬httpâ€*://‬en.wikipedia.org/wikiâ€*/‬
â€*‬Albert_Einsteinâ€*")‬
â€* ‬dataâ€* = ‬usock.readâ€*();‬
â€* ‬usock.closeâ€*();‬
â€* ‬return data
â€*‬I got exception because HTTP 403â€* ‬errorâ€*. ‬whyâ€*? ‬with my browser i can't
â€*‬access it without any problemâ€*?‬
â€*‬
â€*‬Thanksâ€*,‬
â€*‬Shaharâ€*.‬
It appears that Wikipedia may inspect the contents of the User-Agentâ€* ‬
HTTP headerâ€*, ‬and that it does not particularly like the string itâ€* ‬
receives from Python's urllibâ€*. ‬I was able to make it work with urllibâ€* ‬
via the following codeâ€*:‬
import urllib
class CustomURLopenerâ€* (‬urllib.FancyURLopenerâ€*):‬
â€* ‬versionâ€* = '‬Mozilla/5.0â€*'‬
urllibâ€*.‬_urlopenerâ€* = ‬CustomURLopenerâ€*()‬
uâ€* = ‬urllib.urlopenâ€*('‬httpâ€*://‬en.wikipedia.org/wiki/Albert_Einsteinâ€*')‬
dataâ€* = ‬u.readâ€*()‬
I'm assuming a similar trick could be used with urllib2â€*, ‬though I didn'tâ€* ‬
actually try itâ€*. ‬Another thing to watch out forâ€*, ‬is that some sitesâ€* ‬
will redirect a public URL X to an internal URL Yâ€*, ‬and will check thatâ€* ‬
access to Y is only permitted if the Referer field indicates coming fromâ€* ‬
somewhere internal to the siteâ€*. ‬I have seen both of these techniquesâ€* ‬
used to foil screen-scrapingâ€*.‬
Cheersâ€*,‬
â€*-‬M
â€*-- ‬
Michael Jâ€*. ‬Frombergerâ€* | ‬Lecturerâ€*, ‬Deptâ€*. ‬of Computer Science
httpâ€*://‬
www.dartmouth.eduâ€*/‬~stingâ€*/ | ‬Dartmouth Collegeâ€*, ‬Hanoverâ€*, ‬NHâ€*, ‬USA