In comp.infosystems.
www.authoring.html AES <si*****@stanford.edu> wrote:
| I fairly often make PDF copies of web pages or sites by copying the web
| page link from the web page itself and pasting it into the Acrobat 7.0
| Standard "Create PDF From Web Page" command.
|
| (Not trying to steal material; usually just want to make a temporary
| copy to read offline, e.g. on a plane flight.)
|
| This almost always works remarkably well, but when I tried it recently
| with a Wikipedia article
|
| <http://en.wikipedia.org/wiki/Albrecht_Durer>
| and also
| <http://en.wikipedia.org/wiki/Albrecht_D%C3%BCrer>
|
| I got an instant error message saying just "General Error".
|
| But if I use the "Save Page as Complete Web Site" command in Netscape
| 7.2 to capture the same web site, I can then use the same Acrobat
| command to make a PDF copy of the downloaded web site on my hard disk,
| without any error problems.
|
| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?
After getting the exact User-Agent string through a test access of a page
on my own server, I did a fetch from Wikipedia using that exact string and
found they were refusing access.
================================================== ===========================
phil@canopus:/home/phil/user-agent-test 114> wget --user-agent='Mozilla/4.0 (compatible; WebCapture 3.0; Macintosh)' --timestamping --no-directories 'http://en.wikipedia.org/wiki/User_agent'
--10:52:15--
http://en.wikipedia.org/wiki/User_agent
=> `User_agent'
Resolving en.wikipedia.org... 207.142.131.246, 207.142.131.247, 207.142.131.248, ...
Connecting to en.wikipedia.org|207.142.131.246|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
10:52:16 ERROR 403: Forbidden.
================================================== ===========================
I can understand some of this. Wikipedia has frequently been hit by some
nasty spidering programs. I ran into this myself several years ago when
the issue was lack of a User-Agent header. There was some discussion on
this somewhere on the site (I don't recall where right now).
The resolution will obviously be to alter the User-Agent somehow. And be
sure it is something that doesn't mimic anything that has a history of
abusing Wikipedia, or looks like it might.
--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN |
http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net |
http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------