By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,364 Members | 1,536 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,364 IT Pros & Developers. It's quick & easy.

Downloading PDF copies of Wikipedia pages?

P: n/a
AES
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

This almost always works remarkably well, but when I tried it recently
with a Wikipedia article

<http://en.wikipedia.org/wiki/Albrecht_Durer>
and also
<http://en.wikipedia.org/wiki/Albrecht_D%C3%BCrer>

I got an instant error message saying just "General Error".

But if I use the "Save Page as Complete Web Site" command in Netscape
7.2 to capture the same web site, I can then use the same Acrobat
command to make a PDF copy of the downloaded web site on my hard disk,
without any error problems.

I'm just curious as to what's going here: Why doesn't my usual approach
work? Does Wikipedia have special commands in its HTML to block Acrobat
copying? (And I've found a way to bypass it?) Or is Acrobat being
specially fussy about this particular site? Or. . .?
May 13 '06 #1
Share this Question
Share on Google+
9 Replies


P: n/a
In comp.infosystems.www.authoring.html AES <si*****@stanford.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
May 13 '06 #2

P: n/a
ph**************@ipal.net wrote:
In comp.infosystems.www.authoring.html AES <si*****@stanford.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.


HTMLDOC chokes on those links also.
May 13 '06 #3

P: n/a
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
May 13 '06 #4

P: n/a
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.
May 13 '06 #5

P: n/a
AES
In article <12*************@news.supernews.com>,
Dick Margulis <ma*******@comcast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.


Perhaps there's misunderstanding here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.
May 14 '06 #6

P: n/a
+ AES <si*****@stanford.edu>:

| Perhaps there's misunderstanding here (or I didn't post clearly).

No, Phil Howard explained it, but maybe his explanation was too brief.

| I'd prefer to grab the Wikipedia article as a PDF direct from the web
| page using Acrobat [...]
| But it doesn't seem to want to work with the Wikipedia page I noted.
|
| "Save Page As Complete Web Page" using Netscape _does_ work fine
| with that web page, however, producing an index page and a folder of
| files on my HD, which then _can_ be grabbed into a PDF file by
| Acrobat.
|
| I'm just curious as to why the first approach fails with that web
| page, while the second approach still works.

Because the web server at wikipedia adapts its output to whatever user
agent is at the other end. Netscape and Acrobat are not the same user
agent, therefore what they get from the web server may be different.

This is at least a likely explanation.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell
May 14 '06 #7

P: n/a
AES wrote:
In article <12*************@news.supernews.com>,
Dick Margulis <ma*******@comcast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)
So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.


Perhaps there's misunderstanding here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.


I understood your question the first time. What we've proposed is an
ugly workaround, which you were apparently capable of figuring out
yourself. No harm, no foul.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.


There are times when it doesn't pay to be curious. That is, if the
answer is unknowable (because you are not privy to the code running on
wikipedia's servers) or if knowing the answer won't solve the problem
(because you are privy to their code but you can't change it), then
asking why is futile.

I generally try to persuade people that What and How--as in What is
happening? What can I do to change what is happening? How does this
work? How do I use it? How can I work around it?--are more useful
questions than Why, where software is concerned. Why questions are
mostly unanswerable.
May 14 '06 #8

P: n/a
AES wrote:
Perhaps there's misunderstanding here (or I didn't post clearly).
There always is. Communication fails, except by accident (Wiio's law).
By posting to three groups, you reduce the probability of exceptions.
Followups now trimmed.
I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
Then you apparently underestimate the power of the Force, oops I mean
HTML. You view the pages on your browser, so why would you view them
differently and in a clumsier program offline? Or didn't you simply know
how to save a page with its images and all? I expect so in my
Followup-To header. Depending on the browser and its version, such a
save operation can be invoked in various ways, including keyboard
shortcuts. There are also various special features for "offline
browsing" in browsers. For more massive operations, a separate program
such as httrack might be more suitable, since it can grab an entire
_site_ when desired.
But it doesn't seem to want to work with the Wikipedia page I noted.
My guess is that it chokes on non-ASCII characters in the URL. There are
basically two ways to represent non-ASCII characters in a URL, and
here it does not really matter which one is right; what matters here is
that Acrobat might fail to work properly with the way used on the
wikipedia page. (BTW, have you really got problems with wikipedia pages
in general, as the Subject line says, or with the particular page you
mentioned.)
"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.


Apparently the version of Netscape you're using can deal with non-ASCII
characters in URLs as represented in wikipedia.
May 14 '06 #9

P: n/a
In comp.infosystems.www.authoring.html AES <si*****@stanford.edu> wrote:

| I fairly often make PDF copies of web pages or sites by copying the web
| page link from the web page itself and pasting it into the Acrobat 7.0
| Standard "Create PDF From Web Page" command.
|
| (Not trying to steal material; usually just want to make a temporary
| copy to read offline, e.g. on a plane flight.)
|
| This almost always works remarkably well, but when I tried it recently
| with a Wikipedia article
|
| <http://en.wikipedia.org/wiki/Albrecht_Durer>
| and also
| <http://en.wikipedia.org/wiki/Albrecht_D%C3%BCrer>
|
| I got an instant error message saying just "General Error".
|
| But if I use the "Save Page as Complete Web Site" command in Netscape
| 7.2 to capture the same web site, I can then use the same Acrobat
| command to make a PDF copy of the downloaded web site on my hard disk,
| without any error problems.
|
| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

After getting the exact User-Agent string through a test access of a page
on my own server, I did a fetch from Wikipedia using that exact string and
found they were refusing access.

================================================== ===========================
phil@canopus:/home/phil/user-agent-test 114> wget --user-agent='Mozilla/4.0 (compatible; WebCapture 3.0; Macintosh)' --timestamping --no-directories 'http://en.wikipedia.org/wiki/User_agent'
--10:52:15-- http://en.wikipedia.org/wiki/User_agent
=> `User_agent'
Resolving en.wikipedia.org... 207.142.131.246, 207.142.131.247, 207.142.131.248, ...
Connecting to en.wikipedia.org|207.142.131.246|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
10:52:16 ERROR 403: Forbidden.

================================================== ===========================

I can understand some of this. Wikipedia has frequently been hit by some
nasty spidering programs. I ran into this myself several years ago when
the issue was lack of a User-Agent header. There was some discussion on
this somewhere on the site (I don't recall where right now).

The resolution will obviously be to alter the User-Agent somehow. And be
sure it is something that doesn't mimic anything that has a history of
abusing Wikipedia, or looks like it might.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
May 15 '06 #10

This discussion thread is closed

Replies have been disabled for this discussion.