473,325 Members | 2,828 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

Downloading PDF copies of Wikipedia pages?

AES
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

This almost always works remarkably well, but when I tried it recently
with a Wikipedia article

<http://en.wikipedia.org/wiki/Albrecht_Durer>
and also
<http://en.wikipedia.org/wiki/Albrecht_D%C3%BCrer>

I got an instant error message saying just "General Error".

But if I use the "Save Page as Complete Web Site" command in Netscape
7.2 to capture the same web site, I can then use the same Acrobat
command to make a PDF copy of the downloaded web site on my hard disk,
without any error problems.

I'm just curious as to what's going here: Why doesn't my usual approach
work? Does Wikipedia have special commands in its HTML to block Acrobat
copying? (And I've found a way to bypass it?) Or is Acrobat being
specially fussy about this particular site? Or. . .?
May 13 '06 #1
9 4936
In comp.infosystems.www.authoring.html AES <si*****@stanford.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
May 13 '06 #2
ph**************@ipal.net wrote:
In comp.infosystems.www.authoring.html AES <si*****@stanford.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.


HTMLDOC chokes on those links also.
May 13 '06 #3
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
May 13 '06 #4
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.
May 13 '06 #5
AES
In article <12*************@news.supernews.com>,
Dick Margulis <ma*******@comcast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.


Perhaps there's misunderstanding here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.
May 14 '06 #6
+ AES <si*****@stanford.edu>:

| Perhaps there's misunderstanding here (or I didn't post clearly).

No, Phil Howard explained it, but maybe his explanation was too brief.

| I'd prefer to grab the Wikipedia article as a PDF direct from the web
| page using Acrobat [...]
| But it doesn't seem to want to work with the Wikipedia page I noted.
|
| "Save Page As Complete Web Page" using Netscape _does_ work fine
| with that web page, however, producing an index page and a folder of
| files on my HD, which then _can_ be grabbed into a PDF file by
| Acrobat.
|
| I'm just curious as to why the first approach fails with that web
| page, while the second approach still works.

Because the web server at wikipedia adapts its output to whatever user
agent is at the other end. Netscape and Acrobat are not the same user
agent, therefore what they get from the web server may be different.

This is at least a likely explanation.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell
May 14 '06 #7
AES wrote:
In article <12*************@news.supernews.com>,
Dick Margulis <ma*******@comcast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanford.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)
So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.


Perhaps there's misunderstanding here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.


I understood your question the first time. What we've proposed is an
ugly workaround, which you were apparently capable of figuring out
yourself. No harm, no foul.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.


There are times when it doesn't pay to be curious. That is, if the
answer is unknowable (because you are not privy to the code running on
wikipedia's servers) or if knowing the answer won't solve the problem
(because you are privy to their code but you can't change it), then
asking why is futile.

I generally try to persuade people that What and How--as in What is
happening? What can I do to change what is happening? How does this
work? How do I use it? How can I work around it?--are more useful
questions than Why, where software is concerned. Why questions are
mostly unanswerable.
May 14 '06 #8
AES wrote:
Perhaps there's misunderstanding here (or I didn't post clearly).
There always is. Communication fails, except by accident (Wiio's law).
By posting to three groups, you reduce the probability of exceptions.
Followups now trimmed.
I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
Then you apparently underestimate the power of the Force, oops I mean
HTML. You view the pages on your browser, so why would you view them
differently and in a clumsier program offline? Or didn't you simply know
how to save a page with its images and all? I expect so in my
Followup-To header. Depending on the browser and its version, such a
save operation can be invoked in various ways, including keyboard
shortcuts. There are also various special features for "offline
browsing" in browsers. For more massive operations, a separate program
such as httrack might be more suitable, since it can grab an entire
_site_ when desired.
But it doesn't seem to want to work with the Wikipedia page I noted.
My guess is that it chokes on non-ASCII characters in the URL. There are
basically two ways to represent non-ASCII characters in a URL, and
here it does not really matter which one is right; what matters here is
that Acrobat might fail to work properly with the way used on the
wikipedia page. (BTW, have you really got problems with wikipedia pages
in general, as the Subject line says, or with the particular page you
mentioned.)
"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.


Apparently the version of Netscape you're using can deal with non-ASCII
characters in URLs as represented in wikipedia.
May 14 '06 #9
In comp.infosystems.www.authoring.html AES <si*****@stanford.edu> wrote:

| I fairly often make PDF copies of web pages or sites by copying the web
| page link from the web page itself and pasting it into the Acrobat 7.0
| Standard "Create PDF From Web Page" command.
|
| (Not trying to steal material; usually just want to make a temporary
| copy to read offline, e.g. on a plane flight.)
|
| This almost always works remarkably well, but when I tried it recently
| with a Wikipedia article
|
| <http://en.wikipedia.org/wiki/Albrecht_Durer>
| and also
| <http://en.wikipedia.org/wiki/Albrecht_D%C3%BCrer>
|
| I got an instant error message saying just "General Error".
|
| But if I use the "Save Page as Complete Web Site" command in Netscape
| 7.2 to capture the same web site, I can then use the same Acrobat
| command to make a PDF copy of the downloaded web site on my hard disk,
| without any error problems.
|
| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

After getting the exact User-Agent string through a test access of a page
on my own server, I did a fetch from Wikipedia using that exact string and
found they were refusing access.

================================================== ===========================
phil@canopus:/home/phil/user-agent-test 114> wget --user-agent='Mozilla/4.0 (compatible; WebCapture 3.0; Macintosh)' --timestamping --no-directories 'http://en.wikipedia.org/wiki/User_agent'
--10:52:15-- http://en.wikipedia.org/wiki/User_agent
=> `User_agent'
Resolving en.wikipedia.org... 207.142.131.246, 207.142.131.247, 207.142.131.248, ...
Connecting to en.wikipedia.org|207.142.131.246|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
10:52:16 ERROR 403: Forbidden.

================================================== ===========================

I can understand some of this. Wikipedia has frequently been hit by some
nasty spidering programs. I ran into this myself several years ago when
the issue was lack of a User-Agent header. There was some discussion on
this somewhere on the site (I don't recall where right now).

The resolution will obviously be to alter the User-Agent somehow. And be
sure it is something that doesn't mimic anything that has a history of
abusing Wikipedia, or looks like it might.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
May 15 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Will | last post by:
I have a 1 page report which I want to print numerous copies of. The number of copies will be specified by the user. My problem is that I want each report to print the copy number of number of...
4
by: Joe | last post by:
I'm hosting my web service on a Windows 2003 box which is remotely located. When trying to add a web reference to a C# project I get an error message 'There was an error downloading...
23
by: Doug van Vianen | last post by:
Hi, Is there some way in JavaScript to stop the downloading of pictures from a web page? Thank you. Doug van Vianen
1
by: shumaila | last post by:
hello all can anybody help me???? My asp pages are not working, when i run it, it ask me for downloading?
7
by: Pitaridis Aristotelis | last post by:
There is a free encyclopedia called wikipedia (http://wikimediafoundation.org/). Does anyone knows how to use it in order to get various articles for diplaying them in my application?
3
by: Chuck Renner | last post by:
Please help! This MIGHT even be a bug in PHP! I'll provide version numbers and site specific information (browser, OS, and kernel versions) if others cannot reproduce this problem. I'm...
9
by: Frank Potter | last post by:
I want to find a multithreaded downloading lib in python, can someone recommend one for me, please? Thanks~
15
by: Andreas Prilop | last post by:
If you have a browser that supports user stylesheets (like Firefox), then you can write html, body, #globalWrapper { font-size: 100% !important } into your own stylesheet (e.g....
7
Ajm113
by: Ajm113 | last post by:
Hello, I created a database and all and have a php page that displays content to the readers when a certain id is entered in the address bar. What I want to do is to have the user download a zip...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.