Downloading PDF copies of Wikipedia pages?

AES

I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

This almost always works remarkably well, but when I tried it recently
with a Wikipedia article

<http://en.wikipedia.or g/wiki/Albrecht_Durer>
and also
<http://en.wikipedia.or g/wiki/Albrecht_D%C3%B Crer>

I got an instant error message saying just "General Error".

But if I use the "Save Page as Complete Web Site" command in Netscape
7.2 to capture the same web site, I can then use the same Acrobat
command to make a PDF copy of the downloaded web site on my hard disk,
without any error problems.

I'm just curious as to what's going here: Why doesn't my usual approach
work? Does Wikipedia have special commands in its HTML to block Acrobat
copying? (And I've found a way to bypass it?) Or is Acrobat being
specially fussy about this particular site? Or. . .?

May 13 '06 #1

Subscribe Reply

4956

phil-news-nospam

In comp.infosystem s.www.authoring.html AES <si*****@stanfo rd.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------

May 13 '06 #2

PDFrank

ph************* *@ipal.net wrote:

In comp.infosystem s.www.authoring.html AES <si*****@stanfo rd.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.

HTMLDOC chokes on those links also.

May 13 '06 #3

Stan Brown

Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:

I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you

May 13 '06 #4

Dick Margulis

Stan Brown wrote:

Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.

May 13 '06 #5

AES

In article <12************ *@news.supernew s.com>,
Dick Margulis <ma*******@comc ast.net> wrote:

Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.

Perhaps there's misunderstandin g here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.

May 14 '06 #6

Harald Hanche-Olsen

+ AES <si*****@stanfo rd.edu>:

| Perhaps there's misunderstandin g here (or I didn't post clearly).

No, Phil Howard explained it, but maybe his explanation was too brief.

| I'd prefer to grab the Wikipedia article as a PDF direct from the web
| page using Acrobat [...]
| But it doesn't seem to want to work with the Wikipedia page I noted.
|
| "Save Page As Complete Web Page" using Netscape _does_ work fine
| with that web page, however, producing an index page and a folder of
| files on my HD, which then _can_ be grabbed into a PDF file by
| Acrobat.
|
| I'm just curious as to why the first approach fails with that web
| page, while the second approach still works.

Because the web server at wikipedia adapts its output to whatever user
agent is at the other end. Netscape and Acrobat are not the same user
agent, therefore what they get from the web server may be different.

This is at least a likely explanation.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.n o/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell

May 14 '06 #7

Dick Margulis

AES wrote:

In article <12************ *@news.supernew s.com>,
Dick Margulis <ma*******@comc ast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)
So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.

Perhaps there's misunderstandin g here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.

I understood your question the first time. What we've proposed is an
ugly workaround, which you were apparently capable of figuring out
yourself. No harm, no foul.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.

There are times when it doesn't pay to be curious. That is, if the
answer is unknowable (because you are not privy to the code running on
wikipedia's servers) or if knowing the answer won't solve the problem
(because you are privy to their code but you can't change it), then
asking why is futile.

I generally try to persuade people that What and How--as in What is
happening? What can I do to change what is happening? How does this
work? How do I use it? How can I work around it?--are more useful
questions than Why, where software is concerned. Why questions are
mostly unanswerable.

May 14 '06 #8

Jukka K. Korpela

AES wrote:

Perhaps there's misunderstandin g here (or I didn't post clearly).
There always is. Communication fails, except by accident (Wiio's law).
By posting to three groups, you reduce the probability of exceptions.
Followups now trimmed.
I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
Then you apparently underestimate the power of the Force, oops I mean
HTML. You view the pages on your browser, so why would you view them
differently and in a clumsier program offline? Or didn't you simply know
how to save a page with its images and all? I expect so in my
Followup-To header. Depending on the browser and its version, such a
save operation can be invoked in various ways, including keyboard
shortcuts. There are also various special features for "offline
browsing" in browsers. For more massive operations, a separate program
such as httrack might be more suitable, since it can grab an entire
_site_ when desired.
But it doesn't seem to want to work with the Wikipedia page I noted.
My guess is that it chokes on non-ASCII characters in the URL. There are
basically two ways to represent non-ASCII characters in a URL, and
here it does not really matter which one is right; what matters here is
that Acrobat might fail to work properly with the way used on the
wikipedia page. (BTW, have you really got problems with wikipedia pages
in general, as the Subject line says, or with the particular page you
mentioned.)
"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

Apparently the version of Netscape you're using can deal with non-ASCII
characters in URLs as represented in wikipedia.

May 14 '06 #9

phil-news-nospam

In comp.infosystem s.www.authoring.html AES <si*****@stanfo rd.edu> wrote:

| I fairly often make PDF copies of web pages or sites by copying the web
| page link from the web page itself and pasting it into the Acrobat 7.0
| Standard "Create PDF From Web Page" command.
|
| (Not trying to steal material; usually just want to make a temporary
| copy to read offline, e.g. on a plane flight.)
|
| This almost always works remarkably well, but when I tried it recently
| with a Wikipedia article
|
| <http://en.wikipedia.or g/wiki/Albrecht_Durer>
| and also
| <http://en.wikipedia.or g/wiki/Albrecht_D%C3%B Crer>
|
| I got an instant error message saying just "General Error".
|
| But if I use the "Save Page as Complete Web Site" command in Netscape
| 7.2 to capture the same web site, I can then use the same Acrobat
| command to make a PDF copy of the downloaded web site on my hard disk,
| without any error problems.
|
| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

After getting the exact User-Agent string through a test access of a page
on my own server, I did a fetch from Wikipedia using that exact string and
found they were refusing access.

=============== =============== =============== =============== =============== ==
phil@canopus:/home/phil/user-agent-test 114> wget --user-agent='Mozilla/4.0 (compatible; WebCapture 3.0; Macintosh)' --timestamping --no-directories 'http://en.wikipedia.or g/wiki/User_agent'
--10:52:15-- http://en.wikipedia.org/wiki/User_agent
=> `User_agent'
Resolving en.wikipedia.or g... 207.142.131.246 , 207.142.131.247 , 207.142.131.248 , ...
Connecting to en.wikipedia.or g|207.142.131.2 46|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
10:52:16 ERROR 403: Forbidden.

=============== =============== =============== =============== =============== ==

I can understand some of this. Wikipedia has frequently been hit by some
nasty spidering programs. I ran into this myself several years ago when
the issue was lack of a User-Agent header. There was some discussion on
this somewhere on the site (I don't recall where right now).

The resolution will obviously be to alter the User-Agent somehow. And be
sure it is something that doesn't mimic anything that has a history of
abusing Wikipedia, or looks like it might.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------

May 15 '06 #10

Similar topics

2960

Setting Page numbers on copies of a report

by: Will | last post by:

I have a 1 page report which I want to print numerous copies of. The number of copies will be specified by the user. My problem is that I want each report to print the copy number of number of copies, e.g. 1 of 6. I can set the report so the report prompts the user for the copy number and the number of copies but this means they have to enter 1 then 6 then print, 2 then 6 print etc. How can I set it so that when the report is opened it...

Microsoft Access / VBA

2928

Operation timed-out downloading web service durning Add Web Reference - still no solution

by: Joe | last post by:

I'm hosting my web service on a Windows 2003 box which is remotely located. When trying to add a web reference to a C# project I get an error message 'There was an error downloading 'http://mydomain.com:port/webservice.asmx' The operation has timed-out (I've tried with and without using a separate port for the service) The weird thing is the page does show up on the left side of the screen listing the available methods but the Add...

ASP.NET

1817

Precluding the Downloading of Pictures

by: Doug van Vianen | last post by:

Hi, Is there some way in JavaScript to stop the downloading of pictures from a web page? Thank you. Doug van Vianen

Javascript

1710

My asp pages are not working, when i run it, it ask me for downloading?

by: shumaila | last post by:

hello all can anybody help me???? My asp pages are not working, when i run it, it ask me for downloading?

ASP / Active Server Pages

1414

wikipedia

by: Pitaridis Aristotelis | last post by:

There is a free encyclopedia called wikipedia (http://wikimediafoundation.org/). Does anyone knows how to use it in order to get various articles for diplaying them in my application?

Visual Basic .NET

2628

HELP: strange php behavior downloading html

by: Chuck Renner | last post by:

Please help! This MIGHT even be a bug in PHP! I'll provide version numbers and site specific information (browser, OS, and kernel versions) if others cannot reproduce this problem. I'm running into some PHP behavior that I do not understand in PHP 5.1.2. I need to parse the HTML from the following carefully constructed URI:

PHP

2463

Any python scripts to do parallel downloading?

by: Frank Potter | last post by:

I want to find a multithreaded downloading lib in python, can someone recommend one for me, please? Thanks~

Python

2528

Font size at Wikipedia

by: Andreas Prilop | last post by:

If you have a browser that supports user stylesheets (like Firefox), then you can write html, body, #globalWrapper { font-size: 100% !important } into your own stylesheet (e.g. userContent.css) and read Wikipedia's pages in your *own* font size. -- In memoriam Alan J. Flavell

HTML / CSS

2043

Zip Files, Mysql, Downloading, & PHP

by: Ajm113 | last post by:

Hello, I created a database and all and have a php page that displays content to the readers when a certain id is entered in the address bar. What I want to do is to have the user download a zip file that contains copies of that php script, but rename them to a title value of each row, so no replacing is done and display the individual content on each page from that title into the php pages that where made in the zip file. How can I do...

PHP

8383

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8587

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8658

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6210

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4206

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4384

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2792

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2029

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1787

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General