473,669 Members | 2,457 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Downloading PDF copies of Wikipedia pages?

AES
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)

This almost always works remarkably well, but when I tried it recently
with a Wikipedia article

<http://en.wikipedia.or g/wiki/Albrecht_Durer>
and also
<http://en.wikipedia.or g/wiki/Albrecht_D%C3%B Crer>

I got an instant error message saying just "General Error".

But if I use the "Save Page as Complete Web Site" command in Netscape
7.2 to capture the same web site, I can then use the same Acrobat
command to make a PDF copy of the downloaded web site on my hard disk,
without any error problems.

I'm just curious as to what's going here: Why doesn't my usual approach
work? Does Wikipedia have special commands in its HTML to block Acrobat
copying? (And I've found a way to bypass it?) Or is Acrobat being
specially fussy about this particular site? Or. . .?
May 13 '06 #1
9 4956
In comp.infosystem s.www.authoring.html AES <si*****@stanfo rd.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
May 13 '06 #2
ph************* *@ipal.net wrote:
In comp.infosystem s.www.authoring.html AES <si*****@stanfo rd.edu> wrote:

| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

Perhaps Wikipedia is producing tweaks based on User-Agent and it just is
not handling Acrobat well. If Acrobat has a way to set what User-Agent it
uses, you may have success making it pretend to be Netscape completely.


HTMLDOC chokes on those links also.
May 13 '06 #3
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
May 13 '06 #4
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.
May 13 '06 #5
AES
In article <12************ *@news.supernew s.com>,
Dick Margulis <ma*******@comc ast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)


So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.


Perhaps there's misunderstandin g here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.
May 14 '06 #6
+ AES <si*****@stanfo rd.edu>:

| Perhaps there's misunderstandin g here (or I didn't post clearly).

No, Phil Howard explained it, but maybe his explanation was too brief.

| I'd prefer to grab the Wikipedia article as a PDF direct from the web
| page using Acrobat [...]
| But it doesn't seem to want to work with the Wikipedia page I noted.
|
| "Save Page As Complete Web Page" using Netscape _does_ work fine
| with that web page, however, producing an index page and a folder of
| files on my HD, which then _can_ be grabbed into a PDF file by
| Acrobat.
|
| I'm just curious as to why the first approach fails with that web
| page, while the second approach still works.

Because the web server at wikipedia adapts its output to whatever user
agent is at the other end. Netscape and Acrobat are not the same user
agent, therefore what they get from the web server may be different.

This is at least a likely explanation.

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.n o/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell
May 14 '06 #7
AES wrote:
In article <12************ *@news.supernew s.com>,
Dick Margulis <ma*******@comc ast.net> wrote:
Stan Brown wrote:
Sat, 13 May 2006 09:39:54 -0700 from AES <si*****@stanfo rd.edu>:
I fairly often make PDF copies of web pages or sites by copying the web
page link from the web page itself and pasting it into the Acrobat 7.0
Standard "Create PDF From Web Page" command.

(Not trying to steal material; usually just want to make a temporary
copy to read offline, e.g. on a plane flight.)
So why not do "Save As complete web page" -- then you should see the
same stuff in your browser that you would see live.

And could, further, navigate to that saved page, using Acrobat, to turn
it into a PDF.


Perhaps there's misunderstandin g here (or I didn't post clearly).

I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
But it doesn't seem to want to work with the Wikipedia page I noted.


I understood your question the first time. What we've proposed is an
ugly workaround, which you were apparently capable of figuring out
yourself. No harm, no foul.

"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.

I'm just curious as to why the first approach fails with that web page,
while the second approach still works.


There are times when it doesn't pay to be curious. That is, if the
answer is unknowable (because you are not privy to the code running on
wikipedia's servers) or if knowing the answer won't solve the problem
(because you are privy to their code but you can't change it), then
asking why is futile.

I generally try to persuade people that What and How--as in What is
happening? What can I do to change what is happening? How does this
work? How do I use it? How can I work around it?--are more useful
questions than Why, where software is concerned. Why questions are
mostly unanswerable.
May 14 '06 #8
AES wrote:
Perhaps there's misunderstandin g here (or I didn't post clearly).
There always is. Communication fails, except by accident (Wiio's law).
By posting to three groups, you reduce the probability of exceptions.
Followups now trimmed.
I'd prefer to grab the Wikipedia article as a PDF direct from the web
page using Acrobat because
--I use PDFs and Acrobat all the time;
--The result is a single PDF file; and
--The Acrobat grab is a single step and usually works well.
Then you apparently underestimate the power of the Force, oops I mean
HTML. You view the pages on your browser, so why would you view them
differently and in a clumsier program offline? Or didn't you simply know
how to save a page with its images and all? I expect so in my
Followup-To header. Depending on the browser and its version, such a
save operation can be invoked in various ways, including keyboard
shortcuts. There are also various special features for "offline
browsing" in browsers. For more massive operations, a separate program
such as httrack might be more suitable, since it can grab an entire
_site_ when desired.
But it doesn't seem to want to work with the Wikipedia page I noted.
My guess is that it chokes on non-ASCII characters in the URL. There are
basically two ways to represent non-ASCII characters in a URL, and
here it does not really matter which one is right; what matters here is
that Acrobat might fail to work properly with the way used on the
wikipedia page. (BTW, have you really got problems with wikipedia pages
in general, as the Subject line says, or with the particular page you
mentioned.)
"Save Page As Complete Web Page" using Netscape _does_ work fine with
that web page, however, producing an index page and a folder of files on
my HD, which then _can_ be grabbed into a PDF file by Acrobat.


Apparently the version of Netscape you're using can deal with non-ASCII
characters in URLs as represented in wikipedia.
May 14 '06 #9
In comp.infosystem s.www.authoring.html AES <si*****@stanfo rd.edu> wrote:

| I fairly often make PDF copies of web pages or sites by copying the web
| page link from the web page itself and pasting it into the Acrobat 7.0
| Standard "Create PDF From Web Page" command.
|
| (Not trying to steal material; usually just want to make a temporary
| copy to read offline, e.g. on a plane flight.)
|
| This almost always works remarkably well, but when I tried it recently
| with a Wikipedia article
|
| <http://en.wikipedia.or g/wiki/Albrecht_Durer>
| and also
| <http://en.wikipedia.or g/wiki/Albrecht_D%C3%B Crer>
|
| I got an instant error message saying just "General Error".
|
| But if I use the "Save Page as Complete Web Site" command in Netscape
| 7.2 to capture the same web site, I can then use the same Acrobat
| command to make a PDF copy of the downloaded web site on my hard disk,
| without any error problems.
|
| I'm just curious as to what's going here: Why doesn't my usual approach
| work? Does Wikipedia have special commands in its HTML to block Acrobat
| copying? (And I've found a way to bypass it?) Or is Acrobat being
| specially fussy about this particular site? Or. . .?

After getting the exact User-Agent string through a test access of a page
on my own server, I did a fetch from Wikipedia using that exact string and
found they were refusing access.

=============== =============== =============== =============== =============== ==
phil@canopus:/home/phil/user-agent-test 114> wget --user-agent='Mozilla/4.0 (compatible; WebCapture 3.0; Macintosh)' --timestamping --no-directories 'http://en.wikipedia.or g/wiki/User_agent'
--10:52:15-- http://en.wikipedia.org/wiki/User_agent
=> `User_agent'
Resolving en.wikipedia.or g... 207.142.131.246 , 207.142.131.247 , 207.142.131.248 , ...
Connecting to en.wikipedia.or g|207.142.131.2 46|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
10:52:16 ERROR 403: Forbidden.

=============== =============== =============== =============== =============== ==

I can understand some of this. Wikipedia has frequently been hit by some
nasty spidering programs. I ran into this myself several years ago when
the issue was lack of a User-Agent header. There was some discussion on
this somewhere on the site (I don't recall where right now).

The resolution will obviously be to alter the User-Agent somehow. And be
sure it is something that doesn't mimic anything that has a history of
abusing Wikipedia, or looks like it might.

--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
May 15 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2960
by: Will | last post by:
I have a 1 page report which I want to print numerous copies of. The number of copies will be specified by the user. My problem is that I want each report to print the copy number of number of copies, e.g. 1 of 6. I can set the report so the report prompts the user for the copy number and the number of copies but this means they have to enter 1 then 6 then print, 2 then 6 print etc. How can I set it so that when the report is opened it...
4
2928
by: Joe | last post by:
I'm hosting my web service on a Windows 2003 box which is remotely located. When trying to add a web reference to a C# project I get an error message 'There was an error downloading 'http://mydomain.com:port/webservice.asmx' The operation has timed-out (I've tried with and without using a separate port for the service) The weird thing is the page does show up on the left side of the screen listing the available methods but the Add...
23
1817
by: Doug van Vianen | last post by:
Hi, Is there some way in JavaScript to stop the downloading of pictures from a web page? Thank you. Doug van Vianen
1
1710
by: shumaila | last post by:
hello all can anybody help me???? My asp pages are not working, when i run it, it ask me for downloading?
7
1414
by: Pitaridis Aristotelis | last post by:
There is a free encyclopedia called wikipedia (http://wikimediafoundation.org/). Does anyone knows how to use it in order to get various articles for diplaying them in my application?
3
2628
by: Chuck Renner | last post by:
Please help! This MIGHT even be a bug in PHP! I'll provide version numbers and site specific information (browser, OS, and kernel versions) if others cannot reproduce this problem. I'm running into some PHP behavior that I do not understand in PHP 5.1.2. I need to parse the HTML from the following carefully constructed URI:
9
2463
by: Frank Potter | last post by:
I want to find a multithreaded downloading lib in python, can someone recommend one for me, please? Thanks~
15
2528
by: Andreas Prilop | last post by:
If you have a browser that supports user stylesheets (like Firefox), then you can write html, body, #globalWrapper { font-size: 100% !important } into your own stylesheet (e.g. userContent.css) and read Wikipedia's pages in your *own* font size. -- In memoriam Alan J. Flavell
7
2043
Ajm113
by: Ajm113 | last post by:
Hello, I created a database and all and have a php page that displays content to the readers when a certain id is entered in the address bar. What I want to do is to have the user download a zip file that contains copies of that php script, but rename them to a title value of each row, so no replacing is done and display the individual content on each page from that title into the php pages that where made in the zip file. How can I do...
0
8383
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8587
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8658
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6210
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4206
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4384
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2792
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2029
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
2
1787
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.