473,326 Members | 2,192 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Chasing PDF links while crawling the Web ??

I am trying to crawl the Web using HttpWebRequest and HttpWebResponse for the
each of the links I discover. For content types like Application/pdf etc., I
am able to convert the pdf file to text using third party converters.
However, I see that some pdf files internally link each page to another url
(normally appending &PGN=pageNumber to the URL). Opening the pdf file with
notepad, I see tags like GoToR and some binary information, which must be
pointing to the URL. On the browser, each page of the pdf is loaded on a
demand basis; the URL is constructed when the page gets loaded.

Here is an example URL:

http://v3.espacenet.com/pdfdoc?DB=EP...N=WO2005028634

Scrolling to page 3, the URL changes to

http://v3.espacenet.com/pdfdoc?DB=EP...05028634&PGN=3

The HttpWebResponse.getResponse() and getResponseStream gets just the first
page for the link. Once downloaded, the pdf file obviously is truncated. But
on the browser, each page is loaded.

How do I get the entire file with getResponseStream?
Nov 17 '05 #1
0 1089

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
0
by: Jason Manfield | last post by:
I am trying to crawl the Web using HttpWebRequest and HttpWebResponse for the each of the links I discover. For content types like Application/pdf etc., I am able to convert the pdf file to text...
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
0
by: Peter Otten | last post by:
QOTW: "It seems if you lurk here long enough you eventually get all you questions answered without even asking!" - Ted Landis "We're going to learn from Python. JavaScript is pretty close to...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.