473,320 Members | 1,746 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Chasing PDF links while crawling the Web ??

I am trying to crawl the Web using HttpWebRequest and HttpWebResponse for the
each of the links I discover. For content types like Application/pdf etc., I
am able to convert the pdf file to text using third party converters.
However, I see that some pdf files internally link each page to another url
(normally appending &PGN=pageNumber to the URL). Opening the pdf file with
notepad, I see tags like GoToR and some binary information, which must be
pointing to the URL. On the browser, each page of the pdf is loaded on a
demand basis; the URL is constructed when the page gets loaded.

Here is an example URL:

http://v3.espacenet.com/pdfdoc?DB=EP...N=WO2005028634

Scrolling to page 3, the URL changes to

http://v3.espacenet.com/pdfdoc?DB=EP...05028634&PGN=3

The HttpWebResponse.getResponse() and getResponseStream gets just the first
page for the link. Once downloaded, the pdf file obviously is truncated. But
on the browser, each page is loaded.

How do I get the entire file with getResponseStream?
Nov 17 '05 #1
0 1088

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
0
by: Jason Manfield | last post by:
I am trying to crawl the Web using HttpWebRequest and HttpWebResponse for the each of the links I discover. For content types like Application/pdf etc., I am able to convert the pdf file to text...
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
0
by: Peter Otten | last post by:
QOTW: "It seems if you lurk here long enough you eventually get all you questions answered without even asking!" - Ted Landis "We're going to learn from Python. JavaScript is pretty close to...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.