I am trying to crawl the Web using HttpWebRequest and HttpWebResponse for the
each of the links I discover. For content types like Application/pdf etc., I
am able to convert the pdf file to text using third party converters.
However, I see that some pdf files internally link each page to another url
(normally appending &PGN=pageNumber to the URL). Opening the pdf file with
notepad, I see tags like GoToR and some binary information, which must be
pointing to the URL. On the browser, each page of the pdf is loaded on a
demand basis; the URL is constructed when the page gets loaded.
Here is an example URL:
http://v3.espacenet.com/pdfdoc?DB=EP...N=WO2005028634
Scrolling to page 3, the URL changes to
http://v3.espacenet.com/pdfdoc?DB=EP...05028634&PGN=3
The HttpWebResponse.getResponse() and getResponseStream gets just the first
page for the link. Once downloaded, the pdf file obviously is truncated. But
on the browser, each page is loaded.
How do I get the entire file with getResponseStream?