By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,261 Members | 1,674 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,261 IT Pros & Developers. It's quick & easy.

Difference between page source and what's being displayed

P: 15
Hi All,

(I originally posted this in the .NET Programming Languages, but i realised the fact that i'm using C# is irrelevant as i think the problem i'm having is with DHTML / JavaScript.)

So here goes again:

I'm writing an app in C# that will be doing a bit of web scraping. I've got a fair bit of expierence with this but i've come across an issue with the returned HTML i'm getting from the pages.

When i view the page i'm trying to scrape in a browser all is well, and i can see a 'Services' section on the bottom of the page, however when i do a view source on the page, the corresponding code is missing for that part of the page. I've realised that the reason may be that on the page there is a JavaScript function where the 'Services' section should be. Here is the code:

Expand|Select|Wrap|Line Numbers
  1. <script type='text/javascript'>initTabs('dhtmlgoodies_tabView',Array('Current Services','Services Overview','Notes<img src="temp_files/rosettes/notes_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Photos','Attachments','Services Tree','Services Detail','Organisation Tree','Sales Tasks','Incidents<img src="temp_files/rosettes/openincidents_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Orders<img src="temp_files/rosettes/openorders_1.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Ceases','POs'),0,'100%','');</script>
I've found the corresponding function in an included .js file and it comes from www.dhtmlgoodies.com.

So my problem is this, when viewing in a browser (and doing a view source) i can't see the actual HTML that creates this 'Services' section and thus when i do my httpWebRequest i don't get the HTML either.

In summary, how can i get ALL the source information from a WebRequest? Do i somehow get the WebRequest to run the JavaScript function and return the data? If so how?

Very much appreciate any help!

Regards,

Andy
Sep 14 '08 #1
Share this Question
Share on Google+
5 Replies


acoder
Expert Mod 15k+
P: 16,027
For JavaScript, you would have to eval it or you could add it to the main web page using dynamic script tags.

Depending on browser/add-ons, you can see the generated source in some browsers.
Sep 14 '08 #2

P: 15
Ok Thanks,

Can you provide an example of how i can evaluate it in my context?

I can view the source in my browser but that's not the problem, i'm missing some! The items displayed in the browser differ to what's in the source code.

I need the output of the JavaScript functions displayed in the view source.
Sep 15 '08 #3

rnd me
Expert 100+
P: 427
if you want to display content that is provided by running javascript, you will have to run the javascript.

this will prove difficult in C#.

you might be able to use a webbrowser control to do your scraping.

you might also be able to evaluate the code a jscript, but you wouldn't have any DOM methods in that case, and your dynamic content would probably not work.

in short, you really can't do this.

you may be able to engineer a workaround, but it would likely take a lot longer than implementing another route to your solution.
Sep 16 '08 #4

P: 15
OK, thanks.

I'll try explaining what i'm trying to do and see if someone can suggest an alternate method!

Expand|Select|Wrap|Line Numbers
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  2.     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3. <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml">
  4.   <head>
  5.     <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  6.     <title>Test</title>
  7.     <script type="text/javascript">
  8.     var word = ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"];
  9.     function initialize() {
  10.         for (i = 0; i < word.length; i++) {
  11.             document.getElementById("message").innerHTML += word[i];
  12.         }
  13.     }  
  14.     </script>
  15.   </head>
  16.  
  17.   <body onload="initialize()" >
  18.     <div id="message"></div>
  19.   </body>
  20. </html>
  21.  
Imagine the source code above saved in test.html.

Now if you view it in a browser you can clearly see "Hello World" on the page.
But if you click on view source then all you see is the above code.
Now heres my problem: if you did a search of the scraped source code from this website for "Hello World" it would not be found.
We can clearly see it on the web page, but not the source code as "Hello World" is only created at runtime.
How can i scrap "Hello World" from this website? (I need the dynamic output!)

[The above is just a simplification of what i'm trying to do, obviously i'm using regex to scrap data that changes all the time and has nothing to do with the above, but the principle is identical]

Hope this is clear enough for someone to suggest a solution, i don't mind if it uses a different approach / technology all together!

Thanks

Andy
Sep 16 '08 #5

rnd me
Expert 100+
P: 427
if my proposals sound complicated, that's because they are.
screen-scraping has always been complex, hence the push toward rss.
dynamic content only serves to greatly complicate things further.

it sounds as though you want to access the data you can see in a browser.
it would only make sense then to use a browser in your solution.

unfortunately, browsers don't like sharing web-pages with javascript code.
if you solution is to be coded in javascript (the natural language of browsers), you will run into the "same-origin policy".

this means that you will not only need a browser, but a browser navigated to the site you want to scrape.

you would then run your scraping code in the context of that site.

how the heck can you run your code on someone else's site?

two suggestions:
1. greasemonkey, which is also capable of cross-domain transfers. you code a greasemonkey script file, and greasemonkey automatically injects the code into an associated webpage every time the page is visited. coupled with firebug, this makes a simple yet robust screen-scraping solution. main downside is that running your application mean navigating to the page in firefox, rather than simply executing an exe.

2. MS's webbrowser control. using a browser as an object in another language like VB or C# allows injection of code into any particular site though a series of public API properties. the current source can be read and manipulated with this interface. the main advantage to this approach is that you can do the data processing in either javascript, or the host language of your project. the disadvantage is the complexity and rigidness of a hard-coded solution tailored to a single site/design.

there may be other solutions as well:

- john resig recently emulated a browser's DOM entirely in javascript running on rhino. you may be able to harness the updates using a model like his.

-you might be able to download and reverse-engineer the javascript files to grab the apropos data from the script block. this would hinge largely on the complexity of the update script.

-you could use ASP jscript and a msHTMLDocument object to meld the changes to the document and capture the result. this avoids the same origin problems that plague browser solutions.

in conclusion, it can be done, but it won't be easy. i like the greasemonkey scenario for its simplicity and available documentation. if you can live with running firefox to grab your updates, this is certainly the path of least resistance.
Sep 16 '08 #6

Post your reply

Sign in to post your reply or Sign up for a free account.