Difference between page source and what's being displayed

Hi All,

(I originally posted this in the .NET Programming Languages, but i realised the fact that i'm using C# is irrelevant as i think the problem i'm having is with DHTML / JavaScript.)

So here goes again:

I'm writing an app in C# that will be doing a bit of web scraping. I've got a fair bit of expierence with this but i've come across an issue with the returned HTML i'm getting from the pages.

When i view the page i'm trying to scrape in a browser all is well, and i can see a 'Services' section on the bottom of the page, however when i do a view source on the page, the corresponding code is missing for that part of the page. I've realised that the reason may be that on the page there is a JavaScript function where the 'Services' section should be. Here is the code:

Expand|Select|Wrap|Line Numbers

 <script type='text/javascript'>initTabs('dhtmlgoodies_tabView',Array('Current Services','Services Overview','Notes<img src="temp_files/rosettes/notes_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Photos','Attachments','Services Tree','Services Detail','Organisation Tree','Sales Tasks','Incidents<img src="temp_files/rosettes/openincidents_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Orders<img src="temp_files/rosettes/openorders_1.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Ceases','POs'),0,'100%','');</script>
 

I've found the corresponding function in an included .js file and it comes from www.dhtmlgoodies.com.

So my problem is this, when viewing in a browser (and doing a view source) i can't see the actual HTML that creates this 'Services' section and thus when i do my httpWebRequest i don't get the HTML either.

In summary, how can i get ALL the source information from a WebRequest? Do i somehow get the WebRequest to run the JavaScript function and return the data? If so how?

Very much appreciate any help!

Regards,

Andy

Sep 14 '08 #1

Subscribe Post Reply

2005

acoder

16,027

Expert Mod 8TB

For JavaScript, you would have to eval it or you could add it to the main web page using dynamic script tags.

Depending on browser/add-ons, you can see the generated source in some browsers.

Sep 14 '08 #2

vegetable21

Ok Thanks,

Can you provide an example of how i can evaluate it in my context?

I can view the source in my browser but that's not the problem, i'm missing some! The items displayed in the browser differ to what's in the source code.

I need the output of the JavaScript functions displayed in the view source.

Sep 15 '08 #3

rnd me

427

Expert 256MB

if you want to display content that is provided by running javascript, you will have to run the javascript.

this will prove difficult in C#.

you might be able to use a webbrowser control to do your scraping.

you might also be able to evaluate the code a jscript, but you wouldn't have any DOM methods in that case, and your dynamic content would probably not work.

in short, you really can't do this.

you may be able to engineer a workaround, but it would likely take a lot longer than implementing another route to your solution.

Sep 16 '08 #4

vegetable21

OK, thanks.

I'll try explaining what i'm trying to do and see if someone can suggest an alternate method!

Expand|Select|Wrap|Line Numbers

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml">

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>

    <title>Test</title>

    <script type="text/javascript">

    var word = ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"];

    function initialize() {

        for (i = 0; i < word.length; i++) {

            document.getElementById("message").innerHTML += word[i];

        }

    }  

    </script>

  </head>
 
  <body onload="initialize()" >

    <div id="message"></div>

  </body>

</html>

Imagine the source code above saved in test.html.

Now if you view it in a browser you can clearly see "Hello World" on the page.
But if you click on view source then all you see is the above code.
Now heres my problem: if you did a search of the scraped source code from this website for "Hello World" it would not be found.
We can clearly see it on the web page, but not the source code as "Hello World" is only created at runtime.
How can i scrap "Hello World" from this website? (I need the dynamic output!)

[The above is just a simplification of what i'm trying to do, obviously i'm using regex to scrap data that changes all the time and has nothing to do with the above, but the principle is identical]

Hope this is clear enough for someone to suggest a solution, i don't mind if it uses a different approach / technology all together!

Thanks

Andy

Sep 16 '08 #5

rnd me

427

Expert 256MB

if my proposals sound complicated, that's because they are.
screen-scraping has always been complex, hence the push toward rss.
dynamic content only serves to greatly complicate things further.

it sounds as though you want to access the data you can see in a browser.
it would only make sense then to use a browser in your solution.

unfortunately, browsers don't like sharing web-pages with javascript code.
if you solution is to be coded in javascript (the natural language of browsers), you will run into the "same-origin policy".

this means that you will not only need a browser, but a browser navigated to the site you want to scrape.

you would then run your scraping code in the context of that site.

how the heck can you run your code on someone else's site?

two suggestions:
1. greasemonkey, which is also capable of cross-domain transfers. you code a greasemonkey script file, and greasemonkey automatically injects the code into an associated webpage every time the page is visited. coupled with firebug, this makes a simple yet robust screen-scraping solution. main downside is that running your application mean navigating to the page in firefox, rather than simply executing an exe.

2. MS's webbrowser control. using a browser as an object in another language like VB or C# allows injection of code into any particular site though a series of public API properties. the current source can be read and manipulated with this interface. the main advantage to this approach is that you can do the data processing in either javascript, or the host language of your project. the disadvantage is the complexity and rigidness of a hard-coded solution tailored to a single site/design.

there may be other solutions as well:

- john resig recently emulated a browser's DOM entirely in javascript running on rhino. you may be able to harness the updates using a model like his.

-you might be able to download and reverse-engineer the javascript files to grab the apropos data from the script block. this would hinge largely on the complexity of the update script.

-you could use ASP jscript and a msHTMLDocument object to meld the changes to the document and capture the result. this avoids the same origin problems that plague browser solutions.

in conclusion, it can be done, but it won't be easy. i like the greasemonkey scenario for its simplicity and available documentation. if you can live with running firefox to grab your updates, this is certainly the path of least resistance.

Sep 16 '08 #6

by: MikeT | last post by:

I have a page that produces little thumbnails of the 3D models it finds in a specified directory (and iterates down through any sub directories). It basically scans each directory for 3D Studio...

ASP / Active Server Pages

ASP.Net Page that has a control to display HTML

by: David Elliott | last post by:

I need a control on a Web Page that can accept an HTML Document and will display it. Any help would be appreciated. Thanks, Dave Here is what I was trying...

ASP.NET

special browser cannot display page ...

by: Laurent Compere | last post by:

Hi, I'm trying to do something I thought easy. Explanation : I've recently bought a Multimedia Player DVX-500E. This player is connected via ethernet on my local network et allows among other...

PHP

Opening ASP page causes file download dialog instead

by: Neil Rossi | last post by:

I have an issue with a particular ASP page on two web servers. Let's call these servers Dev1 and Beta1. Both Servers are running IIS 5, Windows 2000 SP4 with "almost" all of the latest patches. ...

ASP / Active Server Pages

Maintaining Page State with Panel Control

by: clintonG | last post by:

I'm having a problem maintaining state with a Panel control in a MasterPage and I need help thinking through this process. The basic structure of the HTML in the Master looks like this... ...

C# / C Sharp

Design difference Firefox<->IE

by: roN | last post by:

Hi, I'm creating a Website with divs and i do have some troubles, to make it looking the same way in Firefox and IE (tested with IE7). I checked it with the e3c validator and it says: " This...

HTML / CSS

difference between urllib2.urlopen and firefox view 'page source'?

by: cjl | last post by:

Hi. I am trying to screen scrape some stock data from yahoo, so I am trying to use urllib2 to retrieve the html and beautiful soup for the parsing. Maybe (most likely) I am doing something...

Python

Cross-page Posting in difference application

by: Vu Truong | last post by:

Hello, I try to use PostBackUrl to cross post between 2 pages on difference application. At the first time, I press the button on source page. Check on target page, I see Request.Form is...

.NET Framework

Difference between ASP.NET Website and ASP.NET Application Project

by: daveh551 | last post by:

What, from a high level point of view, is the difference (in Visual Studio 2005) between Website (accessed with Open Website or Create Website from the StartPage) that is an ASP.NET Website, and a...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Difference between page source and what's being displayed

Similar topics