473,396 Members | 2,030 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Difference between page source and what's being displayed

Hi All,

(I originally posted this in the .NET Programming Languages, but i realised the fact that i'm using C# is irrelevant as i think the problem i'm having is with DHTML / JavaScript.)

So here goes again:

I'm writing an app in C# that will be doing a bit of web scraping. I've got a fair bit of expierence with this but i've come across an issue with the returned HTML i'm getting from the pages.

When i view the page i'm trying to scrape in a browser all is well, and i can see a 'Services' section on the bottom of the page, however when i do a view source on the page, the corresponding code is missing for that part of the page. I've realised that the reason may be that on the page there is a JavaScript function where the 'Services' section should be. Here is the code:

Expand|Select|Wrap|Line Numbers
  1. <script type='text/javascript'>initTabs('dhtmlgoodies_tabView',Array('Current Services','Services Overview','Notes<img src="temp_files/rosettes/notes_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Photos','Attachments','Services Tree','Services Detail','Organisation Tree','Sales Tasks','Incidents<img src="temp_files/rosettes/openincidents_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Orders<img src="temp_files/rosettes/openorders_1.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Ceases','POs'),0,'100%','');</script>
I've found the corresponding function in an included .js file and it comes from www.dhtmlgoodies.com.

So my problem is this, when viewing in a browser (and doing a view source) i can't see the actual HTML that creates this 'Services' section and thus when i do my httpWebRequest i don't get the HTML either.

In summary, how can i get ALL the source information from a WebRequest? Do i somehow get the WebRequest to run the JavaScript function and return the data? If so how?

Very much appreciate any help!

Regards,

Andy
Sep 14 '08 #1
5 2005
acoder
16,027 Expert Mod 8TB
For JavaScript, you would have to eval it or you could add it to the main web page using dynamic script tags.

Depending on browser/add-ons, you can see the generated source in some browsers.
Sep 14 '08 #2
Ok Thanks,

Can you provide an example of how i can evaluate it in my context?

I can view the source in my browser but that's not the problem, i'm missing some! The items displayed in the browser differ to what's in the source code.

I need the output of the JavaScript functions displayed in the view source.
Sep 15 '08 #3
rnd me
427 Expert 256MB
if you want to display content that is provided by running javascript, you will have to run the javascript.

this will prove difficult in C#.

you might be able to use a webbrowser control to do your scraping.

you might also be able to evaluate the code a jscript, but you wouldn't have any DOM methods in that case, and your dynamic content would probably not work.

in short, you really can't do this.

you may be able to engineer a workaround, but it would likely take a lot longer than implementing another route to your solution.
Sep 16 '08 #4
OK, thanks.

I'll try explaining what i'm trying to do and see if someone can suggest an alternate method!

Expand|Select|Wrap|Line Numbers
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  2.     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3. <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml">
  4.   <head>
  5.     <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  6.     <title>Test</title>
  7.     <script type="text/javascript">
  8.     var word = ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"];
  9.     function initialize() {
  10.         for (i = 0; i < word.length; i++) {
  11.             document.getElementById("message").innerHTML += word[i];
  12.         }
  13.     }  
  14.     </script>
  15.   </head>
  16.  
  17.   <body onload="initialize()" >
  18.     <div id="message"></div>
  19.   </body>
  20. </html>
  21.  
Imagine the source code above saved in test.html.

Now if you view it in a browser you can clearly see "Hello World" on the page.
But if you click on view source then all you see is the above code.
Now heres my problem: if you did a search of the scraped source code from this website for "Hello World" it would not be found.
We can clearly see it on the web page, but not the source code as "Hello World" is only created at runtime.
How can i scrap "Hello World" from this website? (I need the dynamic output!)

[The above is just a simplification of what i'm trying to do, obviously i'm using regex to scrap data that changes all the time and has nothing to do with the above, but the principle is identical]

Hope this is clear enough for someone to suggest a solution, i don't mind if it uses a different approach / technology all together!

Thanks

Andy
Sep 16 '08 #5
rnd me
427 Expert 256MB
if my proposals sound complicated, that's because they are.
screen-scraping has always been complex, hence the push toward rss.
dynamic content only serves to greatly complicate things further.

it sounds as though you want to access the data you can see in a browser.
it would only make sense then to use a browser in your solution.

unfortunately, browsers don't like sharing web-pages with javascript code.
if you solution is to be coded in javascript (the natural language of browsers), you will run into the "same-origin policy".

this means that you will not only need a browser, but a browser navigated to the site you want to scrape.

you would then run your scraping code in the context of that site.

how the heck can you run your code on someone else's site?

two suggestions:
1. greasemonkey, which is also capable of cross-domain transfers. you code a greasemonkey script file, and greasemonkey automatically injects the code into an associated webpage every time the page is visited. coupled with firebug, this makes a simple yet robust screen-scraping solution. main downside is that running your application mean navigating to the page in firefox, rather than simply executing an exe.

2. MS's webbrowser control. using a browser as an object in another language like VB or C# allows injection of code into any particular site though a series of public API properties. the current source can be read and manipulated with this interface. the main advantage to this approach is that you can do the data processing in either javascript, or the host language of your project. the disadvantage is the complexity and rigidness of a hard-coded solution tailored to a single site/design.

there may be other solutions as well:

- john resig recently emulated a browser's DOM entirely in javascript running on rhino. you may be able to harness the updates using a model like his.

-you might be able to download and reverse-engineer the javascript files to grab the apropos data from the script block. this would hinge largely on the complexity of the update script.

-you could use ASP jscript and a msHTMLDocument object to meld the changes to the document and capture the result. this avoids the same origin problems that plague browser solutions.

in conclusion, it can be done, but it won't be easy. i like the greasemonkey scenario for its simplicity and available documentation. if you can live with running firefox to grab your updates, this is certainly the path of least resistance.
Sep 16 '08 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

12
by: MikeT | last post by:
I have a page that produces little thumbnails of the 3D models it finds in a specified directory (and iterates down through any sub directories). It basically scans each directory for 3D Studio...
5
by: David Elliott | last post by:
I need a control on a Web Page that can accept an HTML Document and will display it. Any help would be appreciated. Thanks, Dave Here is what I was trying...
11
by: Laurent Compere | last post by:
Hi, I'm trying to do something I thought easy. Explanation : I've recently bought a Multimedia Player DVX-500E. This player is connected via ethernet on my local network et allows among other...
5
by: Neil Rossi | last post by:
I have an issue with a particular ASP page on two web servers. Let's call these servers Dev1 and Beta1. Both Servers are running IIS 5, Windows 2000 SP4 with "almost" all of the latest patches. ...
1
by: clintonG | last post by:
I'm having a problem maintaining state with a Panel control in a MasterPage and I need help thinking through this process. The basic structure of the HTML in the Master looks like this... ...
17
by: roN | last post by:
Hi, I'm creating a Website with divs and i do have some troubles, to make it looking the same way in Firefox and IE (tested with IE7). I checked it with the e3c validator and it says: " This...
5
by: cjl | last post by:
Hi. I am trying to screen scrape some stock data from yahoo, so I am trying to use urllib2 to retrieve the html and beautiful soup for the parsing. Maybe (most likely) I am doing something...
2
by: Vu Truong | last post by:
Hello, I try to use PostBackUrl to cross post between 2 pages on difference application. At the first time, I press the button on source page. Check on target page, I see Request.Form is...
5
by: daveh551 | last post by:
What, from a high level point of view, is the difference (in Visual Studio 2005) between Website (accessed with Open Website or Create Website from the StartPage) that is an ASP.NET Website, and a...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.