473,796 Members | 2,640 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Difference between page source and what's being displayed

15 New Member
Hi All,

(I originally posted this in the .NET Programming Languages, but i realised the fact that i'm using C# is irrelevant as i think the problem i'm having is with DHTML / JavaScript.)

So here goes again:

I'm writing an app in C# that will be doing a bit of web scraping. I've got a fair bit of expierence with this but i've come across an issue with the returned HTML i'm getting from the pages.

When i view the page i'm trying to scrape in a browser all is well, and i can see a 'Services' section on the bottom of the page, however when i do a view source on the page, the corresponding code is missing for that part of the page. I've realised that the reason may be that on the page there is a JavaScript function where the 'Services' section should be. Here is the code:

Expand|Select|Wrap|Line Numbers
  1. <script type='text/javascript'>initTabs('dhtmlgoodies_tabView',Array('Current Services','Services Overview','Notes<img src="temp_files/rosettes/notes_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Photos','Attachments','Services Tree','Services Detail','Organisation Tree','Sales Tasks','Incidents<img src="temp_files/rosettes/openincidents_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Orders<img src="temp_files/rosettes/openorders_1.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Ceases','POs'),0,'100%','');</script>
I've found the corresponding function in an included .js file and it comes from www.dhtmlgoodie s.com.

So my problem is this, when viewing in a browser (and doing a view source) i can't see the actual HTML that creates this 'Services' section and thus when i do my httpWebRequest i don't get the HTML either.

In summary, how can i get ALL the source information from a WebRequest? Do i somehow get the WebRequest to run the JavaScript function and return the data? If so how?

Very much appreciate any help!

Regards,

Andy
Sep 14 '08 #1
5 2028
acoder
16,027 Recognized Expert Moderator MVP
For JavaScript, you would have to eval it or you could add it to the main web page using dynamic script tags.

Depending on browser/add-ons, you can see the generated source in some browsers.
Sep 14 '08 #2
vegetable21
15 New Member
Ok Thanks,

Can you provide an example of how i can evaluate it in my context?

I can view the source in my browser but that's not the problem, i'm missing some! The items displayed in the browser differ to what's in the source code.

I need the output of the JavaScript functions displayed in the view source.
Sep 15 '08 #3
rnd me
427 Recognized Expert Contributor
if you want to display content that is provided by running javascript, you will have to run the javascript.

this will prove difficult in C#.

you might be able to use a webbrowser control to do your scraping.

you might also be able to evaluate the code a jscript, but you wouldn't have any DOM methods in that case, and your dynamic content would probably not work.

in short, you really can't do this.

you may be able to engineer a workaround, but it would likely take a lot longer than implementing another route to your solution.
Sep 16 '08 #4
vegetable21
15 New Member
OK, thanks.

I'll try explaining what i'm trying to do and see if someone can suggest an alternate method!

Expand|Select|Wrap|Line Numbers
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  2.     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3. <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml">
  4.   <head>
  5.     <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
  6.     <title>Test</title>
  7.     <script type="text/javascript">
  8.     var word = ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"];
  9.     function initialize() {
  10.         for (i = 0; i < word.length; i++) {
  11.             document.getElementById("message").innerHTML += word[i];
  12.         }
  13.     }  
  14.     </script>
  15.   </head>
  16.  
  17.   <body onload="initialize()" >
  18.     <div id="message"></div>
  19.   </body>
  20. </html>
  21.  
Imagine the source code above saved in test.html.

Now if you view it in a browser you can clearly see "Hello World" on the page.
But if you click on view source then all you see is the above code.
Now heres my problem: if you did a search of the scraped source code from this website for "Hello World" it would not be found.
We can clearly see it on the web page, but not the source code as "Hello World" is only created at runtime.
How can i scrap "Hello World" from this website? (I need the dynamic output!)

[The above is just a simplification of what i'm trying to do, obviously i'm using regex to scrap data that changes all the time and has nothing to do with the above, but the principle is identical]

Hope this is clear enough for someone to suggest a solution, i don't mind if it uses a different approach / technology all together!

Thanks

Andy
Sep 16 '08 #5
rnd me
427 Recognized Expert Contributor
if my proposals sound complicated, that's because they are.
screen-scraping has always been complex, hence the push toward rss.
dynamic content only serves to greatly complicate things further.

it sounds as though you want to access the data you can see in a browser.
it would only make sense then to use a browser in your solution.

unfortunately, browsers don't like sharing web-pages with javascript code.
if you solution is to be coded in javascript (the natural language of browsers), you will run into the "same-origin policy".

this means that you will not only need a browser, but a browser navigated to the site you want to scrape.

you would then run your scraping code in the context of that site.

how the heck can you run your code on someone else's site?

two suggestions:
1. greasemonkey, which is also capable of cross-domain transfers. you code a greasemonkey script file, and greasemonkey automatically injects the code into an associated webpage every time the page is visited. coupled with firebug, this makes a simple yet robust screen-scraping solution. main downside is that running your application mean navigating to the page in firefox, rather than simply executing an exe.

2. MS's webbrowser control. using a browser as an object in another language like VB or C# allows injection of code into any particular site though a series of public API properties. the current source can be read and manipulated with this interface. the main advantage to this approach is that you can do the data processing in either javascript, or the host language of your project. the disadvantage is the complexity and rigidness of a hard-coded solution tailored to a single site/design.

there may be other solutions as well:

- john resig recently emulated a browser's DOM entirely in javascript running on rhino. you may be able to harness the updates using a model like his.

-you might be able to download and reverse-engineer the javascript files to grab the apropos data from the script block. this would hinge largely on the complexity of the update script.

-you could use ASP jscript and a msHTMLDocument object to meld the changes to the document and capture the result. this avoids the same origin problems that plague browser solutions.

in conclusion, it can be done, but it won't be easy. i like the greasemonkey scenario for its simplicity and available documentation. if you can live with running firefox to grab your updates, this is certainly the path of least resistance.
Sep 16 '08 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

12
2050
by: MikeT | last post by:
I have a page that produces little thumbnails of the 3D models it finds in a specified directory (and iterates down through any sub directories). It basically scans each directory for 3D Studio Max files using the filesystemobject and writes an activeX component called iDrop for each file so it can be displayed on the page (and drag-dropped straight into Max). If it happens to find a similarly named XML file in the directory, it loads...
5
15822
by: David Elliott | last post by:
I need a control on a Web Page that can accept an HTML Document and will display it. Any help would be appreciated. Thanks, Dave Here is what I was trying...
11
2747
by: Laurent Compere | last post by:
Hi, I'm trying to do something I thought easy. Explanation : I've recently bought a Multimedia Player DVX-500E. This player is connected via ethernet on my local network et allows among other things to browse web pages (not everyone). I don't know anything about this browser nor about media server implentation (on the pc side). I want to browse a page with Tv schedule (a simplified version of EuroTV.com). To do that, I've created a php...
5
12313
by: Neil Rossi | last post by:
I have an issue with a particular ASP page on two web servers. Let's call these servers Dev1 and Beta1. Both Servers are running IIS 5, Windows 2000 SP4 with "almost" all of the latest patches. On Beta1, I am able to execute a particular page with no problem, that page opens up in the comes up just fine. On Win2kdev1, when I go to execute the same page, it opens a file download dialog and asks me whether I want to open or save the...
1
2152
by: clintonG | last post by:
I'm having a problem maintaining state with a Panel control in a MasterPage and I need help thinking through this process. The basic structure of the HTML in the Master looks like this... <asp:PanelActivatorLinkButton ... /> .... .... <asp:Panel Visible="" ...> <asp:LoadContentLinkButton ... /> <asp:ContentPlaceHolder ... />
17
4852
by: roN | last post by:
Hi, I'm creating a Website with divs and i do have some troubles, to make it looking the same way in Firefox and IE (tested with IE7). I checked it with the e3c validator and it says: " This Page Is Valid XHTML 1.0 Transitional!" but it still wouldn't look the same. It is on http://www.dvdnowkiosks.com/new/theproduct.php scroll down and recognize the black bottom bar when you go ewith firefox(2.0) which isn't there with IE7. Why does...
5
3761
by: cjl | last post by:
Hi. I am trying to screen scrape some stock data from yahoo, so I am trying to use urllib2 to retrieve the html and beautiful soup for the parsing. Maybe (most likely) I am doing something wrong, but when I use urllib2.urlopen to fetch a page, and when I view 'page source' of the exact same URL in firefox, I am seeing slight differences in the raw html.
2
1236
by: Vu Truong | last post by:
Hello, I try to use PostBackUrl to cross post between 2 pages on difference application. At the first time, I press the button on source page. Check on target page, I see Request.Form is nothing and Request.HttpMethod is GET (not POST). But if I don't close the browser, try to enter the URL of source page and press the button on source page again, I see, on target page, Request.Form is source page and Request.HttpMethod is POST. I try...
5
3011
by: daveh551 | last post by:
What, from a high level point of view, is the difference (in Visual Studio 2005) between Website (accessed with Open Website or Create Website from the StartPage) that is an ASP.NET Website, and a Project that is created with the "ASP.NET Application" template? I see some obvious differences: the Project creates the working folder under the Visual Studio 2005\Projects directory, while the Website creates it in the Inetpub\wwwroot...
0
9685
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10242
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10200
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10021
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9061
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7558
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6800
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
4127
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2931
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.