Hi All,
(I originally posted this in the .NET Programming Languages, but i realised the fact that i'm using C# is irrelevant as i think the problem i'm having is with DHTML / JavaScript.)
So here goes again:
I'm writing an app in C# that will be doing a bit of web scraping. I've got a fair bit of expierence with this but i've come across an issue with the returned HTML i'm getting from the pages.
When i view the page i'm trying to scrape in a browser all is well, and i can see a 'Services' section on the bottom of the page, however when i do a view source on the page, the corresponding code is missing for that part of the page. I've realised that the reason may be that on the page there is a JavaScript function where the 'Services' section should be. Here is the code: - <script type='text/javascript'>initTabs('dhtmlgoodies_tabView',Array('Current Services','Services Overview','Notes<img src="temp_files/rosettes/notes_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Photos','Attachments','Services Tree','Services Detail','Organisation Tree','Sales Tasks','Incidents<img src="temp_files/rosettes/openincidents_0.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Orders<img src="temp_files/rosettes/openorders_1.png" style="vertical-align: text-bottom;margin-top:4px;padding-top:0;margin-left:5px;float:none">','Ceases','POs'),0,'100%','');</script>
I've found the corresponding function in an included .js file and it comes from www.dhtmlgoodie s.com.
So my problem is this, when viewing in a browser (and doing a view source) i can't see the actual HTML that creates this 'Services' section and thus when i do my httpWebRequest i don't get the HTML either.
In summary, how can i get ALL the source information from a WebRequest? Do i somehow get the WebRequest to run the JavaScript function and return the data? If so how?
Very much appreciate any help!
Regards,
Andy
5 2028 acoder 16,027
Recognized Expert Moderator MVP
For JavaScript, you would have to eval it or you could add it to the main web page using dynamic script tags.
Depending on browser/add-ons, you can see the generated source in some browsers.
Ok Thanks,
Can you provide an example of how i can evaluate it in my context?
I can view the source in my browser but that's not the problem, i'm missing some! The items displayed in the browser differ to what's in the source code.
I need the output of the JavaScript functions displayed in the view source.
rnd me 427
Recognized Expert Contributor
if you want to display content that is provided by running javascript, you will have to run the javascript.
this will prove difficult in C#.
you might be able to use a webbrowser control to do your scraping.
you might also be able to evaluate the code a jscript, but you wouldn't have any DOM methods in that case, and your dynamic content would probably not work.
in short, you really can't do this.
you may be able to engineer a workaround, but it would likely take a lot longer than implementing another route to your solution.
OK, thanks.
I'll try explaining what i'm trying to do and see if someone can suggest an alternate method! - <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
-
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml">
-
<head>
-
<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
-
<title>Test</title>
-
<script type="text/javascript">
-
var word = ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"];
-
function initialize() {
-
for (i = 0; i < word.length; i++) {
-
document.getElementById("message").innerHTML += word[i];
-
}
-
}
-
</script>
-
</head>
-
-
<body onload="initialize()" >
-
<div id="message"></div>
-
</body>
-
</html>
-
Imagine the source code above saved in test.html.
Now if you view it in a browser you can clearly see "Hello World" on the page.
But if you click on view source then all you see is the above code.
Now heres my problem: if you did a search of the scraped source code from this website for "Hello World" it would not be found.
We can clearly see it on the web page, but not the source code as "Hello World" is only created at runtime.
How can i scrap "Hello World" from this website? (I need the dynamic output!)
[The above is just a simplification of what i'm trying to do, obviously i'm using regex to scrap data that changes all the time and has nothing to do with the above, but the principle is identical]
Hope this is clear enough for someone to suggest a solution, i don't mind if it uses a different approach / technology all together!
Thanks
Andy
rnd me 427
Recognized Expert Contributor
if my proposals sound complicated, that's because they are.
screen-scraping has always been complex, hence the push toward rss.
dynamic content only serves to greatly complicate things further.
it sounds as though you want to access the data you can see in a browser.
it would only make sense then to use a browser in your solution.
unfortunately, browsers don't like sharing web-pages with javascript code.
if you solution is to be coded in javascript (the natural language of browsers), you will run into the "same-origin policy".
this means that you will not only need a browser, but a browser navigated to the site you want to scrape.
you would then run your scraping code in the context of that site.
how the heck can you run your code on someone else's site?
two suggestions:
1. greasemonkey, which is also capable of cross-domain transfers. you code a greasemonkey script file, and greasemonkey automatically injects the code into an associated webpage every time the page is visited. coupled with firebug, this makes a simple yet robust screen-scraping solution. main downside is that running your application mean navigating to the page in firefox, rather than simply executing an exe.
2. MS's webbrowser control. using a browser as an object in another language like VB or C# allows injection of code into any particular site though a series of public API properties. the current source can be read and manipulated with this interface. the main advantage to this approach is that you can do the data processing in either javascript, or the host language of your project. the disadvantage is the complexity and rigidness of a hard-coded solution tailored to a single site/design.
there may be other solutions as well:
- john resig recently emulated a browser's DOM entirely in javascript running on rhino. you may be able to harness the updates using a model like his.
-you might be able to download and reverse-engineer the javascript files to grab the apropos data from the script block. this would hinge largely on the complexity of the update script.
-you could use ASP jscript and a msHTMLDocument object to meld the changes to the document and capture the result. this avoids the same origin problems that plague browser solutions.
in conclusion, it can be done, but it won't be easy. i like the greasemonkey scenario for its simplicity and available documentation. if you can live with running firefox to grab your updates, this is certainly the path of least resistance.
Sign in to post your reply or Sign up for a free account.
Similar topics |
by: MikeT |
last post by:
I have a page that produces little thumbnails of the 3D models it
finds in a specified directory (and iterates down through any sub
directories).
It basically scans each directory for 3D Studio Max files using the
filesystemobject and writes an activeX component called iDrop for each
file so it can be displayed on the page (and drag-dropped straight
into Max). If it happens to find a similarly named XML file in the
directory, it loads...
|
by: David Elliott |
last post by:
I need a control on a Web Page that can accept an HTML Document and will display it.
Any help would be appreciated.
Thanks,
Dave
Here is what I was trying...
|
by: Laurent Compere |
last post by:
Hi,
I'm trying to do something I thought easy. Explanation : I've recently
bought a Multimedia Player DVX-500E. This player is connected via ethernet
on my local network et allows among other things to browse web pages (not
everyone).
I don't know anything about this browser nor about media server implentation
(on the pc side).
I want to browse a page with Tv schedule (a simplified version of
EuroTV.com). To do that, I've created a php...
|
by: Neil Rossi |
last post by:
I have an issue with a particular ASP page on two web servers. Let's
call these servers Dev1 and Beta1. Both Servers are running IIS 5,
Windows 2000 SP4 with "almost" all of the latest patches.
On Beta1, I am able to execute a particular page with no problem, that
page opens up in the comes up just fine.
On Win2kdev1, when I go to execute the same page, it opens a file
download dialog and asks me whether I want to open or save the...
|
by: clintonG |
last post by:
I'm having a problem maintaining state with a Panel control in a MasterPage
and I need help thinking through this process.
The basic structure of the HTML in the Master looks like this...
<asp:PanelActivatorLinkButton ... />
....
....
<asp:Panel Visible="" ...>
<asp:LoadContentLinkButton ... />
<asp:ContentPlaceHolder ... />
| |
by: roN |
last post by:
Hi,
I'm creating a Website with divs and i do have some troubles, to make it
looking the same way in Firefox and IE (tested with IE7). I checked it with
the e3c validator and it says: "
This Page Is Valid XHTML 1.0 Transitional!" but it still wouldn't look the
same.
It is on http://www.dvdnowkiosks.com/new/theproduct.php scroll down and
recognize the black bottom bar when you go ewith firefox(2.0) which isn't
there with IE7. Why does...
|
by: cjl |
last post by:
Hi.
I am trying to screen scrape some stock data from yahoo, so I am
trying to use urllib2 to retrieve the html and beautiful soup for the
parsing.
Maybe (most likely) I am doing something wrong, but when I use
urllib2.urlopen to fetch a page, and when I view 'page source' of the
exact same URL in firefox, I am seeing slight differences in the raw
html.
|
by: Vu Truong |
last post by:
Hello,
I try to use PostBackUrl to cross post between 2 pages on difference application.
At the first time, I press the button on source page. Check on target page, I see Request.Form is nothing and Request.HttpMethod is GET (not POST).
But if I don't close the browser, try to enter the URL of source page and press the button on source page again, I see, on target page, Request.Form is source page and Request.HttpMethod is POST. I try...
|
by: daveh551 |
last post by:
What, from a high level point of view, is the difference (in Visual
Studio 2005) between Website (accessed with Open Website or Create
Website from the StartPage) that is an ASP.NET Website, and a Project
that is created with the "ASP.NET Application" template?
I see some obvious differences: the Project creates the working folder
under the Visual Studio 2005\Projects directory, while the Website
creates it in the Inetpub\wwwroot...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |