Connecting Tech Pros Worldwide Forums | Help | Site Map

capturing and using html from website

Larry Asher
Guest
 
Posts: n/a
#1: Jul 23 '05
Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage (once I get this to work I am going to
manipulate the html in other ways - this isn't the end purpose of this
effort). To do this I am trying to open another window containing the
source html from a URL and then capture the html from that window. I
can open the window fine but get an "access denied" error when trying to
assign the html to a variable. The basic code follows. Basically any
way that I can assign the html that results from an entered URL to a
javascript variable or object that I can then manipulate should work for
me. Suggestions?
Thanks in advance
Larry



<html>
<form>
Paste URL here: <input name=url value='http://www.yahoo.com'>
<input type=button onclick="try()" value=Go>
<input type=reset>
</form>
<p id=here></p>
<script>
function try() {
if (document.forms[0].url.value=='') {return};
// open a new window with the url from the user.

window2=window.open(document.forms[0].url.value,"","height=0,width=0");
// get the content of the new page. NEXT IS THE LINE THAT GETS THE
ACCESS DENIED ERROR.

t=window2.document.body.innerHTML;

// display the content in this page.

here.innerHTML=t;

// close the new page.

window2.close();

};

</script>
</html>

David Dorward
Guest
 
Posts: n/a
#2: Jul 23 '05

re: capturing and using html from website


Larry Asher wrote:
[color=blue]
> Hi all. I'm a bit of a novice in this arena so please forgive if this
> question reflects that. I am trying to grab the html from a website and
> display it within another webpage[/color]

http://jibbering.com/faq/#FAQ4_19

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
VK
Guest
 
Posts: n/a
#3: Jul 23 '05

re: capturing and using html from website


It's called "content stealing", the oldest and the worst sin of WWW.
This group is not a new-thief info source. Try www.astalavista.com or
so.

Larry Asher
Guest
 
Posts: n/a
#4: Jul 23 '05

re: capturing and using html from website


VK wrote:[color=blue]
> It's called "content stealing", the oldest and the worst sin of WWW.
> This group is not a new-thief info source. Try www.astalavista.com or
> so.
>[/color]

Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.
Larry Asher
Guest
 
Posts: n/a
#5: Jul 23 '05

re: capturing and using html from website


David Dorward wrote:[color=blue]
> Larry Asher wrote:
>
>[color=green]
>>Hi all. I'm a bit of a novice in this arena so please forgive if this
>>question reflects that. I am trying to grab the html from a website and
>>display it within another webpage[/color]
>
>
> http://jibbering.com/faq/#FAQ4_19
>[/color]

Thank you for the link. That is quite useful. Anyone know a way around
this?
David Dorward
Guest
 
Posts: n/a
#6: Jul 23 '05

re: capturing and using html from website


Larry Asher wrote:
[color=blue][color=green]
>> http://jibbering.com/faq/#FAQ4_19[/color][/color]
[color=blue]
> Thank you for the link. That is quite useful. Anyone know a way around
> this?[/color]

It tells you the ways around it at that link

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
David Dorward
Guest
 
Posts: n/a
#7: Jul 23 '05

re: capturing and using html from website


Larry Asher wrote:
[color=blue]
> Thank you for your reply. However, just because "content stealing" is
> one application of this doesn't mean it is the only one. If you are
> interested we are applying AI algorithms and genetic search techniques
> to analyze semantic representations and navigation paths of large
> complex corporate intranets - WITH permission.[/color]

.... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Larry Asher
Guest
 
Posts: n/a
#8: Jul 23 '05

re: capturing and using html from website


David Dorward wrote:[color=blue]
> Larry Asher wrote:
>
>[color=green]
>>Thank you for your reply. However, just because "content stealing" is
>>one application of this doesn't mean it is the only one. If you are
>>interested we are applying AI algorithms and genetic search techniques
>>to analyze semantic representations and navigation paths of large
>>complex corporate intranets - WITH permission.[/color]
>
>
> ... and your writing the application in JavaScript that runs in a
> webbrowser? Blimey.
>[/color]

We are only using javascript to collect and archive the html
(alternative suggestions appreciated). The algorithms have been written
in C++. I'm a mathematician with some programming skills not the other
way around. I am particularly inexperienced at web based programming,
thus my question.

On the workaround, I somehow missed the link - mental fatigue no doubt.
Thanks again for that information.
VK
Guest
 
Posts: n/a
#9: Jul 23 '05

re: capturing and using html from website


> However, just because "content stealing" is[color=blue]
> one application of this doesn't mean it is the only one. If you are
> interested we are applying AI algorithms and genetic search techniques
> to analyze semantic representations and navigation paths of large
> complex corporate intranets - WITH permission. The hard part (the AI
> algorithms and such) is done. We just need to automate the process of
> navigating through the links.[/color]

It's like "I need a full local drive access w/o prompts. I don't do
anything malicious, I just want to provide more convenience to our
users" (pops up from time to time in this group). Unfortunately browser
has its security politics and it doesn't accept any swears, whether
verbal, written or blood signed.

If involved sites are *really* involved, they have to change their page
accordingly (onload="report the content to the parent")

Or, if your project is so serious as it is stated, you definitely can
found extra $199 for code signing certificate and read what you want
from wherever you want (putting your good name on it).
<http://www.thawte.com/codesign/index.html>

Jim Ley
Guest
 
Posts: n/a
#10: Jul 23 '05

re: capturing and using html from website


On Sun, 17 Jul 2005 14:16:35 GMT, Larry Asher <larry@nowhere.com>
wrote:
[color=blue]
>We are only using javascript to collect and archive the html
>(alternative suggestions appreciated).[/color]

Just automate a browser in C++ or javascript or whatever, the
solutions are Zeepe or HTA type constructs to do it in Script with IE,
or IWebBrowser2 automation in Windows C++. Or for mozilla, automate
it using a Mozilla plugin. It's all simple, there's lots of ways of
collecting HTML for such purposes. Pure javascript is probably not
the best, unless you want a quick knock up to automate sites in IE.

Jim.
David Dorward
Guest
 
Posts: n/a
#11: Jul 23 '05

re: capturing and using html from website


Larry Asher wrote:
[color=blue][color=green]
>> ... and your writing the application in JavaScript that runs in a
>> webbrowser? Blimey.[/color][/color]
[color=blue]
> We are only using javascript to collect and archive the html
> (alternative suggestions appreciated).[/color]

http://search.cpan.org/~petdance/WWW...W/Mechanize.pm
or ... if you need to deal with JavaScript dependent pages:
http://search.cpan.org/~abeltje/Win3...E/Mechanize.pm

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
J. B. Moreno
Guest
 
Posts: n/a
#12: Jul 23 '05

re: capturing and using html from website


Larry Asher <larry@nowhere.com> wrote:
[color=blue]
> David Dorward wrote:[color=green]
> > Larry Asher wrote:
> >
> >[color=darkred]
> >>we are applying AI algorithms and genetic search techniques
> >>to analyze semantic representations and navigation paths of large
> >>complex corporate intranets - WITH permission.[/color]
> >
> > ... and your writing the application in JavaScript that runs in a
> > webbrowser? Blimey.[/color]
>
> We are only using javascript to collect and archive the html
> (alternative suggestions appreciated).[/color]

Use a unix machine, a shell script, and curl?

--
J.B.Moreno
Larry Asher
Guest
 
Posts: n/a
#13: Jul 23 '05

re: capturing and using html from website


I do realize the security is there for a reason. I guess I did not
anticipate that simply accessing the html for other than browser
rendering would be so restricted since this code is already fully
downloaded to a client and can be easily viewed/copied/etc manually. We
are not trying to access server side scripts or anything else that I
would normally consider a security risk. Basically all we are
attempting to do is create a webcrawler - no more malicious than google
- less so actually since we have permission to access the sites in
question (yes, I know I have offered no proof of this - you'll just have
to take my word for it...or not).

Thank you for the information on code signing. This research is being
conducted in a university setting so the $s are not free flowing but we
should be able to scrape up $199 so that is definitely an option.


VK wrote:[color=blue][color=green]
>>However, just because "content stealing" is
>>one application of this doesn't mean it is the only one. If you are
>>interested we are applying AI algorithms and genetic search techniques
>>to analyze semantic representations and navigation paths of large
>>complex corporate intranets - WITH permission. The hard part (the AI
>>algorithms and such) is done. We just need to automate the process of
>>navigating through the links.[/color]
>
>
> It's like "I need a full local drive access w/o prompts. I don't do
> anything malicious, I just want to provide more convenience to our
> users" (pops up from time to time in this group). Unfortunately browser
> has its security politics and it doesn't accept any swears, whether
> verbal, written or blood signed.
>
> If involved sites are *really* involved, they have to change their page
> accordingly (onload="report the content to the parent")
>
> Or, if your project is so serious as it is stated, you definitely can
> found extra $199 for code signing certificate and read what you want
> from wherever you want (putting your good name on it).
> <http://www.thawte.com/codesign/index.html>
>[/color]
David Dorward
Guest
 
Posts: n/a
#14: Jul 23 '05

re: capturing and using html from website


Larry Asher wrote:

Please don't top post.
[color=blue]
> I do realize the security is there for a reason. I guess I did not
> anticipate that simply accessing the html for other than browser
> rendering would be so restricted since this code is already fully
> downloaded to a client and can be easily viewed/copied/etc manually.[/color]

If I've browsing my bank's website, then the code is already fully
downloaded to a client and can be easily viewed/copies/etc manually.

If a /script/ on another domain could do it, then it could copy my banking
records to some hidden form fields then submit the form - thus sending
those records to the author of a script.
[color=blue]
> We are not trying to access server side scripts or anything else that I
> would normally consider a security risk.[/color]

An HTTP resource is an HTTP resource. The client has no idea how the server
decides what data to put in it.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Lasse Reichstein Nielsen
Guest
 
Posts: n/a
#15: Jul 23 '05

re: capturing and using html from website


J. B. Moreno <planb@newsreaders.com> writes:
[color=blue]
> Larry Asher <larry@nowhere.com> wrote:
>[color=green]
>> We are only using javascript to collect and archive the html
>> (alternative suggestions appreciated).[/color]
>
> Use a unix machine, a shell script, and curl?[/color]

Or just use wget. It exists for Windows too:
<URL:http://www.interlog.com/~tcharron/wgetwin.html>

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Shaun
Guest
 
Posts: n/a
#16: Jul 23 '05

re: capturing and using html from website


I've recently used Ruby to do something similar. I suggest Ruby (or
some other scripting language) above javascript because this really
isn't the best task for a browser -- however unintuitive that statement
may be.

Basically, have ruby download the page and then use a regex to parse
out the data (good old-fashioned screen scraping). I worked wonderfully
for what I needed.

Closed Thread


Similar JavaScript / Ajax / DHTML bytes