capturing and using html from website

Larry Asher

Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage (once I get this to work I am going to
manipulate the html in other ways - this isn't the end purpose of this
effort). To do this I am trying to open another window containing the
source html from a URL and then capture the html from that window. I
can open the window fine but get an "access denied" error when trying to
assign the html to a variable. The basic code follows. Basically any
way that I can assign the html that results from an entered URL to a
javascript variable or object that I can then manipulate should work for
me. Suggestions?
Thanks in advance
Larry

<html>
<form>
Paste URL here: <input name=url value='http://www.yahoo.com'>
<input type=button onclick="try()" value=Go>
<input type=reset>
</form>
<p id=here></p>
<script>
function try() {
if (document.forms[0].url.value=='') {return};
// open a new window with the url from the user.

window2=window.open(document.forms[0].url.value,"","height=0,width=0");
// get the content of the new page. NEXT IS THE LINE THAT GETS THE
ACCESS DENIED ERROR.

t=window2.document.body.innerHTML;

// display the content in this page.

here.innerHTML=t;

// close the new page.

window2.close();

};

</script>
</html>

Jul 23 '05 #1

Subscribe Post Reply

1991

David Dorward

Larry Asher wrote:

Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage

http://jibbering.com/faq/#FAQ4_19

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is

Jul 23 '05 #2

It's called "content stealing", the oldest and the worst sin of WWW.
This group is not a new-thief info source. Try www.astalavista.com or
so.

Jul 23 '05 #3

Larry Asher

VK wrote:

It's called "content stealing", the oldest and the worst sin of WWW.
This group is not a new-thief info source. Try www.astalavista.com or
so.

Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.

Jul 23 '05 #4

Larry Asher

David Dorward wrote:

Larry Asher wrote:

Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage

http://jibbering.com/faq/#FAQ4_19

Thank you for the link. That is quite useful. Anyone know a way around
this?

Jul 23 '05 #5

David Dorward

Larry Asher wrote:

http://jibbering.com/faq/#FAQ4_19
Thank you for the link. That is quite useful. Anyone know a way around
this?

It tells you the ways around it at that link

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is

Jul 23 '05 #6

David Dorward

Larry Asher wrote:

Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission.

.... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is

Jul 23 '05 #7

Larry Asher

David Dorward wrote:

Larry Asher wrote:

Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission.

... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.

We are only using javascript to collect and archive the html
(alternative suggestions appreciated). The algorithms have been written
in C++. I'm a mathematician with some programming skills not the other
way around. I am particularly inexperienced at web based programming,
thus my question.

On the workaround, I somehow missed the link - mental fatigue no doubt.
Thanks again for that information.

Jul 23 '05 #8

> However, just because "content stealing" is

one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.

It's like "I need a full local drive access w/o prompts. I don't do
anything malicious, I just want to provide more convenience to our
users" (pops up from time to time in this group). Unfortunately browser
has its security politics and it doesn't accept any swears, whether
verbal, written or blood signed.

If involved sites are *really* involved, they have to change their page
accordingly (onload="report the content to the parent")

Or, if your project is so serious as it is stated, you definitely can
found extra $199 for code signing certificate and read what you want
from wherever you want (putting your good name on it).
<http://www.thawte.com/codesign/index.html>

Jul 23 '05 #9

Jim Ley

On Sun, 17 Jul 2005 14:16:35 GMT, Larry Asher <la***@nowhere.com>
wrote:

We are only using javascript to collect and archive the html
(alternative suggestions appreciated).

Just automate a browser in C++ or javascript or whatever, the
solutions are Zeepe or HTA type constructs to do it in Script with IE,
or IWebBrowser2 automation in Windows C++. Or for mozilla, automate
it using a Mozilla plugin. It's all simple, there's lots of ways of
collecting HTML for such purposes. Pure javascript is probably not
the best, unless you want a quick knock up to automate sites in IE.

Jim.

Jul 23 '05 #10

David Dorward

Larry Asher wrote:

... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.
We are only using javascript to collect and archive the html
(alternative suggestions appreciated).

http://search.cpan.org/~petdance/WWW...W/Mechanize.pm
or ... if you need to deal with JavaScript dependent pages:
http://search.cpan.org/~abeltje/Win3...E/Mechanize.pm

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is

Jul 23 '05 #11

J. B. Moreno

Larry Asher <la***@nowhere.com> wrote:

David Dorward wrote:
Larry Asher wrote:

we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission.

... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.

We are only using javascript to collect and archive the html
(alternative suggestions appreciated).

Use a unix machine, a shell script, and curl?

--
J.B.Moreno

Jul 23 '05 #12

Larry Asher

I do realize the security is there for a reason. I guess I did not
anticipate that simply accessing the html for other than browser
rendering would be so restricted since this code is already fully
downloaded to a client and can be easily viewed/copied/etc manually. We
are not trying to access server side scripts or anything else that I
would normally consider a security risk. Basically all we are
attempting to do is create a webcrawler - no more malicious than google
- less so actually since we have permission to access the sites in
question (yes, I know I have offered no proof of this - you'll just have
to take my word for it...or not).

Thank you for the information on code signing. This research is being
conducted in a university setting so the $s are not free flowing but we
should be able to scrape up $199 so that is definitely an option.
VK wrote:

However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.

It's like "I need a full local drive access w/o prompts. I don't do
anything malicious, I just want to provide more convenience to our
users" (pops up from time to time in this group). Unfortunately browser
has its security politics and it doesn't accept any swears, whether
verbal, written or blood signed.

If involved sites are *really* involved, they have to change their page
accordingly (onload="report the content to the parent")

Or, if your project is so serious as it is stated, you definitely can
found extra $199 for code signing certificate and read what you want
from wherever you want (putting your good name on it).
<http://www.thawte.com/codesign/index.html>

Jul 23 '05 #13

David Dorward

Larry Asher wrote:

Please don't top post.

I do realize the security is there for a reason. I guess I did not
anticipate that simply accessing the html for other than browser
rendering would be so restricted since this code is already fully
downloaded to a client and can be easily viewed/copied/etc manually.
If I've browsing my bank's website, then the code is already fully
downloaded to a client and can be easily viewed/copies/etc manually.

If a /script/ on another domain could do it, then it could copy my banking
records to some hidden form fields then submit the form - thus sending
those records to the author of a script.
We are not trying to access server side scripts or anything else that I
would normally consider a security risk.

An HTTP resource is an HTTP resource. The client has no idea how the server
decides what data to put in it.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is

Jul 23 '05 #14

Lasse Reichstein Nielsen

J. B. Moreno <pl***@newsreaders.com> writes:

Larry Asher <la***@nowhere.com> wrote:
We are only using javascript to collect and archive the html
(alternative suggestions appreciated).

Use a unix machine, a shell script, and curl?

Or just use wget. It exists for Windows too:
<URL:http://www.interlog.com/~tcharron/wgetwin.html>

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #15

Shaun

I've recently used Ruby to do something similar. I suggest Ruby (or
some other scripting language) above javascript because this really
isn't the best task for a browser -- however unintuitive that statement
may be.

Basically, have ruby download the page and then use a regex to parse
out the data (good old-fashioned screen scraping). I worked wonderfully
for what I needed.

Jul 23 '05 #16

Similar topics

capturing text from a GUI window

by: Earl Eiland | last post by:

Anyone know how to capture text from GUI output? I need to process information returned via a GUI window. Earl

Python

Capturing client-side image coordinates

by: Tony Gahlinger | last post by:

I'm using my browser (Mozilla/5.0 Linux i686 Gecko/20031007 Firebird/0.7) to do some client-side image processing. I want to capture the sequence of coordinates a user clicks on in xxx.jpg in the...

Javascript

javascript for capturing filename

by: Sharad Gupta | last post by:

i have this problem of capturing the filename on the instance when onclick is activated in the <body> the function should catch the filename and display it. Second problem, i have to catch the...

Javascript

more than 100 capturing groups in a regex

by: Joerg Schuster | last post by:

Hello, Python regular expressions must not have more than 100 capturing groups. The source code responsible for this reads as follows: # XXX: <fl> get rid of this limitation! if...

Python

Regex: Capturing HTML

by: kevin | last post by:

I am trying to strip the outermost html tag by capturing this tag with regex and then using the string replace function to replace it with an empty string. while stepping through the code, RegEx...

C# / C Sharp

capturing username

by: Brent Burkart | last post by:

I am trying to capture the Windows Authenticated username, but I want to be able to capture the login name that exists in IIS, not Windows. In order to enter my company's intranet through the...

ASP.NET

capturing messagebox OK/cancel event

by: Andrew | last post by:

Hi, I have a messagebox that pops up due to an event. I did it in javascript. ie. alert("Time's up. Assessment Ended"); I want to capture the OK and Cancel events of this alert messagebox. My...

ASP.NET

capturing stdout from lynx..

by: sergio | last post by:

i have a huge database that contains large amounts of html that i need to translate to ascii.. i have tried using html2text.py: http://www.aaronsw.com/2002/html2text/ but i could not figure...

Python

Capturing Session End...

by: David Lozzi | last post by:

Howdy, I'm trying to capture the session end event. I put a spot of code in the Session_End event in the Global.asax.vb file. The function simply writes to a database table logging the event. I...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice