473,396 Members | 1,891 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

capturing and using html from website

Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage (once I get this to work I am going to
manipulate the html in other ways - this isn't the end purpose of this
effort). To do this I am trying to open another window containing the
source html from a URL and then capture the html from that window. I
can open the window fine but get an "access denied" error when trying to
assign the html to a variable. The basic code follows. Basically any
way that I can assign the html that results from an entered URL to a
javascript variable or object that I can then manipulate should work for
me. Suggestions?
Thanks in advance
Larry

<html>
<form>
Paste URL here: <input name=url value='http://www.yahoo.com'>
<input type=button onclick="try()" value=Go>
<input type=reset>
</form>
<p id=here></p>
<script>
function try() {
if (document.forms[0].url.value=='') {return};
// open a new window with the url from the user.

window2=window.open(document.forms[0].url.value,"","height=0,width=0");
// get the content of the new page. NEXT IS THE LINE THAT GETS THE
ACCESS DENIED ERROR.

t=window2.document.body.innerHTML;

// display the content in this page.

here.innerHTML=t;

// close the new page.

window2.close();

};

</script>
</html>
Jul 23 '05 #1
15 1991
Larry Asher wrote:
Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage


http://jibbering.com/faq/#FAQ4_19

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #2
VK
It's called "content stealing", the oldest and the worst sin of WWW.
This group is not a new-thief info source. Try www.astalavista.com or
so.

Jul 23 '05 #3
VK wrote:
It's called "content stealing", the oldest and the worst sin of WWW.
This group is not a new-thief info source. Try www.astalavista.com or
so.


Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.
Jul 23 '05 #4
David Dorward wrote:
Larry Asher wrote:

Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage

http://jibbering.com/faq/#FAQ4_19


Thank you for the link. That is quite useful. Anyone know a way around
this?
Jul 23 '05 #5
Larry Asher wrote:
http://jibbering.com/faq/#FAQ4_19
Thank you for the link. That is quite useful. Anyone know a way around
this?


It tells you the ways around it at that link

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #6
Larry Asher wrote:
Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission.


.... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #7
David Dorward wrote:
Larry Asher wrote:

Thank you for your reply. However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission.

... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.


We are only using javascript to collect and archive the html
(alternative suggestions appreciated). The algorithms have been written
in C++. I'm a mathematician with some programming skills not the other
way around. I am particularly inexperienced at web based programming,
thus my question.

On the workaround, I somehow missed the link - mental fatigue no doubt.
Thanks again for that information.
Jul 23 '05 #8
VK
> However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.


It's like "I need a full local drive access w/o prompts. I don't do
anything malicious, I just want to provide more convenience to our
users" (pops up from time to time in this group). Unfortunately browser
has its security politics and it doesn't accept any swears, whether
verbal, written or blood signed.

If involved sites are *really* involved, they have to change their page
accordingly (onload="report the content to the parent")

Or, if your project is so serious as it is stated, you definitely can
found extra $199 for code signing certificate and read what you want
from wherever you want (putting your good name on it).
<http://www.thawte.com/codesign/index.html>

Jul 23 '05 #9
On Sun, 17 Jul 2005 14:16:35 GMT, Larry Asher <la***@nowhere.com>
wrote:
We are only using javascript to collect and archive the html
(alternative suggestions appreciated).


Just automate a browser in C++ or javascript or whatever, the
solutions are Zeepe or HTA type constructs to do it in Script with IE,
or IWebBrowser2 automation in Windows C++. Or for mozilla, automate
it using a Mozilla plugin. It's all simple, there's lots of ways of
collecting HTML for such purposes. Pure javascript is probably not
the best, unless you want a quick knock up to automate sites in IE.

Jim.
Jul 23 '05 #10
Larry Asher wrote:
... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.
We are only using javascript to collect and archive the html
(alternative suggestions appreciated).


http://search.cpan.org/~petdance/WWW...W/Mechanize.pm
or ... if you need to deal with JavaScript dependent pages:
http://search.cpan.org/~abeltje/Win3...E/Mechanize.pm

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #11
Larry Asher <la***@nowhere.com> wrote:
David Dorward wrote:
Larry Asher wrote:

we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission.


... and your writing the application in JavaScript that runs in a
webbrowser? Blimey.


We are only using javascript to collect and archive the html
(alternative suggestions appreciated).


Use a unix machine, a shell script, and curl?

--
J.B.Moreno
Jul 23 '05 #12
I do realize the security is there for a reason. I guess I did not
anticipate that simply accessing the html for other than browser
rendering would be so restricted since this code is already fully
downloaded to a client and can be easily viewed/copied/etc manually. We
are not trying to access server side scripts or anything else that I
would normally consider a security risk. Basically all we are
attempting to do is create a webcrawler - no more malicious than google
- less so actually since we have permission to access the sites in
question (yes, I know I have offered no proof of this - you'll just have
to take my word for it...or not).

Thank you for the information on code signing. This research is being
conducted in a university setting so the $s are not free flowing but we
should be able to scrape up $199 so that is definitely an option.
VK wrote:
However, just because "content stealing" is
one application of this doesn't mean it is the only one. If you are
interested we are applying AI algorithms and genetic search techniques
to analyze semantic representations and navigation paths of large
complex corporate intranets - WITH permission. The hard part (the AI
algorithms and such) is done. We just need to automate the process of
navigating through the links.

It's like "I need a full local drive access w/o prompts. I don't do
anything malicious, I just want to provide more convenience to our
users" (pops up from time to time in this group). Unfortunately browser
has its security politics and it doesn't accept any swears, whether
verbal, written or blood signed.

If involved sites are *really* involved, they have to change their page
accordingly (onload="report the content to the parent")

Or, if your project is so serious as it is stated, you definitely can
found extra $199 for code signing certificate and read what you want
from wherever you want (putting your good name on it).
<http://www.thawte.com/codesign/index.html>

Jul 23 '05 #13
Larry Asher wrote:

Please don't top post.
I do realize the security is there for a reason. I guess I did not
anticipate that simply accessing the html for other than browser
rendering would be so restricted since this code is already fully
downloaded to a client and can be easily viewed/copied/etc manually.
If I've browsing my bank's website, then the code is already fully
downloaded to a client and can be easily viewed/copies/etc manually.

If a /script/ on another domain could do it, then it could copy my banking
records to some hidden form fields then submit the form - thus sending
those records to the author of a script.
We are not trying to access server side scripts or anything else that I
would normally consider a security risk.


An HTTP resource is an HTTP resource. The client has no idea how the server
decides what data to put in it.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #14
J. B. Moreno <pl***@newsreaders.com> writes:
Larry Asher <la***@nowhere.com> wrote:
We are only using javascript to collect and archive the html
(alternative suggestions appreciated).


Use a unix machine, a shell script, and curl?


Or just use wget. It exists for Windows too:
<URL:http://www.interlog.com/~tcharron/wgetwin.html>

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Jul 23 '05 #15
I've recently used Ruby to do something similar. I suggest Ruby (or
some other scripting language) above javascript because this really
isn't the best task for a browser -- however unintuitive that statement
may be.

Basically, have ruby download the page and then use a regex to parse
out the data (good old-fashioned screen scraping). I worked wonderfully
for what I needed.

Jul 23 '05 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Earl Eiland | last post by:
Anyone know how to capture text from GUI output? I need to process information returned via a GUI window. Earl
15
by: Tony Gahlinger | last post by:
I'm using my browser (Mozilla/5.0 Linux i686 Gecko/20031007 Firebird/0.7) to do some client-side image processing. I want to capture the sequence of coordinates a user clicks on in xxx.jpg in the...
12
by: Sharad Gupta | last post by:
i have this problem of capturing the filename on the instance when onclick is activated in the <body> the function should catch the filename and display it. Second problem, i have to catch the...
33
by: Joerg Schuster | last post by:
Hello, Python regular expressions must not have more than 100 capturing groups. The source code responsible for this reads as follows: # XXX: <fl> get rid of this limitation! if...
1
by: kevin | last post by:
I am trying to strip the outermost html tag by capturing this tag with regex and then using the string replace function to replace it with an empty string. while stepping through the code, RegEx...
14
by: Brent Burkart | last post by:
I am trying to capture the Windows Authenticated username, but I want to be able to capture the login name that exists in IIS, not Windows. In order to enter my company's intranet through the...
10
by: Andrew | last post by:
Hi, I have a messagebox that pops up due to an event. I did it in javascript. ie. alert("Time's up. Assessment Ended"); I want to capture the OK and Cancel events of this alert messagebox. My...
2
by: sergio | last post by:
i have a huge database that contains large amounts of html that i need to translate to ascii.. i have tried using html2text.py: http://www.aaronsw.com/2002/html2text/ but i could not figure...
7
by: David Lozzi | last post by:
Howdy, I'm trying to capture the session end event. I put a spot of code in the Session_End event in the Global.asax.vb file. The function simply writes to a database table logging the event. I...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.