By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,886 Members | 1,123 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,886 IT Pros & Developers. It's quick & easy.

web spider using mozilla and javascript

P: n/a
I am trying to create a simple custom web spider using mozilla and
javascript, the basic functionality is to open a website and then
manipulate it using DOM (possibly opening links etc.).

it seems like pretty easy task to do, the only pr0oblem is that I
can't figure out how to wait for the website to load - I use mozilla
security settings to be able to open the window and mainpulate it:

var w = window.open("http://google.com", "google window");
var treeWalker = w.document.createTreeWalker(
w.document, NodeFilter.SHOW_ALL, null, false
);
// now use treeWalker

the problem is that the window.open returns immediately after window
is open, does not wait for page to be loaded and I can't figure out how
to wait for the site to load (as far as I can tell IE has readyState for
this).

any ideas? (searched the web, found same Qs but no answers) it's OK
if it mozilla/firefox specific and if it needs funky security settings,
it's for my own use so I can set up mozilla in any way neccessary (i.e.
no need for portability)

would it be better to use XUL for a project like this?

TIA,

erik
Aug 24 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
shouldn't you use xmlhttp instead? If you want to get something that
parses results from google google has a search api, and examples of
making requests using xmlhttp IIRC.

Aug 24 '05 #2

P: n/a


Erik Steffl wrote:

it seems like pretty easy task to do, the only pr0oblem is that I
can't figure out how to wait for the website to load - I use mozilla
security settings to be able to open the window and mainpulate it:

var w = window.open("http://google.com", "google window");
var treeWalker = w.document.createTreeWalker(
w.document, NodeFilter.SHOW_ALL, null, false
);
// now use treeWalker

the problem is that the window.open returns immediately after window
is open, does not wait for page to be loaded and I can't figure out how
to wait for the site to load (as far as I can tell IE has readyState for
this).
You can try to attach an event listener to the returned window e.g.
var w = window.open('url');
w.addEventListener(
'load',
function (evt) {
// handle load event here e.g. access w.document
},
true
);
Of course it is a bit of a gamble as in the above sequence it would be
possible that the document in the popup is loaded before the script in
the other window manages to add its event listener.
But as far as I have tested with Mozilla 1.7 the event listener is fired
when a page from e.g. http://localhost/ loads another page from
localhost while attempts to load a page from a different server did not
fire the event listener although the JavaScript console shows no
security error or warning. I am not sure whether the event listener is
not fired as Mozilla treats that attempt as a security problem or as
Mozilla has a bug there.

Do you need to use a popup window? Perhaps an iframe to load the page
you are interested in is enough?
would it be better to use XUL for a project like this?


XUL is just another markup language so as long as you would simply load
a XUL document from a HTTP server or from the local file system your
script has not more rights. But if you write a Mozilla extension and
install it into chrome then your script has full access to the XPCOM
components the browser exposes to JavaScript and in the end you will
probably find more reliable ways to control/monitor the opening of
windows and access their contents. But of course it is a steep learning
curve, the scripting language is JavaScript but the objects you need to
deal with are mostly new then.
Check out http://www.xulplanet.com/ for introductions and reference
documentation. As with HTML and JavaScript where people often learn by
example there are then lots of Mozilla and Firefox extensions around to
learn from.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Aug 24 '05 #3

P: n/a

Erik Steffl wrote:
I am trying to create a simple custom web spider using mozilla and
javascript, the basic functionality is to open a website and then
manipulate it using DOM (possibly opening links etc.).

[snip]

Hi Erik,

I'm not exactly sure what you mean by manipulating another website
using javascript. But if you meant manipulating by traversing the
document tree of another website, then you should know that you're not
allowed to do cross domain scripting.

Aug 24 '05 #4

P: n/a
web.dev wrote:
Erik Steffl wrote:
I am trying to create a simple custom web spider using mozilla and
javascript, the basic functionality is to open a website and then
manipulate it using DOM (possibly opening links etc.).


[snip]

Hi Erik,

I'm not exactly sure what you mean by manipulating another website
using javascript. But if you meant manipulating by traversing the
document tree of another website, then you should know that you're not
allowed to do cross domain scripting.


basically yes, traversing DOM

of course I am allowed, as I said I don't mind if I need to change
security settings or if it's mozilla specific (because I want to use it
in _my_ browser that I have full control of).

the traversing works OK, the only problem I have is that I can't
figure out how to wait for page to be fully loaded (i.e. equivalent of
readyState in IE).

unfortunately it looks like there's no way to reliably tell the state
of page loading in mozilla (based on responses here and my search on the
net).

erik
Aug 25 '05 #5

P: n/a
Martin Honnen wrote:


Erik Steffl wrote:

it seems like pretty easy task to do, the only pr0oblem is that I
can't figure out how to wait for the website to load - I use mozilla
security settings to be able to open the window and mainpulate it:

var w = window.open("http://google.com", "google window");
var treeWalker = w.document.createTreeWalker(
w.document, NodeFilter.SHOW_ALL, null, false
);
// now use treeWalker

the problem is that the window.open returns immediately after window
is open, does not wait for page to be loaded and I can't figure out
how to wait for the site to load (as far as I can tell IE has
readyState for this).

You can try to attach an event listener to the returned window e.g.
var w = window.open('url');
w.addEventListener(
'load',
function (evt) {
// handle load event here e.g. access w.document
},
true
);
Of course it is a bit of a gamble as in the above sequence it would be
possible that the document in the popup is loaded before the script in
the other window manages to add its event listener.
But as far as I have tested with Mozilla 1.7 the event listener is fired
when a page from e.g. http://localhost/ loads another page from
localhost while attempts to load a page from a different server did not
fire the event listener although the JavaScript console shows no
security error or warning. I am not sure whether the event listener is
not fired as Mozilla treats that attempt as a security problem or as
Mozilla has a bug there.

Do you need to use a popup window? Perhaps an iframe to load the page
you are interested in is enough?


yes, iframe would be enough, I don't care how exactly it is loaded as
long as I have access to DOM - would it make any difference if it's an
iframe?

would it be better to use XUL for a project like this?

XUL is just another markup language so as long as you would simply load
a XUL document from a HTTP server or from the local file system your
script has not more rights. But if you write a Mozilla extension and

.... Check out http://www.xulplanet.com/ for introductions and reference

....

yeah, I guess that's what I'll try next (thought javascript would be
easier:-)

erik
Aug 25 '05 #6

P: n/a
Erik Steffl wrote:
yes, iframe would be enough, I don't care how exactly it is loaded as
long as I have access to DOM - would it make any difference if it's an
iframe?


Well , if the page being spidered does not want to be framed,
it could easily break your spider by popping out of frames.

I think the other fellow who suggested using xmlhttp was giving you
the better advice, as it will call your function after it has received
all the data, and it's behaviors are fairly immune to the content of the
page, allowing you to choose when and if the page "plays", or if you
simply want to manipulate the source.

In the bad old days (1994), I used a .cgi parsing technique to place my
content on every web page a user visited after leaving my site, but I'm
not going to inflict that code on the population.

--
--.
--=<> Dr. Clue (A.K.A. Ian A. Storms) <>=-- C++,HTML/CSS,Javascript,TCP ...
--`
Sep 5 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.