473,397 Members | 2,077 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,397 software developers and data experts.

Re: parsing javascript


On Oct 12, 2008, at 5:25 AM, S.Selvam Siva wrote:
I have to do a parsing on webpagesand fetch urls.My problem is ,many
urls i
need to parse are dynamically loaded using javascript function
(onload()).How to fetch those links from python? Thanks in advance.
Selvam,
You can try to find them yourself using string parsing, but that's
difficult. The closer you want to get to "perfect" at finding URLs
expressed in JS, the closer you'll get to rewriting a JS interpreter.
For instance, this is not so hard to understand:
"http://example.com/"
but this is:
"http://ZZZ_DOMAIN_ZZZ/index.html".replace(/ZZZ_DOMAIN_ZZZ/,
the_domain_variable)

This is a long-standing problem for any program that parses Web pages.
You either have to embed a JS interpreter in your application or just
ignore the JavaScript. Most Web parsing robots take the latter route.

Good luck
Philip
Oct 12 '08 #1
1 2836
On Oct 12, 2:28 pm, Philip Semanchuk <phi...@semanchuk.comwrote:
On Oct 12, 2008, at 5:25 AM, S.SelvamSivawrote:
I have to do a parsing on webpagesand fetch urls.My problem is ,many
urls i
need to parse are dynamically loaded usingjavascriptfunction
(onload()).How to fetch those links from python? Thanks in advance.

Selvam,
You can try to find them yourself using string parsing, but that's
difficult. The closer you want to get to "perfect" at finding URLs
expressed in JS, the closer you'll get to rewriting a JS interpreter.
For instance, this is not so hard to understand:
"http://example.com/"
but this is:
"http://ZZZ_DOMAIN_ZZZ/index.html".replace(/ZZZ_DOMAIN_ZZZ/,
the_domain_variable)

This is a long-standing problem for any program that parses Web pages.
yep :)
You either have to embed a JS interpreter in your application or
yep.

there are several.

pyv8 is the newest addition: http://advogato.org/article/985.html

it's a python wrapper around google's v8 javascript execution
library.

then there's pykhtml: http://paul.giannaros.org/pykhtml/

it's a python wrapper around KHTML, providing very convenient access
to KDE's HTML capabilities: what pykhtml does is "pretends" that the
GUI part of KDE doesn't exist, so you can run your program as a
command-line shell; it will execute the javascript, which you will
have to wait a bit for of course; then you can walk the DOM tree
(using pykhtml bindings) using pykhtml.DOM.getElementById() and
getElementsByTagName("a") etc. etc. looking for the URLs.

there's even an AJAX example included which does 1-second polling of
the DOM model, waiting for a spell-checking web site to deliver the
answer.

then there's webkit, with the new glib bindings:
https://bugs.webkit.org/show_bug.cgi?id=16401

which are then followed up by python bindings to _those_ bindings:
http://code.google.com/p/pywebkitgtk...s/detail?id=13

this will also allow you to execute arbitrary javascript - again, it's
similar to KHTML and in fact webkit really _is_ the KDE KHTML code
(JavaScriptCore, KJS etc) but forked, improved, etc. etc.

unfortunately, the glib bindings are tied - at three key and strategic
locations - to gtk at the moment, which will take _very_ little work
to "un"tie them [pay me and i'll do the work], so you would need to
create a blank gtk window - just like is done with pykhtml, behind the
scenes.

it would be a very simple task to create a "dummy" - console-based -
port of webkit, providing an array of callbacks which you must hand to
the library. at the moment, the design of webkit is not particularly
good in this respect: there are three ports, gtk, wx and qt, which are
heavily tied in to webkit. it would be a _far_ better design to be
passing in a struct containing function callbacks (rather a lot of
them - about eighty!) and then what you could do is have a "console"-
based port of webkit, which would do the job you needed.

alternatively, if you don't mind wrapping a binary application with
e.g. Popen3 then look at the webkit DumpRenderTree application, paying
particular attention to using the --html option. you won't have any
control over how long the javascript is executed for. after an
arbitrary and small period of time, DumpRenderTree _stops_ executing
the javascript and prints out the HTML DOM model (in a non-html-layout
fashion - it's used for debugging and testing purposes but will
suffice for your purposes).

so, as it stands, pywebkitgtk is _no worse_ than pykhtml, but with a
little bit of tweaking, the "gtk" could be removed from "pywebkitgtk"
and you'd end up with... ohh... call it "pywebkitglib" ... which would
be much better as a stand-alone library, for your purposes

then there's also "spidermonkey", which is mozilla's javascript
engine. i haven't investigated this option: haven't had a need to.

then there's also PyXPCOMExt, which is embedding python into mozilla,
and from there you have PyDOM, which allows you access to the DOM
model of the mozilla "thing". so, if you don't mind embedding your
application into XULRunner, you've got a home for executing your app
and obtaining the urls, post-javascript-execution.

the neat thing about PyXPCOMExt is that you have complete and full
access to python - so your app can make external TCP and UDP sockets,
you can embed an entire _server_ in the damn thing if you want (you
could embed... python-twisted if you wanted!) you can access the
filesystem - anything. absolutely anything. reason: the _entire_
python suite is embedded into the browser. every single bit of it.
that's about all i've been able to find, so far. there might be more
options out there. not that there aren't enough already :)

all of them will allow you complete and full access to execution of
javascript, including AJAX execution. which is why you'll need to do
that "polling" trick in many instances.

l.
Oct 13 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
3
by: Hilton Lima | last post by:
Hi there; I have the following problem. I am using one of the many javascript XML parsing scripts available around the Internet, but the parsing fails when it reach an XML child node that is...
5
by: Martin Walke | last post by:
Hi all, Can someone help me out here? I'm been using ASP and VBScript for some years but have just ventured into the realms of using server side Javascript and apart from hitting various...
0
by: bruce | last post by:
hi... it appears that i'm running into a possible problem with mechanize/browser/python rgarding the "select_form" method. i've tried the following and get the error listed: br.select_form(nr...
3
by: Rodrigo Meza | last post by:
Hello Everyone For a project I am working on, I need to retrieve links from html documents. The easy part is to obtain 'plain' links like <A HREF="http://site/path/document">, but when those...
6
by: jackwootton | last post by:
Hello everyone, I understand that XML can be parsed using JavaScript using the XML Document object. However, it is possible to parse XHTML using JavaScript? I currently listen for DOMMutation...
0
by: bruce | last post by:
Hi... I've got a couple of test apps that I use to parse/test different html webpages. However, I'm now looking at how to parse a given site/page that uses javascript calls to dynamically...
1
by: avpkills2002 | last post by:
I seem to be getting this weird problem in Internet explorer. I have written a code for parsing a XML file and displaying the output. The code works perfectly fine with ffx(Firefox).However is not...
16
by: freefony | last post by:
Am trying to parse a php array into javascript but i found that only one element of the array was present in the javascript array. here is my code code <?php include("../../connect.php");...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.