Thomas SMETS wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dear,
I need to parse XHTML/HTML files in all ways :
~ _ Removing comments and javascripts is a first issue
~ _ Retrieving the list of fields to submit is my following item (todo)
Any idea where I could find this already made ... ?
You could try XIST (
http://www.livinglogic.de/Python/xist).
Removing comments and javascripts works like this:
---
from ll.xist import xsc, parsers
from ll.xist.ns import html
e = parsers.parseURL("http://www.python.org/", tidy=True)
def removestuff(node, converter):
if isinstance(node, xsc.Comment):
node = xsc.Null
elif isinstance(node, html.script) and \
(unicode(node["type"]) == u"text/javascript" or \
unicode(node["language"]) == u"Javascript" \
):
node = xsc.Null
return node
e = e.mapped(removestuff)
print e.asBytes()
---
Retrieving the list of fields from all forms on a page might look like this:
---
from ll.xist import xsc, parsers, xfind
from ll.xist.ns import html
e = parsers.parseURL("http://www.python.org/", tidy=True)
for form in e//html.form:
print "Fields for %s" % form["action"]
for field in form//xfind.is_(html.input, html.textarea):
if "id" in field.attrs:
print "\t%s" % field["id"]
else:
print "\t%s" % field["name"]
---
This prints:
Fields for
http://www.google.com/search
q
domains
sitesearch
sourceid
submit
Hope that helps!
Bye,
Walter Dörwald