By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,050 Members | 1,020 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,050 IT Pros & Developers. It's quick & easy.

Parsing HTML?

P: n/a
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
Apr 3 '08 #1
Share this Question
Share on Google+
7 Replies


P: n/a
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
Apr 3 '08 #2

P: n/a
BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

On Wed, Apr 2, 2008 at 10:37 PM, Daniel Fetchinson
<fe********@googlemail.comwrote:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
Apr 3 '08 #3

P: n/a
On Apr 3, 12:39*am, ben...@gmail.com wrote:
BeautifulSoup does what I need it to. *Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. *Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.
innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.
Apr 4 '08 #4

P: n/a
Benjamin wrote:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
Apr 7 '08 #5

P: n/a
On Apr 3, 9:10*pm, 7stud <bbxx789_0...@yahoo.comwrote:
On Apr 3, 12:39*am, ben...@gmail.com wrote:
BeautifulSoup does what I need it to. *Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. *Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

innerHTML has never been part of the DOM. *It is however a defacto
browser standard. *That's probably why you aren't having any luck
using a python module that implements the DOM.
That makes sense.
Jun 27 '08 #6

P: n/a
On Apr 6, 11:03*pm, Stefan Behnel <stefan...@behnel.dewrote:
Benjamin wrote:
I'm trying to parse an HTML file. *I want to retrieve all of the text
inside a certain tag that I find with XPath. *The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

* * import lxml.html as h
* * tree = h.parse("somefile.html")
* * text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
Jun 27 '08 #7

P: n/a
Benjamin wrote:
On Apr 6, 11:03 pm, Stefan Behnel <stefan...@behnel.dewrote:
>Benjamin wrote:
>>I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
Jun 27 '08 #8

This discussion thread is closed

Replies have been disabled for this discussion.