Connecting Tech Pros Worldwide Help | Site Map

Parsing HTML?

Benjamin
Guest
 
Posts: n/a
#1: Apr 3 '08
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
Daniel Fetchinson
Guest
 
Posts: n/a
#2: Apr 3 '08

re: Parsing HTML?


I'm trying to parse an HTML file. I want to retrieve all of the text
Quote:
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
benash@gmail.com
Guest
 
Posts: n/a
#3: Apr 3 '08

re: Parsing HTML?


BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

On Wed, Apr 2, 2008 at 10:37 PM, Daniel Fetchinson
<fetchinson@googlemail.comwrote:
Quote:
Quote:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
>
Have you tried http://www.google.com/search?q=python+html+parser ?
>
HTH,
Daniel
>
7stud
Guest
 
Posts: n/a
#4: Apr 4 '08

re: Parsing HTML?


On Apr 3, 12:39*am, ben...@gmail.com wrote:
Quote:
BeautifulSoup does what I need it to. *Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. *Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.
>
innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.
Stefan Behnel
Guest
 
Posts: n/a
#5: Apr 7 '08

re: Parsing HTML?


Benjamin wrote:
Quote:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
Benjamin
Guest
 
Posts: n/a
#6: Jun 27 '08

re: Parsing HTML?


On Apr 3, 9:10*pm, 7stud <bbxx789_0...@yahoo.comwrote:
Quote:
On Apr 3, 12:39*am, ben...@gmail.com wrote:
>
Quote:
BeautifulSoup does what I need it to. *Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. *Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.
>
innerHTML has never been part of the DOM. *It is however a defacto
browser standard. *That's probably why you aren't having any luck
using a python module that implements the DOM.
That makes sense.
Benjamin
Guest
 
Posts: n/a
#7: Jun 27 '08

re: Parsing HTML?


On Apr 6, 11:03*pm, Stefan Behnel <stefan...@behnel.dewrote:
Quote:
Benjamin wrote:
Quote:
I'm trying to parse an HTML file. *I want to retrieve all of the text
inside a certain tag that I find with XPath. *The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
>
* * import lxml.html as h
* * tree = h.parse("somefile.html")
* * text = tree.xpath("string( some/element[@condition] )")
>
http://codespeak.net/lxml
>
Stefan
I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
Stefan Behnel
Guest
 
Posts: n/a
#8: Jun 27 '08

re: Parsing HTML?


Benjamin wrote:
Quote:
On Apr 6, 11:03 pm, Stefan Behnel <stefan...@behnel.dewrote:
Quote:
>Benjamin wrote:
Quote:
>>I'm trying to parse an HTML file. I want to retrieve all of the text
>>inside a certain tag that I find with XPath. The DOM seems to make
>>this available with the innerHTML element, but I haven't found a way
>>to do it in Python.
> import lxml.html as h
> tree = h.parse("somefile.html")
> text = tree.xpath("string( some/element[@condition] )")
>>
>http://codespeak.net/lxml
>>
>Stefan
>
I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
Closed Thread