473,326 Members | 2,023 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Parsing HTML?

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
Apr 3 '08 #1
7 5576
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
Apr 3 '08 #2
BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

On Wed, Apr 2, 2008 at 10:37 PM, Daniel Fetchinson
<fe********@googlemail.comwrote:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
Apr 3 '08 #3
On Apr 3, 12:39*am, ben...@gmail.com wrote:
BeautifulSoup does what I need it to. *Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. *Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.
innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.
Apr 4 '08 #4
Benjamin wrote:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
Apr 7 '08 #5
On Apr 3, 9:10*pm, 7stud <bbxx789_0...@yahoo.comwrote:
On Apr 3, 12:39*am, ben...@gmail.com wrote:
BeautifulSoup does what I need it to. *Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. *Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

innerHTML has never been part of the DOM. *It is however a defacto
browser standard. *That's probably why you aren't having any luck
using a python module that implements the DOM.
That makes sense.
Jun 27 '08 #6
On Apr 6, 11:03*pm, Stefan Behnel <stefan...@behnel.dewrote:
Benjamin wrote:
I'm trying to parse an HTML file. *I want to retrieve all of the text
inside a certain tag that I find with XPath. *The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

* * import lxml.html as h
* * tree = h.parse("somefile.html")
* * text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan
I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
Jun 27 '08 #7
Benjamin wrote:
On Apr 6, 11:03 pm, Stefan Behnel <stefan...@behnel.dewrote:
>Benjamin wrote:
>>I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.
Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
Jun 27 '08 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
14
by: Viktor Rosenfeld | last post by:
Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: -...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...
0
by: Fuzzyman | last post by:
I am trying to parse an HTML page an only modify URLs within tags - e.g. inside IMG, A, SCRIPT, FRAME tags etc... I have built one that works fine using the HTMLParser.HTMLParser and it works...
3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.