473,395 Members | 1,678 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

HTML Parsing

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike

Feb 10 '07 #1
5 2357
En Sat, 10 Feb 2007 20:07:43 -0300, mtuller <mi******@gmail.comescribió:
<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
[...]
I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can[...]
Just try harder with BeautifulSoup, should work OK for your use case.
Unfortunately I can't give you an example right now.

--
Gabriel Genellina

Feb 11 '07 #2
"mtuller" typed:
I have also tried Beautiful Soup, but had trouble understanding the
documentation
As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

<span class="title">Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability</span>
>>import re
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen('http://www.someurl.tld/'))
title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
title
u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

--
Ayaz Ahmed Khan

A witty saying proves nothing, but saying something pointless gets
people's attention.
Feb 11 '07 #3
On Feb 11, 6:05 pm, Ayaz Ahmed Khan <a...@dev.slash.nullwrote:
"mtuller" typed:
I have also tried Beautiful Soup, but had trouble understanding the
documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

<span class="title">Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability</span>
>import re
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen('http://www.someurl.tld/'))
title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
title

u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'
One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)

C:\junk>type element_soup.py
from xml.etree import cElementTree as ET
import cStringIO

guff = """
<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
</tr>
"""

tree = ET.parse(cStringIO.StringIO(guff))
for elem in tree.getiterator('td'):
key = elem.get('headers')
assert elem[0].tag == 'span'
value = elem[0].text
print repr(key), repr(value)

C:\junk>\python25\python element_soup.py
'col1_1' 'LETTER'
'col2_1' '33,699'
'col3_1' '1.0'

HTH,
John

Feb 11 '07 #4
John Machin wrote:
One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)
or get the best of both worlds:

http://effbot.org/zone/element-soup.htm

</F>

Feb 11 '07 #5
John Machin wrote:
One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)
Or (as we were talking about the best of both worlds already) use lxml's HTML
parser, which is also capable of parsing pretty disgusting HTML-like tag soup.

Stefan
Feb 25 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
82
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...
1
by: anagai | last post by:
Im wondering if generating html objects such as tabels and rows in javascript is faster than typing the html directly? Seems when you do it in javascript you have to download alot of code and would...
59
by: Lennart Björk | last post by:
Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta...
28
by: Timothy Larson | last post by:
A couple years ago it seemed like XHTML was the direction of most web markup, a foregone conclusion. Now I return to the scene and I see many here recommending that authors stick to HTML, albeit...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
4
by: Rick Walsh | last post by:
I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
22
by: John | last post by:
Hello, I have a php include command in my website and the script shows up. However, I need for the script to show up on the right side of page (there is enough room there)... but for some...
11
by: Tim Arnold | last post by:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.