HTML Parsing - Python

mtuller

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?

Thanks,

Mike

Feb 10 '07 #1

Subscribe Post Reply

2357

Gabriel Genellina

En Sat, 10 Feb 2007 20:07:43 -0300, mtuller <mi******@gmail.comescribió:

<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
[...]
I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can[...]

Just try harder with BeautifulSoup, should work OK for your use case.
Unfortunately I can't give you an example right now.

--
Gabriel Genellina

Feb 11 '07 #2

Ayaz Ahmed Khan

"mtuller" typed:

I have also tried Beautiful Soup, but had trouble understanding the
documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability

>>import re
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen('http://www.someurl.tld/'))
title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
title

u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

--
Ayaz Ahmed Khan

A witty saying proves nothing, but saying something pointless gets
people's attention.

Feb 11 '07 #3

John Machin

On Feb 11, 6:05 pm, Ayaz Ahmed Khan <a...@dev.slash.nullwrote:

"mtuller" typed:

I have also tried Beautiful Soup, but had trouble understanding the
documentation

As Gabriel has suggested, spend a little more time going through the
documentation of BeautifulSoup. It is pretty easy to grasp.

I'll give you an example: I want to extract the text between the
following span tags in a large HTML source file.

Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability

>import re
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen('http://www.someurl.tld/'))
title = soup.find(name='span', attrs={'class':'title'}, text=re.compile(r'^Linux \w+'))
title

u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability'

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)

C:\junk>type element_soup.py
from xml.etree import cElementTree as ET
import cStringIO

guff = """
<tr >
<td headers="col1_1" style="width:21%" >
LETTER</td>
<td headers="col2_1" style="width:13%; text-align:right" >
33,699</td>
<td headers="col3_1" style="width:13%; text-align:right" >
1.0</td>
</tr>
"""

tree = ET.parse(cStringIO.StringIO(guff))
for elem in tree.getiterator('td'):
key = elem.get('headers')
assert elem[0].tag == 'span'
value = elem[0].text
print repr(key), repr(value)

C:\junk>\python25\python element_soup.py
'col1_1' 'LETTER'
'col2_1' '33,699'
'col3_1' '1.0'

HTH,
John

Feb 11 '07 #4

Fredrik Lundh

John Machin wrote:

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)

or get the best of both worlds:

http://effbot.org/zone/element-soup.htm

</F>

Feb 11 '07 #5

Stefan Behnel

John Machin wrote:

One can even use ElementTree, if the HTML is well-formed. See below.
However if it is as ill-formed as the sample (4th "td" element not
closed; I've omitted it below), then the OP would be better off
sticking with Beautiful Soup :-)

Or (as we were talking about the best of both worlds already) use lxml's HTML
parser, which is also capable of parsing pretty disgusting HTML-like tag soup.

Stefan

Feb 25 '07 #6

Similar topics

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility...

HTML / CSS

straight html or dom generated html objects faster?

by: anagai | last post by:

Im wondering if generating html objects such as tabels and rows in javascript is faster than typing the html directly? Seems when you do it in javascript you have to download alot of code and would...

Javascript

PHP-Yes, HTML-No --- Why?

by: Lennart Björk | last post by:

Hi All, I have a tiny program: <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>MyTitle</title> <meta...

PHP

HTML vs XHTML

by: Timothy Larson | last post by:

A couple years ago it seemed like XHTML was the direction of most web markup, a foregone conclusion. Now I return to the scene and I see many here recommending that authors stick to HTML, albeit...

HTML / CSS

html parsing / regular expressions

by: yonido | last post by:

hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...

.NET Framework

Parsing an HTML table with XML

by: Rick Walsh | last post by:

I have an HTML table in the following format: <table> <tr><td>Header 1</td><td>Header 2</td></tr> <tr><td>1</td><td>2</td></tr> <tr><td>3</td><td>4</td></tr> <tr><td>5</td><td>6</td></tr>...

.NET Framework

Parsing an html/aspx file

by: Neil.Smith | last post by:

I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...

ASP.NET

PHP in html

by: John | last post by:

Hello, I have a php include command in my website and the script shows up. However, I need for the script to show up on the right side of page (there is enough room there)... but for some...

PHP

convert xhtml back to html

by: Tim Arnold | last post by:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. ) to plain html...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server