473,322 Members | 1,806 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Parsing HTML [solved using the re module]

Hello hello, i'm very much a beginner and I've done 1 task successfully (with help) and now i want to deviate just a little and i'm stumped. Here's what i've done...

In a previous task I needed to get a specific number out of this source code:
<TD HEIGHT="24" CLASS="bubblemiddle" ALIGN="right" id="homeindexvolume" name="homeindexvolume">2,017,798,400</TD>

so I used:
e.compile('<TD>.*name="homeindexvolume">(.*?)</TD>',re.M|re.DOTALL)

Now from a different piece of a source code i need a specific number when there is a lot more to the original line.
Here's the source code:

<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,508,577,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">51,073,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,966,371,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,125,754,373</font></td></tr>

Now all I want is 1,508,577,000.

How would I grab just that number?

How about if I wanted a different nubmer in there, say 51,073,000?

Thanks
May 21 '07 #1
1 1098
bvdet
2,851 Expert Mod 2GB
Hello hello, i'm very much a beginner and I've done 1 task successfully (with help) and now i want to deviate just a little and i'm stumped. Here's what i've done...

In a previous task I needed to get a specific number out of this source code:
<TD HEIGHT="24" CLASS="bubblemiddle" ALIGN="right" id="homeindexvolume" name="homeindexvolume">2,017,798,400</TD>

so I used:
e.compile('<TD>.*name="homeindexvolume">(.*?)</TD>',re.M|re.DOTALL)

Now from a different piece of a source code i need a specific number when there is a lot more to the original line.
Here's the source code:

<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,508,577,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">51,073,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,966,371,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,125,754,373</font></td></tr>

Now all I want is 1,508,577,000.

How would I grab just that number?

How about if I wanted a different nubmer in there, say 51,073,000?

Thanks
This will extract the numbers from the string:
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. s = '<tr><td bgcolor="EEEEEE"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000"><b>Total</b></font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,508,577,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">51,073,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">1,966,371,000</font></td><td bgcolor="EEEEEE" align="right"><FONT FACE="Arial,Helvetica,sans-serif" SIZE="2" COLOR="#000000">2,125,754,373</font></td></tr>'
  4.  
  5. patt = r'>([0-9,]+)<'
  6. dataList = re.findall(patt, s)
  7. print dataList
  8.  
  9. '''
  10. >>> ['1,508,577,000', '51,073,000', '1,966,371,000', '2,125,754,373']
  11. '''
Use the list index to get individual items:
Expand|Select|Wrap|Line Numbers
  1. >>> number = dataList[0]
  2. >>> number
  3. '1,508,577,000'
  4. >>> 
May 21 '07 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

8
by: KC | last post by:
I have written a parser using htmllib.HTMLParser and it functions fine unless the HTML is malformed. For example, is some instances, the provider of the HTML leaves out the <TR> tags but includes...
303
by: mike420 | last post by:
In the context of LATEX, some Pythonista asked what the big successes of Lisp were. I think there were at least three *big* successes. a. orbitz.com web site uses Lisp for algorithms, etc. b....
8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
6
by: Tuang | last post by:
I've been looking all over in the docs, but I can't figure out how you're *supposed* to parse formatted strings into numbers (and other data types, for that matter) in Python. In C#, you can say...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...
3
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception...
3
by: Sanjay Arora | last post by:
We are looking to select the language & toolset more suitable for a project that requires getting data from several web-sites in real- time....html parsing/scraping. It would require full emulation...
13
by: Phillip B Oldham | last post by:
Is there a standard library for parsing emails that can cope with the different way email clients quote?
2
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH...
2
by: =?ISO-8859-1?Q?Andr=E9?= | last post by:
Hi everyone, I would like to implement a parser for a mini-language and would appreciate some pointers. The type of text I would like to parse is an extension of: ...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.