471,349 Members | 1,438 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,349 software developers and data experts.

Python Regex Question

I need to extract the number on each <td tags from a html file.

i.e 49.950 from the following:

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>

The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>

How can I just extract the real/integer number using regex?

Sep 20 '07 #1
5 7246
jo***********@gmail.com wrote:
I need to extract the number on each <td tags from a html file.

i.e 49.950 from the following:

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>

The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>

How can I just extract the real/integer number using regex?

'[0-9]*\.[0-9]*'

--
Posted via a free Usenet account from http://www.teranews.com

Sep 20 '07 #2
jo***********@gmail.com wrote:
>I need to extract the number on each <td tags from a html file.

i.e 49.950 from the following:

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>

The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>

How can I just extract the real/integer number using regex?
If all the td's content has the &nbsp;[value_to_extract]&nbsp; pattern,
things goes simplest

[untested]

/<td.*&nbsp;([^&]*)&nbsp;/

the parentesis will be used to group() the result (and extract what you
really want)

Cheers
Gerardo
Sep 20 '07 #3
On Sep 20, 4:12 pm, Tobiah <t...@tobiah.orgwrote:
joemystery...@gmail.com wrote:
I need to extract the number on each <td tags from a html file.
i.e 49.950 from the following:
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>
The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>
How can I just extract the real/integer number using regex?

'[0-9]*\.[0-9]*'

--
Posted via a free Usenet account fromhttp://www.teranews.com
I am trying to use BeautifulSoup:

soup = BeautifulSoup(page)

td_tags = soup.findAll('td')
i=0
for td in td_tags:
i = i+1
print "td: ", td
# re.search('[0-9]*\.[0-9]*', td)
price = re.compile('[0-9]*\.[0-9]*').search(td)

I am getting an error:

price= re.compile('[0-9]*\.[0-9]*').search(td)
TypeError: expected string or buffer

Does beautiful soup returns array of objects? If so, how do I pass
"td" instance as string to re.search? What is the different between
re.search vs re.compile().search?

Sep 20 '07 #4
Ivo
crybaby wrote:
On Sep 20, 4:12 pm, Tobiah <t...@tobiah.orgwrote:
>joemystery...@gmail.com wrote:
>>I need to extract the number on each <td tags from a html file.
i.e 49.950 from the following:
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>
The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>
How can I just extract the real/integer number using regex?
'[0-9]*\.[0-9]*'

--
Posted via a free Usenet account fromhttp://www.teranews.com

I am trying to use BeautifulSoup:

soup = BeautifulSoup(page)

td_tags = soup.findAll('td')
i=0
for td in td_tags:
i = i+1
print "td: ", td
# re.search('[0-9]*\.[0-9]*', td)
price = re.compile('[0-9]*\.[0-9]*').search(td)

I am getting an error:

price= re.compile('[0-9]*\.[0-9]*').search(td)
TypeError: expected string or buffer

Does beautiful soup returns array of objects? If so, how do I pass
"td" instance as string to re.search? What is the different between
re.search vs re.compile().search?
I don't know anything about BeautifulSoup, but to the other questions:

var=re.compile(regexpr) compiles the expression and after that you can
use var as the reference to that compiled expression (costs less)

re.search(expr, string) compiles and searches every time. This can
potentially be more expensive in calculating power. especially if you
have to use the expression a lot of times.

The way you use it it doesn't matter.

do:
pattern = re.compile('[0-9]*\.[0-9]*')
result = pattern.findall(your tekst here)

Now you can reuse pattern.

Cheers,
Ivo.
Sep 21 '07 #5
re.search(expr, string) compiles and searches every time. This can
potentially be more expensive in calculating power. especially if you
have to use the expression a lot of times.
The re module-level helper functions cache expressions and their
compiled form in a dict. They are only compiled once. The main
overhead would be for repeated dict lookups.

See sre.py (included from re.py) for more details. /usr/lib/python2.4/sre.py
Sep 21 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

17 posts views Thread by Michael McGarry | last post: by
5 posts views Thread by Vamsee Krishna Gomatam | last post: by
3 posts views Thread by gisleyt | last post: by
10 posts views Thread by Raymond | last post: by
3 posts views Thread by Walter Cruz | last post: by
reply views Thread by XIAOLAOHU | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.