By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,525 Members | 1,623 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,525 IT Pros & Developers. It's quick & easy.

Python Regex Question

P: n/a
I need to extract the number on each <td tags from a html file.

i.e 49.950 from the following:

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>

The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>

How can I just extract the real/integer number using regex?

Sep 20 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
jo***********@gmail.com wrote:
I need to extract the number on each <td tags from a html file.

i.e 49.950 from the following:

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>

The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>

How can I just extract the real/integer number using regex?

'[0-9]*\.[0-9]*'

--
Posted via a free Usenet account from http://www.teranews.com

Sep 20 '07 #2

P: n/a
jo***********@gmail.com wrote:
>I need to extract the number on each <td tags from a html file.

i.e 49.950 from the following:

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>

The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.

<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>

How can I just extract the real/integer number using regex?
If all the td's content has the &nbsp;[value_to_extract]&nbsp; pattern,
things goes simplest

[untested]

/<td.*&nbsp;([^&]*)&nbsp;/

the parentesis will be used to group() the result (and extract what you
really want)

Cheers
Gerardo
Sep 20 '07 #3

P: n/a
On Sep 20, 4:12 pm, Tobiah <t...@tobiah.orgwrote:
joemystery...@gmail.com wrote:
I need to extract the number on each <td tags from a html file.
i.e 49.950 from the following:
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>
The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>
How can I just extract the real/integer number using regex?

'[0-9]*\.[0-9]*'

--
Posted via a free Usenet account fromhttp://www.teranews.com
I am trying to use BeautifulSoup:

soup = BeautifulSoup(page)

td_tags = soup.findAll('td')
i=0
for td in td_tags:
i = i+1
print "td: ", td
# re.search('[0-9]*\.[0-9]*', td)
price = re.compile('[0-9]*\.[0-9]*').search(td)

I am getting an error:

price= re.compile('[0-9]*\.[0-9]*').search(td)
TypeError: expected string or buffer

Does beautiful soup returns array of objects? If so, how do I pass
"td" instance as string to re.search? What is the different between
re.search vs re.compile().search?

Sep 20 '07 #4

P: n/a
Ivo
crybaby wrote:
On Sep 20, 4:12 pm, Tobiah <t...@tobiah.orgwrote:
>joemystery...@gmail.com wrote:
>>I need to extract the number on each <td tags from a html file.
i.e 49.950 from the following:
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;49.950&nbsp;</font></td>
The actual number between: &nbsp;49.950&nbsp; can be any number of
digits before decimal and after decimal.
<td align=right width=80><font size=2 face="New Times
Roman,Times,Serif">&nbsp;######.####&nbsp;</font></td>
How can I just extract the real/integer number using regex?
'[0-9]*\.[0-9]*'

--
Posted via a free Usenet account fromhttp://www.teranews.com

I am trying to use BeautifulSoup:

soup = BeautifulSoup(page)

td_tags = soup.findAll('td')
i=0
for td in td_tags:
i = i+1
print "td: ", td
# re.search('[0-9]*\.[0-9]*', td)
price = re.compile('[0-9]*\.[0-9]*').search(td)

I am getting an error:

price= re.compile('[0-9]*\.[0-9]*').search(td)
TypeError: expected string or buffer

Does beautiful soup returns array of objects? If so, how do I pass
"td" instance as string to re.search? What is the different between
re.search vs re.compile().search?
I don't know anything about BeautifulSoup, but to the other questions:

var=re.compile(regexpr) compiles the expression and after that you can
use var as the reference to that compiled expression (costs less)

re.search(expr, string) compiles and searches every time. This can
potentially be more expensive in calculating power. especially if you
have to use the expression a lot of times.

The way you use it it doesn't matter.

do:
pattern = re.compile('[0-9]*\.[0-9]*')
result = pattern.findall(your tekst here)

Now you can reuse pattern.

Cheers,
Ivo.
Sep 21 '07 #5

P: n/a
re.search(expr, string) compiles and searches every time. This can
potentially be more expensive in calculating power. especially if you
have to use the expression a lot of times.
The re module-level helper functions cache expressions and their
compiled form in a dict. They are only compiled once. The main
overhead would be for repeated dict lookups.

See sre.py (included from re.py) for more details. /usr/lib/python2.4/sre.py
Sep 21 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.