By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,100 Members | 2,892 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,100 IT Pros & Developers. It's quick & easy.

How to do this in python with regular expressions

P: n/a
Hi all

I'm trying to parsing html with re module.

html = """
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
"""

I want to get DATA1-8 from that string.(DATA maybe not english words.)
Can anyone tell me how to do it with regular expression in python?

Thank you very much.

May 25 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
* Jia Lu (25 May 2007 04:51:35 -0700)
I'm trying to parsing html with re module.
[...]
Can anyone tell me how to do it with regular expression in python?
Just don't. Use an HTML parser like BeautifulSoup
May 25 '07 #2

P: n/a
Thorsten Kampe ha scritto:
> I'm trying to parsing html with re module.
Just don't. Use an HTML parser like BeautifulSoup
Or HTMLParser/htmllib

--
|\/|55: Mattia Gentilini e 55 = log2(che_palle_sta_storia) (by mezzo)
|/_| ETICS project at CNAF, INFN, Bologna, Italy
|\/| www.getfirefox.com www.getthunderbird.com
* Using Mac OS X 10.4.9 powered by Cerebros (Core 2 Duo) *
May 25 '07 #3

P: n/a
Thorsten Kampe ha scritto:
> I'm trying to parsing html with re module.
Just don't. Use an HTML parser like BeautifulSoup
Or HTMLParser/htmllib. of course you can mix those and re, it'll be
easier than re only.

--
|\/|55: Mattia Gentilini e 55 = log2(che_palle_sta_storia) (by mezzo)
|/_| ETICS project at CNAF, INFN, Bologna, Italy
|\/| www.getfirefox.com www.getthunderbird.com
* Using Mac OS X 10.4.9 powered by Cerebros (Core 2 Duo) *
May 25 '07 #4

P: n/a
On May 25, 6:51 am, Jia Lu <Roka...@gmail.comwrote:
Hi all

I'm trying to parsing html with re module.

html = """
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
"""

I want to get DATA1-8 from that string.(DATA maybe not english words.)
Can anyone tell me how to do it with regular expression in python?

Thank you very much.


# example1.py
# This example will print out more than what's in the HTML table. It
would also print
# out text between <body></bodytags, and so on.

import HTMLParser

class DataParser(HTMLParser.HTMLParser):
def handle_data (self, data):
data = data.strip()
if data:
print data

html = '''
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
'''

parser = DataParser()
parser.feed(html)
parser.close()

example1.py output:

$ python example1.py
DATA1
DATA2
DATA3
DATA4
DATA5
DATA6
DATA7
DATA8

# example2.py
# This example uses the re module to pull out only the table portions
of HTML. This
# should only print out data between <table></tabletags. Notice that
there is some
# data between the <body></bodytags that is not present in the
output.

import HTMLParser
import re

class DataParser(HTMLParser.HTMLParser):
def handle_data (self, data):
data = data.strip()
if data:
print data

html = '''
<html>
<head></head>
<body>
body data 1
<table>
<tr><td>table 1 data 1</td></tr>
<tr><td>table 1 data 2</td></tr>
</table>

<table>
<tr><td>table 2 data 1</td></tr>
<tr><td>table 2 data 2</td></tr>
</table>
body data 2
</body>
</html>
'''

tables_list = re.findall('<table>.*?</table>', html, re.DOTALL |
re.IGNORECASE)
tables_html = str.join(' ', tables_list)

parser = DataParser()
parser.feed(tables_html)
parser.close()

example2.py output:

$ python example2.py
table 1 data 1
table 1 data 2
table 2 data 1
table 2 data 2

# example3.py
# This example does basically the same thing as example2.py, but it
uses HTMLParser
# to keep track of whether the data is between <table></tabletags.

import HTMLParser

class DataParser(HTMLParser.HTMLParser):
def __init__ (self):
HTMLParser.HTMLParser.__init__(self)
self.table_count = 0
def handle_starttag (self, tag, attrs):
if tag == 'table':
self.table_count += 1
def handle_endtag (self, tag):
if tag == 'table':
self.table_count -= 1
def handle_data (self, data):
data = data.strip()
if data and self.table_count 0:
print data

html = '''
<html>
<head></head>
<body>
body data 1
<table>
<tr><td>table 1 data 1</td></tr>
<tr><td>table 1 data 2</td></tr>
</table>

<table>
<tr><td>table 2 data 1</td></tr>
<tr><td>table 2 data 2</td></tr>
</table>
body data 2
</body>
</html>
'''

parser = DataParser()
parser.feed(html)
parser.close()

example3.py output:

$ python example3.py
table 1 data 1
table 1 data 2
table 2 data 1
table 2 data 2

May 27 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.