473,399 Members | 3,401 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

How to do this in python with regular expressions

Hi all

I'm trying to parsing html with re module.

html = """
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
"""

I want to get DATA1-8 from that string.(DATA maybe not english words.)
Can anyone tell me how to do it with regular expression in python?

Thank you very much.

May 25 '07 #1
4 1554
* Jia Lu (25 May 2007 04:51:35 -0700)
I'm trying to parsing html with re module.
[...]
Can anyone tell me how to do it with regular expression in python?
Just don't. Use an HTML parser like BeautifulSoup
May 25 '07 #2
Thorsten Kampe ha scritto:
> I'm trying to parsing html with re module.
Just don't. Use an HTML parser like BeautifulSoup
Or HTMLParser/htmllib

--
|\/|55: Mattia Gentilini e 55 = log2(che_palle_sta_storia) (by mezzo)
|/_| ETICS project at CNAF, INFN, Bologna, Italy
|\/| www.getfirefox.com www.getthunderbird.com
* Using Mac OS X 10.4.9 powered by Cerebros (Core 2 Duo) *
May 25 '07 #3
Thorsten Kampe ha scritto:
> I'm trying to parsing html with re module.
Just don't. Use an HTML parser like BeautifulSoup
Or HTMLParser/htmllib. of course you can mix those and re, it'll be
easier than re only.

--
|\/|55: Mattia Gentilini e 55 = log2(che_palle_sta_storia) (by mezzo)
|/_| ETICS project at CNAF, INFN, Bologna, Italy
|\/| www.getfirefox.com www.getthunderbird.com
* Using Mac OS X 10.4.9 powered by Cerebros (Core 2 Duo) *
May 25 '07 #4
On May 25, 6:51 am, Jia Lu <Roka...@gmail.comwrote:
Hi all

I'm trying to parsing html with re module.

html = """
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
"""

I want to get DATA1-8 from that string.(DATA maybe not english words.)
Can anyone tell me how to do it with regular expression in python?

Thank you very much.


# example1.py
# This example will print out more than what's in the HTML table. It
would also print
# out text between <body></bodytags, and so on.

import HTMLParser

class DataParser(HTMLParser.HTMLParser):
def handle_data (self, data):
data = data.strip()
if data:
print data

html = '''
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
'''

parser = DataParser()
parser.feed(html)
parser.close()

example1.py output:

$ python example1.py
DATA1
DATA2
DATA3
DATA4
DATA5
DATA6
DATA7
DATA8

# example2.py
# This example uses the re module to pull out only the table portions
of HTML. This
# should only print out data between <table></tabletags. Notice that
there is some
# data between the <body></bodytags that is not present in the
output.

import HTMLParser
import re

class DataParser(HTMLParser.HTMLParser):
def handle_data (self, data):
data = data.strip()
if data:
print data

html = '''
<html>
<head></head>
<body>
body data 1
<table>
<tr><td>table 1 data 1</td></tr>
<tr><td>table 1 data 2</td></tr>
</table>

<table>
<tr><td>table 2 data 1</td></tr>
<tr><td>table 2 data 2</td></tr>
</table>
body data 2
</body>
</html>
'''

tables_list = re.findall('<table>.*?</table>', html, re.DOTALL |
re.IGNORECASE)
tables_html = str.join(' ', tables_list)

parser = DataParser()
parser.feed(tables_html)
parser.close()

example2.py output:

$ python example2.py
table 1 data 1
table 1 data 2
table 2 data 1
table 2 data 2

# example3.py
# This example does basically the same thing as example2.py, but it
uses HTMLParser
# to keep track of whether the data is between <table></tabletags.

import HTMLParser

class DataParser(HTMLParser.HTMLParser):
def __init__ (self):
HTMLParser.HTMLParser.__init__(self)
self.table_count = 0
def handle_starttag (self, tag, attrs):
if tag == 'table':
self.table_count += 1
def handle_endtag (self, tag):
if tag == 'table':
self.table_count -= 1
def handle_data (self, data):
data = data.strip()
if data and self.table_count 0:
print data

html = '''
<html>
<head></head>
<body>
body data 1
<table>
<tr><td>table 1 data 1</td></tr>
<tr><td>table 1 data 2</td></tr>
</table>

<table>
<tr><td>table 2 data 1</td></tr>
<tr><td>table 2 data 2</td></tr>
</table>
body data 2
</body>
</html>
'''

parser = DataParser()
parser.feed(html)
parser.close()

example3.py output:

$ python example3.py
table 1 data 1
table 1 data 2
table 2 data 1
table 2 data 2

May 27 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Tony C | last post by:
I'm writing a python program which uses regular expressions, but I'm totally new to regexps. I've got Kuchling's "Regexp HOWTO", "Mastering Regular Expresions" by Oreilly, and have access to...
31
by: surfunbear | last post by:
I've read some posts on Perl versus Python and studied a bit of my Python book. I'm a software engineer, familiar with C++ objected oriented development, but have been using Perl because it is...
75
by: Xah Lee | last post by:
http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...
9
by: Dieter Vanderelst | last post by:
Dear all, I'm currently comparing Python versus Perl to use in a project that involved a lot of text processing. I'm trying to determine what the most efficient language would be for our...
19
by: Davy | last post by:
Hi all, I am a C/C++/Perl user and want to switch to Python (I found Python is more similar to C). Does Python support robust regular expression like Perl? And Python and Perl's File...
13
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...
5
by: Svenn Are Bjerkem | last post by:
On Jul 23, 1:03 pm, christopher.saun...@durham.ac.uk (c d saunter) wrote: As a start I want to parse VHDL which is going to be synthesised, and I am limiting myself to the entities and the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.