How to do this in python with regular expressions

Jia Lu

Hi all

I'm trying to parsing html with re module.

html = """
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
"""

I want to get DATA1-8 from that string.(DATA maybe not english words.)
Can anyone tell me how to do it with regular expression in python?

Thank you very much.

May 25 '07 #1

Subscribe Post Reply

1554

Thorsten Kampe

* Jia Lu (25 May 2007 04:51:35 -0700)

I'm trying to parsing html with re module.
[...]
Can anyone tell me how to do it with regular expression in python?

Just don't. Use an HTML parser like BeautifulSoup

May 25 '07 #2

Mattia Gentilini

Thorsten Kampe ha scritto:

> I'm trying to parsing html with re module.
Just don't. Use an HTML parser like BeautifulSoup

Or HTMLParser/htmllib

--
|\/|55: Mattia Gentilini e 55 = log2(che_palle_sta_storia) (by mezzo)
|/_| ETICS project at CNAF, INFN, Bologna, Italy
|\/| www.getfirefox.com www.getthunderbird.com
* Using Mac OS X 10.4.9 powered by Cerebros (Core 2 Duo) *

May 25 '07 #3

Mattia Gentilini

Thorsten Kampe ha scritto:

> I'm trying to parsing html with re module.
Just don't. Use an HTML parser like BeautifulSoup

Or HTMLParser/htmllib. of course you can mix those and re, it'll be
easier than re only.

--
|\/|55: Mattia Gentilini e 55 = log2(che_palle_sta_storia) (by mezzo)
|/_| ETICS project at CNAF, INFN, Bologna, Italy
|\/| www.getfirefox.com www.getthunderbird.com
* Using Mac OS X 10.4.9 powered by Cerebros (Core 2 Duo) *

May 25 '07 #4

snorble

On May 25, 6:51 am, Jia Lu <Roka...@gmail.comwrote:

Hi all

I'm trying to parsing html with re module.

html = """
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
"""

I want to get DATA1-8 from that string.(DATA maybe not english words.)
Can anyone tell me how to do it with regular expression in python?

Thank you very much.

# example1.py
# This example will print out more than what's in the HTML table. It
would also print
# out text between <body></bodytags, and so on.

import HTMLParser

class DataParser(HTMLParser.HTMLParser):
def handle_data (self, data):
data = data.strip()
if data:
print data

html = '''
<TABLE BORDER=1 cellspacing=0 cellpadding=2>
<TR>

<TH nowrap>DATA1</TH><TH nowrap>DATA2</HT><TH nowrap>DATA3</
HT><TH>DATA4</TH>
</TR>

<TR><TD>DATA5</TD><TD>DATA6</TD><TD>DATA7</TD><TD>DATA8</TD></TR>

</TABLE>
'''

parser = DataParser()
parser.feed(html)
parser.close()

example1.py output:

$ python example1.py
DATA1
DATA2
DATA3
DATA4
DATA5
DATA6
DATA7
DATA8

# example2.py
# This example uses the re module to pull out only the table portions
of HTML. This
# should only print out data between <table></tabletags. Notice that
there is some
# data between the <body></bodytags that is not present in the
output.

import HTMLParser
import re

class DataParser(HTMLParser.HTMLParser):
def handle_data (self, data):
data = data.strip()
if data:
print data

html = '''
<html>
<head></head>
<body>
body data 1
<table>
<tr><td>table 1 data 1</td></tr>
<tr><td>table 1 data 2</td></tr>
</table>

<table>
<tr><td>table 2 data 1</td></tr>
<tr><td>table 2 data 2</td></tr>
</table>
body data 2
</body>
</html>
'''

tables_list = re.findall('<table>.*?</table>', html, re.DOTALL |
re.IGNORECASE)
tables_html = str.join(' ', tables_list)

parser = DataParser()
parser.feed(tables_html)
parser.close()

example2.py output:

$ python example2.py
table 1 data 1
table 1 data 2
table 2 data 1
table 2 data 2

# example3.py
# This example does basically the same thing as example2.py, but it
uses HTMLParser
# to keep track of whether the data is between <table></tabletags.

import HTMLParser

class DataParser(HTMLParser.HTMLParser):
def __init__ (self):
HTMLParser.HTMLParser.__init__(self)
self.table_count = 0
def handle_starttag (self, tag, attrs):
if tag == 'table':
self.table_count += 1
def handle_endtag (self, tag):
if tag == 'table':
self.table_count -= 1
def handle_data (self, data):
data = data.strip()
if data and self.table_count 0:
print data

html = '''
<html>
<head></head>
<body>
body data 1
<table>
<tr><td>table 1 data 1</td></tr>
<tr><td>table 1 data 2</td></tr>
</table>

<table>
<tr><td>table 2 data 1</td></tr>
<tr><td>table 2 data 2</td></tr>
</table>
body data 2
</body>
</html>
'''

parser = DataParser()
parser.feed(html)
parser.close()

example3.py output:

$ python example3.py
table 1 data 1
table 1 data 2
table 2 data 1
table 2 data 2

May 27 '07 #5

by: Tony C | last post by:

I'm writing a python program which uses regular expressions, but I'm totally new to regexps. I've got Kuchling's "Regexp HOWTO", "Mastering Regular Expresions" by Oreilly, and have access to...

Python

Python versus Perl ?

by: surfunbear | last post by:

I've read some posts on Perl versus Python and studied a bit of my Python book. I'm a software engineer, familiar with C++ objected oriented development, but have been using Perl because it is...

Python

[perl-python] Python documentation moronicities (continued)

by: Xah Lee | last post by:

http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...

Python

Python versus Perl

by: Dieter Vanderelst | last post by:

Dear all, I'm currently comparing Python versus Perl to use in a project that involved a lot of text processing. I'm trying to determine what the most efficient language would be for our...

Python

Python's regular expression?

by: Davy | last post by:

Hi all, I am a C/C++/Perl user and want to switch to Python (I found Python is more similar to C). Does Python support robust regular expression like Perl? And Python and Perl's File...

Python

Python regular expressions just ain't PCRE

by: Wiseman | last post by:

I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...

Python

Re: Parsing VHDL with python, where to start.

by: Svenn Are Bjerkem | last post by:

On Jul 23, 1:03 pm, christopher.saun...@durham.ac.uk (c d saunter) wrote: As a start I want to parse VHDL which is going to be synthesised, and I am limiting myself to the entities and the...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

How to do this in python with regular expressions

Similar topics