Regular Expression help

RunLevelZero

I have some data and I need to put it in a list in a particular way. I
have that figured out but there is " stuff " in the data that I don't
want.

Example:

10:00am - 11:00am: <a
href="/tvpdb?d=tvp&id=167540528&cf=0&lineup=us_KS57836d&c hannels=us_KCTV&chspid=166030466&chname=CBS&progut n=1146150000&.intl=us">The
Price Is Right</a>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a>)')

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

TIA

Apr 27 '06 #1

Subscribe Post Reply

1149

Edward Elliott

RunLevelZero wrote:

10:00am - 11:00am: <a href="/tvpdb?d=tvp&id=167540528&[snip]>The
Price Is Right</a>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a>)')
1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens. Just call group() on the returned MatchObject.

2. If all you want is the link text, you don't need to do so much matching.
If you don't need the time, don't match it in the first place. If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not
as exact but a bit simpler. Or just r'[\d:apm]{6,7}'

3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'

4. If the link text itself contains html tags, you'll have to strip those
off separately. Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.

5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

No one can help with that unless you show us how you're building your list.

Apr 27 '06 #2

RunLevelZero

Great I will test this out once I have the time... thanks for the quick
response

Apr 27 '06 #3

johnzenger

If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

Apr 27 '06 #4

RunLevelZero

I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)

Apr 27 '06 #5

johnzenger

If what you need is "simple," regular expressions are almost never the
answer. And how simple can it be if you are posting here? :)

BeautifulSoup isn't all that hard. Observe:

from BeautifulSoup import BeautifulSoup
html = '10:00am - 11:00am: <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a>'
soup = BeautifulSoup(html)
soup('a') [<a href=""/tvpdb?d=tvp&id=167540528&">ThePrice Is Right</a>] for show in soup('a'):
print show.contents[0]
The Price Is Right

RunLevelZero wrote: I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)

Apr 27 '06 #6

RunLevelZero

r'<a[^>]*>(.*?)</a>'

With a slight modification that did exactly what I wanted, and yes the
findall was the only way to get all that I needed as I buffered all the
read.

Thanks a bunch.

Apr 27 '06 #7

RunLevelZero

Interesting... thank you.

Apr 27 '06 #8

Edward Elliott

jo********@gmail.com wrote:

If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak havoc.
Regexes can also help if you only want elements preceded/followed by a
certain sibling or cousin in the parse tree. It all depends on what you're
trying to accomplish. In general though, yes parsers are better suited to
extracting from markup.

Apr 27 '06 #9

John Bokma

Edward Elliott <no****@127.0.0.1> wrote:

jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak
havoc. Regexes can also help if you only want elements
preceded/followed by a certain sibling or cousin in the parse tree.
It all depends on what you're trying to accomplish. In general
though, yes parsers are better suited to extracting from markup.

A parser can be written in such a way that it doesn't give up on malformed
HTML. Probably less hard then coming up with regexes that handle HTML
that's well-formed. (and that coming from a Perl programmer ;-) )

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

Apr 27 '06 #10

Kent Johnson

Edward Elliott wrote:

jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html.

Beautiful Soup is intended to handle malformed HTML and seems to do
pretty well.

Kent

Apr 28 '06 #11

Similar topics

Help with regular expression?

by: Bradley Plett | last post by:

I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...

.NET Framework

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Regular expression question

by: Lee Kuhn | last post by:

I am trying the create a regular expression that will essentially match characters in the middle of a fixed-length string. The string may be any characters, but will always be the same length. In...

C# / C Sharp

Replacing special chars using regular expressions

by: James D. Marshall | last post by:

The issue at hand, I believe is my comprehension of using regular expression, specially to assist in replacing the expression with other text. using regular expression (\s*) my understanding is...

Visual Basic .NET

Regular expression optimization

by: Billa | last post by:

Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...

.NET Framework

Regular Expression Matches

by: Pete Davis | last post by:

I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...

C# / C Sharp

Regular Expression help

by: Zach | last post by:

Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...

C# / C Sharp

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

Help with a Regular Expression

by: Mr.Steskal | last post by:

Posted: Wed Jul 11, 2007 7:01 am Post subject: Regular Expression Help -------------------------------------------------------------------------------- I need help writing a regular...

Javascript

Regular Expression

by: Lit | last post by:

Hi, I am looking for a Regular expression for a password for my RegExp ValidationControl Requirements are, At least 8 characters long. At least one digit At least one upper case character

ASP.NET

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA