470,636 Members | 1,462 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,636 developers. It's quick & easy.

Regular Expression help

I have some data and I need to put it in a list in a particular way. I
have that figured out but there is " stuff " in the data that I don't
want.

Example:

10:00am - 11:00am:</b> <a
href="/tvpdb?d=tvp&id=167540528&cf=0&lineup=us_KS57836d&c hannels=us_KCTV&chspid=166030466&chname=CBS&progut n=1146150000&.intl=us">The
Price Is Right</a><em>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

TIA

Apr 27 '06 #1
10 1072
RunLevelZero wrote:
10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The
Price Is Right</a><em>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')
1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens. Just call group() on the returned MatchObject.

2. If all you want is the link text, you don't need to do so much matching.
If you don't need the time, don't match it in the first place. If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not
as exact but a bit simpler. Or just r'[\d:apm]{6,7}'

3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'

4. If the link text itself contains html tags, you'll have to strip those
off separately. Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.

5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.


No one can help with that unless you show us how you're building your list.
Apr 27 '06 #2
Great I will test this out once I have the time... thanks for the quick
response

Apr 27 '06 #3
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

Apr 27 '06 #4
I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)

Apr 27 '06 #5
If what you need is "simple," regular expressions are almost never the
answer. And how simple can it be if you are posting here? :)

BeautifulSoup isn't all that hard. Observe:
from BeautifulSoup import BeautifulSoup
html = '10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a><em>'
soup = BeautifulSoup(html)
soup('a') [<a href=""/tvpdb?d=tvp&id=167540528&">ThePrice Is Right</a>] for show in soup('a'):
print show.contents[0]
The Price Is Right

RunLevelZero wrote: I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)


Apr 27 '06 #6
r'<a[^>]*>(.*?)</a>'

With a slight modification that did exactly what I wanted, and yes the
findall was the only way to get all that I needed as I buffered all the
read.

Thanks a bunch.

Apr 27 '06 #7
Interesting... thank you.

Apr 27 '06 #8
jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.


I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak havoc.
Regexes can also help if you only want elements preceded/followed by a
certain sibling or cousin in the parse tree. It all depends on what you're
trying to accomplish. In general though, yes parsers are better suited to
extracting from markup.

Apr 27 '06 #9
Edward Elliott <no****@127.0.0.1> wrote:
jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.


I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak
havoc. Regexes can also help if you only want elements
preceded/followed by a certain sibling or cousin in the parse tree.
It all depends on what you're trying to accomplish. In general
though, yes parsers are better suited to extracting from markup.


A parser can be written in such a way that it doesn't give up on malformed
HTML. Probably less hard then coming up with regexes that handle HTML
that's well-formed. (and that coming from a Perl programmer ;-) )

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
Apr 27 '06 #10
Edward Elliott wrote:
jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.


I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html.


Beautiful Soup is intended to handle malformed HTML and seems to do
pretty well.

Kent
Apr 28 '06 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Bradley Plett | last post: by
4 posts views Thread by Neri | last post: by
10 posts views Thread by Lee Kuhn | last post: by
3 posts views Thread by James D. Marshall | last post: by
7 posts views Thread by Billa | last post: by
9 posts views Thread by Pete Davis | last post: by
3 posts views Thread by Zach | last post: by
25 posts views Thread by Mike | last post: by
3 posts views Thread by Mr.Steskal | last post: by
18 posts views Thread by Lit | last post: by
1 post views Thread by Korara | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.