473,398 Members | 2,525 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,398 software developers and data experts.

Regular Expression help

I have some data and I need to put it in a list in a particular way. I
have that figured out but there is " stuff " in the data that I don't
want.

Example:

10:00am - 11:00am:</b> <a
href="/tvpdb?d=tvp&id=167540528&cf=0&lineup=us_KS57836d&c hannels=us_KCTV&chspid=166030466&chname=CBS&progut n=1146150000&.intl=us">The
Price Is Right</a><em>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.

TIA

Apr 27 '06 #1
10 1149
RunLevelZero wrote:
10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The
Price Is Right</a><em>

All I want is " Price Is Right "

Here is the re.

findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')
1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens. Just call group() on the returned MatchObject.

2. If all you want is the link text, you don't need to do so much matching.
If you don't need the time, don't match it in the first place. If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not
as exact but a bit simpler. Or just r'[\d:apm]{6,7}'

3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'

4. If the link text itself contains html tags, you'll have to strip those
off separately. Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.

5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.

I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.

[[Government Access], [Price Is Right, Guiding Light, Another show]]

the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.


No one can help with that unless you show us how you're building your list.
Apr 27 '06 #2
Great I will test this out once I have the time... thanks for the quick
response

Apr 27 '06 #3
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.

Apr 27 '06 #4
I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)

Apr 27 '06 #5
If what you need is "simple," regular expressions are almost never the
answer. And how simple can it be if you are posting here? :)

BeautifulSoup isn't all that hard. Observe:
from BeautifulSoup import BeautifulSoup
html = '10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a><em>'
soup = BeautifulSoup(html)
soup('a') [<a href=""/tvpdb?d=tvp&id=167540528&">ThePrice Is Right</a>] for show in soup('a'):
print show.contents[0]
The Price Is Right

RunLevelZero wrote: I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)


Apr 27 '06 #6
r'<a[^>]*>(.*?)</a>'

With a slight modification that did exactly what I wanted, and yes the
findall was the only way to get all that I needed as I buffered all the
read.

Thanks a bunch.

Apr 27 '06 #7
Interesting... thank you.

Apr 27 '06 #8
jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.


I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak havoc.
Regexes can also help if you only want elements preceded/followed by a
certain sibling or cousin in the parse tree. It all depends on what you're
trying to accomplish. In general though, yes parsers are better suited to
extracting from markup.

Apr 27 '06 #9
Edward Elliott <no****@127.0.0.1> wrote:
jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.


I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak
havoc. Regexes can also help if you only want elements
preceded/followed by a certain sibling or cousin in the parse tree.
It all depends on what you're trying to accomplish. In general
though, yes parsers are better suited to extracting from markup.


A parser can be written in such a way that it doesn't give up on malformed
HTML. Probably less hard then coming up with regexes that handle HTML
that's well-formed. (and that coming from a Perl programmer ;-) )

--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
Apr 27 '06 #10
Edward Elliott wrote:
jo********@gmail.com wrote:
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.


I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html.


Beautiful Soup is intended to handle malformed HTML and seems to do
pretty well.

Kent
Apr 28 '06 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Bradley Plett | last post by:
I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
10
by: Lee Kuhn | last post by:
I am trying the create a regular expression that will essentially match characters in the middle of a fixed-length string. The string may be any characters, but will always be the same length. In...
3
by: James D. Marshall | last post by:
The issue at hand, I believe is my comprehension of using regular expression, specially to assist in replacing the expression with other text. using regular expression (\s*) my understanding is...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
9
by: Pete Davis | last post by:
I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...
3
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
3
by: Mr.Steskal | last post by:
Posted: Wed Jul 11, 2007 7:01 am Post subject: Regular Expression Help -------------------------------------------------------------------------------- I need help writing a regular...
18
by: Lit | last post by:
Hi, I am looking for a Regular expression for a password for my RegExp ValidationControl Requirements are, At least 8 characters long. At least one digit At least one upper case character
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.