I have some data and I need to put it in a list in a particular way. I
have that figured out but there is " stuff " in the data that I don't
want.
Example:
10:00am - 11:00am:</b> <a
href="/tvpdb?d=tvp&id=167540528&cf=0&lineup=us_KS57836d&c hannels=us_KCTV&chspid=166030466&chname=CBS&progut n=1146150000&.intl=us">The
Price Is Right</a><em>
All I want is " Price Is Right "
Here is the re.
findshows =
re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')
I have used a for loop to remove the extra data but then it ruins the
list that I am building. Basically I want the list to be something
like this.
[[Government Access], [Price Is Right, Guiding Light, Another show]]
the for loop just comma deliminates all of them so I lose the list in a
list that I need. I hope I have explained this well enough. Any help
or ideas would be appreciated.
TIA 10 1149
RunLevelZero wrote: 10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a><em>
All I want is " Price Is Right "
Here is the re.
findshows = re.compile(r'(\d\d:\d\d\D\D\s-\s\d\d:\d\d\D\D:*.*</a><em>)')
1. A regex remembers everything it matches -- no need to wrap the entire
thing in parens. Just call group() on the returned MatchObject.
2. If all you want is the link text, you don't need to do so much matching.
If you don't need the time, don't match it in the first place. If you're
using it as a marker, try matching each time with r'[\d:]{4,5}[ap]m'. Not
as exact but a bit simpler. Or just r'[\d:apm]{6,7}'
3. To grab what's inside the link: r'<a[^>]*>(.*?)</a>'
4. If the link text itself contains html tags, you'll have to strip those
off separately. Extracting the text from arbitrarily nested html tags in
one shot requires a parser, not a regex.
5. If you're just going to run this regex repeatedly on an html doc and make
a list of the results, it's easier to read the whole doc into a string and
then use re.findall.
I have used a for loop to remove the extra data but then it ruins the list that I am building. Basically I want the list to be something like this.
[[Government Access], [Price Is Right, Guiding Light, Another show]]
the for loop just comma deliminates all of them so I lose the list in a list that I need. I hope I have explained this well enough. Any help or ideas would be appreciated.
No one can help with that unless you show us how you're building your list.
Great I will test this out once I have the time... thanks for the quick
response
If you are parsing HTML, it may make more sense to use a package
designed especially for that purpose, like Beautiful Soup.
I considered that but what I need is simple and I don't want to use
another library for something so simple but thank you. Plus I don't
understand them all that well :)
If what you need is "simple," regular expressions are almost never the
answer. And how simple can it be if you are posting here? :)
BeautifulSoup isn't all that hard. Observe: from BeautifulSoup import BeautifulSoup html = '10:00am - 11:00am:</b> <a href="/tvpdb?d=tvp&id=167540528&[snip]>The Price Is Right</a><em>' soup = BeautifulSoup(html) soup('a')
[<a href=""/tvpdb?d=tvp&id=167540528&">ThePrice Is Right</a>] for show in soup('a'):
print show.contents[0]
The Price Is Right
RunLevelZero wrote: I considered that but what I need is simple and I don't want to use another library for something so simple but thank you. Plus I don't understand them all that well :)
r'<a[^>]*>(.*?)</a>'
With a slight modification that did exactly what I wanted, and yes the
findall was the only way to get all that I needed as I buffered all the
read.
Thanks a bunch.
Interesting... thank you. jo********@gmail.com wrote: If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup.
I don't know Beautiful Soup, but one advantage regexes have over some
parsers is handling malformed html. Omitted closing tags can wreak havoc.
Regexes can also help if you only want elements preceded/followed by a
certain sibling or cousin in the parse tree. It all depends on what you're
trying to accomplish. In general though, yes parsers are better suited to
extracting from markup.
Edward Elliott <no****@127.0.0.1> wrote: jo********@gmail.com wrote: If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup.
I don't know Beautiful Soup, but one advantage regexes have over some parsers is handling malformed html. Omitted closing tags can wreak havoc. Regexes can also help if you only want elements preceded/followed by a certain sibling or cousin in the parse tree. It all depends on what you're trying to accomplish. In general though, yes parsers are better suited to extracting from markup.
A parser can be written in such a way that it doesn't give up on malformed
HTML. Probably less hard then coming up with regexes that handle HTML
that's well-formed. (and that coming from a Perl programmer ;-) )
--
John MexIT: http://johnbokma.com/mexit/
personal page: http://johnbokma.com/
Experienced programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
Edward Elliott wrote: jo********@gmail.com wrote: If you are parsing HTML, it may make more sense to use a package designed especially for that purpose, like Beautiful Soup.
I don't know Beautiful Soup, but one advantage regexes have over some parsers is handling malformed html.
Beautiful Soup is intended to handle malformed HTML and seems to do
pretty well.
Kent This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Bradley Plett |
last post by:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL,...
|
by: Neri |
last post by:
Some document processing program I write has to deal with documents
that have headers and footers that are unnecessary for the main
processing part. Therefore, I'm using a regular expression to go...
|
by: Lee Kuhn |
last post by:
I am trying the create a regular expression that will essentially match
characters in the middle of a fixed-length string. The string may be any
characters, but will always be the same length. In...
|
by: James D. Marshall |
last post by:
The issue at hand, I believe is my comprehension of using regular
expression, specially to assist in replacing the expression with other text.
using regular expression (\s*) my understanding is...
|
by: Billa |
last post by:
Hi,
I am replaceing a big string using different regular expressions (see
some example at the end of the message). The problem is whenever I
apply a "replace" it makes a new copy of string and I...
|
by: Pete Davis |
last post by:
I'm using regular expressions to extract some data and some links from some
web pages. I download the page and then I want to get a list of certain
links.
For building regular expressions, I use...
|
by: Zach |
last post by:
Hello,
Please forgive if this is not the most appropriate newsgroup for this
question. Unfortunately I didn't find a newsgroup specific to regular
expressions.
I have the following regular...
|
by: Mike |
last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
|
by: Mr.Steskal |
last post by:
Posted: Wed Jul 11, 2007 7:01 am Post subject: Regular Expression
Help
--------------------------------------------------------------------------------
I need help writing a regular...
|
by: Lit |
last post by:
Hi,
I am looking for a Regular expression for a password for my RegExp
ValidationControl
Requirements are,
At least 8 characters long.
At least one digit
At least one upper case character
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| |