By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,631 Members | 892 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,631 IT Pros & Developers. It's quick & easy.

Regex Question

Curtis Rutland
Expert 2.5K+
P: 3,256
I know almost nothing of Regular Expressions other than that they exist. However, I know that it's probably the answer to my co-worker's problem.

We need to strip some HTML out of some data. The biggest problem we have is the <p> tags. But they include some other attributes. For example:
Expand|Select|Wrap|Line Numbers
  1. <p class="asdf1234">Some Text</p>
  2. <p class="qwer567890">Some Other Text</p>
Our end goal is
Expand|Select|Wrap|Line Numbers
  1. Some Text
  2. Some Other Text
We've so far gotten a regex to remove the closing </p>, and to get rid of an empty open <p>, but if it has any attributes included, the regex won't mach it.

Can anyone suggest a regex that will match "<p" + any number of characters/symbols + ">" for me? I'd appreciate it.
Jan 5 '10 #1

✓ answered by NeoPa

Expand|Select|Wrap|Line Numbers
  1. </?p[^>]*>
will match either the introduction or the termination.

If you parse through your text changing this to blank then you should have what you need.

< ==> Find string starting with <.
/? ==> Next it may, or may not, have a /.
p ==> A p must follow.
[^>] ==> Any character other than >.
* ==> Match any number of the preceding specification.
> ==> A > must follow.

Share this Question
Share on Google+
13 Replies


tlhintoq
Expert 2.5K+
P: 3,525
I know nothing of RegEx either meaning my way is probably more brute force.
Get the indexes of the "<" and ">" characters
Discard everything before and including the first ">" and after and including the last "<"
Jan 5 '10 #2

bvdet
Expert Mod 2.5K+
P: 2,851
insertAlias,

In Python:
Expand|Select|Wrap|Line Numbers
  1. "<p.*>([^<>]+?)</p>"
Example:
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. patt = re.compile(r"<p.*>([^<>]+?)</p>")
  4.  
  5. def tag_text(s):
  6.     output = []
  7.     while True:
  8.         m = patt.search(s)
  9.         if m:
  10.             output.append(m.group(1))
  11.             s = s[m.end()+1:]
  12.         else:
  13.             return output
  14.  
  15. s = '''
  16. <p class="asdf1234">Some Text</p>
  17. <p class="qwer567890">Some Other Text</p>'''
  18.  
  19. textList = tag_text(s)
  20.  
  21. print textList
Output:
Expand|Select|Wrap|Line Numbers
  1. >>> ['Some Text', 'Some Other Text']
  2. >>> 
Jan 5 '10 #3

Markus
Expert 5K+
P: 6,050
@bvdet
The regex will largely be the same throughout languages.
Jan 5 '10 #4

NeoPa
Expert Mod 15k+
P: 31,419
Expand|Select|Wrap|Line Numbers
  1. </?p[^>]*>
will match either the introduction or the termination.

If you parse through your text changing this to blank then you should have what you need.

< ==> Find string starting with <.
/? ==> Next it may, or may not, have a /.
p ==> A p must follow.
[^>] ==> Any character other than >.
* ==> Match any number of the preceding specification.
> ==> A > must follow.
Jan 6 '10 #5

Curtis Rutland
Expert 2.5K+
P: 3,256
I think Ade's is going to do it for me. I'll give it to my co-worker tomorrow and see if it works.

Thanks for the response. I've been meaning to learn regex, but I'm a big slacker.
Jan 6 '10 #6

MMcCarthy
Expert Mod 10K+
P: 14,534
IA if this answers the question (after consulting with your colleague) can you move this into misc. Otherwise it won't be public in searches.

Thanks

Mary
Jan 6 '10 #7

NeoPa
Expert Mod 15k+
P: 31,419
Sorry Mary. I should have done this before. There's no need to wait for the answer to be confirmed. That's where it should be anyway.
Jan 6 '10 #8

NeoPa
Expert Mod 15k+
P: 31,419
BTW. I learnt all I know about RegExes from the Help section of a utility called TextPad. A pretty powerful text editor (I'm sure there are various other good ones out there too) which supports them. If you click Help while in the Search or Replace dialog boxes it takes you to three pages of info and details. Go through that and practice a bit (I found it so powerful I didn't need to try to practice) and you'll be making them dance in no time. It also acts as a reference when you can sort of remember what you need but need a memory jog.

There are probably other places more web available, but I only know this one well, as I used it to learn from and it did a good job for me.
Jan 6 '10 #9

bvdet
Expert Mod 2.5K+
P: 2,851
I learned about regular expressions a little at a time. I found this page to be very informative and easy to understand. It is helpful having a way to easily test a regular expression. When you cannot figure out why an expression won't work, editing in a regex debugger makes it almost tolerable. I have been using Kodos.
Jan 6 '10 #10

Markus
Expert 5K+
P: 6,050
The only problem is... where the hell is misc? It's disappeared from the navigation again.
Jan 6 '10 #11

NeoPa
Expert Mod 15k+
P: 31,419
You could try the Ask Question link at the top.

It's not there either, but you could try just for fun :D

Otherwise the breadcrumbs is good from here :S
Jan 6 '10 #12

dgreenhouse
Expert 100+
P: 250
I know this is a 3 week old thread, but I'd like to note
that the following book is the RegEx bible:

Mastering Regular Expressions by Jeffrey E. F. Friedl
Publisher: O'Reilly (it's currently in its 3rd edition).

It sits to miy desk to the right; I've barely touched its depths; It hurts your head! :-)
Feb 2 '10 #13

Markus
Expert 5K+
P: 6,050
That's going on my wishlist. Thanks, dgreenhouse.
Feb 2 '10 #14

Post your reply

Sign in to post your reply or Sign up for a free account.