By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,404 Members | 1,074 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,404 IT Pros & Developers. It's quick & easy.

Regular Expression Help, getting over the newline \n

P: 3
Hello all,

I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!
Expand|Select|Wrap|Line Numbers
  1. # Sample HTML text:
  2. text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  3.  
  4. # My regex:
  5. results = open("results.txt","a")  
  6. speechPattern = re.compile(r'''
  7. <p>&nbsp;&nbsp;&nbsp;   
  8. (.*)
  9. ''', re.VERBOSE)        
  10. test = speechPattern.findall(text)
  11. results.writelines(test)
  12. results.close()
  13.  
Thanks again!

Law
May 7 '07 #1
Share this Question
Share on Google+
5 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
Hello all,

I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!
Expand|Select|Wrap|Line Numbers
  1. # Sample HTML text:
  2. text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  3.  
  4. # My regex:
  5. results = open("results.txt","a")  
  6. speechPattern = re.compile(r'''
  7. <p>&nbsp;&nbsp;&nbsp;   
  8. (.*)
  9. ''', re.VERBOSE)        
  10. test = speechPattern.findall(text)
  11. results.writelines(test)
  12. results.close()
  13.  
Thanks again!

Law
If you must have a regex solution, this will not help:
Expand|Select|Wrap|Line Numbers
  1. >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  2. >>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']
  3. ['We operate forever. We will become Representatives.', 'Any conference']
  4. >>> 
May 7 '07 #2

P: 3
If you must have a regex solution, this will not help:
Expand|Select|Wrap|Line Numbers
  1. >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  2. >>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']
  3. ['We operate forever. We will become Representatives.', 'Any conference']
  4. >>> 
A great example of my beginner's eyes not seeing a better way; thanks so much!
May 7 '07 #3

Expert 100+
P: 511
to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)
May 8 '07 #4

bartonc
Expert 5K+
P: 6,596
to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)
This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
May 25 '07 #5

Expert 100+
P: 511
This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
hey bc no prob...:) yup that book is good.
May 26 '07 #6

Post your reply

Sign in to post your reply or Sign up for a free account.