473,405 Members | 2,272 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Regular Expression Help, getting over the newline \n

3
Hello all,

I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!
Expand|Select|Wrap|Line Numbers
  1. # Sample HTML text:
  2. text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  3.  
  4. # My regex:
  5. results = open("results.txt","a")  
  6. speechPattern = re.compile(r'''
  7. <p>&nbsp;&nbsp;&nbsp;   
  8. (.*)
  9. ''', re.VERBOSE)        
  10. test = speechPattern.findall(text)
  11. results.writelines(test)
  12. results.close()
  13.  
Thanks again!

Law
May 7 '07 #1
5 8121
bvdet
2,851 Expert Mod 2GB
Hello all,

I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!
Expand|Select|Wrap|Line Numbers
  1. # Sample HTML text:
  2. text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  3.  
  4. # My regex:
  5. results = open("results.txt","a")  
  6. speechPattern = re.compile(r'''
  7. <p>&nbsp;&nbsp;&nbsp;   
  8. (.*)
  9. ''', re.VERBOSE)        
  10. test = speechPattern.findall(text)
  11. results.writelines(test)
  12. results.close()
  13.  
Thanks again!

Law
If you must have a regex solution, this will not help:
Expand|Select|Wrap|Line Numbers
  1. >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  2. >>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']
  3. ['We operate forever. We will become Representatives.', 'Any conference']
  4. >>> 
May 7 '07 #2
BLaw
3
If you must have a regex solution, this will not help:
Expand|Select|Wrap|Line Numbers
  1. >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
  2. >>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']
  3. ['We operate forever. We will become Representatives.', 'Any conference']
  4. >>> 
A great example of my beginner's eyes not seeing a better way; thanks so much!
May 7 '07 #3
ghostdog74
511 Expert 256MB
to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)
May 8 '07 #4
bartonc
6,596 Expert 4TB
to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)
This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
May 25 '07 #5
ghostdog74
511 Expert 256MB
This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
hey bc no prob...:) yup that book is good.
May 26 '07 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

14
by: Tina Li | last post by:
Hello, I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum recursion limit exceeded". Here is the pattern string: ...
2
by: Oriana | last post by:
Hi! I'm trying to 'clean up' this source file using regular expressions in Python. My problem is, that when I try to delete extra lines, my code fails. Here's an example.... /** * *...
1
by: Antonio | last post by:
Good morning, I've the following part of an html file, //////////////////////////////////////////////////////////////////////// </font></b></td> </tr> <tr> <td bgcolor=white class=s><div...
3
by: Guoqi Zheng | last post by:
Dear sir, On regular expression, a . means Match anything except newline. How about if I need it to includes newline as well? I try , but it seems not working. and idea? -- Kind regards
3
by: James D. Marshall | last post by:
The issue at hand, I believe is my comprehension of using regular expression, specially to assist in replacing the expression with other text. using regular expression (\s*) my understanding is...
9
by: Pete Davis | last post by:
I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
3
by: shapper | last post by:
Hello, I have a regular expression to validate email addresses: "\w+(\w+)*@\w+(\w+)*\.\w+(\w+)*" Now I need to force all emails to be from a given domain, for example, accept only:...
12
by: chaarmann | last post by:
I want to format a long line of text by inserting newline-characters in a way that the printout contains small rows with maximum 80 chars. But I am not allowed to split inside the sequence "<(>"....
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.