Hello all,
I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much! -
# Sample HTML text:
-
text = '<p> We operate forever. \nWe will become Representatives. \n<p> Any conference'
-
-
# My regex:
-
results = open("results.txt","a")
-
speechPattern = re.compile(r'''
-
<p>
-
(.*)
-
''', re.VERBOSE)
-
test = speechPattern.findall(text)
-
results.writelines(test)
-
results.close()
-
Thanks again!
Law
5 8121 bvdet 2,851
Expert Mod 2GB
Hello all,
I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much! -
# Sample HTML text:
-
text = '<p> We operate forever. \nWe will become Representatives. \n<p> Any conference'
-
-
# My regex:
-
results = open("results.txt","a")
-
speechPattern = re.compile(r'''
-
<p>
-
(.*)
-
''', re.VERBOSE)
-
test = speechPattern.findall(text)
-
results.writelines(test)
-
results.close()
-
Thanks again!
Law
If you must have a regex solution, this will not help: - >>> text = '<p> We operate forever. \nWe will become Representatives. \n<p> Any conference'
-
>>> [s.strip() for s in text.replace('\n', '').split('<p> ') if s != '']
-
['We operate forever. We will become Representatives.', 'Any conference']
-
>>>
If you must have a regex solution, this will not help: - >>> text = '<p> We operate forever. \nWe will become Representatives. \n<p> Any conference'
-
>>> [s.strip() for s in text.replace('\n', '').split('<p> ') if s != '']
-
['We operate forever. We will become Representatives.', 'Any conference']
-
>>>
A great example of my beginner's eyes not seeing a better way; thanks so much!
to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)
to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)
This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
hey bc no prob...:) yup that book is good.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Tina Li |
last post by:
Hello,
I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum
recursion limit exceeded". Here is the pattern string:
...
|
by: Oriana |
last post by:
Hi!
I'm trying to 'clean up' this source file using regular expressions
in Python. My problem is, that when I try to delete extra lines, my
code fails. Here's an example....
/**
*
*...
|
by: Antonio |
last post by:
Good morning,
I've the following part of an html file,
////////////////////////////////////////////////////////////////////////
</font></b></td>
</tr> <tr>
<td bgcolor=white class=s><div...
|
by: Guoqi Zheng |
last post by:
Dear sir,
On regular expression, a . means Match anything except newline. How about if
I need it to includes newline as well? I try , but it seems not
working.
and idea?
--
Kind regards
|
by: James D. Marshall |
last post by:
The issue at hand, I believe is my comprehension of using regular
expression, specially to assist in replacing the expression with other text.
using regular expression (\s*) my understanding is...
|
by: Pete Davis |
last post by:
I'm using regular expressions to extract some data and some links from some
web pages. I download the page and then I want to get a list of certain
links.
For building regular expressions, I use...
|
by: Mike |
last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
|
by: shapper |
last post by:
Hello,
I have a regular expression to validate email addresses:
"\w+(\w+)*@\w+(\w+)*\.\w+(\w+)*"
Now I need to force all emails to be from a given domain, for example,
accept only:...
|
by: chaarmann |
last post by:
I want to format a long line of text by inserting newline-characters in a way that the printout contains small rows with maximum 80 chars. But I am not allowed to split inside the sequence "<(>"....
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| | |