Regular Expression Help, getting over the newline \n

Hello all,

I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!

Expand|Select|Wrap|Line Numbers

 
# Sample HTML text:

text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
 
# My regex:

results = open("results.txt","a")  

speechPattern = re.compile(r'''

<p>&nbsp;&nbsp;&nbsp;   

(.*)

''', re.VERBOSE)        

test = speechPattern.findall(text)

results.writelines(test)

results.close()

Thanks again!

Law

May 7 '07 #1

Subscribe Post Reply

8121

bvdet

2,851

Expert Mod 2GB

Hello all,

I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!

Expand|Select|Wrap|Line Numbers

# Sample HTML text:

text = '   We operate forever. \nWe will become Representatives. \n    Any conference'

# My regex:

results = open("results.txt","a")

speechPattern = re.compile(r'''

   

(.*)

''', re.VERBOSE)

test = speechPattern.findall(text)

results.writelines(test)

results.close()

Thanks again!

Law

If you must have a regex solution, this will not help:

Expand|Select|Wrap|Line Numbers

 >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'

>>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']

['We operate forever. We will become Representatives.', 'Any conference']

>>>

May 7 '07 #2

BLaw

If you must have a regex solution, this will not help:

Expand|Select|Wrap|Line Numbers

>>> text = '   We operate forever. \nWe will become Representatives. \n    Any conference'

>>> [s.strip() for s in text.replace('\n', '').split('   ') if s != '']

['We operate forever. We will become Representatives.', 'Any conference']

>>>

A great example of my beginner's eyes not seeing a better way; thanks so much!

May 7 '07 #3

ghostdog74

511

Expert 256MB

to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)

May 8 '07 #4

bartonc

6,596

Expert 4TB

to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
eg
re.compile("regexp", re.DOTALL|re.M)

This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.

May 25 '07 #5

ghostdog74

511

Expert 256MB

This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.

hey bc no prob...:) yup that book is good.

May 26 '07 #6

by: Tina Li | last post by:

Hello, I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum recursion limit exceeded". Here is the pattern string: ...

Python

Regular Expressions Problem

by: Oriana | last post by:

Hi! I'm trying to 'clean up' this source file using regular expressions in Python. My problem is, that when I try to delete extra lines, my code fails. Here's an example.... /** * *...

Python

Regular expression that doesn't recognize newline

by: Antonio | last post by:

Good morning, I've the following part of an html file, //////////////////////////////////////////////////////////////////////// </td> </tr> <tr> <td bgcolor=white class=s><div...

.NET Framework

. in regular expression

by: Guoqi Zheng | last post by:

Dear sir, On regular expression, a . means Match anything except newline. How about if I need it to includes newline as well? I try , but it seems not working. and idea? -- Kind regards

ASP.NET

Replacing special chars using regular expressions

by: James D. Marshall | last post by:

The issue at hand, I believe is my comprehension of using regular expression, specially to assist in replacing the expression with other text. using regular expression (\s*) my understanding is...

Visual Basic .NET

Regular Expression Matches

by: Pete Davis | last post by:

I'm using regular expressions to extract some data and some links from some web pages. I download the page and then I want to get a list of certain links. For building regular expressions, I use...

C# / C Sharp

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

Email regular expression

by: shapper | last post by:

Hello, I have a regular expression to validate email addresses: "\w+(\w+)*@\w+(\w+)*\.\w+(\w+)*" Now I need to force all emails to be from a given domain, for example, accept only:...

ASP.NET

Help with regular expression split

by: chaarmann | last post by:

I want to format a long line of text by inserting newline-characters in a way that the printout contains small rows with maximum 80 chars. But I am not allowed to split inside the sequence "<(>"....

Java

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Regular Expression Help, getting over the newline \n

Similar topics