matching a sentence, greedy up!

Christian Buck

Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search( s)
if mo:
print "found:"
sentences = re_satz.findall (s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

thx in advance

Christian

Jul 18 '05 #1

Subscribe Reply

2908

Helmut Jarausch

Christian Buck wrote:

Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search( s)
if mo:
print "found:"
sentences = re_satz.findall (s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

First, you don't need to escape any characters within a character group [].

The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what
I expected. How do you think you can differentiate between the end of a
sentence and (the first part of) an abbreviation?
--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Jul 18 '05 #2

Similar topics

5482

regexp: not matching a sequence of characters

by: | last post by:

Hi, I'm fairly new to regular expressions, and this may be a rather dumb question, but so far I haven't found the answer in any tutorial or reference yet... If I have f.i. the string "The {{{{power of {{{{regular expressions}}}} comes from}}}} the ability to include alternatives and repetitions in the pattern." from which I want to...

PHP

8875

greedy algorithm

by: Jack Smith | last post by:

Hello, any help appreciated with following problem. I figured out the algorithm (I think), just having trouble proving it is optimal. Suppose we are given n tasks each of which takes 1 unit time to complete. Suppose further that each task has a deadline by which it is expected to finish. IF a task is not finished by the deadline, a...

Java

2573

Pyparsing: Non-greedy matching?

by: Peter Fein | last post by:

I'm trying to use pyparsing write a screenscraper. I've got some arbitrary HTML text I define as opener & closer. In between is the HTML data I want to extract. However, the data may contain the same characters as used in the closer (but not the exact same text, obviously). I'd like to get the *minimal* amount of data between these. ...

Python

4407

re module non-greedy matches broken

by: lothar | last post by:

re: 4.2.1 Regular Expression Syntax http://docs.python.org/lib/re-syntax.html *?, +?, ?? Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. the regular expression module fails to perform non-greedy matches as described in the documentation: more than...

Python

2454

regexp non-greedy matching bug?

by: John Hazen | last post by:

I want to match one or two instances of a pattern in a string. According to the docs for the 're' module ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier is greedy by default, and adding a '?' after a qualifier makes it non-greedy. > The "*", "+", and "?" qualifiers are all greedy... > Adding "?" after the qualifier...

Python

5738

Pattern matching with string and list

by: olaufr | last post by:

Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I have a list, as below: sentence = "the color is $red" patterns = pos = sentence.find($)

Python

5059

String pattern matching

by: Jim Lewis | last post by:

Anyone have experience with string pattern matching? I need a fast way to match variables to strings. Example: string - variables ============ abcaaab - xyz abca - xy eeabcac - vxw x matches abc

Python

2121

Regexp: Case-insensitive matching | N factorial

by: gentsquash | last post by:

In a setting where I can specify only a JS regular expression, but not the JS code that will use it, I seek a regexp component that matches a string of letters, ignoring case. E.g, for "cat" I'd like the effect of () but without having to have many occurrences of .

Javascript

3423

Re: template strings for matching?

by: Joe Strout | last post by:

Wow, this was harder than I thought (at least for a rusty Pythoneer like myself). Here's my stab at an implementation. Remember, the goal is to add a "match" method to Template which works like Template.substitute, but in reverse: given a string, if that string matches the template, then it should return a dictionary mapping each template...

Python

7619

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...

Windows Server

7930

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

8138

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7681

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

6290

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5228

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3651

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1229

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

950

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General