473,569 Members | 2,880 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

matching a sentence, greedy up!

Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search( s)
if mo:
print "found:"
sentences = re_satz.findall (s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

thx in advance

Christian
Jul 18 '05 #1
1 2908
Christian Buck wrote:
Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search( s)
if mo:
print "found:"
sentences = re_satz.findall (s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?


First, you don't need to escape any characters within a character group [].

The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what
I expected. How do you think you can differentiate between the end of a
sentence and (the first part of) an abbreviation?
--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
5482
by: | last post by:
Hi, I'm fairly new to regular expressions, and this may be a rather dumb question, but so far I haven't found the answer in any tutorial or reference yet... If I have f.i. the string "The {{{{power of {{{{regular expressions}}}} comes from}}}} the ability to include alternatives and repetitions in the pattern." from which I want to...
6
8875
by: Jack Smith | last post by:
Hello, any help appreciated with following problem. I figured out the algorithm (I think), just having trouble proving it is optimal. Suppose we are given n tasks each of which takes 1 unit time to complete. Suppose further that each task has a deadline by which it is expected to finish. IF a task is not finished by the deadline, a...
2
2573
by: Peter Fein | last post by:
I'm trying to use pyparsing write a screenscraper. I've got some arbitrary HTML text I define as opener & closer. In between is the HTML data I want to extract. However, the data may contain the same characters as used in the closer (but not the exact same text, obviously). I'd like to get the *minimal* amount of data between these. ...
12
4407
by: lothar | last post by:
re: 4.2.1 Regular Expression Syntax http://docs.python.org/lib/re-syntax.html *?, +?, ?? Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. the regular expression module fails to perform non-greedy matches as described in the documentation: more than...
8
2454
by: John Hazen | last post by:
I want to match one or two instances of a pattern in a string. According to the docs for the 're' module ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier is greedy by default, and adding a '?' after a qualifier makes it non-greedy. > The "*", "+", and "?" qualifiers are all greedy... > Adding "?" after the qualifier...
5
5738
by: olaufr | last post by:
Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I have a list, as below: sentence = "the color is $red" patterns = pos = sentence.find($)
9
5059
by: Jim Lewis | last post by:
Anyone have experience with string pattern matching? I need a fast way to match variables to strings. Example: string - variables ============ abcaaab - xyz abca - xy eeabcac - vxw x matches abc
5
2121
by: gentsquash | last post by:
In a setting where I can specify only a JS regular expression, but not the JS code that will use it, I seek a regexp component that matches a string of letters, ignoring case. E.g, for "cat" I'd like the effect of () but without having to have many occurrences of .
1
3423
by: Joe Strout | last post by:
Wow, this was harder than I thought (at least for a rusty Pythoneer like myself). Here's my stab at an implementation. Remember, the goal is to add a "match" method to Template which works like Template.substitute, but in reverse: given a string, if that string matches the template, then it should return a dictionary mapping each template...
0
7619
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7930
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8138
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7681
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6290
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5228
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3651
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1229
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
950
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.