469,360 Members | 1,686 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,360 developers. It's quick & easy.

matching a sentence, greedy up!

Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search(s)
if mo:
print "found:"
sentences = re_satz.findall(s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

thx in advance

Christian
Jul 18 '05 #1
1 2792
Christian Buck wrote:
Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search(s)
if mo:
print "found:"
sentences = re_satz.findall(s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?


First, you don't need to escape any characters within a character group [].

The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what
I expected. How do you think you can differentiate between the end of a
sentence and (the first part of) an abbreviation?
--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Jack Smith | last post: by
2 posts views Thread by Peter Fein | last post: by
12 posts views Thread by lothar | last post: by
8 posts views Thread by John Hazen | last post: by
5 posts views Thread by olaufr | last post: by
9 posts views Thread by Jim Lewis | last post: by
5 posts views Thread by gentsquash | last post: by
1 post views Thread by Joe Strout | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.