473,837 Members | 1,766 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

matching a sentence, greedy up!

Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search( s)
if mo:
print "found:"
sentences = re_satz.findall (s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

thx in advance

Christian
Jul 18 '05 #1
1 2922
Christian Buck wrote:
Hi,

i'm writing a regexp that matches complete sentences in a german text,
and correctly ignores abbrevations. Here is a very simplified version of
it, as soon as it works i could post the complete regexp if anyone is
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on
locales an speed does'nt matter. (i removed german chars in the above
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z] - start with an uppercase char.
(?: - don't make a group
[^\.\?\!]+ - eat everything that does not look like an end
| - OR
[^a-zA-Z0-9\-_] - accept a non character
(?: - followed by ...
[a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
| - OR
\d*\. - a number and a dot
| - OR
z\.[\s\-]?B\. - some common abbrevations (one one here)
)){3,} - some times, at least 3
[\.\?\!]+ - this is the end, and should also match '...'
(?!\s[a-z]) - not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
r'?:(?!\s[a-z]))')
mo = re_satz.search( s)
if mo:
print "found:"
sentences = re_satz.findall (s)
for s in sentences:
print "Sentence: ", s
else:
print "not found :-("

- snip -

Output:
found!
Sentence: My text may i.
Sentence: This is the end.

Why isnt the above regexp greedier and matches the whole sentence?


First, you don't need to escape any characters within a character group [].

The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
you exclude the '.' . So it matches upto but not including the first dot.
Now, as far as I can see, nothing else fits. So the output is just what
I expected. How do you think you can differentiate between the end of a
sentence and (the first part of) an abbreviation?
--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
5496
by: | last post by:
Hi, I'm fairly new to regular expressions, and this may be a rather dumb question, but so far I haven't found the answer in any tutorial or reference yet... If I have f.i. the string "The {{{{power of {{{{regular expressions}}}} comes from}}}} the ability to include alternatives and repetitions in the pattern." from which I want to extract chunks starting with "{{{{" and ending with "}}}}".
6
8892
by: Jack Smith | last post by:
Hello, any help appreciated with following problem. I figured out the algorithm (I think), just having trouble proving it is optimal. Suppose we are given n tasks each of which takes 1 unit time to complete. Suppose further that each task has a deadline by which it is expected to finish. IF a task is not finished by the deadline, a standard penalty of $10 is applied. The problem is to find a schedule of the tasks that minimizes the...
2
2594
by: Peter Fein | last post by:
I'm trying to use pyparsing write a screenscraper. I've got some arbitrary HTML text I define as opener & closer. In between is the HTML data I want to extract. However, the data may contain the same characters as used in the closer (but not the exact same text, obviously). I'd like to get the *minimal* amount of data between these. Here's an example (whitespace may differ): from pyparsing import *
12
4441
by: lothar | last post by:
re: 4.2.1 Regular Expression Syntax http://docs.python.org/lib/re-syntax.html *?, +?, ?? Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. the regular expression module fails to perform non-greedy matches as described in the documentation: more than "as few characters as possible"
8
2476
by: John Hazen | last post by:
I want to match one or two instances of a pattern in a string. According to the docs for the 're' module ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier is greedy by default, and adding a '?' after a qualifier makes it non-greedy. > The "*", "+", and "?" qualifiers are all greedy... > Adding "?" after the qualifier makes it perform the match in > non-greedy or minimal fashion...
5
5766
by: olaufr | last post by:
Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I have a list, as below: sentence = "the color is $red" patterns = pos = sentence.find($)
9
5072
by: Jim Lewis | last post by:
Anyone have experience with string pattern matching? I need a fast way to match variables to strings. Example: string - variables ============ abcaaab - xyz abca - xy eeabcac - vxw x matches abc
5
2150
by: gentsquash | last post by:
In a setting where I can specify only a JS regular expression, but not the JS code that will use it, I seek a regexp component that matches a string of letters, ignoring case. E.g, for "cat" I'd like the effect of () but without having to have many occurrences of .
1
3461
by: Joe Strout | last post by:
Wow, this was harder than I thought (at least for a rusty Pythoneer like myself). Here's my stab at an implementation. Remember, the goal is to add a "match" method to Template which works like Template.substitute, but in reverse: given a string, if that string matches the template, then it should return a dictionary mapping each template field to the corresponding value in the given string. Oh, and as one extra feature, I want to...
0
9846
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9693
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10583
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10280
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7823
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7009
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5679
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5859
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3128
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.