By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,564 Members | 933 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,564 IT Pros & Developers. It's quick & easy.

Nlp, Python and period

P: n/a
Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., fo*@home.co.uk, etc)?

Thanks

F.
Aug 4 '08 #1
Share this Question
Share on Google+
4 Replies

P: n/a
On 4 Aug, 11:59, Fred Mangusta <a...@bbb.itwrote:
Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., f...@home.co.uk, etc)?
I wouldn't mind finding out about such packages, either. I see that
NLTK offers a few options, with the following tokeniser being
interesting if you don't mind training the software:

http://nltk.org/doc/guides/tokenize....unkt-tokenizer

There was also discussion of this topic on Ned Batchelder's blog a
while back:

http://nedbatchelder.com/blog/200804...sentences.html

My comment on there (that I'm using a regular expression with some
postprocessing) still stands.

Paul
Aug 4 '08 #2

P: n/a
On Aug 4, 7:59 pm, Fred Mangusta <a...@bbb.itwrote:
Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., f...@home.co.uk, etc)?
google("python nltk") ... it may do what you want.
Aug 4 '08 #3

P: n/a
Hi Paul,

thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

Thanks!
F.

Paul Boddie wrote:
There was also discussion of this topic on Ned Batchelder's blog a
while back:

http://nedbatchelder.com/blog/200804...sentences.html

My comment on there (that I'm using a regular expression with some
postprocessing) still stands.

Paul
Aug 4 '08 #4

P: n/a
On 4 Aug, 12:34, Fred Mangusta <a...@bbb.itwrote:
>
thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?
I can't log into SourceForge, possibly because I've forgotten my
password, but I can give you a fairly similar regular expression which
does some of the work:

sentence_pattern = re.compile(
r'(' +
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,a-z,0-9]' + # Match sentence with specific start
character
r'.+?' + # Match sentence content - "?" means non-
greedy
r'[\.\!\?]' + # End of sentence
r'[\)\"\]]*' + # End quoting or bracketing
r')' +
r'(\s+)' + # Spaces
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,0-9]' # Match sentence with specific start
character
)

This is mostly the same as that posted to SourceForge, but with some
enhancements; I've indented the part which actually produces the
matched sentence text in a group. Unfortunately, some postprocessing
is required to deal with abbreviations, and I maintain a list of these
against which I test the supposed ends of sentences that the regular
expression provides. In addition, I also try and detect initials (eg.
G. van Rossum) which the regular expression may regard as the end of a
sentence.

As I noted, I'd be interested to hear of any better solutions which
don't involve training.

Paul
Aug 4 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.