By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,559 Members | 1,150 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,559 IT Pros & Developers. It's quick & easy.

find and replace with regular expressions

P: n/a
I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.

So I have...

----------------

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.

----------------

I want to keep the ie or eg but just take out the periods. Any
ideas? Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.
Jul 31 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
On Jul 31, 3:07*pm, chrispoliq...@gmail.com wrote:
I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. *I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.

So I have...

----------------

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.

----------------

I want to keep the ie or eg but just take out the periods. *Any
ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.
>>middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
s = 'A test, i.e., an example.'
a = middle_abbr.search(s) # find the abbreviation
b = re.compile('\.') # period pattern
c = b.sub('',a.group(0)) # remove periods from abbreviation
d = middle_abbr.sub(c,s) # substitute new abbr for old
d
'A test, ie, an example.'
Jul 31 '08 #2

P: n/a
On Jul 31, 3:56*pm, Mensanator <mensana...@aol.comwrote:
On Jul 31, 3:07*pm, chrispoliq...@gmail.com wrote:


I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. *I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.
So I have...
----------------
import re
middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.
----------------
I want to keep the ie or eg but just take out the periods. *Any
ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.
>middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
s = 'A test, i.e., an example.'
a = middle_abbr.search(s) * * *# find the abbreviation
b = re.compile('\.') * * * * * # period pattern
c = b.sub('',a.group(0)) * * * # remove periods from abbreviation
d = middle_abbr.sub(c,s) * * * # substitute new abbr for old
d

'A test, ie, an example.'

A more versatile version:

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
s = 'A test, i.e., an example.'
a = middle_abbr.search(s) # find the abbreviation
b = re.compile('\.') # period pattern
c = b.sub('',a.group(0)) # remove periods from abbreviation
d = middle_abbr.sub(c,s) # substitute new abbr for old

print d
print
print

s = """A test, i.e., an example.
Yet another test, i.e., example with 2 abbr."""

a = middle_abbr.search(s) # find the abbreviation
c = b.sub('',a.group(0)) # remove periods from abbreviation
d = middle_abbr.sub(c,s) # substitute new abbr for old

print d
print
print

s = """A test, i.e., an example.
Yet another test, i.e., example with 2 abbr.
A multi-test, e.g., one with different abbr."""

done = False

while not done:
a = middle_abbr.search(s) # find the abbreviation
if a:
c = b.sub('',a.group(0)) # remove periods from abbreviation
s = middle_abbr.sub(c,s,1) # substitute new abbr for old ONCE
else: # repeat until all removed
done = True

print s

## A test, ie, an example.
##
##
## A test, ie, an example.
## Yet another test, ie, example with 2 abbr.'
##
##
## A test, ie, an example.
## Yet another test, ie, example with 2 abbr.
## A multi-test, eg, one with different abbr.
Jul 31 '08 #3

P: n/a
On Jul 31, 3:07*pm, chrispoliq...@gmail.com wrote:
>
middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
When defining re's with string literals, it is good practice to use
the raw string literal format (precede with an 'r'):
middle_abbr = re.compile(r'[A-Za-z0-9]\.[A-Za-z0-9]\.')

What abbreviations have numeric digits in them?

I hope your input string doesn't include something like this:
For a good approximation of pi, use 3.1.

-- Paul
Jul 31 '08 #4

P: n/a
On Jul 31, 9:07*pm, chrispoliq...@gmail.com wrote:
I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. *I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.

So I have...

----------------

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.

----------------

I want to keep the ie or eg but just take out the periods. *Any
ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.
It's recommended that you should use a raw strings for regular
expressions.

Capture the letters using parentheses:

middle_abbr = re.compile(r'([A-Za-z0-9])\.([A-Za-z0-9])\.')

and replace what was found with what was captured:

newString = middle_abbr.sub(r'\1\2', txt)

HTH
Jul 31 '08 #5

P: n/a
On Jul 31, 10:07*pm, chrispoliq...@gmail.com wrote:
I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. *I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.

So I have...

----------------

import re

middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')

# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.

----------------

I want to keep the ie or eg but just take out the periods. *Any
ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.
Its impossible with regex. U could try it with a statistical analysis;
and even this would give u a good split.
Aug 1 '08 #6

P: n/a
On Aug 1, 12:53*pm, dusans <dusan.smit...@gmail.comwrote:
On Jul 31, 10:07*pm, chrispoliq...@gmail.com wrote:


I am using regular expressions to search a string (always full
sentences, maybe more than one sentence) for common abbreviations and
remove the periods. *I need to break the string into different
sentences but split('.') doesn't solve the whole problem because of
possible periods in the middle of a sentence.
So I have...
----------------
import re
middle_abbr = re.compile('[A-Za-z0-9]\.[A-Za-z0-9]\.')
# this will find abbreviations like e.g. or i.e. in the middle of a
sentence.
# then I want to remove the periods.
----------------
I want to keep the ie or eg but just take out the periods. *Any
ideas? *Of course newString = middle_abbr.sub('',txt) where txt is the
string will take out the entire abbreviation with the alphanumeric
characters included.

Its impossible with regex. U could try it with a statistical analysis;
and even this would give u a good split.
"and even this wont* give u a good split." :P
Aug 1 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.