469,927 Members | 1,848 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,927 developers. It's quick & easy.

Counting elements in a list wildcard

If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

Apr 24 '06 #1
13 3942
"Ryan Ginstrom" <ry***@gol.com> writes:
If there are specific spellings you want to allow, you could just
create a list of them and see if your Suzy is in there:
possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ]
for line in my_strings: ... if line in possible_suzys: print line
...
Susi


If you wanted to do something later, rather than only during the scan
over the list, getting a list of suzies would probaby be more useful:
possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
found_suzys = [s for s in my_strings if s in possible_suzys]
found_suzys

['Susi', 'Susy']

--
\ "The number of UNIX installations has grown to 10, with more |
`\ expected." -- Unix Programmer's Manual, 2nd Ed., 12-Jun-1972 |
_o__) |
Ben Finney

Apr 25 '06 #2
hawkesed wrote:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed


You might want to check out the SoundEx and MetaPhone algorithms which
provide approximations of the "sound" of a word based on spelling
(assuming English pronunciations).

Apparently a soundex module used to be built into Python but was
removed in 2.0. You can find several implementations on the 'net, for
example:

http://orca.mojam.com/~skip/python/soundex.py
http://aspn.activestate.com/ASPN/Coo...n/Recipe/52213

MetaPhone is generally considered better than SoundEx for "sounds-like"
matching, although it's considerably more complex (IIRC, although it's
been a long time since I wrote an implementation of either in any
language). A Python MetaPhone implementations (there must be more than
this one?):

http://joelspeters.com/awesomecode/

Another algorithm that might interest isn't based on "sounds-like" but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:

http://trific.ath.cx/resources/python/levenshtein/

Whichever algorithm you go with, you'll wind up with some sort of
"similar" function which could be applied in a similar manner to Ben's
example (I've just mocked up the following -- it's not an actual
session):
import soundex
import metaphone
import levenshtein
my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
found_suzys = [s for s in my_strings if soundsex.sounds_similar(s, 'Susy')] found_suzys = [s for s in my_strings if metaphone.sounds_similar(s, 'Susy')] found_suzys = [s for s in my_strings if levenshtein.distance(s, 'Susy') < 4] found_suzys

['Susi', 'Susy'] (one hopes anyway!)
HTH,

Dave.
--

Apr 25 '06 #3
Dave Hughes wrote:
Another algorithm that might interest isn't based on "sounds-like" but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:


I don't know what algorithm it uses, but the difflib module looks similar.
I've had good results using the get_close_matches function to locate
similarly-named mp3 files.

However I don't think "close enough" is well suited for this application.
The sequences are short and non-distinct. Difference matching needs longer
sequences to be effective. Phoneme matching seems overly complex and might
grab things like Tsu-zi. I'd just use a list of alternate spellings like
Ben suggested.
Apr 25 '06 #4

hawkesed wrote:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed


Dare I suggest using REs? This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
count = 0
for name in names:
if namePattern.match(name):
count += 1
return count

susie = re.compile("Su(s|z)(i|ie|y)")

print countMatches(["John", "Suzy", "Peter", "Steven", "Susie",
"Susi"], susie)
some other patters:

iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")
john = re.compile("Jo(h)?n")
Iain

Apr 25 '06 #5
On 25/04/2006 3:15 PM, Edward Elliott wrote:
Phoneme matching seems overly complex and might
grab things like Tsu-zi.


It might *only* if somebody had a rush of blood to the head and devised
yet another phonetic key "algorithm". Tsuzi does *not* give the same
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
of them throw away the 'T' sound.

Apr 25 '06 #6
On 25/04/2006 6:26 PM, Iain King wrote:
hawkesed wrote:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed
Dare I suggest using REs? This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
count = 0
for name in names:
if namePattern.match(name):
count += 1
return count

susie = re.compile("Su(s|z)(i|ie|y)")

print countMatches(["John", "Suzy", "Peter", "Steven", "Susie",
"Susi"], susie)
some other patters:

iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")


What about Steffan, Etienne, Esteban, István, ... ?
john = re.compile("Jo(h)?n")


IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.
Apr 25 '06 #7

John Machin wrote:
On 25/04/2006 6:26 PM, Iain King wrote:
hawkesed wrote:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.

snip
steven = re.compile("Ste(v|ph|f)(e|a)n")


What about Steffan, Etienne, Esteban, István, ... ?


well, obviously these could be included:
"(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban)", but the OP never said he
wanted to translate anything into another language. He just wanted to
catch variable spellings.
john = re.compile("Jo(h)?n")


IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.


Iain

Apr 25 '06 #8
On 25/04/2006 8:51 PM, Iain King wrote:
John Machin wrote:
On 25/04/2006 6:26 PM, Iain King wrote:
hawkesed wrote:
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count("Susie")
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
snip
steven = re.compile("Ste(v|ph|f)(e|a)n")

What about Steffan, Etienne, Esteban, István, ... ?


well, obviously these could be included:
"(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban)", but the OP never said he
wanted to translate anything into another language.


Neither did I. But if you have to cope with a practical situation like
where the birth certificate says István and the job application says
Steven and the foreman calls him Steve, you won't be stuffing about with
hand-crafted REs, one per popular given name. Could be worse: the punter
could have looked up a dictionary and changed his surname from Kovács to
Smith; believe me -- it happens.

Oh and if you cast your net as wide as the Pacific islands, chuck in
Sitiveni. That's enough examples. We won't go near Benjamin :-)

Apr 25 '06 #9
John Machin wrote:
On 25/04/2006 6:26 PM, Iain King wrote:
iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")


IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.


Also compare the readability of regular expressions in this case to a simple
list:
["Steven", "Stephen", "Stefan", "Stephan", ...]
Apr 25 '06 #10
John Machin wrote:
On 25/04/2006 3:15 PM, Edward Elliott wrote:
Phoneme matching seems overly complex and might
grab things like Tsu-zi.


It might *only* if somebody had a rush of blood to the head and devised
yet another phonetic key "algorithm". Tsuzi does *not* give the same
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
of them throw away the 'T' sound.


Spelling isn't phonetic. The 't' character doesn't necessarily affect
pronounciation. Or it may affect pronounciation in a way the soundex
doesn't understand (think tonal languages). Latinizing foreign languages
raises all sorts of problems.

A soundex is only as good as its pronounciation database. It may work well
in many situations, but it isn't fool-proof.

Apr 25 '06 #11

Edward Elliott wrote:
John Machin wrote:
On 25/04/2006 6:26 PM, Iain King wrote:
iain = re.compile("(Ia(i)?n|Eoin)")
steven = re.compile("Ste(v|ph|f)(e|a)n")


IMHO, the amount of hand-crafting that goes into a *general-purpose*
phonetic matching algorithm is already bordering on overkill. Your
method using REs would not appear to scale well at all.


Also compare the readability of regular expressions in this case to a simple
list:
["Steven", "Stephen", "Stefan", "Stephan", ...]


Somehow I'm the advocate for REs here, which: erg. But you have some
mighty convenient elipses there...
compare:

steven = re.compile("Ste(v|ph|f|ff)(e|a)n")
steven = ["Steven", "Stephen", "Stefen", "Steffen", "Stevan",
"Stephan", "Stefan", "Steffan"]

I know which I'd rather type. 'Course, if you can use a ready-built
list of names...

Iain

Apr 26 '06 #12
Iain King wrote:
steven = re.compile("Ste(v|ph|f|ff)(e|a)n")
steven = ["Steven", "Stephen", "Stefen", "Steffen", "Stevan",
"Stephan", "Stefan", "Steffan"]

I know which I'd rather type. 'Course, if you can use a ready-built
list of names...


Oh I agree, I'd rather *type* the former, but I'd rather *read* the
latter. :)
Apr 26 '06 #13
Iain King wrote:
steven = re.compile("Ste(v|ph|f|ff)(e|a)n")


Also you can expand the RE a bit to improve readability:

re.compile("Stev|Steph|Stef|Steff)(en|an)")
Apr 26 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by oliver | last post: by
8 posts views Thread by RickMuller | last post: by
4 posts views Thread by Victor Engmark | last post: by
3 posts views Thread by Jody Greening | last post: by
4 posts views Thread by aaronfude | last post: by
2 posts views Thread by David C | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.