469,323 Members | 1,575 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,323 developers. It's quick & easy.

Help: Creating condensed expressions

Here's the problem: Given a list of item names like:

apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM

which is a subset of a much larger list of items,
is there an efficient algorithm to create condensed forms that match
those items, and only those items? Such as:

apple[12]
apple3_SD
form[ABC]
kla_M[MB]
kca_MM

The condensed expression syntax only has [...] and * as operators. [...]
matches a set of individual characters, * matches any string.
I'd be satisfied with a solution that only uses the [...] syntax, since
I don't think it's possible to use * without potentially matching items
not explicitly in the given list.

I'm not sure what this condensed expression syntax is called (looks a
bit like shell name expansion syntax), and I'm not even sure there is an
efficient way to do what I'm asking. Any ideas would be appreciated.

Thanks,
-David

--
Presenting:
mediocre nebula.

Mar 24 '06 #1
7 1283
This is a first try, is something like this enough for you?
data = """apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM""".split()

headtails = {}
for word in data:
head = word[:-1]
if head in headtails:
headtails[head].append(word[-1])
else:
headtails[head] = [word[-1]]
for head, tails in sorted(headtails.iteritems()):
if len(tails) == 1:
print head + tails[0]
else:
print head + "[%s]" % "".join(tails)
Output:
apple[12]
apple3_SD
form[ABC]
kca_MM
kla_M[MB]

It looks only the last letters. It modifies the original order (it
sorts the sequence on the root-word).
If you don't need sorted results you can remove the sorted().

Bye,
bearophile

Mar 24 '06 #2
David Hirschfield a écrit :
Here's the problem: Given a list of item names like:

apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM

which is a subset of a much larger list of items,
is there an efficient algorithm to create condensed forms that match
those items, and only those items? Such as:

apple[12]
apple3_SD
form[ABC]
kla_M[MB]
kca_MM

The condensed expression syntax only has [...] and * as operators. [...]
matches a set of individual characters, * matches any string.
I'd be satisfied with a solution that only uses the [...] syntax, since
I don't think it's possible to use * without potentially matching items
not explicitly in the given list.

I'm not sure what this condensed expression syntax is called (looks a
bit like shell name expansion syntax),
Looks like a very restricted subset of regular expressions.
and I'm not even sure there is an
efficient way to do what I'm asking. Any ideas would be appreciated.

import re

lines = """
apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM
""".strip().split()

patterns = [
r'apple[12]',
r'apple3_SD',
r'form[ABC]',
r'kla_M[MB]',
r'kca_MM',
]

for pat in patterns:
for line in lines:
m = re.match(pat, line)
print "%s match %s : %s" % (pat, line, m and "Yes" or 'No')
HTH
Mar 24 '06 #3
Bruno Desthuilliers a écrit :
David Hirschfield a écrit :
Here's the problem: (snip)
is there an efficient algorithm to create condensed forms that match
those items,


Sorry, answered to fast... Well, at least you have a way to test the
algo !-)
Mar 24 '06 #4
Nevermind, I didn't understand the problem/question... Sorry.

Bye,
bearophile

Mar 24 '06 #5
be************@lycos.com a écrit :
Nevermind, I didn't understand the problem/question... Sorry.


actually, I think *you* got it right.

Mar 24 '06 #6
David Hirschfield <da****@ilm.com> writes:
Here's the problem: Given a list of item names like: apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM which is a subset of a much larger list of items,
is there an efficient algorithm to create condensed forms that match
those items, and only those items? Such as: apple[12]
apple3_SD
form[ABC]
kla_M[MB]
kca_MM The condensed expression syntax only has [...] and * as operators. [...]
matches a set of individual characters, * matches any string.
I'd be satisfied with a solution that only uses the [...] syntax, since
I don't think it's possible to use * without potentially matching items
not explicitly in the given list. I'm not sure what this condensed expression syntax is called (looks a
bit like shell name expansion syntax), and I'm not even sure there is an
efficient way to do what I'm asking. Any ideas would be appreciated.


If you used regular expressions instead of globs then I think what you want is
to optimize the regular expression for:

'apple1|apple2|apple3_SD|formA|formB|formC|kla_MM| kla_MB|kca_MM'

then spit the optimised finite automaton back out. Of course if you're just
searching Python might do a decent job of optimising it anyway.

Looking at:

http://en.wikipedia.org/wiki/Finite_...e#Optimization

they make it sound so easy!. There's also a whole bunch of tools on that
page. Maybe there's an optimiser you can use in one of them.

Failing that I guess you build a tree and try to merge nodes where they fork.
I suspect an optimal solution would be hard but if you knew your input did
have lots of redundancy a simple algorithm might work.

Or I could be talking crap as usual.

Eddie
Mar 24 '06 #7
be************@lycos.com wrote:
Nevermind, I didn't understand the problem/question... Sorry.

Bye,
bearophile

Really? Your solution looks fine to me.

In any case, here's an alternative approach to the (based on the same
understanding of the problem as bearophile's, but with the additional
requirement that the input terms be sorted)
from itertools import groupby
from operator import itemgetter
src_iter = ((i[:-1],i[-1]) for i in src.splitlines()) #src must be sorted
grouped = groupby(src_iter, itemgetter(0))
stemmed = ((stem, "".join(i[1] for i in values)) for stem, values in grouped)
[(len(s[1])>1 and "%s[%s]" or "%s%s") % s for s in stemmed] ['apple[12]', 'apple3_SD', 'form[ABC]', 'kla_M[MB]', 'kca_MM']

Michael

Mar 24 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

10 posts views Thread by Jacek Generowicz | last post: by
45 posts views Thread by Joh | last post: by
10 posts views Thread by Tjerk Wolterink | last post: by
1 post views Thread by Rahul | last post: by
3 posts views Thread by Zach | last post: by
23 posts views Thread by codefire | last post: by
4 posts views Thread by =?Utf-8?B?anAybXNmdA==?= | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by Gurmeet2796 | last post: by
reply views Thread by mdpf | last post: by
reply views Thread by harlem98 | last post: by
reply views Thread by listenups61195 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.