On Sep 9, 9:12 am, "metaperl.com" <metap...@gmail.comwrote:

Pyparsing has a really nice feature that I want in PLY. I want to

specify a list of strings and have them converted to a regular

expression.

A Perl module which does an aggressively optimizing job of this is

Regexp::List -http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Lis...

I really dont care if the expression is optimal. So the goal is

something like:

vowel_regexp = oneOf("a aa i ii u uu".split()) # yielding r'(aa|a|uu|

u|ii|i)'

Is there a public module available for this purpose?

I was about to start a thread about a very similar problem; hope I'm

not hijacking this thread. However I am interested in a solution that

(1) scales to a potentially large number of alternate strings

(hundreds or thousands), which will hit, sooner or later, some max

limit in the builtin regexps and (2) is not limited to strings but can

be used for arbitrary objects. Therefore I am looking for a different

approach, perhaps a specialization of the general regexp search

algorithm.

More formally, given (a) an input sequence I and (b) a set of

'censored' sequences S, find and remove all the sequences in S that

appear in I. For instance, if

S = set([(a,), (a,b), (a,c), (c,), (d,a)])

and

I = (x, a, b, c, y, d, a, b), the filtered sequence would be:

O = (x, y, b)

i.e. the censored subsequences are (a,b), (c,), and (d,a).

As with regexps, if sequence `x` is a prefix of `y`, the longer

sequence `y` wins when both match (e.g. if (a,b) matches, it

supersedes (a,)).

For extra bonus, the algorithm should remove overlapping subsequences

too, i.e. for the input I above, the output would be (x, y) since both

(d,a) and (a,b) would match for (d,a,b).

With respect to complexity, I am mainly interested in len(S); len(I)

is small for my application, typically no more than 10. Of course, an

algorithm that scales decently in both len(S) and len(I) would be even

better. Any ideas or relevant links ?

George

PS: This is not a homework ;)