On Sat, 27 Dec 2003 15:23:34 -0500, "Terry Reedy" <tjreedy@udel.edu> wrote:
[color=blue]
>
>"Stephan Diehl" <stephan.diehlNOSPAM@gmx.net> wrote in message
>news:bskcb0$36n$00$1@news.t-online.com...[color=green]
>> All of this can be found at
>>
http://aspn.activestate.com/ASPN/Coo.../Recipe/252177
>>
>> What I was most surprised at was the inefficency of the trivial solutions
>> (and that the right algorithm makes indeed a difference).[/color]
>
>A common surprise. The science of algorithms (including empirical testing)
>gives real benefits.
>[color=green]
>> --[/color]
>os.path.commonprefix --------------------------------------------------[color=green]
>>
>> def f1(m):
>> "Given a list of pathnames, returns the longest common leading
>> component"
>> if not m: return ''
>> prefix = m[0][/color]
>
> prefix = m.pop() # avoids comparing prefix to itself as first item
>[color=green]
>> for item in m:
>> for i in range(len(prefix)):
>> if prefix[:i+1] != item[:i+1]:[/color]
>
>I am 99% sure that above is a typo and performance bug that should read
> if prefix[i:i+1] != item[i:i+1]:
>since all previous chars have been found equal in previous iteration.
>[color=green]
>> prefix = prefix[:i]
>> if i == 0:
>> return ''
>> break
>> return prefix[/color]
>
>Perhaps you could retest this with the two suggested changes. It will be
>somewhat faster.
>[color=green]
>> The problem with this algorithm is the copying of all those small[/color]
>strings.
>
>Since Python does not have pointers, comparing characters within strings
>requires pulling out the chars as separate strings. This is why using
>C-coded comparison functions may win even though more comparisons are done.
>
>The reason f1uses slicing (of len 1 after my correction) instead of
>indexing is to avoid exceptions when len(item) < len(prefix). However, all
>the +1s have a cost (and I suspect slicing does also), so it may pay to
>truncate prefix to the length of item first. The simplest fix for this
>(untested) gives
>
>def f9(m): # modified f1 == os.path.commonprefix
> "Given a list of strings, returns the longest common prefix"
> if not m: return ''
> prefix = m.pop()
> for item in m:
> prefix = prefix[:len(item)]
> for i in range(len(prefix)):
> if prefix[i] != item[i]:
> if not i: return ''
> prefix = prefix[:i]
> break
> return prefix
>[color=green]
>> This can be easily fixed
>> --- optimized[/color]
>os.path.commonprefix ----------------------------------------[color=green]
>>
>> def f2(m):
>> "Given a list of pathnames, returns the longest common leading
>> component"
>> if not m: return ''
>> if len(m) == 1: return m[0]
>> prefix = m[0]
>> for i in xrange(len(prefix)):
>> for item in m[1:]:
>> if prefix[i] != item[i]:
>> return prefix[:i]
>> return prefix[:i][/color]
>
>and easily messed up;-) If len(item) < len(prefix), item[i] throws
>exception. For this approach to work, prefix should be set as shortest
>string of m in preliminary loop. Also, you reslice m several times. Do it
>once before outer loop.
>[color=green]
>> It is just not nessesary to compare all strings in the list.[/color]
>
>Every string has to be compared to something at least once.
>[color=green]
>> It is enough to sort the list first
>> and then compare the first and the last element.[/color]
>
>Sorting compares all strings in the list to something at least once, and
>most more than once.
>[color=green]
>> Even though the 'sort' algorithm is coded in C and is therefore quite[/color]
>fast,[color=green]
>> the order of runtime has changed.[/color]
>
>The C part is what makes f3 faster. In your timings, 128 is not large
>enough for the nlogn component to be noticeable.
>[color=green]
>> Michael Dyck then pointed out that instead of using 'sort', 'min' and[/color]
>'max'[color=green]
>> should be used. While tests suggest that this is true, I have no idea why
>> that should be, since finding a minimum or maximum uses some sorting[/color]
>anyway
>
>No. max and min each do a linear scan. No sorting. But each does at
>least as many character comparisons as modified f1 or f2. The speedup is
>from looping and comparing in C, even though at least twice as many
>compares are done.
>[color=green]
>> You might have realized that the optimization so far was done one the[/color]
>number[color=green]
>> of strings. There is still another dimension to optimize in and that is[/color]
>the[color=green]
>> actual string comparing.
>> Raymond Hettinger suggests using a binary search:[/color]
>
>Since this only affects the final comparison of min and max, and not the n
>comparisons done to calculate each, the effect is minimal and constant,
>independent of number of strings.
>
>Since this compares slices rather than chars in each loop, I wonder whether
>this is really faster than linear scan anyway. I would like to see timing
>of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
>with sort removed and min/max added.) Since you changed two parts of f3 to
>get f4, we cannot be sure that both changes are each an improvement even
>though the combination of two is.
>
>def f5(seq):
> if not seq: return ''
> s1 = min(seq)
> s2 = max(seq)
> n = min(len(s1), len(s2))
> if not n: return '' # not required since s1[0:0] == ''
> for i in xrange(n) :
> if s1[i] != s2[i] :
> return s1[0:i]
> return s1[0:n]
>[/color]
I wonder about this version for speed (not very tested ;-):
[color=blue][color=green][color=darkred]
>>> def f10(m):[/color][/color][/color]
... "return longest common prefix of strings in seq"
... if not m: return ''
... prefix = m.pop()
... ssw = str.startswith
... for item in m:
... while not ssw(item, prefix): prefix = prefix[:-1]
... if not prefix: return ''
... return prefix
...[color=blue][color=green][color=darkred]
>>> f10('abcd abcd'.split())[/color][/color][/color]
'abcd'[color=blue][color=green][color=darkred]
>>> f10('abcd abce'.split())[/color][/color][/color]
'abc'[color=blue][color=green][color=darkred]
>>> f10('abcd abcd a'.split())[/color][/color][/color]
'a'[color=blue][color=green][color=darkred]
>>> f10('abcd abcd a x'.split())[/color][/color][/color]
''
Regards,
Bengt Richter