473,387 Members | 1,456 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Some optimization tale

A while ago, I've posted a recipie about finding a common prefix to a list
of strings. While the recipie itself is quite bad (I have to admit) and I
didn't know at that time that this problem was solved already in the
os.path module.
The interesting part can be found in the commentaries, as this turned out to
be a quest for the most efficient algorithm to solve this particular
problem.
All of this can be found at
http://aspn.activestate.com/ASPN/Coo.../Recipe/252177

What I was most surprised at was the inefficency of the trivial solutions
(and that the right algorithm makes indeed a difference).

If you have read that far, you might be interested in the actual algorithms.
(If you are interested in the one-liner versions, please have a look at the
recipie)
Here they are:

-- os.path.commonprefix --------------------------------------------------

def f1(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
prefix = m[0]
for item in m:
for i in range(len(prefix)):
if prefix[:i+1] != item[:i+1]:
prefix = prefix[:i]
if i == 0:
return ''
break
return prefix

---------------------------------------------------------------------------

The problem with this algorithm is the copying of all those small strings.
This can be easily fixed

--- optimized os.path.commonprefix ----------------------------------------

def f2(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
if len(m) == 1: return m[0]
prefix = m[0]
for i in xrange(len(prefix)):
for item in m[1:]:
if prefix[i] != item[i]:
return prefix[:i]
return prefix[:i]

---------------------------------------------------------------------------

Now it gets interesting. It turns out that the above algorithms doesn't
scale well.

Some anonymous submitter suggested the following

---- by anonymous ---------------------------------------------------------

def f3(seq):
if not seq:return ""
seq.sort()
s1, s2 = seq[0], seq[-1]
l = min(len(s1), len(s2))
if l == 0 :
return ""
for i in xrange(l) :
if s1[i] != s2[i] :
return s1[0:i]
return s1[0:l]

---------------------------------------------------------------------------

It is just not nessesary to compare all strings in the list. It is enough to
sort the list first and then compare the first and the last element. Even
though the 'sort' algorithm is coded in C and is therefore quite fast, the
order of runtime has changed.

Michael Dyck then pointed out that instead of using 'sort', 'min' and 'max'
should be used. While tests suggest that this is true, I have no idea why
that should be, since finding a minimum or maximum uses some sorting anyway
(if we don't have some quantum computer at our hands), so, my reasoning
would be that sorting once should be faster than computing both, maximum
and minimum.

You might have realized that the optimization so far was done one the number
of strings. There is still another dimension to optimize in and that is the
actual string comparing.
Raymond Hettinger suggests using a binary search:

---------------------------------------------------------------------------
def f4(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
a, b = min(m), max(m)
lo, hi = 0, min(len(a), len(b))
while lo < hi:
mid = (lo+hi)//2 + 1
if a[lo:mid] == b[lo:mid]:
lo = mid
else:
hi = mid - 1
return a[:hi]
----------------------------------------------------------------------------
To give you some ideas about the running times:

f1 f3 f4 # of strings

['0.131058', '0.021471', '0.012050'] 2
['0.214896', '0.041648', '0.012963'] 4
['0.401236', '0.020444', '0.014707'] 8
['0.841738', '0.026415', '0.018589'] 16
['1.670606', '0.039348', '0.029020'] 32
['3.184446', '0.065657', '0.044247'] 64
['6.257635', '0.123510', '0.072568'] 128
Every calculation was done 200 times.
Furthermore, the testset consists of only two different strings, so the
binary search part of Raymonds solution comes only in as a static factor.

Anyway, the fastest solution is up to a 100 times faster than the trivial
one.

Cheers

Stephan

Jul 18 '05 #1
7 1182

"Stephan Diehl" <st*****************@gmx.net> wrote in message
news:bs*************@news.t-online.com...
All of this can be found at
http://aspn.activestate.com/ASPN/Coo.../Recipe/252177

What I was most surprised at was the inefficency of the trivial solutions
(and that the right algorithm makes indeed a difference).
A common surprise. The science of algorithms (including empirical testing)
gives real benefits.
-- os.path.commonprefix --------------------------------------------------
def f1(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
prefix = m[0]
prefix = m.pop() # avoids comparing prefix to itself as first item
for item in m:
for i in range(len(prefix)):
if prefix[:i+1] != item[:i+1]:
I am 99% sure that above is a typo and performance bug that should read
if prefix[i:i+1] != item[i:i+1]:
since all previous chars have been found equal in previous iteration.
prefix = prefix[:i]
if i == 0:
return ''
break
return prefix
Perhaps you could retest this with the two suggested changes. It will be
somewhat faster.
The problem with this algorithm is the copying of all those small strings.

Since Python does not have pointers, comparing characters within strings
requires pulling out the chars as separate strings. This is why using
C-coded comparison functions may win even though more comparisons are done.

The reason f1uses slicing (of len 1 after my correction) instead of
indexing is to avoid exceptions when len(item) < len(prefix). However, all
the +1s have a cost (and I suspect slicing does also), so it may pay to
truncate prefix to the length of item first. The simplest fix for this
(untested) gives

def f9(m): # modified f1 == os.path.commonprefix
"Given a list of strings, returns the longest common prefix"
if not m: return ''
prefix = m.pop()
for item in m:
prefix = prefix[:len(item)]
for i in range(len(prefix)):
if prefix[i] != item[i]:
if not i: return ''
prefix = prefix[:i]
break
return prefix
This can be easily fixed
--- optimized os.path.commonprefix ----------------------------------------
def f2(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
if len(m) == 1: return m[0]
prefix = m[0]
for i in xrange(len(prefix)):
for item in m[1:]:
if prefix[i] != item[i]:
return prefix[:i]
return prefix[:i]
and easily messed up;-) If len(item) < len(prefix), item[i] throws
exception. For this approach to work, prefix should be set as shortest
string of m in preliminary loop. Also, you reslice m several times. Do it
once before outer loop.
It is just not nessesary to compare all strings in the list.
Every string has to be compared to something at least once.
It is enough to sort the list first
and then compare the first and the last element.
Sorting compares all strings in the list to something at least once, and
most more than once.
Even though the 'sort' algorithm is coded in C and is therefore quite fast, the order of runtime has changed.
The C part is what makes f3 faster. In your timings, 128 is not large
enough for the nlogn component to be noticeable.
Michael Dyck then pointed out that instead of using 'sort', 'min' and 'max' should be used. While tests suggest that this is true, I have no idea why
that should be, since finding a minimum or maximum uses some sorting anyway

No. max and min each do a linear scan. No sorting. But each does at
least as many character comparisons as modified f1 or f2. The speedup is
from looping and comparing in C, even though at least twice as many
compares are done.
You might have realized that the optimization so far was done one the number of strings. There is still another dimension to optimize in and that is the actual string comparing.
Raymond Hettinger suggests using a binary search:


Since this only affects the final comparison of min and max, and not the n
comparisons done to calculate each, the effect is minimal and constant,
independent of number of strings.

Since this compares slices rather than chars in each loop, I wonder whether
this is really faster than linear scan anyway. I would like to see timing
of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
with sort removed and min/max added.) Since you changed two parts of f3 to
get f4, we cannot be sure that both changes are each an improvement even
though the combination of two is.

def f5(seq):
if not seq: return ''
s1 = min(seq)
s2 = max(seq)
n = min(len(s1), len(s2))
if not n: return '' # not required since s1[0:0] == ''
for i in xrange(n) :
if s1[i] != s2[i] :
return s1[0:i]
return s1[0:n]

Terry J. Reedy
Jul 18 '05 #2
Terry Reedy wrote:

[...]

and easily messed up;-) If len(item) < len(prefix), item[i] throws
exception. For this approach to work, prefix should be set as shortest
string of m in preliminary loop. Also, you reslice m several times. Do
it once before outer loop.
It is just not nessesary to compare all strings in the list.
Every string has to be compared to something at least once.


You are right, of course. I have to admit that I've been too sloppy in my
descriptions (and too sloppy in my thinking).

[...]
Michael Dyck then pointed out that instead of using 'sort', 'min' and

'max'
should be used. While tests suggest that this is true, I have no idea why
that should be, since finding a minimum or maximum uses some sorting

anyway

No. max and min each do a linear scan. No sorting. But each does at
least as many character comparisons as modified f1 or f2. The speedup is
from looping and comparing in C, even though at least twice as many
compares are done.


Makes sense.
You might have realized that the optimization so far was done one the number
of strings. There is still another dimension to optimize in and that is

the
actual string comparing.
Raymond Hettinger suggests using a binary search:


Since this only affects the final comparison of min and max, and not the n
comparisons done to calculate each, the effect is minimal and constant,
independent of number of strings.

Since this compares slices rather than chars in each loop, I wonder
whether
this is really faster than linear scan anyway. I would like to see timing
of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
with sort removed and min/max added.) Since you changed two parts of f3
to get f4, we cannot be sure that both changes are each an improvement
even though the combination of two is.

def f5(seq):
if not seq: return ''
s1 = min(seq)
s2 = max(seq)
n = min(len(s1), len(s2))
if not n: return '' # not required since s1[0:0] == ''
for i in xrange(n) :
if s1[i] != s2[i] :
return s1[0:i]
return s1[0:n]


Your f5 function runs virtually at the same speed than Raymonds version.
Even with growing string length, there is no dicernable difference.

Cheers

Stephan
Terry J. Reedy


Jul 18 '05 #3
On Sat, 27 Dec 2003 17:36:46 +0100, rumours say that Stephan Diehl
<st*****************@gmx.net> might have written:
A while ago, I've posted a recipie about finding a common prefix to a list
of strings. While the recipie itself is quite bad (I have to admit) and I
didn't know at that time that this problem was solved already in the
os.path module.
The interesting part can be found in the commentaries, as this turned out to
be a quest for the most efficient algorithm to solve this particular
problem.


You might also want to read:

http://www.python.org/sf/681780
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #4
Christos TZOTZIOY Georgiou wrote:
On Sat, 27 Dec 2003 17:36:46 +0100, rumours say that Stephan Diehl
<st*****************@gmx.net> might have written:
A while ago, I've posted a recipie about finding a common prefix to a list
of strings. While the recipie itself is quite bad (I have to admit) and I
didn't know at that time that this problem was solved already in the
os.path module.
The interesting part can be found in the commentaries, as this turned out
to be a quest for the most efficient algorithm to solve this particular
problem.


You might also want to read:

http://www.python.org/sf/681780


Terry's solution is much faster (at least on my test set) and, as an
additional benefit, is the easiest to understand.
Jul 18 '05 #5
On Sat, 27 Dec 2003 15:23:34 -0500, "Terry Reedy" <tj*****@udel.edu> wrote:

"Stephan Diehl" <st*****************@gmx.net> wrote in message
news:bs*************@news.t-online.com...
All of this can be found at
http://aspn.activestate.com/ASPN/Coo.../Recipe/252177

What I was most surprised at was the inefficency of the trivial solutions
(and that the right algorithm makes indeed a difference).


A common surprise. The science of algorithms (including empirical testing)
gives real benefits.
--

os.path.commonprefix --------------------------------------------------

def f1(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
prefix = m[0]


prefix = m.pop() # avoids comparing prefix to itself as first item
for item in m:
for i in range(len(prefix)):
if prefix[:i+1] != item[:i+1]:


I am 99% sure that above is a typo and performance bug that should read
if prefix[i:i+1] != item[i:i+1]:
since all previous chars have been found equal in previous iteration.
prefix = prefix[:i]
if i == 0:
return ''
break
return prefix


Perhaps you could retest this with the two suggested changes. It will be
somewhat faster.
The problem with this algorithm is the copying of all those small

strings.

Since Python does not have pointers, comparing characters within strings
requires pulling out the chars as separate strings. This is why using
C-coded comparison functions may win even though more comparisons are done.

The reason f1uses slicing (of len 1 after my correction) instead of
indexing is to avoid exceptions when len(item) < len(prefix). However, all
the +1s have a cost (and I suspect slicing does also), so it may pay to
truncate prefix to the length of item first. The simplest fix for this
(untested) gives

def f9(m): # modified f1 == os.path.commonprefix
"Given a list of strings, returns the longest common prefix"
if not m: return ''
prefix = m.pop()
for item in m:
prefix = prefix[:len(item)]
for i in range(len(prefix)):
if prefix[i] != item[i]:
if not i: return ''
prefix = prefix[:i]
break
return prefix
This can be easily fixed
--- optimized

os.path.commonprefix ----------------------------------------

def f2(m):
"Given a list of pathnames, returns the longest common leading
component"
if not m: return ''
if len(m) == 1: return m[0]
prefix = m[0]
for i in xrange(len(prefix)):
for item in m[1:]:
if prefix[i] != item[i]:
return prefix[:i]
return prefix[:i]


and easily messed up;-) If len(item) < len(prefix), item[i] throws
exception. For this approach to work, prefix should be set as shortest
string of m in preliminary loop. Also, you reslice m several times. Do it
once before outer loop.
It is just not nessesary to compare all strings in the list.


Every string has to be compared to something at least once.
It is enough to sort the list first
and then compare the first and the last element.


Sorting compares all strings in the list to something at least once, and
most more than once.
Even though the 'sort' algorithm is coded in C and is therefore quite

fast,
the order of runtime has changed.


The C part is what makes f3 faster. In your timings, 128 is not large
enough for the nlogn component to be noticeable.
Michael Dyck then pointed out that instead of using 'sort', 'min' and

'max'
should be used. While tests suggest that this is true, I have no idea why
that should be, since finding a minimum or maximum uses some sorting

anyway

No. max and min each do a linear scan. No sorting. But each does at
least as many character comparisons as modified f1 or f2. The speedup is
from looping and comparing in C, even though at least twice as many
compares are done.
You might have realized that the optimization so far was done one the

number
of strings. There is still another dimension to optimize in and that is

the
actual string comparing.
Raymond Hettinger suggests using a binary search:


Since this only affects the final comparison of min and max, and not the n
comparisons done to calculate each, the effect is minimal and constant,
independent of number of strings.

Since this compares slices rather than chars in each loop, I wonder whether
this is really faster than linear scan anyway. I would like to see timing
of f5 with min/max of f4 combined with linear scan of f3. (Basically, f3
with sort removed and min/max added.) Since you changed two parts of f3 to
get f4, we cannot be sure that both changes are each an improvement even
though the combination of two is.

def f5(seq):
if not seq: return ''
s1 = min(seq)
s2 = max(seq)
n = min(len(s1), len(s2))
if not n: return '' # not required since s1[0:0] == ''
for i in xrange(n) :
if s1[i] != s2[i] :
return s1[0:i]
return s1[0:n]

I wonder about this version for speed (not very tested ;-):
def f10(m): ... "return longest common prefix of strings in seq"
... if not m: return ''
... prefix = m.pop()
... ssw = str.startswith
... for item in m:
... while not ssw(item, prefix): prefix = prefix[:-1]
... if not prefix: return ''
... return prefix
... f10('abcd abcd'.split()) 'abcd' f10('abcd abce'.split()) 'abc' f10('abcd abcd a'.split()) 'a' f10('abcd abcd a x'.split())

''

Regards,
Bengt Richter
Jul 18 '05 #6
On Mon, 29 Dec 2003 17:08:39 +0100, rumours say that Stephan Diehl
<st*****************@gmx.net> might have written:
You might also want to read:

http://www.python.org/sf/681780


Terry's solution is much faster (at least on my test set) and, as an
additional benefit, is the easiest to understand.


Yep, I believe clarity is essential in the library (and that is why my
patch was obviously not accepted :) Actually IIRC (it's been a long
since) that code never compares more than once prefixes that have been
found equal and does not create slices (used buffer() and then switched
to startswith() for future compatibility), that is why it's a little
complicated. The main loop runs math.ceil(math.log(N,2)) times, where N
is min([len(x) for x in argument_list]).

Anyway, perhaps Terry or you should update the SF patch so that
os.path.commonprefix becomes faster... the point is to benefit the whole
Python community, right? :)
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #7
Christos TZOTZIOY Georgiou wrote:
On Mon, 29 Dec 2003 17:08:39 +0100, rumours say that Stephan Diehl
<st*****************@gmx.net> might have written:
You might also want to read:

http://www.python.org/sf/681780


Terry's solution is much faster (at least on my test set) and, as an
additional benefit, is the easiest to understand.


Yep, I believe clarity is essential in the library (and that is why my
patch was obviously not accepted :) Actually IIRC (it's been a long
since) that code never compares more than once prefixes that have been
found equal and does not create slices (used buffer() and then switched
to startswith() for future compatibility), that is why it's a little
complicated. The main loop runs math.ceil(math.log(N,2)) times, where N
is min([len(x) for x in argument_list]).

Anyway, perhaps Terry or you should update the SF patch so that
os.path.commonprefix becomes faster... the point is to benefit the whole
Python community, right? :)


In principle, yes :-)
I'm not too sure if commonprefix is used that often and in a way that would
really be worth the effort to patch the existing code. (I guess, if there
had been a real need, it would have been done already a long time ago).
Another thing is that the speed improvement would be largely due to using C
implemented functions instead of pure Python code (o.k., the lines are
blury here). In order to understand what's going on, one needs to know what
kind of functions are C implemented and why, in that particular case, it is
a good idea, to use them.
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Rune | last post by:
Is it best to use double quotes and let PHP expand variables inside strings, or is it faster to do the string manipulation yourself manually? Which is quicker? 1) $insert = 'To Be';...
2
by: Eugene | last post by:
I am trying to set query optimization class in a simple SQL UDF like this: CREATE FUNCTION udftest ( in_item_id INT ) SPECIFIC udftest MODIFIES SQL DATA RETURNS TABLE( location_id INT,...
12
by: WantedToBeDBA | last post by:
Hi all, db2 => create table emp(empno int not null primary key, \ db2 (cont.) => sex char(1) not null constraint s_check check \ db2 (cont.) => (sex in ('m','f')) \ db2 (cont.) => not enforced...
193
by: Michael B. | last post by:
I was just thinking about this, specifically wondering if there's any features that the C specification currently lacks, and which may be included in some future standardization. Of course, I...
50
by: Jatinder | last post by:
I 'm a professional looking for the job.In interview these questions were asked with some others which I answered.But some of them left unanswered.Plz help. Here are some questions on C/C++, OS...
18
by: chellappa | last post by:
hi suppose like this function ,,, i want optimize to exceute more fast.... please give some optimization techiques for this routine and also give some ideas for c programming optimzation for...
24
by: Kunal | last post by:
Hello, I need help in removing if ..else conditions inside for loops. I have used the following method but I am not sure whether it has actually helped. Below is an example to illustrate what I...
21
by: mjbackues at yahoo | last post by:
Hello. I'm having a problem with the Visual Studio .net (2003) C++ speed optimization, and hope someone can suggest a workaround. My project includes many C++ files, most of which work fine...
7
by: ramasubramanian.rahul | last post by:
hi i was trying to see how the compiler hides the static golbals from the linker and allows golbal varibale to be visable to the linker.i managed to figure out how it did that ( the .lcomm and...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.