Which is better to use with an ArrayList: BinarySearch or Contains?
The list is only going to have strings in it and it will be sorted.
Thanks,
Tom 43 6568
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
Which is better to use with an ArrayList: BinarySearch or Contains?
The list is only going to have strings in it and it will be sorted.
Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is an
O(lg(n)) operation. Just make sure that the data really is sorted!
cd
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
Which is better to use with an ArrayList: BinarySearch or Contains?
The list is only going to have strings in it and it will be sorted.
Contains is O(n)
BinarySearch is O(log(n))
For small collections Contains is probably faster
For large collections BinarySearch is Faster
If you are frequently looking for data near the front of the collection
Contains might have an edge
In short, it depends on the number of items and normal usage.
Hope this helps
Bill
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:O9**************@TK2MSFTNGP03.phx.gbl...
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
>Which is better to use with an ArrayList: BinarySearch or Contains?
The list is only going to have strings in it and it will be sorted.
Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is
an O(lg(n)) operation. Just make sure that the data really is sorted!
What does O(n) and O(lg(n)) mean?
Thanks,
Tom
"tshad" <ts**********@ftsolutions.comwrote in message
news:eF**************@TK2MSFTNGP03.phx.gbl...
What does O(n) and O(lg(n)) mean?
This question popped up just this week.
Here was my answer
If memory serves correctly the O stands for 'Order'.
Thus O(n) would be called "Order n" and O(n^2) would be "Order n
squared".
This is a measure of how efficient a given algorithm performs.
Technically it is a measure of the asymptotic behavior as the number of
elements gets very large, but, it often is appropriate even for fairly
small collections.
O(1) means that this operation takes a constant amount of time
independent of the number of items involved. A hashtable retrieval is an
O(1) operation
O(n) means that an operation takes a time proportional to the number of
elements (n). If an collection has twice as many elements, the operation
takes twice as long. A linear search is an O(n) operation
others include
O(n^2) : Goes as n*n (not very efficient for large collections)
O(log(n)): goes as the logarithm of the number of elements
O(n*log(n)) : you get the idea.
Hope this helps
Bill
"tshad" <ts**********@ftsolutions.comwrote in message
news:eF**************@TK2MSFTNGP03.phx.gbl...
What does O(n) and O(lg(n)) mean?
It's a Computer Science way of describing the complexity or cost of an
algorithm. The "n" relates to the size of the data, the "O" is read "order"
and the actual function describes how the length of the operation varies
according to the length of the data given the operation.
In this particular case, O(n) refers to the fact that a straight linear
search through the data will always take an amount of time that is directly
proportional to the size of the data, and increases linearly as the size of
the data increases. On the other hand, the BinarySearch method takes
advantage of the fact that the data is sorted, allowing a search for a
particular data item to increase in time proportional to log(n). Since
log(n) is always much smaller than n itself, this means BinarySearch is much
more efficient, on average.
If you intend to do any serious programming, especially where performance is
an issue, you would do well to find a good textbook or other reference on
algorithms and learn not only about the concept of "order", but also what
common algorithms have what order and how the order is affected by the data
(for example, the average case for a binary search is O(log(n)), but
depending on how the data is stored the worst case might wind up being O(n)
anyway).
Pete
Hi Tom,
"Big O Notation" http://en.wikipedia.org/wiki/Big_o_notation
Basically you can think of it as the "n"umber of "O"perations that is the
worstcase scenario to find the specified item in the list.
e.g., if you have 10 items in the list then the worst case in a linear search
would be O(n). This means that the worst case scenario would be examining
every item in the list (10 operations). The worstcase in a binary search
would be O(lg(n)). Binary logarithm of 10 operations. Binary log is the
worst case scenario because a binary search cuts the list in half, then
performs the same operation again on the half that contains the search item.
The pattern is continued until only one or zero items remain. If the list
isn't sorted properly than the search may fail because the algorithm might
chose the "wrong" half in which to continue searching after comparing the
current item to the search item.
"Binary Logarithm" http://en.wikipedia.org/wiki/Binary_logarithm
"Binary Search" http://en.wikipedia.org/wiki/Binary_search

Dave Sexton
"tshad" <ts**********@ftsolutions.comwrote in message
news:eF**************@TK2MSFTNGP03.phx.gbl...
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:O9**************@TK2MSFTNGP03.phx.gbl...
>"tshad" <ts**********@ftsolutions.comwrote in message news:u8**************@TK2MSFTNGP04.phx.gbl...
>>Which is better to use with an ArrayList: BinarySearch or Contains?
The list is only going to have strings in it and it will be sorted.
Then BinarySearch is better. Contains will be an O(n) operation always (since it doesn't know whether the data is sorted), while BinarySearch is an O(lg(n)) operation. Just make sure that the data really is sorted!
What does O(n) and O(lg(n)) mean?
Thanks,
Tom
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
e.g., if you have 10 items in the list then the worst case in a linear
search would be O(n). This means that the worst case scenario would be
examining every item in the list (10 operations). The worstcase in a
binary search would be O(lg(n)).
Note:
"Order" stated by itself is almost always for *average* performance, not
worstcase. One would normally specify explicitly if describing the
worstcase order for an algorithm.
For example, it is NOT true that "the worstcase in a binary search would be
O(log(n))" for all implementations of a binary search. If you have a sorted
array, it's true that the worst case is O(log(n)). However, if you have a
binary tree, a search on that tree could be as bad as O(n).
Pete
Hi Peter,
My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worstcase, so thanks for correcting me.
I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worstcases :)

Dave Sexton
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
>e.g., if you have 10 items in the list then the worst case in a linear search would be O(n). This means that the worst case scenario would be examining every item in the list (10 operations). The worstcase in a binary search would be O(lg(n)).
Note:
"Order" stated by itself is almost always for *average* performance, not
worstcase. One would normally specify explicitly if describing the
worstcase order for an algorithm.
For example, it is NOT true that "the worstcase in a binary search would be
O(log(n))" for all implementations of a binary search. If you have a sorted
array, it's true that the worst case is O(log(n)). However, if you have a
binary tree, a search on that tree could be as bad as O(n).
Pete
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:Ot**************@TK2MSFTNGP02.phx.gbl...
My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worstcase, so thanks for correcting
me.
I just took a look at the article you referenced. Wow. They sure didn't
cover in that kind of detail in my CS classes way back when. :) Anyway, I
can see where the confusion comes from, since the phrase "asymptotic upper
bound" could be read to imply "worst case" (the "upper bound" certainly has
that implication in daily usage). I can't say as I blame you for reading it
that way.
That's what we lowly programmers get for using the same concept that those
highfalutin' "algorithm scientists" use. :)
I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worstcases :)
I guess. Though, to be fair, in those two cases the order of the average
case is the same as the order of the worst case (if one assumes a binary
search on an array, which seems reasonable given this thread). So the
distinction is entirely academic. :)
Pete
Hi Peter,
I must admit that my incomplete understanding of the O notation wasn't based
solely on wikipedia. I've always believed it to represent "worstcase", not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p
Luckily, I've never been tasked with the responsibility of authoring
highperformance search algorithms on complex data structures. Although if I
was, I probably would have learned the correct meaning of the notation ;)
Thanks again.

Dave Sexton
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:Ot**************@TK2MSFTNGP02.phx.gbl...
>My interpretation of the wikipedia articles is obviously off a little. I thought O notation always represented worstcase, so thanks for correcting me.
I just took a look at the article you referenced. Wow. They sure didn't
cover in that kind of detail in my CS classes way back when. :) Anyway, I
can see where the confusion comes from, since the phrase "asymptotic upper
bound" could be read to imply "worst case" (the "upper bound" certainly has
that implication in daily usage). I can't say as I blame you for reading it
that way.
That's what we lowly programmers get for using the same concept that those
highfalutin' "algorithm scientists" use. :)
>I guess it's just a fluke that I analyzed the two cases that actually are expressing the worstcases :)
I guess. Though, to be fair, in those two cases the order of the average
case is the same as the order of the worst case (if one assumes a binary
search on an array, which seems reasonable given this thread). So the
distinction is entirely academic. :)
Pete
Dave Sexton <dave@jwa[remove.this]online.comwrote:
I must admit that my incomplete understanding of the O notation wasn't based
solely on wikipedia. I've always believed it to represent "worstcase", not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.

Jon Skeet  <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Hi Jon,
>I must admit that my incomplete understanding of the O notation wasn't based solely on wikipedia. I've always believed it to represent "worstcase", not average, for many years now. It's the math that makes it confusing, yet I thought I had it right :p
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.
That's a good idea to cover all bases.
Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?
What I mean is that it's much easier to chose an algorithm based on either
average time or worstcase time but it's hard for me to compare both, so which
aspect has more relevance? Of course, I'd agree that it's better to have all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)
BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN,
so it seems you did remember correctly.
"Quicksort" http://en.wikipedia.org/wiki/QuickSort
(The Quicksort article is particularly interesting with its visual
representations of the sort in action)
"Array.Sort Method" http://msdn.microsoft.com/library/de...ssorttopic.asp

Dave Sexton
Contains is O(n) operation while BinarySearch is O(log n) operation. In big
list sizes and under intense load, this will make a huge difference, so use
BinarySearch to locate items in sorted lists.
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
Which is better to use with an ArrayList: BinarySearch or Contains?
The list is only going to have strings in it and it will be sorted.
Thanks,
Tom
Dave Sexton <dave@jwa[remove.this]online.comwrote:
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.
That's a good idea to cover all bases.
Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?
They can only be compared with as much data as is available.
What I mean is that it's much easier to chose an algorithm based on either
average time or worstcase time but it's hard for me to compare both, so which
aspect has more relevance? Of course, I'd agree that it's better to have all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)
You'd need to have more information such as how likely the worst case
scenario is, and how easy it might be to avoid.
Note that just knowing the order isn't always enough either  things
can both be the same order, but with a large factor involved, eg one
will always be five times slower than another. IIRC (again!) a merge
sort can *always* be O(n log n) but it has other costs both in memory
and time factors which are higher than QuickSort, which is why
QuickSort is used more often.
BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN,
so it seems you did remember correctly.
And to think people claim miracles don't happen in the 21st century ;)

Jon Skeet  <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
<snip>
If you have a sorted array, it's true that the worst case is
O(log(n)).
Everyone is making wonderful points on this topic, but I wanted to
underline a few key points.
As Peter said, a BinarySearch on a sorted ArrayList (or a balanced
Binary tree) will have a Worst case performance of O(log(n)).
Another important point of Big O notation is that it describes
Asymptotic behavior. Thus for sufficiently large n an O(log(n))
algorithm will outperform an O(n) algorithm. The problem is determining
what constitutes "sufficiently large n". For small values of n,
O(log(n)) algorithms often perform WORSE than linear algorithms do to
the additional setup complexity. Of course, for small collections, ANY
search will be fast, so it doesn't really matter.
Bill
Hi Jon,
So it seems that choosing an appropriate algorithm depends not only on what
the O notation says about performance, but other factors that the notation
doesn't account for such as scale, depth and distribution of the data,
perhaps, which also may affect the performance of the algorithm.
I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick one
and see if it performs better than any other algorithm that I decide to test.
I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable criteria
for performance and memory consumption.
I'll be outsourcing any project in the future where this is required ;)
Thanks.

Dave Sexton
"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP************************@msnews.microsoft.c om...
Dave Sexton <dave@jwa[remove.this]online.comwrote:
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.
That's a good idea to cover all bases.
Then the difficulty comes in comparing two algorithms that have asymmetric average to worst case ratios. How can the algorithms be compared when they don't have any common denominator?
They can only be compared with as much data as is available.
>What I mean is that it's much easier to chose an algorithm based on either average time or worstcase time but it's hard for me to compare both, so which aspect has more relevance? Of course, I'd agree that it's better to have all the information on hand when making the decision (even though ignorance is bliss, as they say ;)
You'd need to have more information such as how likely the worst case
scenario is, and how easy it might be to avoid.
Note that just knowing the order isn't always enough either  things
can both be the same order, but with a large factor involved, eg one
will always be five times slower than another. IIRC (again!) a merge
sort can *always* be O(n log n) but it has other costs both in memory
and time factors which are higher than QuickSort, which is why
QuickSort is used more often.
>BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN, so it seems you did remember correctly.
And to think people claim miracles don't happen in the 21st century ;)

Jon Skeet  <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Hi Bill,
It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?
Since I'm not one for microoptimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there may
be substantial amounts of data at times and just swallow the performance hit
on those smaller collections. I guess if there is a diametric aspect to the
data then maybe switching between linear and binary algorithms at runtime
might be acceptable. I've never had a project that required this much thought
on search algorithms, however :)
Thanks.

Dave Sexton
"Bill Butler" <qw****@asdf.comwrote in message
news:_4q5h.358$5P2.166@trnddc02...
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
<snip>
>If you have a sorted array, it's true that the worst case is O(log(n)).
Everyone is making wonderful points on this topic, but I wanted to underline
a few key points.
As Peter said, a BinarySearch on a sorted ArrayList (or a balanced Binary
tree) will have a Worst case performance of O(log(n)).
Another important point of Big O notation is that it describes Asymptotic
behavior. Thus for sufficiently large n an O(log(n)) algorithm will
outperform an O(n) algorithm. The problem is determining what constitutes
"sufficiently large n". For small values of n, O(log(n)) algorithms often
perform WORSE than linear algorithms do to the additional setup complexity.
Of course, for small collections, ANY search will be fast, so it doesn't
really matter.
Bill
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:ux**************@TK2MSFTNGP02.phx.gbl...
Hi Jon,
So it seems that choosing an appropriate algorithm depends not only on
what the O notation says about performance, but other factors that the
notation doesn't account for such as scale, depth and distribution of the
data, perhaps, which also may affect the performance of the algorithm.
Yup. Welcome to the world of optimization. :) The "order" of an algorithm
is an important analytical aspect, but as you've noted one should not
consider it to the be the last word in which algorithm is "best". There are
so many other factors that come into play, the order by itself just doesn't
tell you everything you need to know.
I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick
one and see if it performs better than any other algorithm that I decide
to test.
Well, yes and no. I mean, it's useful to use the order as a starting point.
After all, you can pretty much rule out even testing a bubble sort when
you've got a Quicksort implementation to use. You don't literally need to
test all possible solutions, thank goodness.
Also note that testing your potential solutions does not necessarily tell
you worstcase performance. It depends on how you pick your data for the
tests, and different algorithms may have different data that results in the
worstcase performance. So, testing can be useful, but it's not necessarily
an improvement over just looking at the order. To improve upon the order
analysis, you have to do more analysis. Testing alone won't cut it.
I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable
criteria for performance and memory consumption.
I'll be outsourcing any project in the future where this is required ;)
That works. :)
That said, I'll note that much of the time, absolutely *optimal* performance
is not required, while at the same time the very worst cases are not
necessarily that hard to avoid. For example, even in the average case,
Quicksort is dependent on the distribution of the data and how you pick the
pivot element for each iteration. Even when the magnitude is reasonably
proportional to n log(n), there can be wide variations in realtime
performance.
This can be addressed either by playing some tricks when picking the pivot,
or by simply ignoring the problem and accepting that occasionally you'll get
data organized in a way that hinders the sorting performance. The very
worst case comes when the data is already sorted, and this condition can be
detected fairly easily (and of course, if you detect it, you need not do any
more work :) ).
I'm just using Quicksort as an example. The same ideas apply to pretty much
any algorithm. Knowing the order is useful, but it's not the endall,
beall for picking an algorithm or implementation. You can use the order to
make some broad generalizations, and then refine the implementation to
address specifics that apply in one's particular situation. And in the end,
most coders simply accept that sometimes the universe will throw some really
hard data at the algorithm, and it won't get the very best performance it
might have gotten. :)
Pete
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
It seems you are suggesting that the scale of the data can affect whether
a required sorting operation will actually reduce the overall performance
of an algorithm so that it becomes worse than a linear search.
His point (I think) is that some algorithms only work after some initial
setup, and that setup can be costly. I'm not sure that all sorting and
searching algorithms are a good example of this. For example, depending on
the implementation of a Quicksort, it can perform just as well on small data
sets as large ones. Likewise binary search (though for very small data
sets, the difference between n and log n is very small).
But it is certainly true that the order of an algorithm considers only a
data set that's basically infinite. There can be wide variations in
performance, well away from the theoretical performance, when the data set
is very small. To make matters worse, what constitutes "very small" or
"large enough" varies depending on the algorithm
I think this is what Bill was saying...I don't think I'm adding anything
new, but hopefully am just restating it in a slightly different way, which
might make it clearer to some people (just as it may obfuscate it to others
:) ).
Can this be true even when the data is sorted as it's being entered into
the structure?
Ack! And I mean that in the Bill the Cat way, not the i/o handshaking way.
:)
The analysis becomes very different when you've got an incremental situation
like that. The order analysis for these different algorithms assumes you
start with all the data and run through it once to produce the end results.
What is best in that case may very well not be best for a case where you
want to perform the same operation after each data element is added.
For example, resorting an array would be a silly way to sort data as it's
being added to a data structure. After all, if the data is already sorted,
sorting it again just to add on element is very inefficient. Even worse, as
I noted elsewhere, the worstcase performance for some sorting algorithms
comes when the data starts out sorted. This is a classic example of why
knowing the exact scenario the algorithm is to be used is at least as
important as knowing the order of the algorithm. :)
I would never use Quicksort to sort data as it's being entered into a data
structure. It's one of the worst choices in that case, and you practically
guarantee minimal performance, even though the order of the algorithm is
superior on average.
Since I'm not one for microoptimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there
may be substantial amounts of data at times and just swallow the
performance hit on those smaller collections.
As Bill notes, when the data size is small, the order of the algorithm is
usually not relevant since any algorithm will be fast. So yes, swallowing
the performance hit is fine in that case.
However, that's not the same as ignoring the exact scenario being solved.
You need to take into account how the algorithm will be used, and not create
a situation where you choose an algorithm that has a good average case
order, but will always be used where the worst case order applies.
I hope that makes sense...I'm typing quickly because I'm hungry and want to
eat lunch. :)
Pete
Hi Peter,
I have a lot of experience optimizing code, just not for search and sorting
algorithms. They appear now to require similar amounts of knowledge and
effort. The main difference appears to be that search and sort algorithms
have a mathematical base, but general performance tweaking in code just
requires an understanding of the coding language, the framework and expected
use of the application, without the need for any math. Testing for
performance and sampling data has always been good enough on my projects so
far. The mathematical aspects of the algorithms are confusing so I just
always leaned on my original understanding of the O notation, expecting it to
tell the whole story.
Thanks for your comprehensive response.

Dave Sexton
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:ux**************@TK2MSFTNGP02.phx.gbl...
>Hi Jon,
So it seems that choosing an appropriate algorithm depends not only on what the O notation says about performance, but other factors that the notation doesn't account for such as scale, depth and distribution of the data, perhaps, which also may affect the performance of the algorithm.
Yup. Welcome to the world of optimization. :) The "order" of an algorithm
is an important analytical aspect, but as you've noted one should not
consider it to the be the last word in which algorithm is "best". There are
so many other factors that come into play, the order by itself just doesn't
tell you everything you need to know.
>I guess the best way to choose an appropriate algorithm for an application when dealing with complex data structures, sorts and searches, is to pick one and see if it performs better than any other algorithm that I decide to test.
Well, yes and no. I mean, it's useful to use the order as a starting point.
After all, you can pretty much rule out even testing a bubble sort when
you've got a Quicksort implementation to use. You don't literally need to
test all possible solutions, thank goodness.
Also note that testing your potential solutions does not necessarily tell
you worstcase performance. It depends on how you pick your data for the
tests, and different algorithms may have different data that results in the
worstcase performance. So, testing can be useful, but it's not necessarily
an improvement over just looking at the order. To improve upon the order
analysis, you have to do more analysis. Testing alone won't cut it.
>I can choose the algorithms to be initially tested based on all of the information that is available, ahead of time, about the data and its structure, the noted performance of the algorithms and the acceptable criteria for performance and memory consumption.
I'll be outsourcing any project in the future where this is required ;)
That works. :)
That said, I'll note that much of the time, absolutely *optimal* performance
is not required, while at the same time the very worst cases are not
necessarily that hard to avoid. For example, even in the average case,
Quicksort is dependent on the distribution of the data and how you pick the
pivot element for each iteration. Even when the magnitude is reasonably
proportional to n log(n), there can be wide variations in realtime
performance.
This can be addressed either by playing some tricks when picking the pivot,
or by simply ignoring the problem and accepting that occasionally you'll get
data organized in a way that hinders the sorting performance. The very
worst case comes when the data is already sorted, and this condition can be
detected fairly easily (and of course, if you detect it, you need not do any
more work :) ).
I'm just using Quicksort as an example. The same ideas apply to pretty much
any algorithm. Knowing the order is useful, but it's not the endall,
beall for picking an algorithm or implementation. You can use the order to
make some broad generalizations, and then refine the implementation to
address specifics that apply in one's particular situation. And in the end,
most coders simply accept that sometimes the universe will throw some really
hard data at the algorithm, and it won't get the very best performance it
might have gotten. :)
Pete
Hi Peter,
Informative and entertaining ;)
I thought that "setup" might have been referring to the perquisite that some
algorithms have on sorted data. If not, what exactly does "setup" mean?
I'm stepping out to eat too (and go see Borat). We can reconvene at 8:30 :p

Dave Sexton
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
>It seems you are suggesting that the scale of the data can affect whether a required sorting operation will actually reduce the overall performance of an algorithm so that it becomes worse than a linear search.
His point (I think) is that some algorithms only work after some initial
setup, and that setup can be costly. I'm not sure that all sorting and
searching algorithms are a good example of this. For example, depending on
the implementation of a Quicksort, it can perform just as well on small data
sets as large ones. Likewise binary search (though for very small data
sets, the difference between n and log n is very small).
But it is certainly true that the order of an algorithm considers only a
data set that's basically infinite. There can be wide variations in
performance, well away from the theoretical performance, when the data set
is very small. To make matters worse, what constitutes "very small" or
"large enough" varies depending on the algorithm
I think this is what Bill was saying...I don't think I'm adding anything
new, but hopefully am just restating it in a slightly different way, which
might make it clearer to some people (just as it may obfuscate it to others
:) ).
>Can this be true even when the data is sorted as it's being entered into the structure?
Ack! And I mean that in the Bill the Cat way, not the i/o handshaking way.
:)
The analysis becomes very different when you've got an incremental situation
like that. The order analysis for these different algorithms assumes you
start with all the data and run through it once to produce the end results.
What is best in that case may very well not be best for a case where you
want to perform the same operation after each data element is added.
For example, resorting an array would be a silly way to sort data as it's
being added to a data structure. After all, if the data is already sorted,
sorting it again just to add on element is very inefficient. Even worse, as
I noted elsewhere, the worstcase performance for some sorting algorithms
comes when the data starts out sorted. This is a classic example of why
knowing the exact scenario the algorithm is to be used is at least as
important as knowing the order of the algorithm. :)
I would never use Quicksort to sort data as it's being entered into a data
structure. It's one of the worst choices in that case, and you practically
guarantee minimal performance, even though the order of the algorithm is
superior on average.
>Since I'm not one for microoptimization either I'd rather just choose an algorithm with an average performance of O(lg(n)) if I'm sure that there may be substantial amounts of data at times and just swallow the performance hit on those smaller collections.
As Bill notes, when the data size is small, the order of the algorithm is
usually not relevant since any algorithm will be fast. So yes, swallowing
the performance hit is fine in that case.
However, that's not the same as ignoring the exact scenario being solved.
You need to take into account how the algorithm will be used, and not create
a situation where you choose an algorithm that has a good average case
order, but will always be used where the worst case order applies.
I hope that makes sense...I'm typing quickly because I'm hungry and want to
eat lunch. :)
Pete
Dave Sexton wrote:
Hi Bill,
It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the BigOh notation anyway,
that a Hashtable would always be faster, but what BigOh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable. Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:OP*************@TK2MSFTNGP03.phx.gbl...
Hi Peter,
I have a lot of experience optimizing code, just not for search and
sorting algorithms. They appear now to require similar amounts of
knowledge and effort. The main difference appears to be that search and
sort algorithms have a mathematical base, but general performance tweaking
in code just requires an understanding of the coding language, the
framework and expected use of the application, without the need for any
math.
Yeah, I was being a bit careless. :) I actually would have been surprised,
given other posts you've written, had you NOT ever had to optimize code. My
"welcome" was more a general topicrelated comment, than specifically
directed at you.
I guess the point is (as you say) simply that there are two parts to
optimization. One has to do with making sure a given implementation
performs well, and the other has to do with making sure the algorithm itself
performs well.
The latter is usually the first step, since if you have a bad algorithm, no
amount of codelevel optimization can fix that. However, I also admit that
in many situations, it makes more sense to code *some* algorithm, *any*
algorithm, and then to look harder at the algorithm if it turns out that
that portion of the code is a bottleneck. As with other aspects of
theoretical coding practices, in reality there's a cycle in which the
results from a first pass through the theoretical process are used to
improve on subsequent passes through that process. :)
Pete
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:Ob**************@TK2MSFTNGP03.phx.gbl...
Hi Peter,
Informative and entertaining ;)
I thought that "setup" might have been referring to the perquisite that
some algorithms have on sorted data. If not, what exactly does "setup"
mean?
It just depends on the algorithm. It could refer to a prerequisite, or it
could be some state initialization. Furthermore, this isn't the *only*
reason an algorithm with better order would perform worse on smaller data
sets. It's just an example. Another example would be an algorithm in which
each individual step takes twice as long for the betterorder algorithm.
For smaller data sets, where the worse order is still not twice the better
order, the worse order algorithm will still perform better.
The takeaway isn't so much the specific reasons that a better order
algorithm might be slower, but just that you can't rely on the order when
the data set is small. You have to have large data sets for the order to
come into play, because order is *defined* according to what are essentially
infinitely large data sets.
Pete
"Brian Gideon" <br*********@yahoo.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. [...] Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
Interestingly, that's also a good example of the other comment Bill made.
That is, if the breakeven point is just 10 elements, there's not really any
point in bothering with the betterperformingbutworseorder algorithm,
since even the poor performance of the betterorder algorithm is going to be
practically instantaneous as far as any user is concerned. There just
aren't enough data elements for ANY algorithm to take any significant amount
of time.
Assuming 10 is truly the breakeven point, then I question the need for the
HybridDictionary class at all, if the only difference is that it uses one
method for data sets fewer than 10 and another for data sets larger than
that.
Pete
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
Hi Bill,
It seems you are suggesting that the scale of the data can affect
whether a required sorting operation will actually reduce the overall
performance of an algorithm so that it becomes worse than a linear
search. Can this be true even when the data is sorted as it's being
entered into the structure?
Absolutely.
Seach algorithms have two distinct pieces to them
 Comparisons : Is this the right element? It is greater than this?
Less than this?
 What is the next element in the collection that I should compare
against.
A Linear search does a Fast Compare and a fast iteration to the next
element, but it doesn't apply any logic to reduce the number of
comparisons required. Therefore, for small data sets, a linear search
can be the best option. For large sets, you pay the price of having to
compare agains each and every element until you find the correct one.
A Binary search against a sorted dataset has a fairly fast compare with
a slower iteration to the next element, since it needs to calculate
where it should do the next compare. This additional logic in the
algorithm means that each step in the algorithm is slower than the
linear algorithm, but it drastically reduces the number of steps
involved.Therefore the Binary search is faster when the number of
elements becomes the dominant factor.
The original question asked about the best way to find elements in a
sorted ArrayList, but I would like to bring up an alternative. Try
using a Hashtable instead. It has a higher memory footprint, but the
performance is exceptional. For completeness in this discussion I would
like to point out that Hashtable retrival is a constant time operation
O(1). I ran a quick test comparing Hashtable vs Linear search vs
BinarySearch and the results are impressive. Hashtable wins hands down
when more than a handfull of elements are involved
Bill
Hi Peter,
Yeah, I was being a bit careless. :) I actually would have been surprised,
given other posts you've written, had you NOT ever had to optimize code. My
"welcome" was more a general topicrelated comment, than specifically
directed at you.
No need to cater to my narcissism  I understood what you meant and I
appreciate the help ;)
I guess the point is (as you say) simply that there are two parts to
optimization. One has to do with making sure a given implementation
performs well, and the other has to do with making sure the algorithm itself
performs well.
The latter is usually the first step, since if you have a bad algorithm, no
amount of codelevel optimization can fix that. However, I also admit that
in many situations, it makes more sense to code *some* algorithm, *any*
algorithm, and then to look harder at the algorithm if it turns out that
that portion of the code is a bottleneck. As with other aspects of
theoretical coding practices, in reality there's a cycle in which the
results from a first pass through the theoretical process are used to
improve on subsequent passes through that process. :)
Yes, I agree that refinement is critical in applications that rely on
topnotch performance. It's rare to get it right the first time. I wouldn't
doubt that it's the same when coding search and sort algorithms. It does make
sense to me as well that being able to choose an appropriate algorithm before
implementing the framework in which it's executed is ideal, but that's not
always possible as you mentioned. I guess a lot of that depends on how many
of the variables that govern which algorithm is appropriate will continue to
change and how much. i.e., the less things will change the easier it is to
choose an appropriate algorithm before choosing its code base.

Dave Sexton
Hi Peter,
>I thought that "setup" might have been referring to the perquisite that some algorithms have on sorted data. If not, what exactly does "setup" mean?
It just depends on the algorithm. It could refer to a prerequisite, or it
could be some state initialization. Furthermore, this isn't the *only*
reason an algorithm with better order would perform worse on smaller data
sets. It's just an example. Another example would be an algorithm in which
each individual step takes twice as long for the betterorder algorithm. For
smaller data sets, where the worse order is still not twice the better
order, the worse order algorithm will still perform better.
I was just speaking about this thread to a friend of mine who was quick to
jump in and say, "Use linear searches with 30 items or less and BinarySearch
otherwise", immediately after I said the words, "newsgroup thread on the big O
notation and sorting algorithms". I think this type of thought process
relating to search and sort algorithms is quite common among programmers.
It's really easy to understand and people just seem to love "rules" since it
makes things easy to remember.
From what has been discussed so far it sounds like this line of thought is
incorrect  there's just too many variables. Choosing an appropriate pivot
element in a quicksort based on the actual data being sorted, for example,
seems to prevent any constant threshold such as "30" from being appropriate,
in general.
So is there some practical guideline such as 30 or is this truly an arbitrary
number?
The takeaway isn't so much the specific reasons that a better order
algorithm might be slower, but just that you can't rely on the order when
the data set is small. You have to have large data sets for the order to
come into play, because order is *defined* according to what are essentially
infinitely large data sets.
Big O notation seems to be like expressing algorithms in terms of probability,
which becomes more accurate as the number of "attempts" increases, so to
speak. When a collection is small, there is more of a chance that the data
will coalesce in a way that undermines the effectiveness of an algorithm.
Larger collections may have a better distribution of data, which can increase
the effectiveness of an algorithm much like how larger data sets provide more
accurate results, probabilistically.
Have I understood you correctly?

Dave Sexton
HI Brian,
Thanks for the reply.
<snip>
This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the BigOh notation anyway,
that a Hashtable would always be faster, but what BigOh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable
Doesn't O(1) indicate the constant factor?
Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
I'd debate that if I had more knowledge on the subject ;)

Dave Sexton
"Brian Gideon" <br*********@yahoo.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
Dave Sexton wrote:
>Hi Bill,
It seems you are suggesting that the scale of the data can affect whether a required sorting operation will actually reduce the overall performance of an algorithm so that it becomes worse than a linear search. Can this be true even when the data is sorted as it's being entered into the structure?
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the BigOh notation anyway,
that a Hashtable would always be faster, but what BigOh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable. Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
Hi Peter,
>Yep, that's basically what he's saying. A good example of this is the HybridDictionary. The HybridDictionary starts out as a ListDictionary and then converts to a Hashtable when the number of items is large enough. [...] Typically, a ListDictionary will outperform a Hashtable if the number of items is less than 10.
Interestingly, that's also a good example of the other comment Bill made.
That is, if the breakeven point is just 10 elements, there's not really any
point in bothering with the betterperformingbutworseorder algorithm,
since even the poor performance of the betterorder algorithm is going to be
practically instantaneous as far as any user is concerned. There just
aren't enough data elements for ANY algorithm to take any significant amount
of time.
Assuming 10 is truly the breakeven point, then I question the need for the
HybridDictionary class at all, if the only difference is that it uses one
method for data sets fewer than 10 and another for data sets larger than
that.
I was thinking the same thing (in my last post to you which was on a related
topic), but I think there is still use for the HybridDictionary:
If an application requires the use of many dictionaries at once, each
containing a small number of distinct elements to begin with, the number of
elements increased over time, and searched consistently and asynchronously,
and if the best possible performance is required, then HybridDictionary might
make sense.
I'm not sure that it deserved to be in the FCL before LinkedList, however ;)

Dave Sexton
Hi Bill,
The results of your test are interesting, but I'm sure it depends on the
distribution of data in the Hashtable, does it not?
If all of the data is in a single bucket doesn't a search operation become
O(n)?

Dave Sexton
"Bill Butler" <qw****@asdf.comwrote in message
news:T8w5h.196$5F2.84@trnddc04...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
>Hi Bill,
It seems you are suggesting that the scale of the data can affect whether a required sorting operation will actually reduce the overall performance of an algorithm so that it becomes worse than a linear search. Can this be true even when the data is sorted as it's being entered into the structure?
Absolutely.
Seach algorithms have two distinct pieces to them
 Comparisons : Is this the right element? It is greater than this? Less
than this?
 What is the next element in the collection that I should compare
against.
A Linear search does a Fast Compare and a fast iteration to the next
element, but it doesn't apply any logic to reduce the number of comparisons
required. Therefore, for small data sets, a linear search can be the best
option. For large sets, you pay the price of having to compare agains each
and every element until you find the correct one.
A Binary search against a sorted dataset has a fairly fast compare with a
slower iteration to the next element, since it needs to calculate where it
should do the next compare. This additional logic in the algorithm means
that each step in the algorithm is slower than the linear algorithm, but it
drastically reduces the number of steps involved.Therefore the Binary search
is faster when the number of elements becomes the dominant factor.
The original question asked about the best way to find elements in a sorted
ArrayList, but I would like to bring up an alternative. Try using a
Hashtable instead. It has a higher memory footprint, but the performance is
exceptional. For completeness in this discussion I would like to point out
that Hashtable retrival is a constant time operation O(1). I ran a quick
test comparing Hashtable vs Linear search vs BinarySearch and the results
are impressive. Hashtable wins hands down when more than a handfull of
elements are involved
Bill
Hi Brian,
Actually, I thought more about what you said and it's clear to me now that you
meant O(1) doesn't reveal the actual time it takes, it just shows that the
time is constant, whatever value it may be.
10 still sounds arbitrary to me, however :)

Dave Sexton
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP02.phx.gbl...
HI Brian,
Thanks for the reply.
<snip>
>This works well despite the fact that the ListDictionary uses a O(n) implementation as opposed to the Hashtable's O(1) implementation. It would seem, based on the BigOh notation anyway, that a Hashtable would always be faster, but what BigOh doesn't reveal is the constant time factor which is significant in the case of a Hashtable
Doesn't O(1) indicate the constant factor?
>Typically, a ListDictionary will outperform a Hashtable if the number of items is less than 10.
I'd debate that if I had more knowledge on the subject ;)

Dave Sexton
"Brian Gideon" <br*********@yahoo.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
>Dave Sexton wrote:
>>Hi Bill,
It seems you are suggesting that the scale of the data can affect whether a required sorting operation will actually reduce the overall performance of an algorithm so that it becomes worse than a linear search. Can this be true even when the data is sorted as it's being entered into the structure? Yep, that's basically what he's saying. A good example of this is the HybridDictionary. The HybridDictionary starts out as a ListDictionary and then converts to a Hashtable when the number of items is large enough. This works well despite the fact that the ListDictionary uses a O(n) implementation as opposed to the Hashtable's O(1) implementation. It would seem, based on the BigOh notation anyway, that a Hashtable would always be faster, but what BigOh doesn't reveal is the constant time factor which is significant in the case of a Hashtable. Typically, a ListDictionary will outperform a Hashtable if the number of items is less than 10.
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
[...]
From what has been discussed so far it sounds like this line of thought is
incorrect  there's just too many variables. Choosing an appropriate
pivot element in a quicksort based on the actual data being sorted, for
example, seems to prevent any constant threshold such as "30" from being
appropriate, in general.
So is there some practical guideline such as 30 or is this truly an
arbitrary number?
As far as I know, your understanding is correct. That is, I don't have any
particular expertise in the area, but I would agree with you that the number
30 (or the number 10, as stated in a different message) is pretty much
arbitrary.
I believe that if you know the exact algorithms, implementation details, and
have some representative data to consider, you could come up with an exact
number for the breakeven point. But since that number does depend on the
exact implementation as well as the nature of the data (affecting whether
actual performance is closer to best, average, or worst case), I can't
imagine that one could come up with a single number that applies in all
situations.
Big O notation seems to be like expressing algorithms in terms of
probability, which becomes more accurate as the number of "attempts"
increases, so to speak.
I understand why it looks like that, but that's not my understanding of the
nature of the question.
When a collection is small, there is more of a chance that the data will
coalesce in a way that undermines the effectiveness of an algorithm.
Larger collections may have a better distribution of data, which can
increase the effectiveness of an algorithm much like how larger data sets
provide more accurate results, probabilistically.
Have I understood you correctly?
I don't think so. :) I don't recall writing anything that tried to address
*why* the order of an algorithm is accurate only for large data sets. Only
that it is.
As for the whys, I admit once again to not being an expert on the topic.
However, if I recall my studies correctly the reason that the order applies
to infinitely large data sets is not because of a statistical convergence.
It's because when you analyze the cost of an algorithm, there are a variety
of components to that cost, but as the data set gets infinitely large, the
contribution of those costs increase unevenly. One eventually dominates,
and that's the one that is expressed as the order of the algorithm.
If it were an issue of probabilities, then (for example) one would not talk
about the "average" or "worstcase" order of an algorithm. The "worstcase"
wouldn't exist, since you'd be considering only random distribution of the
input data, and an infinitely large data set. But as we've already seen,
for Quicksort even with an infinitely large data set, one can still have an
order that is different from a statistically random and large sample.
I'm pretty sure that's right. But don't take my word for it. :)
Pete
Hi Peter,
<snip>
>When a collection is small, there is more of a chance that the data will coalesce in a way that undermines the effectiveness of an algorithm. Larger collections may have a better distribution of data, which can increase the effectiveness of an algorithm much like how larger data sets provide more accurate results, probabilistically.
Have I understood you correctly?
I don't think so. :) I don't recall writing anything that tried to address
*why* the order of an algorithm is accurate only for large data sets. Only
that it is.
Yes, I see that.
I jumped to conclusions when I read "you can't rely on the order when the data
set is small".
That sounded to me just like how probability works (expressed inversely)  the
larger the sample, the more accurate the prediction. I saw the big O notation
as a prediction of an algorithm's effectiveness on any given set when I wrote
that response.
After much mulling, I see now (again) that the notation describes the number
of steps that an algorithm might require in a given case, but not the chance
of that case actually happening. Cases such as "best", "worst" and "average"
can be expressed by the big O notation. The chances that any of those cases
will occur depends on other factors that simply aren't expressed by the
notation, such as the amount of data, its structure or its distribution within
the set.
As for the whys, I admit once again to not being an expert on the topic.
However, if I recall my studies correctly the reason that the order applies
to infinitely large data sets is not because of a statistical convergence.
It's because when you analyze the cost of an algorithm, there are a variety
of components to that cost, but as the data set gets infinitely large, the
contribution of those costs increase unevenly. One eventually dominates,
and that's the one that is expressed as the order of the algorithm.
Great response, thanks.
If it were an issue of probabilities, then (for example) one would not talk
about the "average" or "worstcase" order of an algorithm. The "worstcase"
wouldn't exist, since you'd be considering only random distribution of the
input data, and an infinitely large data set. But as we've already seen,
for Quicksort even with an infinitely large data set, one can still have an
order that is different from a statistically random and large sample.
Got it.
<snip>

Dave Sexton
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:OD**************@TK2MSFTNGP04.phx.gbl...
Hi Bill,
The results of your test are interesting, but I'm sure it depends on
the distribution of data in the Hashtable, does it not?
If all of the data is in a single bucket doesn't a search operation
become O(n)?
A Hash Table can indeed have linear performance if your hashing function
does not suit your data. There is a pretty good explanation of this
topic at http://en.wikipedia.org/wiki/Hash_function
I don't know enough about the implementation behind the dotnet Hashtable
class to know if it actively avoids long chains in a single bucket.
Theoretically, a hash table implementation can alter it's algorithm to
suit the data and minimize collisions.
Assuming that you have a quality Hashing function, it can be a
frustrating experience just attempting to come up with a data set that
will "Defeat" the hash (make it linear). The reason for this is that
good hash functions tend to be "One Way". By that I mean that it is hard
to find a valid input that will produce a given output except through
trial and error.
Bill
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
<snip>
Big O notation seems to be like expressing algorithms in terms of
probability, which becomes more accurate as the number of "attempts"
increases, so to speak. When a collection is small, there is more of
a chance that the data will coalesce in a way that undermines the
effectiveness of an algorithm. Larger collections may have a better
distribution of data, which can increase the effectiveness of an
algorithm much like how larger data sets provide more accurate
results, probabilistically.
Have I understood you correctly?
It is not a probabilistic issue. It is an asymptotic issue. Let me give
a simple example to demonstrate.
Suppose I have some algorithm that performs as follows:
T = A + B*N
where
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection
The average time per element is then
t = T/N = A/N + B
As N gets large, the first term goes to zero since the cost is averaged
across all elements in the collection. The net effect is that for large
N you can simply ignore the A term altogether and simply assume that T =
B*N.
Here is an example with more factors
T = A + B*N + C*N*N (Maybe an inefficient sorting algorithm)
for large values of N the third term dominates the equation and
therefore this algorithm is O(N^2).
For small values of N it is quite likely that the N^2 term is
negligible.
Lets pick some real value to demonstrate.
Say A=200, B=20, C=1
N=4: The constant term dominates
T = 200 + 80 + 16
N=10: The constant term and Linear term dominate but the N^2 term is
growing
T = 200 + 200 + 100
N= 20 : The constant term is not a big factor anymore, but the N^2 term
is now important
T = 200 + 400 + 400
N = 40 : The N^2 term is starting to dominate
T = 200 + 800 + 1600
N = 100 : No contest
T = 200 + 2000 + 10000
As you can see, the N^2 behavior is only important as the number of
elements grows. For smaller collections, the lower order terms can
dominate the total time. This is the reason why a linear search can
actually be faster than a binary search for small collections. Once your
collection reaches a critical mass however, it is no contest.
Hope this helps
Bill
Hi Bill,
>The results of your test are interesting, but I'm sure it depends on the distribution of data in the Hashtable, does it not?
If all of the data is in a single bucket doesn't a search operation become O(n)?
A Hash Table can indeed have linear performance if your hashing function
does not suit your data. There is a pretty good explanation of this topic at http://en.wikipedia.org/wiki/Hash_function
I don't know enough about the implementation behind the dotnet Hashtable
class to know if it actively avoids long chains in a single bucket.
Theoretically, a hash table implementation can alter it's algorithm to suit
the data and minimize collisions.
Well, I was asking because your test assumes that the keys were all hashed
properly and it made me realize that there must also be a worst case too. So
I understood O(1) to be bestcase and O(n) to be worstcase for hash tables.
I'm aware of hash functions and that a hash table works only as well as the
hashing function used to generate its keys. I just wanted to be absolutely
certain that my understanding of what the O(n) notation represents was
accurate, in the context of a hash table.
I found an article that verifies O(n) to be worstcase:
"Hash Table" http://en.wikipedia.org/wiki/Hash_table
Assuming that you have a quality Hashing function, it can be a frustrating
experience just attempting to come up with a data set that will "Defeat" the
hash (make it linear). The reason for this is that good hash functions tend
to be "One Way". By that I mean that it is hard to find a valid input that
will produce a given output except through trial and error.
You make a good point, but there is still the possibility for worstcase no
matter how well the hashing function performs because there is a finite number
of valid hashes. I understand that worstcase might be difficult to produce
and is probably not realistic in most realworld scenarios.
So do you think it's better to just ignore the worstcase when speaking of
hash tables?
(I'm leaning towards yes myself, although Jon's point in another post about
supplying as much information about an algorithm as possible seems to apply
here as well)

Dave Sexton
Hi Bill,
<snip>
Suppose I have some algorithm that performs as follows:
T = A + B*N
where
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection
The average time per element is then
t = T/N = A/N + B
As N gets large, the first term goes to zero since the cost is averaged
across all elements in the collection. The net effect is that for large N
you can simply ignore the A term altogether and simply assume that T = B*N.
Very nice proof, thank you.
Algebra sure does brings back memories ;)
Here is an example with more factors
T = A + B*N + C*N*N (Maybe an inefficient sorting algorithm)
for large values of N the third term dominates the equation and therefore
this algorithm is O(N^2).
For small values of N it is quite likely that the N^2 term is negligible.
Interesting.
Lets pick some real value to demonstrate.
Say A=200, B=20, C=1
N=4: The constant term dominates
T = 200 + 80 + 16
N=10: The constant term and Linear term dominate but the N^2 term is growing
T = 200 + 200 + 100
N= 20 : The constant term is not a big factor anymore, but the N^2 term is
now important
T = 200 + 400 + 400
N = 40 : The N^2 term is starting to dominate
T = 200 + 800 + 1600
N = 100 : No contest
T = 200 + 2000 + 10000
As you can see, the N^2 behavior is only important as the number of elements
grows. For smaller collections, the lower order terms can dominate the total
time. This is the reason why a linear search can actually be faster than a
binary search for small collections. Once your collection reaches a critical
mass however, it is no contest.
Great, thank you.
I believe you've illustrated what Peter has written about choosing the
dominate factor to be represented by the O notation and how each factor
changes unevenly as the number of elements increases.
Here, O(N^2) notates the worst case?
O(A) is the best case (only one element)?
I assume the average case is closer to O(N^2), but how can it be notated?

Dave Sexton
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...
[...]
Here, O(N^2) notates the worst case?
Not necessarily. It only represents worstcase if you are specifically
looking at the algorithm's behavior in the worstcase. Here, O(N^2) could
very well be the average case. Bill's example didn't actually specify the
algorithm, so there's no way to know in this particular example. But when
calculating order for an actual algorithm, you would have that information
and would be able to say whether the calculated order represented a best,
average, or worst case scenario.
Pete
Hi Peter,
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection
What information is missing?
I thought the example:
T = A + B*N + C*N*N
was representative of the behavior of any algorithm that exhibits an average
time of:
t = A/N + B + C*N
It seems to me that for any algorithm with an average time of t, governed by
this forumla, there must be some way to express that using the O notation;
However, O(A/N + B + C*N) just doesn't seem correct to me.
I don't understand why the average time may vary depending upon the algorithm
because as soon as another variable (or constant) is introduced into an
algorithm this formula would no longer apply.
Conceptually, it seems reasonable that t will be close to N^2, given an
infinite result set, but I lack in the math skills to prove it. I suspect
that "asymptotic" is the keyword here, again, but that's currently outside the
scope of my understanding. I should spend some time and RTFM :)
(Thanks for all the help everyone)

Dave Sexton
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...
>[...] Here, O(N^2) notates the worst case?
Not necessarily. It only represents worstcase if you are specifically
looking at the algorithm's behavior in the worstcase. Here, O(N^2) could
very well be the average case. Bill's example didn't actually specify the
algorithm, so there's no way to know in this particular example. But when
calculating order for an actual algorithm, you would have that information
and would be able to say whether the calculated order represented a best,
average, or worst case scenario.
Pete
Dave Sexton wrote:
Hi Brian,
Actually, I thought more about what you said and it's clear to me now that you
meant O(1) doesn't reveal the actual time it takes, it just shows that the
time is constant, whatever value it may be.
10 still sounds arbitrary to me, however :)
It comes from the MSDN documentation. But, yeah, it's very arbitrary.
I think if you ran your own tests you'd observe a break even point that
was close to 10, but probably not exactly 10.
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...
Hi Peter,
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection
What information is missing?
I can't tell you unless you tell me what algorithm you're asking about.
Perhaps none. Perhaps all sorts of information. It depends.
I thought the example:
T = A + B*N + C*N*N
was representative of the behavior of any algorithm that exhibits an
average time of:
t = A/N + B + C*N
It is, by definition. The latter equation is simply the former, divided by
the number of elements. It's by definition the mathematical average of the
time for a given element.
It seems to me that for any algorithm with an average time of t, governed
by this forumla, there must be some way to express that using the O
notation; However, O(A/N + B + C*N) just doesn't seem correct to me.
Good...because it's not. :)
I don't understand why the average time may vary depending upon the
algorithm because as soon as another variable (or constant) is introduced
into an algorithm this formula would no longer apply.
The formula Bill gave is just an example. It applies only to a theoretical
example algorithm, and can't be used generally. He's just trying to show
how an algorithm with theoretically worse order can still be better for
small numbers of elements.
Pete
Hi Peter,
I understood what Bill was trying to show me  that part was clear.
My confusion was in the fact that I thought the formula WAS the algorithm
after I completed the article, but I think I got it now.
My own inattentive mistake. Thanks for clearing that up.

Dave Sexton This discussion thread is closed Replies have been disabled for this discussion. Similar topics
6 posts
views
Thread by Eric Eggermann 
last post: by

4 posts
views
Thread by Bernie Yaeger 
last post: by

4 posts
views
Thread by Mike Dole 
last post: by

5 posts
views
Thread by Adda 
last post: by

9 posts
views
Thread by Paul Nations 
last post: by

10 posts
views
Thread by JohnR 
last post: by

1 post
views
Thread by garyusenet 
last post: by

3 posts
views
Thread by Justin 
last post: by

8 posts
views
Thread by Guy 
last post: by
          