469,328 Members | 1,316 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,328 developers. It's quick & easy.

ArrayList BinarySearch vs Contains

Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Thanks,

Tom
Nov 10 '06 #1
43 5965
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.
Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is an
O(lg(n)) operation. Just make sure that the data really is sorted!

-cd
Nov 10 '06 #2
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.
Contains is O(n)
BinarySearch is O(log(n))

For small collections Contains is probably faster
For large collections BinarySearch is Faster

If you are frequently looking for data near the front of the collection
Contains might have an edge

In short, it depends on the number of items and normal usage.

Hope this helps
Bill
Nov 10 '06 #3
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:O9**************@TK2MSFTNGP03.phx.gbl...
"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
>Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is
an O(lg(n)) operation. Just make sure that the data really is sorted!
What does O(n) and O(lg(n)) mean?

Thanks,

Tom
Nov 10 '06 #4
"tshad" <ts**********@ftsolutions.comwrote in message
news:eF**************@TK2MSFTNGP03.phx.gbl...
What does O(n) and O(lg(n)) mean?
This question popped up just this week.
Here was my answer
If memory serves correctly the O stands for 'Order'.
Thus O(n) would be called "Order n" and O(n^2) would be "Order n
squared".
This is a measure of how efficient a given algorithm performs.
Technically it is a measure of the asymptotic behavior as the number of
elements gets very large, but, it often is appropriate even for fairly
small collections.

O(1) means that this operation takes a constant amount of time
independent of the number of items involved. A hashtable retrieval is an
O(1) operation

O(n) means that an operation takes a time proportional to the number of
elements (n). If an collection has twice as many elements, the operation
takes twice as long. A linear search is an O(n) operation

others include
O(n^2) : Goes as n*n (not very efficient for large collections)
O(log(n)): goes as the logarithm of the number of elements
O(n*log(n)) : you get the idea.

Hope this helps
Bill

Nov 10 '06 #5
"tshad" <ts**********@ftsolutions.comwrote in message
news:eF**************@TK2MSFTNGP03.phx.gbl...
What does O(n) and O(lg(n)) mean?
It's a Computer Science way of describing the complexity or cost of an
algorithm. The "n" relates to the size of the data, the "O" is read "order"
and the actual function describes how the length of the operation varies
according to the length of the data given the operation.

In this particular case, O(n) refers to the fact that a straight linear
search through the data will always take an amount of time that is directly
proportional to the size of the data, and increases linearly as the size of
the data increases. On the other hand, the BinarySearch method takes
advantage of the fact that the data is sorted, allowing a search for a
particular data item to increase in time proportional to log(n). Since
log(n) is always much smaller than n itself, this means BinarySearch is much
more efficient, on average.

If you intend to do any serious programming, especially where performance is
an issue, you would do well to find a good textbook or other reference on
algorithms and learn not only about the concept of "order", but also what
common algorithms have what order and how the order is affected by the data
(for example, the average case for a binary search is O(log(n)), but
depending on how the data is stored the worst case might wind up being O(n)
anyway).

Pete
Nov 10 '06 #6
Hi Tom,

"Big O Notation"
http://en.wikipedia.org/wiki/Big_o_notation

Basically you can think of it as the "n"umber of "O"perations that is the
worst-case scenario to find the specified item in the list.

e.g., if you have 10 items in the list then the worst case in a linear search
would be O(n). This means that the worst case scenario would be examining
every item in the list (10 operations). The worst-case in a binary search
would be O(lg(n)). Binary logarithm of 10 operations. Binary log is the
worst case scenario because a binary search cuts the list in half, then
performs the same operation again on the half that contains the search item.
The pattern is continued until only one or zero items remain. If the list
isn't sorted properly than the search may fail because the algorithm might
chose the "wrong" half in which to continue searching after comparing the
current item to the search item.

"Binary Logarithm"
http://en.wikipedia.org/wiki/Binary_logarithm

"Binary Search"
http://en.wikipedia.org/wiki/Binary_search

--
Dave Sexton

"tshad" <ts**********@ftsolutions.comwrote in message
news:eF**************@TK2MSFTNGP03.phx.gbl...
"Carl Daniel [VC++ MVP]" <cp*****************************@mvps.org.nospam >
wrote in message news:O9**************@TK2MSFTNGP03.phx.gbl...
>"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
>>Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Then BinarySearch is better. Contains will be an O(n) operation always
(since it doesn't know whether the data is sorted), while BinarySearch is
an O(lg(n)) operation. Just make sure that the data really is sorted!

What does O(n) and O(lg(n)) mean?

Thanks,

Tom

Nov 10 '06 #7
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
e.g., if you have 10 items in the list then the worst case in a linear
search would be O(n). This means that the worst case scenario would be
examining every item in the list (10 operations). The worst-case in a
binary search would be O(lg(n)).
Note:

"Order" stated by itself is almost always for *average* performance, not
worst-case. One would normally specify explicitly if describing the
worst-case order for an algorithm.

For example, it is NOT true that "the worst-case in a binary search would be
O(log(n))" for all implementations of a binary search. If you have a sorted
array, it's true that the worst case is O(log(n)). However, if you have a
binary tree, a search on that tree could be as bad as O(n).

Pete
Nov 11 '06 #8
Hi Peter,

My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worst-case, so thanks for correcting me.
I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worst-cases :)

--
Dave Sexton

"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
>e.g., if you have 10 items in the list then the worst case in a linear
search would be O(n). This means that the worst case scenario would be
examining every item in the list (10 operations). The worst-case in a
binary search would be O(lg(n)).

Note:

"Order" stated by itself is almost always for *average* performance, not
worst-case. One would normally specify explicitly if describing the
worst-case order for an algorithm.

For example, it is NOT true that "the worst-case in a binary search would be
O(log(n))" for all implementations of a binary search. If you have a sorted
array, it's true that the worst case is O(log(n)). However, if you have a
binary tree, a search on that tree could be as bad as O(n).

Pete

Nov 11 '06 #9
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:Ot**************@TK2MSFTNGP02.phx.gbl...
My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worst-case, so thanks for correcting
me.
I just took a look at the article you referenced. Wow. They sure didn't
cover in that kind of detail in my CS classes way back when. :) Anyway, I
can see where the confusion comes from, since the phrase "asymptotic upper
bound" could be read to imply "worst case" (the "upper bound" certainly has
that implication in daily usage). I can't say as I blame you for reading it
that way.

That's what we lowly programmers get for using the same concept that those
high-falutin' "algorithm scientists" use. :)
I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worst-cases :)
I guess. Though, to be fair, in those two cases the order of the average
case is the same as the order of the worst case (if one assumes a binary
search on an array, which seems reasonable given this thread). So the
distinction is entirely academic. :)

Pete
Nov 11 '06 #10
Hi Peter,

I must admit that my incomplete understanding of the O notation wasn't based
solely on wikipedia. I've always believed it to represent "worst-case", not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p

Luckily, I've never been tasked with the responsibility of authoring
high-performance search algorithms on complex data structures. Although if I
was, I probably would have learned the correct meaning of the notation ;)

Thanks again.

--
Dave Sexton

"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:Ot**************@TK2MSFTNGP02.phx.gbl...
>My interpretation of the wikipedia articles is obviously off a little. I
thought O notation always represented worst-case, so thanks for correcting
me.

I just took a look at the article you referenced. Wow. They sure didn't
cover in that kind of detail in my CS classes way back when. :) Anyway, I
can see where the confusion comes from, since the phrase "asymptotic upper
bound" could be read to imply "worst case" (the "upper bound" certainly has
that implication in daily usage). I can't say as I blame you for reading it
that way.

That's what we lowly programmers get for using the same concept that those
high-falutin' "algorithm scientists" use. :)
>I guess it's just a fluke that I analyzed the two cases that actually are
expressing the worst-cases :)

I guess. Though, to be fair, in those two cases the order of the average
case is the same as the order of the worst case (if one assumes a binary
search on an array, which seems reasonable given this thread). So the
distinction is entirely academic. :)

Pete

Nov 11 '06 #11
Dave Sexton <dave@jwa[remove.this]online.comwrote:
I must admit that my incomplete understanding of the O notation wasn't based
solely on wikipedia. I've always believed it to represent "worst-case", not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Nov 11 '06 #12
Hi Jon,
>I must admit that my incomplete understanding of the O notation wasn't
based
solely on wikipedia. I've always believed it to represent "worst-case",
not
average, for many years now. It's the math that makes it confusing, yet I
thought I had it right :p

I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.
That's a good idea to cover all bases.

Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?

What I mean is that it's much easier to chose an algorithm based on either
average time or worst-case time but it's hard for me to compare both, so which
aspect has more relevance? Of course, I'd agree that it's better to have all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)

BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN,
so it seems you did remember correctly.

"Quicksort"
http://en.wikipedia.org/wiki/QuickSort

(The Quicksort article is particularly interesting with its visual
representations of the sort in action)

"Array.Sort Method"
http://msdn.microsoft.com/library/de...ssorttopic.asp

--
Dave Sexton
Nov 11 '06 #13
Contains is O(n) operation while BinarySearch is O(log n) operation. In big
list sizes and under intense load, this will make a huge difference, so use
BinarySearch to locate items in sorted lists.

"tshad" <ts**********@ftsolutions.comwrote in message
news:u8**************@TK2MSFTNGP04.phx.gbl...
Which is better to use with an ArrayList: BinarySearch or Contains?

The list is only going to have strings in it and it will be sorted.

Thanks,

Tom

Nov 11 '06 #14
Dave Sexton <dave@jwa[remove.this]online.comwrote:
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.

That's a good idea to cover all bases.

Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?
They can only be compared with as much data as is available.
What I mean is that it's much easier to chose an algorithm based on either
average time or worst-case time but it's hard for me to compare both, so which
aspect has more relevance? Of course, I'd agree that it's better to have all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)
You'd need to have more information such as how likely the worst case
scenario is, and how easy it might be to avoid.

Note that just knowing the order isn't always enough either - things
can both be the same order, but with a large factor involved, eg one
will always be five times slower than another. IIRC (again!) a merge
sort can *always* be O(n log n) but it has other costs both in memory
and time factors which are higher than QuickSort, which is why
QuickSort is used more often.
BTW, wikipedia agrees with your math and so does the Array.Sort doc on MSDN,
so it seems you did remember correctly.
And to think people claim miracles don't happen in the 21st century ;)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Nov 11 '06 #15
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
<snip>
If you have a sorted array, it's true that the worst case is
O(log(n)).
Everyone is making wonderful points on this topic, but I wanted to
underline a few key points.
As Peter said, a BinarySearch on a sorted ArrayList (or a balanced
Binary tree) will have a Worst case performance of O(log(n)).

Another important point of Big O notation is that it describes
Asymptotic behavior. Thus for sufficiently large n an O(log(n))
algorithm will outperform an O(n) algorithm. The problem is determining
what constitutes "sufficiently large n". For small values of n,
O(log(n)) algorithms often perform WORSE than linear algorithms do to
the additional setup complexity. Of course, for small collections, ANY
search will be fast, so it doesn't really matter.

Bill
Nov 11 '06 #16
Hi Jon,

So it seems that choosing an appropriate algorithm depends not only on what
the O notation says about performance, but other factors that the notation
doesn't account for such as scale, depth and distribution of the data,
perhaps, which also may affect the performance of the algorithm.

I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick one
and see if it performs better than any other algorithm that I decide to test.
I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable criteria
for performance and memory consumption.

I'll be outsourcing any project in the future where this is required ;)

Thanks.

--
Dave Sexton

"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP************************@msnews.microsoft.c om...
Dave Sexton <dave@jwa[remove.this]online.comwrote:
I think that when the two are significantly different, it's best to
give both. For instance, QuickSort has an average O(n log n)
complexity, but its worst case is O(n^2) IIRC.

That's a good idea to cover all bases.

Then the difficulty comes in comparing two algorithms that have asymmetric
average to worst case ratios. How can the algorithms be compared when they
don't have any common denominator?

They can only be compared with as much data as is available.
>What I mean is that it's much easier to chose an algorithm based on either
average time or worst-case time but it's hard for me to compare both, so
which
aspect has more relevance? Of course, I'd agree that it's better to have
all
the information on hand when making the decision (even though ignorance is
bliss, as they say ;)

You'd need to have more information such as how likely the worst case
scenario is, and how easy it might be to avoid.

Note that just knowing the order isn't always enough either - things
can both be the same order, but with a large factor involved, eg one
will always be five times slower than another. IIRC (again!) a merge
sort can *always* be O(n log n) but it has other costs both in memory
and time factors which are higher than QuickSort, which is why
QuickSort is used more often.
>BTW, wikipedia agrees with your math and so does the Array.Sort doc on
MSDN,
so it seems you did remember correctly.

And to think people claim miracles don't happen in the 21st century ;)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Nov 11 '06 #17
Hi Bill,

It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?

Since I'm not one for micro-optimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there may
be substantial amounts of data at times and just swallow the performance hit
on those smaller collections. I guess if there is a diametric aspect to the
data then maybe switching between linear and binary algorithms at runtime
might be acceptable. I've never had a project that required this much thought
on search algorithms, however :)

Thanks.

--
Dave Sexton

"Bill Butler" <qw****@asdf.comwrote in message
news:_4q5h.358$5P2.166@trnddc02...
"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
<snip>
>If you have a sorted array, it's true that the worst case is O(log(n)).

Everyone is making wonderful points on this topic, but I wanted to underline
a few key points.
As Peter said, a BinarySearch on a sorted ArrayList (or a balanced Binary
tree) will have a Worst case performance of O(log(n)).

Another important point of Big O notation is that it describes Asymptotic
behavior. Thus for sufficiently large n an O(log(n)) algorithm will
outperform an O(n) algorithm. The problem is determining what constitutes
"sufficiently large n". For small values of n, O(log(n)) algorithms often
perform WORSE than linear algorithms do to the additional setup complexity.
Of course, for small collections, ANY search will be fast, so it doesn't
really matter.

Bill

Nov 11 '06 #18
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:ux**************@TK2MSFTNGP02.phx.gbl...
Hi Jon,

So it seems that choosing an appropriate algorithm depends not only on
what the O notation says about performance, but other factors that the
notation doesn't account for such as scale, depth and distribution of the
data, perhaps, which also may affect the performance of the algorithm.
Yup. Welcome to the world of optimization. :) The "order" of an algorithm
is an important analytical aspect, but as you've noted one should not
consider it to the be the last word in which algorithm is "best". There are
so many other factors that come into play, the order by itself just doesn't
tell you everything you need to know.
I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick
one and see if it performs better than any other algorithm that I decide
to test.
Well, yes and no. I mean, it's useful to use the order as a starting point.
After all, you can pretty much rule out even testing a bubble sort when
you've got a Quicksort implementation to use. You don't literally need to
test all possible solutions, thank goodness.

Also note that testing your potential solutions does not necessarily tell
you worst-case performance. It depends on how you pick your data for the
tests, and different algorithms may have different data that results in the
worst-case performance. So, testing can be useful, but it's not necessarily
an improvement over just looking at the order. To improve upon the order
analysis, you have to do more analysis. Testing alone won't cut it.
I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable
criteria for performance and memory consumption.

I'll be outsourcing any project in the future where this is required ;)
That works. :)

That said, I'll note that much of the time, absolutely *optimal* performance
is not required, while at the same time the very worst cases are not
necessarily that hard to avoid. For example, even in the average case,
Quicksort is dependent on the distribution of the data and how you pick the
pivot element for each iteration. Even when the magnitude is reasonably
proportional to n log(n), there can be wide variations in real-time
performance.

This can be addressed either by playing some tricks when picking the pivot,
or by simply ignoring the problem and accepting that occasionally you'll get
data organized in a way that hinders the sorting performance. The very
worst case comes when the data is already sorted, and this condition can be
detected fairly easily (and of course, if you detect it, you need not do any
more work :) ).

I'm just using Quicksort as an example. The same ideas apply to pretty much
any algorithm. Knowing the order is useful, but it's not the end-all,
be-all for picking an algorithm or implementation. You can use the order to
make some broad generalizations, and then refine the implementation to
address specifics that apply in one's particular situation. And in the end,
most coders simply accept that sometimes the universe will throw some really
hard data at the algorithm, and it won't get the very best performance it
might have gotten. :)

Pete
Nov 11 '06 #19
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
It seems you are suggesting that the scale of the data can affect whether
a required sorting operation will actually reduce the overall performance
of an algorithm so that it becomes worse than a linear search.
His point (I think) is that some algorithms only work after some initial
setup, and that setup can be costly. I'm not sure that all sorting and
searching algorithms are a good example of this. For example, depending on
the implementation of a Quicksort, it can perform just as well on small data
sets as large ones. Likewise binary search (though for very small data
sets, the difference between n and log n is very small).

But it is certainly true that the order of an algorithm considers only a
data set that's basically infinite. There can be wide variations in
performance, well away from the theoretical performance, when the data set
is very small. To make matters worse, what constitutes "very small" or
"large enough" varies depending on the algorithm

I think this is what Bill was saying...I don't think I'm adding anything
new, but hopefully am just restating it in a slightly different way, which
might make it clearer to some people (just as it may obfuscate it to others
:) ).
Can this be true even when the data is sorted as it's being entered into
the structure?
Ack! And I mean that in the Bill the Cat way, not the i/o handshaking way.
:)

The analysis becomes very different when you've got an incremental situation
like that. The order analysis for these different algorithms assumes you
start with all the data and run through it once to produce the end results.
What is best in that case may very well not be best for a case where you
want to perform the same operation after each data element is added.

For example, resorting an array would be a silly way to sort data as it's
being added to a data structure. After all, if the data is already sorted,
sorting it again just to add on element is very inefficient. Even worse, as
I noted elsewhere, the worst-case performance for some sorting algorithms
comes when the data starts out sorted. This is a classic example of why
knowing the exact scenario the algorithm is to be used is at least as
important as knowing the order of the algorithm. :)

I would never use Quicksort to sort data as it's being entered into a data
structure. It's one of the worst choices in that case, and you practically
guarantee minimal performance, even though the order of the algorithm is
superior on average.
Since I'm not one for micro-optimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there
may be substantial amounts of data at times and just swallow the
performance hit on those smaller collections.
As Bill notes, when the data size is small, the order of the algorithm is
usually not relevant since any algorithm will be fast. So yes, swallowing
the performance hit is fine in that case.

However, that's not the same as ignoring the exact scenario being solved.
You need to take into account how the algorithm will be used, and not create
a situation where you choose an algorithm that has a good average case
order, but will always be used where the worst case order applies.

I hope that makes sense...I'm typing quickly because I'm hungry and want to
eat lunch. :)

Pete
Nov 11 '06 #20
Hi Peter,

I have a lot of experience optimizing code, just not for search and sorting
algorithms. They appear now to require similar amounts of knowledge and
effort. The main difference appears to be that search and sort algorithms
have a mathematical base, but general performance tweaking in code just
requires an understanding of the coding language, the framework and expected
use of the application, without the need for any math. Testing for
performance and sampling data has always been good enough on my projects so
far. The mathematical aspects of the algorithms are confusing so I just
always leaned on my original understanding of the O notation, expecting it to
tell the whole story.

Thanks for your comprehensive response.

--
Dave Sexton

"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:ux**************@TK2MSFTNGP02.phx.gbl...
>Hi Jon,

So it seems that choosing an appropriate algorithm depends not only on what
the O notation says about performance, but other factors that the notation
doesn't account for such as scale, depth and distribution of the data,
perhaps, which also may affect the performance of the algorithm.

Yup. Welcome to the world of optimization. :) The "order" of an algorithm
is an important analytical aspect, but as you've noted one should not
consider it to the be the last word in which algorithm is "best". There are
so many other factors that come into play, the order by itself just doesn't
tell you everything you need to know.
>I guess the best way to choose an appropriate algorithm for an application
when dealing with complex data structures, sorts and searches, is to pick
one and see if it performs better than any other algorithm that I decide to
test.

Well, yes and no. I mean, it's useful to use the order as a starting point.
After all, you can pretty much rule out even testing a bubble sort when
you've got a Quicksort implementation to use. You don't literally need to
test all possible solutions, thank goodness.

Also note that testing your potential solutions does not necessarily tell
you worst-case performance. It depends on how you pick your data for the
tests, and different algorithms may have different data that results in the
worst-case performance. So, testing can be useful, but it's not necessarily
an improvement over just looking at the order. To improve upon the order
analysis, you have to do more analysis. Testing alone won't cut it.
>I can choose the algorithms to be initially tested based on all of the
information that is available, ahead of time, about the data and its
structure, the noted performance of the algorithms and the acceptable
criteria for performance and memory consumption.

I'll be outsourcing any project in the future where this is required ;)

That works. :)

That said, I'll note that much of the time, absolutely *optimal* performance
is not required, while at the same time the very worst cases are not
necessarily that hard to avoid. For example, even in the average case,
Quicksort is dependent on the distribution of the data and how you pick the
pivot element for each iteration. Even when the magnitude is reasonably
proportional to n log(n), there can be wide variations in real-time
performance.

This can be addressed either by playing some tricks when picking the pivot,
or by simply ignoring the problem and accepting that occasionally you'll get
data organized in a way that hinders the sorting performance. The very
worst case comes when the data is already sorted, and this condition can be
detected fairly easily (and of course, if you detect it, you need not do any
more work :) ).

I'm just using Quicksort as an example. The same ideas apply to pretty much
any algorithm. Knowing the order is useful, but it's not the end-all,
be-all for picking an algorithm or implementation. You can use the order to
make some broad generalizations, and then refine the implementation to
address specifics that apply in one's particular situation. And in the end,
most coders simply accept that sometimes the universe will throw some really
hard data at the algorithm, and it won't get the very best performance it
might have gotten. :)

Pete

Nov 11 '06 #21
Hi Peter,

Informative and entertaining ;)

I thought that "setup" might have been referring to the perquisite that some
algorithms have on sorted data. If not, what exactly does "setup" mean?

I'm stepping out to eat too (and go see Borat). We can reconvene at 8:30 :p

--
Dave Sexton

"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
>It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of
an algorithm so that it becomes worse than a linear search.

His point (I think) is that some algorithms only work after some initial
setup, and that setup can be costly. I'm not sure that all sorting and
searching algorithms are a good example of this. For example, depending on
the implementation of a Quicksort, it can perform just as well on small data
sets as large ones. Likewise binary search (though for very small data
sets, the difference between n and log n is very small).

But it is certainly true that the order of an algorithm considers only a
data set that's basically infinite. There can be wide variations in
performance, well away from the theoretical performance, when the data set
is very small. To make matters worse, what constitutes "very small" or
"large enough" varies depending on the algorithm

I think this is what Bill was saying...I don't think I'm adding anything
new, but hopefully am just restating it in a slightly different way, which
might make it clearer to some people (just as it may obfuscate it to others
:) ).
>Can this be true even when the data is sorted as it's being entered into
the structure?

Ack! And I mean that in the Bill the Cat way, not the i/o handshaking way.
:)

The analysis becomes very different when you've got an incremental situation
like that. The order analysis for these different algorithms assumes you
start with all the data and run through it once to produce the end results.
What is best in that case may very well not be best for a case where you
want to perform the same operation after each data element is added.

For example, resorting an array would be a silly way to sort data as it's
being added to a data structure. After all, if the data is already sorted,
sorting it again just to add on element is very inefficient. Even worse, as
I noted elsewhere, the worst-case performance for some sorting algorithms
comes when the data starts out sorted. This is a classic example of why
knowing the exact scenario the algorithm is to be used is at least as
important as knowing the order of the algorithm. :)

I would never use Quicksort to sort data as it's being entered into a data
structure. It's one of the worst choices in that case, and you practically
guarantee minimal performance, even though the order of the algorithm is
superior on average.
>Since I'm not one for micro-optimization either I'd rather just choose an
algorithm with an average performance of O(lg(n)) if I'm sure that there
may be substantial amounts of data at times and just swallow the
performance hit on those smaller collections.

As Bill notes, when the data size is small, the order of the algorithm is
usually not relevant since any algorithm will be fast. So yes, swallowing
the performance hit is fine in that case.

However, that's not the same as ignoring the exact scenario being solved.
You need to take into account how the algorithm will be used, and not create
a situation where you choose an algorithm that has a good average case
order, but will always be used where the worst case order applies.

I hope that makes sense...I'm typing quickly because I'm hungry and want to
eat lunch. :)

Pete

Nov 11 '06 #22
Dave Sexton wrote:
Hi Bill,

It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable. Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

Nov 11 '06 #23
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:OP*************@TK2MSFTNGP03.phx.gbl...
Hi Peter,

I have a lot of experience optimizing code, just not for search and
sorting algorithms. They appear now to require similar amounts of
knowledge and effort. The main difference appears to be that search and
sort algorithms have a mathematical base, but general performance tweaking
in code just requires an understanding of the coding language, the
framework and expected use of the application, without the need for any
math.
Yeah, I was being a bit careless. :) I actually would have been surprised,
given other posts you've written, had you NOT ever had to optimize code. My
"welcome" was more a general topic-related comment, than specifically
directed at you.

I guess the point is (as you say) simply that there are two parts to
optimization. One has to do with making sure a given implementation
performs well, and the other has to do with making sure the algorithm itself
performs well.

The latter is usually the first step, since if you have a bad algorithm, no
amount of code-level optimization can fix that. However, I also admit that
in many situations, it makes more sense to code *some* algorithm, *any*
algorithm, and then to look harder at the algorithm if it turns out that
that portion of the code is a bottleneck. As with other aspects of
theoretical coding practices, in reality there's a cycle in which the
results from a first pass through the theoretical process are used to
improve on subsequent passes through that process. :)

Pete
Nov 11 '06 #24
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:Ob**************@TK2MSFTNGP03.phx.gbl...
Hi Peter,

Informative and entertaining ;)

I thought that "setup" might have been referring to the perquisite that
some algorithms have on sorted data. If not, what exactly does "setup"
mean?
It just depends on the algorithm. It could refer to a prerequisite, or it
could be some state initialization. Furthermore, this isn't the *only*
reason an algorithm with better order would perform worse on smaller data
sets. It's just an example. Another example would be an algorithm in which
each individual step takes twice as long for the better-order algorithm.
For smaller data sets, where the worse order is still not twice the better
order, the worse order algorithm will still perform better.

The take-away isn't so much the specific reasons that a better order
algorithm might be slower, but just that you can't rely on the order when
the data set is small. You have to have large data sets for the order to
come into play, because order is *defined* according to what are essentially
infinitely large data sets.

Pete
Nov 11 '06 #25
"Brian Gideon" <br*********@yahoo.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. [...] Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
Interestingly, that's also a good example of the other comment Bill made.
That is, if the break-even point is just 10 elements, there's not really any
point in bothering with the better-performing-but-worse-order algorithm,
since even the poor performance of the better-order algorithm is going to be
practically instantaneous as far as any user is concerned. There just
aren't enough data elements for ANY algorithm to take any significant amount
of time.

Assuming 10 is truly the break-even point, then I question the need for the
HybridDictionary class at all, if the only difference is that it uses one
method for data sets fewer than 10 and another for data sets larger than
that.

Pete
Nov 11 '06 #26
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
Hi Bill,

It seems you are suggesting that the scale of the data can affect
whether a required sorting operation will actually reduce the overall
performance of an algorithm so that it becomes worse than a linear
search. Can this be true even when the data is sorted as it's being
entered into the structure?
Absolutely.
Seach algorithms have two distinct pieces to them
- Comparisons : Is this the right element? It is greater than this?
Less than this?
- What is the next element in the collection that I should compare
against.

A Linear search does a Fast Compare and a fast iteration to the next
element, but it doesn't apply any logic to reduce the number of
comparisons required. Therefore, for small data sets, a linear search
can be the best option. For large sets, you pay the price of having to
compare agains each and every element until you find the correct one.

A Binary search against a sorted dataset has a fairly fast compare with
a slower iteration to the next element, since it needs to calculate
where it should do the next compare. This additional logic in the
algorithm means that each step in the algorithm is slower than the
linear algorithm, but it drastically reduces the number of steps
involved.Therefore the Binary search is faster when the number of
elements becomes the dominant factor.
The original question asked about the best way to find elements in a
sorted ArrayList, but I would like to bring up an alternative. Try
using a Hashtable instead. It has a higher memory footprint, but the
performance is exceptional. For completeness in this discussion I would
like to point out that Hashtable retrival is a constant time operation
O(1). I ran a quick test comparing Hashtable vs Linear search vs
BinarySearch and the results are impressive. Hashtable wins hands down
when more than a handfull of elements are involved
Bill
Nov 12 '06 #27
Hi Peter,
Yeah, I was being a bit careless. :) I actually would have been surprised,
given other posts you've written, had you NOT ever had to optimize code. My
"welcome" was more a general topic-related comment, than specifically
directed at you.
No need to cater to my narcissism - I understood what you meant and I
appreciate the help ;)
I guess the point is (as you say) simply that there are two parts to
optimization. One has to do with making sure a given implementation
performs well, and the other has to do with making sure the algorithm itself
performs well.

The latter is usually the first step, since if you have a bad algorithm, no
amount of code-level optimization can fix that. However, I also admit that
in many situations, it makes more sense to code *some* algorithm, *any*
algorithm, and then to look harder at the algorithm if it turns out that
that portion of the code is a bottleneck. As with other aspects of
theoretical coding practices, in reality there's a cycle in which the
results from a first pass through the theoretical process are used to
improve on subsequent passes through that process. :)
Yes, I agree that refinement is critical in applications that rely on
top-notch performance. It's rare to get it right the first time. I wouldn't
doubt that it's the same when coding search and sort algorithms. It does make
sense to me as well that being able to choose an appropriate algorithm before
implementing the framework in which it's executed is ideal, but that's not
always possible as you mentioned. I guess a lot of that depends on how many
of the variables that govern which algorithm is appropriate will continue to
change and how much. i.e., the less things will change the easier it is to
choose an appropriate algorithm before choosing its code base.

--
Dave Sexton
Nov 12 '06 #28
Hi Peter,
>I thought that "setup" might have been referring to the perquisite that
some algorithms have on sorted data. If not, what exactly does "setup"
mean?

It just depends on the algorithm. It could refer to a prerequisite, or it
could be some state initialization. Furthermore, this isn't the *only*
reason an algorithm with better order would perform worse on smaller data
sets. It's just an example. Another example would be an algorithm in which
each individual step takes twice as long for the better-order algorithm. For
smaller data sets, where the worse order is still not twice the better
order, the worse order algorithm will still perform better.
I was just speaking about this thread to a friend of mine who was quick to
jump in and say, "Use linear searches with 30 items or less and BinarySearch
otherwise", immediately after I said the words, "newsgroup thread on the big O
notation and sorting algorithms". I think this type of thought process
relating to search and sort algorithms is quite common among programmers.
It's really easy to understand and people just seem to love "rules" since it
makes things easy to remember.

From what has been discussed so far it sounds like this line of thought is
incorrect - there's just too many variables. Choosing an appropriate pivot
element in a quicksort based on the actual data being sorted, for example,
seems to prevent any constant threshold such as "30" from being appropriate,
in general.

So is there some practical guideline such as 30 or is this truly an arbitrary
number?
The take-away isn't so much the specific reasons that a better order
algorithm might be slower, but just that you can't rely on the order when
the data set is small. You have to have large data sets for the order to
come into play, because order is *defined* according to what are essentially
infinitely large data sets.
Big O notation seems to be like expressing algorithms in terms of probability,
which becomes more accurate as the number of "attempts" increases, so to
speak. When a collection is small, there is more of a chance that the data
will coalesce in a way that undermines the effectiveness of an algorithm.
Larger collections may have a better distribution of data, which can increase
the effectiveness of an algorithm much like how larger data sets provide more
accurate results, probabilistically.

Have I understood you correctly?

--
Dave Sexton
Nov 12 '06 #29
HI Brian,

Thanks for the reply.

<snip>
This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable
Doesn't O(1) indicate the constant factor?
Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.
I'd debate that if I had more knowledge on the subject ;)

--
Dave Sexton

"Brian Gideon" <br*********@yahoo.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
Dave Sexton wrote:
>Hi Bill,

It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of
an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?

Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable. Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

Nov 12 '06 #30
Hi Peter,
>Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. [...] Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

Interestingly, that's also a good example of the other comment Bill made.
That is, if the break-even point is just 10 elements, there's not really any
point in bothering with the better-performing-but-worse-order algorithm,
since even the poor performance of the better-order algorithm is going to be
practically instantaneous as far as any user is concerned. There just
aren't enough data elements for ANY algorithm to take any significant amount
of time.

Assuming 10 is truly the break-even point, then I question the need for the
HybridDictionary class at all, if the only difference is that it uses one
method for data sets fewer than 10 and another for data sets larger than
that.
I was thinking the same thing (in my last post to you which was on a related
topic), but I think there is still use for the HybridDictionary:

If an application requires the use of many dictionaries at once, each
containing a small number of distinct elements to begin with, the number of
elements increased over time, and searched consistently and asynchronously,
and if the best possible performance is required, then HybridDictionary might
make sense.

I'm not sure that it deserved to be in the FCL before LinkedList, however ;)

--
Dave Sexton
Nov 12 '06 #31
Hi Bill,

The results of your test are interesting, but I'm sure it depends on the
distribution of data in the Hashtable, does it not?

If all of the data is in a single bucket doesn't a search operation become
O(n)?

--
Dave Sexton

"Bill Butler" <qw****@asdf.comwrote in message
news:T8w5h.196$5F2.84@trnddc04...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:u%****************@TK2MSFTNGP02.phx.gbl...
>Hi Bill,

It seems you are suggesting that the scale of the data can affect whether a
required sorting operation will actually reduce the overall performance of
an algorithm so that it becomes worse than a linear search. Can this be
true even when the data is sorted as it's being entered into the structure?

Absolutely.
Seach algorithms have two distinct pieces to them
- Comparisons : Is this the right element? It is greater than this? Less
than this?
- What is the next element in the collection that I should compare
against.

A Linear search does a Fast Compare and a fast iteration to the next
element, but it doesn't apply any logic to reduce the number of comparisons
required. Therefore, for small data sets, a linear search can be the best
option. For large sets, you pay the price of having to compare agains each
and every element until you find the correct one.

A Binary search against a sorted dataset has a fairly fast compare with a
slower iteration to the next element, since it needs to calculate where it
should do the next compare. This additional logic in the algorithm means
that each step in the algorithm is slower than the linear algorithm, but it
drastically reduces the number of steps involved.Therefore the Binary search
is faster when the number of elements becomes the dominant factor.
The original question asked about the best way to find elements in a sorted
ArrayList, but I would like to bring up an alternative. Try using a
Hashtable instead. It has a higher memory footprint, but the performance is
exceptional. For completeness in this discussion I would like to point out
that Hashtable retrival is a constant time operation O(1). I ran a quick
test comparing Hashtable vs Linear search vs BinarySearch and the results
are impressive. Hashtable wins hands down when more than a handfull of
elements are involved
Bill

Nov 12 '06 #32
Hi Brian,

Actually, I thought more about what you said and it's clear to me now that you
meant O(1) doesn't reveal the actual time it takes, it just shows that the
time is constant, whatever value it may be.

10 still sounds arbitrary to me, however :)

--
Dave Sexton

"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP02.phx.gbl...
HI Brian,

Thanks for the reply.

<snip>
>This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable

Doesn't O(1) indicate the constant factor?
>Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.

I'd debate that if I had more knowledge on the subject ;)

--
Dave Sexton

"Brian Gideon" <br*********@yahoo.comwrote in message
news:11*********************@i42g2000cwa.googlegro ups.com...
>Dave Sexton wrote:
>>Hi Bill,

It seems you are suggesting that the scale of the data can affect whether
a
required sorting operation will actually reduce the overall performance of
an
algorithm so that it becomes worse than a linear search. Can this be true
even when the data is sorted as it's being entered into the structure?

Yep, that's basically what he's saying. A good example of this is the
HybridDictionary. The HybridDictionary starts out as a ListDictionary
and then converts to a Hashtable when the number of items is large
enough. This works well despite the fact that the ListDictionary uses
a O(n) implementation as opposed to the Hashtable's O(1)
implementation. It would seem, based on the Big-Oh notation anyway,
that a Hashtable would always be faster, but what Big-Oh doesn't reveal
is the constant time factor which is significant in the case of a
Hashtable. Typically, a ListDictionary will outperform a Hashtable if
the number of items is less than 10.


Nov 12 '06 #33
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
[...]
From what has been discussed so far it sounds like this line of thought is
incorrect - there's just too many variables. Choosing an appropriate
pivot element in a quicksort based on the actual data being sorted, for
example, seems to prevent any constant threshold such as "30" from being
appropriate, in general.

So is there some practical guideline such as 30 or is this truly an
arbitrary number?
As far as I know, your understanding is correct. That is, I don't have any
particular expertise in the area, but I would agree with you that the number
30 (or the number 10, as stated in a different message) is pretty much
arbitrary.

I believe that if you know the exact algorithms, implementation details, and
have some representative data to consider, you could come up with an exact
number for the break-even point. But since that number does depend on the
exact implementation as well as the nature of the data (affecting whether
actual performance is closer to best, average, or worst case), I can't
imagine that one could come up with a single number that applies in all
situations.
Big O notation seems to be like expressing algorithms in terms of
probability, which becomes more accurate as the number of "attempts"
increases, so to speak.
I understand why it looks like that, but that's not my understanding of the
nature of the question.
When a collection is small, there is more of a chance that the data will
coalesce in a way that undermines the effectiveness of an algorithm.
Larger collections may have a better distribution of data, which can
increase the effectiveness of an algorithm much like how larger data sets
provide more accurate results, probabilistically.

Have I understood you correctly?
I don't think so. :) I don't recall writing anything that tried to address
*why* the order of an algorithm is accurate only for large data sets. Only
that it is.

As for the whys, I admit once again to not being an expert on the topic.
However, if I recall my studies correctly the reason that the order applies
to infinitely large data sets is not because of a statistical convergence.
It's because when you analyze the cost of an algorithm, there are a variety
of components to that cost, but as the data set gets infinitely large, the
contribution of those costs increase unevenly. One eventually dominates,
and that's the one that is expressed as the order of the algorithm.

If it were an issue of probabilities, then (for example) one would not talk
about the "average" or "worst-case" order of an algorithm. The "worst-case"
wouldn't exist, since you'd be considering only random distribution of the
input data, and an infinitely large data set. But as we've already seen,
for Quicksort even with an infinitely large data set, one can still have an
order that is different from a statistically random and large sample.

I'm pretty sure that's right. But don't take my word for it. :)

Pete
Nov 12 '06 #34
Hi Peter,

<snip>
>When a collection is small, there is more of a chance that the data will
coalesce in a way that undermines the effectiveness of an algorithm. Larger
collections may have a better distribution of data, which can increase the
effectiveness of an algorithm much like how larger data sets provide more
accurate results, probabilistically.

Have I understood you correctly?

I don't think so. :) I don't recall writing anything that tried to address
*why* the order of an algorithm is accurate only for large data sets. Only
that it is.
Yes, I see that.

I jumped to conclusions when I read "you can't rely on the order when the data
set is small".

That sounded to me just like how probability works (expressed inversely) - the
larger the sample, the more accurate the prediction. I saw the big O notation
as a prediction of an algorithm's effectiveness on any given set when I wrote
that response.

After much mulling, I see now (again) that the notation describes the number
of steps that an algorithm might require in a given case, but not the chance
of that case actually happening. Cases such as "best", "worst" and "average"
can be expressed by the big O notation. The chances that any of those cases
will occur depends on other factors that simply aren't expressed by the
notation, such as the amount of data, its structure or its distribution within
the set.
As for the whys, I admit once again to not being an expert on the topic.
However, if I recall my studies correctly the reason that the order applies
to infinitely large data sets is not because of a statistical convergence.
It's because when you analyze the cost of an algorithm, there are a variety
of components to that cost, but as the data set gets infinitely large, the
contribution of those costs increase unevenly. One eventually dominates,
and that's the one that is expressed as the order of the algorithm.
Great response, thanks.
If it were an issue of probabilities, then (for example) one would not talk
about the "average" or "worst-case" order of an algorithm. The "worst-case"
wouldn't exist, since you'd be considering only random distribution of the
input data, and an infinitely large data set. But as we've already seen,
for Quicksort even with an infinitely large data set, one can still have an
order that is different from a statistically random and large sample.
Got it.

<snip>

--
Dave Sexton
Nov 12 '06 #35
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:OD**************@TK2MSFTNGP04.phx.gbl...
Hi Bill,

The results of your test are interesting, but I'm sure it depends on
the distribution of data in the Hashtable, does it not?

If all of the data is in a single bucket doesn't a search operation
become O(n)?
A Hash Table can indeed have linear performance if your hashing function
does not suit your data. There is a pretty good explanation of this
topic at http://en.wikipedia.org/wiki/Hash_function

I don't know enough about the implementation behind the dotnet Hashtable
class to know if it actively avoids long chains in a single bucket.
Theoretically, a hash table implementation can alter it's algorithm to
suit the data and minimize collisions.

Assuming that you have a quality Hashing function, it can be a
frustrating experience just attempting to come up with a data set that
will "Defeat" the hash (make it linear). The reason for this is that
good hash functions tend to be "One Way". By that I mean that it is hard
to find a valid input that will produce a given output except through
trial and error.

Bill
Nov 12 '06 #36

"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP03.phx.gbl...
<snip>
Big O notation seems to be like expressing algorithms in terms of
probability, which becomes more accurate as the number of "attempts"
increases, so to speak. When a collection is small, there is more of
a chance that the data will coalesce in a way that undermines the
effectiveness of an algorithm. Larger collections may have a better
distribution of data, which can increase the effectiveness of an
algorithm much like how larger data sets provide more accurate
results, probabilistically.

Have I understood you correctly?
It is not a probabilistic issue. It is an asymptotic issue. Let me give
a simple example to demonstrate.

Suppose I have some algorithm that performs as follows:

T = A + B*N

where
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection

The average time per element is then

t = T/N = A/N + B

As N gets large, the first term goes to zero since the cost is averaged
across all elements in the collection. The net effect is that for large
N you can simply ignore the A term altogether and simply assume that T =
B*N.

Here is an example with more factors

T = A + B*N + C*N*N (Maybe an inefficient sorting algorithm)

for large values of N the third term dominates the equation and
therefore this algorithm is O(N^2).
For small values of N it is quite likely that the N^2 term is
negligible.

Lets pick some real value to demonstrate.

Say A=200, B=20, C=1

N=4: The constant term dominates
T = 200 + 80 + 16

N=10: The constant term and Linear term dominate but the N^2 term is
growing
T = 200 + 200 + 100

N= 20 : The constant term is not a big factor anymore, but the N^2 term
is now important
T = 200 + 400 + 400

N = 40 : The N^2 term is starting to dominate
T = 200 + 800 + 1600

N = 100 : No contest
T = 200 + 2000 + 10000

As you can see, the N^2 behavior is only important as the number of
elements grows. For smaller collections, the lower order terms can
dominate the total time. This is the reason why a linear search can
actually be faster than a binary search for small collections. Once your
collection reaches a critical mass however, it is no contest.
Hope this helps
Bill







Nov 12 '06 #37
Hi Bill,
>The results of your test are interesting, but I'm sure it depends on the
distribution of data in the Hashtable, does it not?

If all of the data is in a single bucket doesn't a search operation become
O(n)?
A Hash Table can indeed have linear performance if your hashing function
does not suit your data. There is a pretty good explanation of this topic at
http://en.wikipedia.org/wiki/Hash_function

I don't know enough about the implementation behind the dotnet Hashtable
class to know if it actively avoids long chains in a single bucket.
Theoretically, a hash table implementation can alter it's algorithm to suit
the data and minimize collisions.
Well, I was asking because your test assumes that the keys were all hashed
properly and it made me realize that there must also be a worst case too. So
I understood O(1) to be best-case and O(n) to be worst-case for hash tables.

I'm aware of hash functions and that a hash table works only as well as the
hashing function used to generate its keys. I just wanted to be absolutely
certain that my understanding of what the O(n) notation represents was
accurate, in the context of a hash table.

I found an article that verifies O(n) to be worst-case:

"Hash Table"
http://en.wikipedia.org/wiki/Hash_table
Assuming that you have a quality Hashing function, it can be a frustrating
experience just attempting to come up with a data set that will "Defeat" the
hash (make it linear). The reason for this is that good hash functions tend
to be "One Way". By that I mean that it is hard to find a valid input that
will produce a given output except through trial and error.
You make a good point, but there is still the possibility for worst-case no
matter how well the hashing function performs because there is a finite number
of valid hashes. I understand that worst-case might be difficult to produce
and is probably not realistic in most real-world scenarios.

So do you think it's better to just ignore the worst-case when speaking of
hash tables?

(I'm leaning towards yes myself, although Jon's point in another post about
supplying as much information about an algorithm as possible seems to apply
here as well)

--
Dave Sexton
Nov 12 '06 #38
Hi Bill,

<snip>
Suppose I have some algorithm that performs as follows:

T = A + B*N

where
T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection

The average time per element is then

t = T/N = A/N + B

As N gets large, the first term goes to zero since the cost is averaged
across all elements in the collection. The net effect is that for large N
you can simply ignore the A term altogether and simply assume that T = B*N.
Very nice proof, thank you.

Algebra sure does brings back memories ;)
Here is an example with more factors

T = A + B*N + C*N*N (Maybe an inefficient sorting algorithm)

for large values of N the third term dominates the equation and therefore
this algorithm is O(N^2).
For small values of N it is quite likely that the N^2 term is negligible.
Interesting.
Lets pick some real value to demonstrate.

Say A=200, B=20, C=1

N=4: The constant term dominates
T = 200 + 80 + 16

N=10: The constant term and Linear term dominate but the N^2 term is growing
T = 200 + 200 + 100

N= 20 : The constant term is not a big factor anymore, but the N^2 term is
now important
T = 200 + 400 + 400

N = 40 : The N^2 term is starting to dominate
T = 200 + 800 + 1600

N = 100 : No contest
T = 200 + 2000 + 10000

As you can see, the N^2 behavior is only important as the number of elements
grows. For smaller collections, the lower order terms can dominate the total
time. This is the reason why a linear search can actually be faster than a
binary search for small collections. Once your collection reaches a critical
mass however, it is no contest.
Great, thank you.

I believe you've illustrated what Peter has written about choosing the
dominate factor to be represented by the O notation and how each factor
changes unevenly as the number of elements increases.

Here, O(N^2) notates the worst case?

O(A) is the best case (only one element)?

I assume the average case is closer to O(N^2), but how can it be notated?

--
Dave Sexton
Nov 12 '06 #39
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...
[...]
Here, O(N^2) notates the worst case?
Not necessarily. It only represents worst-case if you are specifically
looking at the algorithm's behavior in the worst-case. Here, O(N^2) could
very well be the average case. Bill's example didn't actually specify the
algorithm, so there's no way to know in this particular example. But when
calculating order for an actual algorithm, you would have that information
and would be able to say whether the calculated order represented a best,
average, or worst case scenario.

Pete
Nov 12 '06 #40
Hi Peter,

T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection

What information is missing?

I thought the example:

T = A + B*N + C*N*N

was representative of the behavior of any algorithm that exhibits an average
time of:

t = A/N + B + C*N

It seems to me that for any algorithm with an average time of t, governed by
this forumla, there must be some way to express that using the O notation;
However, O(A/N + B + C*N) just doesn't seem correct to me.

I don't understand why the average time may vary depending upon the algorithm
because as soon as another variable (or constant) is introduced into an
algorithm this formula would no longer apply.

Conceptually, it seems reasonable that t will be close to N^2, given an
infinite result set, but I lack in the math skills to prove it. I suspect
that "asymptotic" is the keyword here, again, but that's currently outside the
scope of my understanding. I should spend some time and RTFM :)

(Thanks for all the help everyone)

--
Dave Sexton

"Peter Duniho" <Np*********@NnOwSlPiAnMk.comwrote in message
news:12*************@corp.supernews.com...
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...
>[...]
Here, O(N^2) notates the worst case?

Not necessarily. It only represents worst-case if you are specifically
looking at the algorithm's behavior in the worst-case. Here, O(N^2) could
very well be the average case. Bill's example didn't actually specify the
algorithm, so there's no way to know in this particular example. But when
calculating order for an actual algorithm, you would have that information
and would be able to say whether the calculated order represented a best,
average, or worst case scenario.

Pete

Nov 12 '06 #41

Dave Sexton wrote:
Hi Brian,

Actually, I thought more about what you said and it's clear to me now that you
meant O(1) doesn't reveal the actual time it takes, it just shows that the
time is constant, whatever value it may be.

10 still sounds arbitrary to me, however :)
It comes from the MSDN documentation. But, yeah, it's very arbitrary.
I think if you ran your own tests you'd observe a break even point that
was close to 10, but probably not exactly 10.

Nov 12 '06 #42
"Dave Sexton" <dave@jwa[remove.this]online.comwrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...
Hi Peter,

T = total time
A = (constant) Initialization
B = (constant) The time to test one element in a collection
N = (Variable) The number of elements in a collection

What information is missing?
I can't tell you unless you tell me what algorithm you're asking about.
Perhaps none. Perhaps all sorts of information. It depends.
I thought the example:

T = A + B*N + C*N*N

was representative of the behavior of any algorithm that exhibits an
average time of:

t = A/N + B + C*N
It is, by definition. The latter equation is simply the former, divided by
the number of elements. It's by definition the mathematical average of the
time for a given element.
It seems to me that for any algorithm with an average time of t, governed
by this forumla, there must be some way to express that using the O
notation; However, O(A/N + B + C*N) just doesn't seem correct to me.
Good...because it's not. :)
I don't understand why the average time may vary depending upon the
algorithm because as soon as another variable (or constant) is introduced
into an algorithm this formula would no longer apply.
The formula Bill gave is just an example. It applies only to a theoretical
example algorithm, and can't be used generally. He's just trying to show
how an algorithm with theoretically worse order can still be better for
small numbers of elements.

Pete
Nov 12 '06 #43
Hi Peter,

I understood what Bill was trying to show me - that part was clear.

My confusion was in the fact that I thought the formula WAS the algorithm
after I completed the article, but I think I got it now.

My own inattentive mistake. Thanks for clearing that up.

--
Dave Sexton
Nov 13 '06 #44

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Eric Eggermann | last post: by
4 posts views Thread by Bernie Yaeger | last post: by
9 posts views Thread by Paul Nations | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by Purva khokhar | last post: by
reply views Thread by haryvincent176 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.