P: n/a

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.
what I am doing is
1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one
Are there are more efficient implementations avaiable?
Thanks  
Share this Question
P: n/a

Den 20060314 skrev Imran <ab***************@de.bosch.com>: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
[snip] Are there are more efficient implementations avaiable? Thanks
Check out: http://www.jjj.de/fxt/demo/sort/
For some good implementations.
//Peter

My REAL email address is:
vim c ":%s/^/Cr************@tznvy.pbz/:normal ggVGg?"
"Rainbows are pretty. I don't know why I shoot at them."  
P: n/a

On 20060314, Imran <ab***************@de.bosch.com> wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
what I am doing is
1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one
Are there are more efficient implementations avaiable? Thanks
If the range of the integers is limited to something "reasonable" like
65535 then you could consider creating an array indexed by the
integer in question and incrementing the count. Without writing pure C
code completely, it might approximate to something like:
while(not finished) begin
nextInt = inputIntegers[readIndex++];
countArray[nextInt]++;
end
Very fast. You could keep track of the most common number in the loop
or do a quick scan at the end.
The creation of the count array and the loop details are in your hands...  
P: n/a

On Tuesday 14 March 2006 08:50, Imran opined (in
<dv**********@ns2.fe.internet.bosch.com>): I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
what I am doing is
1. find out the maximum value N
I guess you mean "find the size of the array"? I'd expect that to be
known upfront.
2. loop through 1...N 3. count # times each occurred 4. output the most frequent one
Are there are more efficient implementations avaiable? Thanks
Let's make this slightly Cspecific:
You can use `qsort()` to sort the array. Once sorted, you need just one
pass to determine the most frequent number (and it's frequency),
without the need to keep track of more than one count at the time
(well, two if you include current maximum). What you've proposed above,
would have required a separate count for every unique number you
encounter.

BR, Vladimir
He who laughs, lasts.  
P: n/a

On 20060314, Imran <ab***************@de.bosch.com> wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
what I am doing is
1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one
Are there are more efficient implementations avaiable? Thanks
Reading between the lines a bit, you're making another vector of counts
1 .. N long. It might not need to be that long if not every number is
represented; so you could use a sort of hash table where the hashing
function is just n % n_buckets. This could save memory, but wouldn't
be any faster.
But if every time you update one of the counts you keep track of the
maximum and minimum count so far, i.e.:
...
counts[number]++;
if (counts[number] > *max) max = counts + number;
if (counts[number] < *min) min = counts + number;
...
Then as soon as the number of elements left in the original vector is
less than *max  *min, you can stop counting and break out of the loop,
because none of them can get bigger than the max you've found.
Would this be worth it? Well, statistically, I would have thought quite
possibly, but you'd have to just try it and do some tests.  
P: n/a

On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Imran <ab***************@de.bosch.com> wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
what I am doing is
1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one
Are there are more efficient implementations avaiable? Thanks Reading between the lines a bit, you're making another vector of counts 1 .. N long. It might not need to be that long if not every number is represented; so you could use a sort of hash table where the hashing function is just n % n_buckets. This could save memory, but wouldn't be any faster.
But if every time you update one of the counts you keep track of the maximum and minimum count so far, i.e.:
...
counts[number]++; if (counts[number] > *max) max = counts + number; if (counts[number] < *min) min = counts + number;
...
Then as soon as the number of elements left in the original vector is less than *max  *min, you can stop counting and break out of the loop, because none of them can get bigger than the max you've found.
The break condition is a nice optimization alright.
A little simpler might be (no need to keep pointers to maximum and
minum integer locations and have overhead of pointer addition).
count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < maxsecondHighest)
break; Would this be worth it? Well, statistically, I would have thought quite possibly, but you'd have to just try it and do some tests.  
P: n/a

On 20060314, Richard G. Riley <rg****@gmail.com> wrote: A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition).
count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < maxsecondHighest) break;
Also what I posted was wrong: max  secondHighest is correct. max  min
is not.
The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.
It means then you may need to get back from max to the corresponding
number, since you need to return the number not the count. That was why
I had max as a pointer. There are other ways of course, like keep the
corresponding number in a variable as you go along etc...  
P: n/a

On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Richard G. Riley <rg****@gmail.com> wrote:
A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition).
count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < maxsecondHighest) /**** break;
Also what I posted was wrong: max  secondHighest is correct. max  min is not.
The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough.
I thought I did that? /**** above  
P: n/a

On 20060314, Richard G. Riley <rg****@gmail.com> wrote: On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Richard G. Riley <rg****@gmail.com> wrote:
A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition).
count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < maxsecondHighest) /**** break;
Also what I posted was wrong: max  secondHighest is correct. max  min is not.
The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough.
I thought I did that? /**** above
You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highestfrequency number (but not
the highestfrequency number itself because that's not necessarily
"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max  counts would then be
the highestfrequency number, assuming the counts array is just
one entry for each possible value of number.
But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.  
P: n/a

On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Richard G. Riley <rg****@gmail.com> wrote: On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Richard G. Riley <rg****@gmail.com> wrote:
A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition).
count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < maxsecondHighest) /**** break;
Also what I posted was wrong: max  secondHighest is correct. max  min is not.
The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough. I thought I did that? /**** above
You do break out in the right place, the question is, having broken out what do you have? The frequency of the highestfrequency number (but not the highestfrequency number itself because that's not necessarily
"number", which was all I was saying). How to get from the max count back to the corresponding number is the question. That was why I was using pointers. *max is the highest count, max  counts would then be the highestfrequency number, assuming the counts array is just one entry for each possible value of number.
But never mind, this is just details, and not really as complicated as I'm making it sound, and there are plenty of other ways of doing it that are just as good or better. It would be pretty obvious what to do when one actually implemented it I think.
You're absolutely right!
a one liner after the if statement makes it all good
mostCommonNumber=number;  
P: n/a

"Richard G. Riley" <rg****@gmail.com> wrote in message
news:cp************@quark.hadron... On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Richard G. Riley <rg****@gmail.com> wrote: On 20060314, Ben C <sp******@spam.eggs> wrote: On 20060314, Richard G. Riley <rg****@gmail.com> wrote:
> A little simpler might be (no need to keep pointers to maximum and > minum integer locations and have overhead of pointer addition). > > count = ++counts[number]; > if (count>=max){ > secondHighest=max; > max=count; > } > //numLeft is number left to process in vector. > if(numLeft < maxsecondHighest) /**** > break;
Also what I posted was wrong: max  secondHighest is correct. max  min is not.
The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough.
I thought I did that? /**** above
You do break out in the right place, the question is, having broken out what do you have? The frequency of the highestfrequency number (but not the highestfrequency number itself because that's not necessarily
"number", which was all I was saying). How to get from the max count back to the corresponding number is the question. That was why I was using pointers. *max is the highest count, max  counts would then be the highestfrequency number, assuming the counts array is just one entry for each possible value of number.
But never mind, this is just details, and not really as complicated as I'm making it sound, and there are plenty of other ways of doing it that are just as good or better. It would be pretty obvious what to do when one actually implemented it I think.
You're absolutely right!
a one liner after the if statement makes it all good
mostCommonNumber=number;
The OP also has to consider that the distribution of values might be
multimodal:
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
The above set has 6 equally populated modes.

Fred L. Kleinschmidt
Boeing Associate Technical Fellow
Technical Architect, Software Reuse Project  
P: n/a

"Imran" <ab***************@de.bosch.com> wrote in message
news:dv**********@ns2.fe.internet.bosch.com... I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
Forget about sorting the data, the time it takes to sort will be much more
than the time it takes to count it.
Try to only loop through the data once.
Factors which heavily affect how this could be implemented are:
1) the range of the integers
2) the data source of the vector
3) the size of the vector
If the range of the integers is small, say zero to 127 (ASCII) for a text
analysis of William Shakespaere, the implementation is simple:
1) create a counter array of 127 unsigned long or unsigned long long's
2) increment from the first to last element of the vector
3) increment the appropriate counter
If the range is very large, you may want to create multiple arrays to help
you keep track of the data to reduce memory usage:
1) one array of bits where each bit is a seen or notseen indicator of
values in the vector
2) another array or binary tree which keeps track of the count for each of
the seen integers
3) you may need to normalize the data first, by say mapping one set of
numbers to another.
This could allow you to reduce the range of integers used:
1,6,30,32,60,200 to 1,2,3,4,5,6
If the data source of the vector is say from some function, forget about
generating the vector. Just feed the data into your tabulation method.
If the "huge" vector is truly huge, it becomes an issue of finding a method
to balance the memory available for tabulation and the speed of execution.
Rod Pemberton  
P: n/a

Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
There two common "types" of complexity (of an algorithm):
 expected complexity
 worstcase complexity
For calculating expected complexity you need to know something about the
probabilities of different vectors.
Moreover, good expected complexity is not good enough for some applications
(e.g. realtime)
This leaves you with worstcase complexity.
what I am doing is
1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one
Viewing the worstcase complexity none of the approaches are better than
yours.
I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).
Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;)
My suggestion would be to use a (heightbalanced) binarysearchtree.
At the nodes you keep pairs of (value,frequency). For each array element
search the value in the tree. If it exists increment frequency; else add a
new node (value,1). After you are done with the array look for the node
with highest frequency.
This algorithm is especially fast, if you don't have many different values.
There should be BST implementations available in C.

If your integer type is small compared to the size of the array, sorting the
array with bucket sort could be even faster than using a BST.
Have fun
/JanHinnerk  
P: n/a

JanHinnerk Dumjahn <hi***@despammed.com> writes:
[...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2).
Quicksort has complexity O(n log n). I think pure Quicksort has
worstcase complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.
Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;)
I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.  
P: n/a

Keith Thompson wrote: JanHinnerk Dumjahn <hi***@despammed.com> writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2).
Quicksort has complexity O(n log n). I think pure Quicksort has worstcase complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort.
I have taken a quick look on google but wasn't been able to find something
on Quicksort with O(N log n). However, I came across heapsort which seems
to fit quite nicely. (Wish we had done more fun stuff like that at
university...)
qsort() need not have a poor implementation, but it could ;) Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;)
I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array.
Complexity should be same for both (if you don't need random access). From a
theoretical approach I still believe that e.g. bucketsort would be slower
using arrays. However, if I consider real world effects like processor
cache using arrays could be a lot faster ;(  
P: n/a

JanHinnerk Dumjahn <hi***@despammed.com> writes: Keith Thompson wrote: JanHinnerk Dumjahn <hi***@despammed.com> writes:
[...] Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;)
I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array.
Complexity should be same for both (if you don't need random access). From a theoretical approach I still believe that e.g. bucketsort would be slower using arrays. However, if I consider real world effects like processor cache using arrays could be a lot faster ;(
Most decent (O(n log n)) sorting algorithms do require random access.

Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.  
P: n/a

On 20060314, Keith Thompson <ks***@mib.org> wrote: JanHinnerk Dumjahn <hi***@despammed.com> writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2).
Quicksort has complexity O(n log n). I think pure Quicksort has worstcase complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort.
Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;)
I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array.
I'd argue that the fastest way to sort a linked list is to keep it
sorted in the first place.  
P: n/a

Jordan Abel wrote: On 20060314, Keith Thompson <ks***@mib.org> wrote: JanHinnerk Dumjahn <hi***@despammed.com> writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2).
Quicksort has complexity O(n log n). I think pure Quicksort has worstcase complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort.
Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;)
I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array.
I'd argue that the fastest way to sort a linked list is to keep it sorted in the first place.
I find linked lists to be especially well suited to mergesort.
struct list_node {
struct list_node *next;
void *data;
};
typedef struct list_node list_type;
list_type *list_sort(list_type *head,
int (*compar)(const list_type *, const list_type *));
list_type *list_merge(list_type *head, list_type *tail,
int (*compar)(const list_type *, const list_type *));
static long unsigned node_count(list_type *head);
static list_type *list_split(list_type *head, long unsigned count);
static list_type *node_sort (list_type *head, long unsigned count,
int (*compar)(const list_type *, const list_type *));
list_type *list_sort(list_type *head,
int (*compar)(const list_type *, const list_type *))
{
return node_sort(head, node_count(head), compar);
}
static long unsigned node_count(list_type *head)
{
long unsigned count;
for (count = 0; head != NULL; head = head > next) {
++count;
}
return count;
}
static list_type *node_sort(list_type *head, long unsigned count,
int (*compar)(const list_type *, const list_type *))
{
long unsigned half;
list_type *tail;
if (count > 1) {
half = count / 2;
tail = list_split(head, half);
tail = node_sort(tail, count  half, compar);
head = node_sort(head, half, compar);
head = list_merge(head, tail, compar);
}
return head;
}
static list_type *list_split(list_type *head, long unsigned count)
{
list_type *tail;
while (count != 0) {
head = head > next;
}
tail = head > next;
head > next = NULL;
return tail;
}
list_type *list_merge(list_type *head, list_type *tail,
int (*compar)(const list_type *, const list_type *))
{
list_type *list, *sorted, **node;
node = compar(head, tail) > 0 ? &tail : &head;
sorted = list = *node;
*node = sorted > next;
while (*node != NULL) {
node = compar(head, tail) > 0 ? &tail : &head;
sorted > next = *node;
sorted = *node;
*node = sorted > next;
}
sorted > next = head != NULL ? head : tail;
return list;
}

pete  
P: n/a

JanHinnerk Dumjahn wrote: I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2).
Who said qsort() _must_ implement Quicksort?  
P: n/a

Keith Thompson <ks***@mib.org> wrote: JanHinnerk Dumjahn <hi***@despammed.com> writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2).
Quicksort has complexity O(n log n). I think pure Quicksort has worstcase complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort.
In fact, there's no requirement for it to be implemented as any kind of
quicksort at all, though for obvious reasons it often is.
Richard  
P: n/a

Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge.
what I am doing is
1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one
Are there are more efficient implementations avaiable? Thanks
Try this. Note that if there are multiple integers with the same
number of occurrances, this code returns the highest valued one.
#include <stdio.h>
#include <stdlib.h>
int cmp(const void *p, const void *q)
{
return *((int *) p)  *((int *) q);
}
int sort_and_get_most_popular(int *arr, size_t nelem)
{
int ret = *arr;
size_t i, count = 1, best = 1;
qsort(arr, nelem, sizeof *arr, cmp);
for(i = 1; i < nelem; i++) {
if(arr[i] == arr[i  1])
count++;
else {
if (count > best) {
ret = arr[i  1];
best = count;
}
count = 1;
}
}
return ret;
}
int main(void)
{
int arr[] = { 1, 3, 6, 7, 6, 3, 3, 4, 9, 10 };
int result = sort_and_get_most_popular(arr,
sizeof arr / sizeof *arr);
printf("result: %d\n", result);
return 0;
}
[mark@icepick]$ gcc Wall ansi pedantic O2 foo.c o foo
[mark@icepick]$ ./foo
result: 3
Give it a good review, though It's late, I'm tired, and I whipped it
up very quickly.
Mark F. Haigh mf*****@sbcglobal.net   This discussion thread is closed Replies have been disabled for this discussion.   Question stats  viewed: 7153
 replies: 21
 date asked: Mar 14 '06
