472,809 Members | 2,540 Online

# best method to find the freequent numbers

I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks
Mar 14 '06 #1
21 8044
Den 2006-03-14 skrev Imran <ab***************@de.bosch.com>:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]
and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge. [snip] Are there are more efficient implementations avaiable?
Thanks

Check out:
http://www.jjj.de/fxt/demo/sort/
For some good implementations.

//Peter
--
vim -c ":%s/^/Cr************@tznvy.pbz/|:normal ggVGg?"
"Rainbows are pretty. I don't know why I shoot at them."
Mar 14 '06 #2
On 2006-03-14, Imran <ab***************@de.bosch.com> wrote:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

If the range of the integers is limited to something "reasonable" like
65535 then you could consider creating an array indexed by the
integer in question and incrementing the count. Without writing pure C
code completely, it might approximate to something like:

while(not finished) begin
countArray[nextInt]++;
end

Very fast. You could keep track of the most common number in the loop
or do a quick scan at the end.

The creation of the count array and the loop details are in your hands...
Mar 14 '06 #3
On Tuesday 14 March 2006 08:50, Imran opined (in
<dv**********@ns2.fe.internet.bosch.com>):
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is
the quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
I guess you mean "find the size of the array"? I'd expect that to be
known upfront.
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

Let's make this slightly C-specific:

You can use `qsort()` to sort the array. Once sorted, you need just one
pass to determine the most frequent number (and it's frequency),
without the need to keep track of more than one count at the time
(well, two if you include current maximum). What you've proposed above,
would have required a separate count for every unique number you
encounter.

--

He who laughs, lasts.

Mar 14 '06 #4
On 2006-03-14, Imran <ab***************@de.bosch.com> wrote:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

Reading between the lines a bit, you're making another vector of counts
1 .. N long. It might not need to be that long if not every number is
represented; so you could use a sort of hash table where the hashing
function is just n % n_buckets. This could save memory, but wouldn't
be any faster.

But if every time you update one of the counts you keep track of the
maximum and minimum count so far, i.e.:

...

counts[number]++;
if (counts[number] > *max) max = counts + number;
if (counts[number] < *min) min = counts + number;

...

Then as soon as the number of elements left in the original vector is
less than *max - *min, you can stop counting and break out of the loop,
because none of them can get bigger than the max you've found.

Would this be worth it? Well, statistically, I would have thought quite
possibly, but you'd have to just try it and do some tests.
Mar 14 '06 #5
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Imran <ab***************@de.bosch.com> wrote:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks
Reading between the lines a bit, you're making another vector of counts
1 .. N long. It might not need to be that long if not every number is
represented; so you could use a sort of hash table where the hashing
function is just n % n_buckets. This could save memory, but wouldn't
be any faster.

But if every time you update one of the counts you keep track of the
maximum and minimum count so far, i.e.:

...

counts[number]++;
if (counts[number] > *max) max = counts + number;
if (counts[number] < *min) min = counts + number;

...

Then as soon as the number of elements left in the original vector is
less than *max - *min, you can stop counting and break out of the loop,
because none of them can get bigger than the max you've found.

The break condition is a nice optimization alright.

A little simpler might be (no need to keep pointers to maximum and

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest)
break;

Would this be worth it? Well, statistically, I would have thought quite
possibly, but you'd have to just try it and do some tests.

Mar 14 '06 #6
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:
A little simpler might be (no need to keep pointers to maximum and

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest)
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

It means then you may need to get back from max to the corresponding
number, since you need to return the number not the count. That was why
I had max as a pointer. There are other ways of course, like keep the
corresponding number in a variable as you go along etc...
Mar 14 '06 #7
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:
A little simpler might be (no need to keep pointers to maximum and

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above
Mar 14 '06 #8
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:
A little simpler might be (no need to keep pointers to maximum and

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above

You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highest-frequency number (but not
the highest-frequency number itself-- because that's not necessarily
"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max - counts would then be
the highest-frequency number, assuming the counts array is just
one entry for each possible value of number.

But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.
Mar 14 '06 #9
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:

A little simpler might be (no need to keep pointers to maximum and

count = ++counts[number];
if (count>=max){
secondHighest=max;
max=count;
}
//numLeft is number left to process in vector.
if(numLeft < max-secondHighest) /****
break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.
I thought I did that? /**** above

You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highest-frequency number (but not
the highest-frequency number itself-- because that's not necessarily

"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max - counts would then be
the highest-frequency number, assuming the counts array is just
one entry for each possible value of number.

But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.

You're absolutely right!

a one liner after the if statement makes it all good

mostCommonNumber=number;

Mar 14 '06 #10

"Richard G. Riley" <rg****@gmail.com> wrote in message
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:
On 2006-03-14, Ben C <sp******@spam.eggs> wrote:
On 2006-03-14, Richard G. Riley <rg****@gmail.com> wrote:

> A little simpler might be (no need to keep pointers to maximum and
>
> count = ++counts[number];
> if (count>=max){
> secondHighest=max;
> max=count;
> }
> //numLeft is number left to process in vector.
> if(numLeft < max-secondHighest) /****
> break;

Also what I posted was wrong: max - secondHighest is correct. max - min
is not.

The point at which you break out is not necessarily the point at which
you discover the "winning" max count though; you could have discovered
max and secondHighest on previous iterations, and break out because
numLeft has got small enough.

I thought I did that? /**** above

You do break out in the right place, the question is, having broken out
what do you have? The frequency of the highest-frequency number (but not
the highest-frequency number itself-- because that's not necessarily

"number", which was all I was saying). How to get from the max count
back to the corresponding number is the question. That was why I was
using pointers. *max is the highest count, max - counts would then be
the highest-frequency number, assuming the counts array is just
one entry for each possible value of number.

But never mind, this is just details, and not really as complicated as
I'm making it sound, and there are plenty of other ways of doing it that
are just as good or better. It would be pretty obvious what to do when
one actually implemented it I think.

You're absolutely right!

a one liner after the if statement makes it all good

mostCommonNumber=number;

The OP also has to consider that the distribution of values might be
multi-modal:
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
The above set has 6 equally populated modes.
--
Fred L. Kleinschmidt
Boeing Associate Technical Fellow
Technical Architect, Software Reuse Project

Mar 14 '06 #11

"Imran" <ab***************@de.bosch.com> wrote in message
news:dv**********@ns2.fe.internet.bosch.com...
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

Forget about sorting the data, the time it takes to sort will be much more
than the time it takes to count it.

Try to only loop through the data once.

Factors which heavily affect how this could be implemented are:
1) the range of the integers
2) the data source of the vector
3) the size of the vector

If the range of the integers is small, say zero to 127 (ASCII) for a text
analysis of William Shakespaere, the implementation is simple:
1) create a counter array of 127 unsigned long or unsigned long long's
2) increment from the first to last element of the vector
3) increment the appropriate counter

If the range is very large, you may want to create multiple arrays to help
you keep track of the data to reduce memory usage:
1) one array of bits where each bit is a seen or not-seen indicator of
values in the vector
2) another array or binary tree which keeps track of the count for each of
the seen integers
3) you may need to normalize the data first, by say mapping one set of
numbers to another.
This could allow you to reduce the range of integers used:
1,6,30,32,60,200 to 1,2,3,4,5,6

If the data source of the vector is say from some function, forget about
generating the vector. Just feed the data into your tabulation method.

If the "huge" vector is truly huge, it becomes an issue of finding a method
to balance the memory available for tabulation and the speed of execution.
Rod Pemberton
Mar 14 '06 #12
Imran wrote:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.
There two common "types" of complexity (of an algorithm):
- expected complexity
- worst-case complexity

For calculating expected complexity you need to know something about the
probabilities of different vectors.
Moreover, good expected complexity is not good enough for some applications
(e.g. real-time)

This leaves you with worst-case complexity.
what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Viewing the worst-case complexity none of the approaches are better than
yours.

I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).

Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

My suggestion would be to use a (height-balanced) binary-search-tree.
At the nodes you keep pairs of (value,frequency). For each array element
search the value in the tree. If it exists increment frequency; else add a
new node (value,1). After you are done with the array look for the node
with highest frequency.

This algorithm is especially fast, if you don't have many different values.

There should be BST implementations available in C.

--------

If your integer type is small compared to the size of the array, sorting the
array with bucket sort could be even faster than using a BST.

Have fun
/Jan-Hinnerk

Mar 14 '06 #13
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes:
[...]
I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).
Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.
Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 14 '06 #14
Keith Thompson wrote:
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes:
[...]
I guess the best approach is some kind of sorting and then scanning
through once, like Vladimir suggested. However qsort() is not the best
choice since Quicksort has complexity O(n^2).

Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.

I have taken a quick look on google but wasn't been able to find something
on Quicksort with O(N log n). However, I came across heapsort which seems
to fit quite nicely. (Wish we had done more fun stuff like that at
university...)

qsort() need not have a poor implementation, but it could ;-)
Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

Complexity should be same for both (if you don't need random access). From a
theoretical approach I still believe that e.g. bucketsort would be slower
using arrays. However, if I consider real world effects like processor
cache using arrays could be a lot faster ;-(

Mar 14 '06 #15
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes:
Keith Thompson wrote:
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes: [...]
Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

Complexity should be same for both (if you don't need random access). From a
theoretical approach I still believe that e.g. bucketsort would be slower
using arrays. However, if I consider real world effects like processor
cache using arrays could be a lot faster ;-(

Most decent (O(n log n)) sorting algorithms do require random access.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Mar 14 '06 #16
On 2006-03-14, Keith Thompson <ks***@mib.org> wrote:
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes:
[...]
I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).

Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.
Faster sorting algorithms usually work better with linked lists then with
arrays, so you would need some extra memory ;-)

I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

I'd argue that the fastest way to sort a linked list is to keep it
sorted in the first place.
Mar 14 '06 #17
Jordan Abel wrote:

On 2006-03-14, Keith Thompson <ks***@mib.org> wrote:
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes:
[...]
I guess the best approach is
some kind of sorting and then scanning through
However qsort() is not the best choice since
Quicksort has complexity O(n^2).

Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.
Faster sorting algorithms usually work better
arrays, so you would need some extra memory ;-)

I don't think that's true. The fastest way to sort a linked list is
usually to copy it to an array and sort the array.

I'd argue that the fastest way to sort a linked list is to keep it
sorted in the first place.

I find linked lists to be especially well suited to mergesort.
struct list_node {
struct list_node *next;
void *data;
};

typedef struct list_node list_type;

int (*compar)(const list_type *, const list_type *));
int (*compar)(const list_type *, const list_type *));

static list_type *list_split(list_type *head, long unsigned count);
static list_type *node_sort (list_type *head, long unsigned count,
int (*compar)(const list_type *, const list_type *));

int (*compar)(const list_type *, const list_type *))
{
}

{
long unsigned count;

++count;
}
return count;
}

static list_type *node_sort(list_type *head, long unsigned count,
int (*compar)(const list_type *, const list_type *))
{
long unsigned half;
list_type *tail;

if (count > 1) {
half = count / 2;
tail = node_sort(tail, count - half, compar);
}
}

static list_type *list_split(list_type *head, long unsigned count)
{
list_type *tail;

while (--count != 0) {
}
return tail;
}

int (*compar)(const list_type *, const list_type *))
{
list_type *list, *sorted, **node;

sorted = list = *node;
*node = sorted -> next;
while (*node != NULL) {
sorted -> next = *node;
sorted = *node;
*node = sorted -> next;
}
return list;
}

--
pete
Mar 14 '06 #18
Jan-Hinnerk Dumjahn wrote:
I have taken a quick look on google but wasn't able to find something
on Quicksort with O(N log n). However, I came across heapsort which
seems to fit quite nicely.

Quicksort has O(n log(n)) _average_ complexity.

http://en.wikipedia.org/wiki/Sorting_algorithm
http://en.wikipedia.org/wiki/Quicksort
Mar 15 '06 #19
Jan-Hinnerk Dumjahn wrote:
I guess the best approach is some kind of sorting and then scanning
through once, like Vladimir suggested. However qsort() is not the
best choice since Quicksort has complexity O(n^2).

Who said qsort() _must_ implement Quicksort?
Mar 15 '06 #20
Keith Thompson <ks***@mib.org> wrote:
Jan-Hinnerk Dumjahn <hi***@despammed.com> writes:
[...]
I guess the best approach is some kind of sorting and then scanning through
once, like Vladimir suggested. However qsort() is not the best choice since
Quicksort has complexity O(n^2).

Quicksort has complexity O(n log n). I think pure Quicksort has
worst-case complexity O(n^2), but there's no requirement for qsort()
to be implemented as pure Quicksort.

In fact, there's no requirement for it to be implemented as any kind of
quicksort at all, though for obvious reasons it often is.

Richard
Mar 15 '06 #21
Imran wrote:
I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ]

and I want to find out the number which occurs most frequently.what is the
quick method. My array size is huge.

what I am doing is

1. find out the maximum value N
2. loop through 1...N
3. count # times each occurred
4. output the most frequent one

Are there are more efficient implementations avaiable?
Thanks

Try this. Note that if there are multiple integers with the same
number of occurrances, this code returns the highest valued one.

#include <stdio.h>
#include <stdlib.h>

int cmp(const void *p, const void *q)
{
return *((int *) p) - *((int *) q);
}

int sort_and_get_most_popular(int *arr, size_t nelem)
{
int ret = *arr;
size_t i, count = 1, best = 1;

qsort(arr, nelem, sizeof *arr, cmp);
for(i = 1; i < nelem; i++) {
if(arr[i] == arr[i - 1])
count++;
else {
if (count > best) {
ret = arr[i - 1];
best = count;
}
count = 1;
}
}
return ret;
}

int main(void)
{
int arr[] = { 1, 3, 6, 7, 6, 3, 3, 4, 9, 10 };
int result = sort_and_get_most_popular(arr,
sizeof arr / sizeof *arr);

printf("result: %d\n", result);
return 0;
}

[mark@icepick]\$ gcc -Wall -ansi -pedantic -O2 foo.c -o foo
[mark@icepick]\$ ./foo
result: 3
Give it a good review, though-- It's late, I'm tired, and I whipped it
up very quickly.

Mark F. Haigh
mf*****@sbcglobal.net

Mar 15 '06 #22

This thread has been closed and replies have been disabled. Please start a new discussion.