459,203 Members | 1,554 Online Need help? Post your question and get tips & solutions from a community of 459,203 IT Pros & Developers. It's quick & easy.

# best method to find the freequent numbers

 P: n/a I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. what I am doing is 1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Are there are more efficient implementations avaiable? Thanks Mar 14 '06 #1
21 Replies

 P: n/a Den 2006-03-14 skrev Imran : I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. [snip] Are there are more efficient implementations avaiable? Thanks Check out: http://www.jjj.de/fxt/demo/sort/ For some good implementations. //Peter -- My REAL email address is: vim -c ":%s/^/Cr************@tznvy.pbz/|:normal ggVGg?" "Rainbows are pretty. I don't know why I shoot at them." Mar 14 '06 #2

 P: n/a On 2006-03-14, Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. what I am doing is 1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Are there are more efficient implementations avaiable? Thanks If the range of the integers is limited to something "reasonable" like 65535 then you could consider creating an array indexed by the integer in question and incrementing the count. Without writing pure C code completely, it might approximate to something like: while(not finished) begin nextInt = inputIntegers[readIndex++]; countArray[nextInt]++; end Very fast. You could keep track of the most common number in the loop or do a quick scan at the end. The creation of the count array and the loop details are in your hands... Mar 14 '06 #3

 P: n/a On Tuesday 14 March 2006 08:50, Imran opined (in ): I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. what I am doing is 1. find out the maximum value N I guess you mean "find the size of the array"? I'd expect that to be known upfront. 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Are there are more efficient implementations avaiable? Thanks Let's make this slightly C-specific: You can use `qsort()` to sort the array. Once sorted, you need just one pass to determine the most frequent number (and it's frequency), without the need to keep track of more than one count at the time (well, two if you include current maximum). What you've proposed above, would have required a separate count for every unique number you encounter. -- BR, Vladimir He who laughs, lasts. Mar 14 '06 #4

 P: n/a On 2006-03-14, Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. what I am doing is 1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Are there are more efficient implementations avaiable? Thanks Reading between the lines a bit, you're making another vector of counts 1 .. N long. It might not need to be that long if not every number is represented; so you could use a sort of hash table where the hashing function is just n % n_buckets. This could save memory, but wouldn't be any faster. But if every time you update one of the counts you keep track of the maximum and minimum count so far, i.e.: ... counts[number]++; if (counts[number] > *max) max = counts + number; if (counts[number] < *min) min = counts + number; ... Then as soon as the number of elements left in the original vector is less than *max - *min, you can stop counting and break out of the loop, because none of them can get bigger than the max you've found. Would this be worth it? Well, statistically, I would have thought quite possibly, but you'd have to just try it and do some tests. Mar 14 '06 #5

 P: n/a On 2006-03-14, Ben C wrote: On 2006-03-14, Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. what I am doing is 1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Are there are more efficient implementations avaiable? Thanks Reading between the lines a bit, you're making another vector of counts 1 .. N long. It might not need to be that long if not every number is represented; so you could use a sort of hash table where the hashing function is just n % n_buckets. This could save memory, but wouldn't be any faster. But if every time you update one of the counts you keep track of the maximum and minimum count so far, i.e.: ... counts[number]++; if (counts[number] > *max) max = counts + number; if (counts[number] < *min) min = counts + number; ... Then as soon as the number of elements left in the original vector is less than *max - *min, you can stop counting and break out of the loop, because none of them can get bigger than the max you've found. The break condition is a nice optimization alright. A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition). count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < max-secondHighest) break; Would this be worth it? Well, statistically, I would have thought quite possibly, but you'd have to just try it and do some tests. Mar 14 '06 #6

 P: n/a On 2006-03-14, Richard G. Riley wrote: A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition). count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < max-secondHighest) break; Also what I posted was wrong: max - secondHighest is correct. max - min is not. The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough. It means then you may need to get back from max to the corresponding number, since you need to return the number not the count. That was why I had max as a pointer. There are other ways of course, like keep the corresponding number in a variable as you go along etc... Mar 14 '06 #7

 P: n/a On 2006-03-14, Ben C wrote: On 2006-03-14, Richard G. Riley wrote: A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition). count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < max-secondHighest) /**** break; Also what I posted was wrong: max - secondHighest is correct. max - min is not. The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough. I thought I did that? /**** above Mar 14 '06 #8

 P: n/a On 2006-03-14, Richard G. Riley wrote: On 2006-03-14, Ben C wrote: On 2006-03-14, Richard G. Riley wrote: A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition). count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < max-secondHighest) /**** break; Also what I posted was wrong: max - secondHighest is correct. max - min is not. The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough. I thought I did that? /**** above You do break out in the right place, the question is, having broken out what do you have? The frequency of the highest-frequency number (but not the highest-frequency number itself-- because that's not necessarily "number", which was all I was saying). How to get from the max count back to the corresponding number is the question. That was why I was using pointers. *max is the highest count, max - counts would then be the highest-frequency number, assuming the counts array is just one entry for each possible value of number. But never mind, this is just details, and not really as complicated as I'm making it sound, and there are plenty of other ways of doing it that are just as good or better. It would be pretty obvious what to do when one actually implemented it I think. Mar 14 '06 #9

 P: n/a On 2006-03-14, Ben C wrote: On 2006-03-14, Richard G. Riley wrote: On 2006-03-14, Ben C wrote: On 2006-03-14, Richard G. Riley wrote: A little simpler might be (no need to keep pointers to maximum and minum integer locations and have overhead of pointer addition). count = ++counts[number]; if (count>=max){ secondHighest=max; max=count; } //numLeft is number left to process in vector. if(numLeft < max-secondHighest) /**** break; Also what I posted was wrong: max - secondHighest is correct. max - min is not. The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough. I thought I did that? /**** above You do break out in the right place, the question is, having broken out what do you have? The frequency of the highest-frequency number (but not the highest-frequency number itself-- because that's not necessarily "number", which was all I was saying). How to get from the max count back to the corresponding number is the question. That was why I was using pointers. *max is the highest count, max - counts would then be the highest-frequency number, assuming the counts array is just one entry for each possible value of number. But never mind, this is just details, and not really as complicated as I'm making it sound, and there are plenty of other ways of doing it that are just as good or better. It would be pretty obvious what to do when one actually implemented it I think. You're absolutely right! a one liner after the if statement makes it all good mostCommonNumber=number; Mar 14 '06 #10

 P: n/a "Richard G. Riley" wrote in message news:cp************@quark.hadron... On 2006-03-14, Ben C wrote: On 2006-03-14, Richard G. Riley wrote: On 2006-03-14, Ben C wrote: On 2006-03-14, Richard G. Riley wrote:> A little simpler might be (no need to keep pointers to maximum and> minum integer locations and have overhead of pointer addition).>> count = ++counts[number];> if (count>=max){> secondHighest=max;> max=count;> }> //numLeft is number left to process in vector.> if(numLeft < max-secondHighest) /****> break; Also what I posted was wrong: max - secondHighest is correct. max - min is not. The point at which you break out is not necessarily the point at which you discover the "winning" max count though; you could have discovered max and secondHighest on previous iterations, and break out because numLeft has got small enough. I thought I did that? /**** above You do break out in the right place, the question is, having broken out what do you have? The frequency of the highest-frequency number (but not the highest-frequency number itself-- because that's not necessarily "number", which was all I was saying). How to get from the max count back to the corresponding number is the question. That was why I was using pointers. *max is the highest count, max - counts would then be the highest-frequency number, assuming the counts array is just one entry for each possible value of number. But never mind, this is just details, and not really as complicated as I'm making it sound, and there are plenty of other ways of doing it that are just as good or better. It would be pretty obvious what to do when one actually implemented it I think. You're absolutely right! a one liner after the if statement makes it all good mostCommonNumber=number; The OP also has to consider that the distribution of values might be multi-modal: 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 The above set has 6 equally populated modes. -- Fred L. Kleinschmidt Boeing Associate Technical Fellow Technical Architect, Software Reuse Project Mar 14 '06 #11

 P: n/a "Imran" wrote in message news:dv**********@ns2.fe.internet.bosch.com... I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. Forget about sorting the data, the time it takes to sort will be much more than the time it takes to count it. Try to only loop through the data once. Factors which heavily affect how this could be implemented are: 1) the range of the integers 2) the data source of the vector 3) the size of the vector If the range of the integers is small, say zero to 127 (ASCII) for a text analysis of William Shakespaere, the implementation is simple: 1) create a counter array of 127 unsigned long or unsigned long long's 2) increment from the first to last element of the vector 3) increment the appropriate counter If the range is very large, you may want to create multiple arrays to help you keep track of the data to reduce memory usage: 1) one array of bits where each bit is a seen or not-seen indicator of values in the vector 2) another array or binary tree which keeps track of the count for each of the seen integers 3) you may need to normalize the data first, by say mapping one set of numbers to another. This could allow you to reduce the range of integers used: 1,6,30,32,60,200 to 1,2,3,4,5,6 If the data source of the vector is say from some function, forget about generating the vector. Just feed the data into your tabulation method. If the "huge" vector is truly huge, it becomes an issue of finding a method to balance the memory available for tabulation and the speed of execution. Rod Pemberton Mar 14 '06 #12

 P: n/a Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. There two common "types" of complexity (of an algorithm): - expected complexity - worst-case complexity For calculating expected complexity you need to know something about the probabilities of different vectors. Moreover, good expected complexity is not good enough for some applications (e.g. real-time) This leaves you with worst-case complexity. what I am doing is 1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Viewing the worst-case complexity none of the approaches are better than yours. I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2). Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;-) My suggestion would be to use a (height-balanced) binary-search-tree. At the nodes you keep pairs of (value,frequency). For each array element search the value in the tree. If it exists increment frequency; else add a new node (value,1). After you are done with the array look for the node with highest frequency. This algorithm is especially fast, if you don't have many different values. There should be BST implementations available in C. -------- If your integer type is small compared to the size of the array, sorting the array with bucket sort could be even faster than using a BST. Have fun /Jan-Hinnerk Mar 14 '06 #13

 P: n/a Jan-Hinnerk Dumjahn writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2). Quicksort has complexity O(n log n). I think pure Quicksort has worst-case complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort. Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;-) I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array. -- Keith Thompson (The_Other_Keith) ks***@mib.org San Diego Supercomputer Center <*> We must do something. This is something. Therefore, we must do this. Mar 14 '06 #14

 P: n/a Keith Thompson wrote: Jan-Hinnerk Dumjahn writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2). Quicksort has complexity O(n log n). I think pure Quicksort has worst-case complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort. I have taken a quick look on google but wasn't been able to find something on Quicksort with O(N log n). However, I came across heapsort which seems to fit quite nicely. (Wish we had done more fun stuff like that at university...) qsort() need not have a poor implementation, but it could ;-) Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;-) I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array. Complexity should be same for both (if you don't need random access). From a theoretical approach I still believe that e.g. bucketsort would be slower using arrays. However, if I consider real world effects like processor cache using arrays could be a lot faster ;-( Mar 14 '06 #15

 P: n/a Jan-Hinnerk Dumjahn writes: Keith Thompson wrote: Jan-Hinnerk Dumjahn writes: [...] Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;-) I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array. Complexity should be same for both (if you don't need random access). From a theoretical approach I still believe that e.g. bucketsort would be slower using arrays. However, if I consider real world effects like processor cache using arrays could be a lot faster ;-( Most decent (O(n log n)) sorting algorithms do require random access. -- Keith Thompson (The_Other_Keith) ks***@mib.org San Diego Supercomputer Center <*> We must do something. This is something. Therefore, we must do this. Mar 14 '06 #16

 P: n/a On 2006-03-14, Keith Thompson wrote: Jan-Hinnerk Dumjahn writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2). Quicksort has complexity O(n log n). I think pure Quicksort has worst-case complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort. Faster sorting algorithms usually work better with linked lists then with arrays, so you would need some extra memory ;-) I don't think that's true. The fastest way to sort a linked list is usually to copy it to an array and sort the array. I'd argue that the fastest way to sort a linked list is to keep it sorted in the first place. Mar 14 '06 #17

 P: n/a Jan-Hinnerk Dumjahn wrote: I have taken a quick look on google but wasn't able to find something on Quicksort with O(N log n). However, I came across heapsort which seems to fit quite nicely. Quicksort has O(n log(n)) _average_ complexity. http://en.wikipedia.org/wiki/Sorting_algorithm http://en.wikipedia.org/wiki/Quicksort Mar 15 '06 #19

 P: n/a Jan-Hinnerk Dumjahn wrote: I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2). Who said qsort() _must_ implement Quicksort? Mar 15 '06 #20

 P: n/a Keith Thompson wrote: Jan-Hinnerk Dumjahn writes: [...] I guess the best approach is some kind of sorting and then scanning through once, like Vladimir suggested. However qsort() is not the best choice since Quicksort has complexity O(n^2). Quicksort has complexity O(n log n). I think pure Quicksort has worst-case complexity O(n^2), but there's no requirement for qsort() to be implemented as pure Quicksort. In fact, there's no requirement for it to be implemented as any kind of quicksort at all, though for obvious reasons it often is. Richard Mar 15 '06 #21

 P: n/a Imran wrote: I have a vector of integers, such as [1 3 6 7 6 3 3 4 9 10 ] and I want to find out the number which occurs most frequently.what is the quick method. My array size is huge. what I am doing is 1. find out the maximum value N 2. loop through 1...N 3. count # times each occurred 4. output the most frequent one Are there are more efficient implementations avaiable? Thanks Try this. Note that if there are multiple integers with the same number of occurrances, this code returns the highest valued one. #include #include int cmp(const void *p, const void *q) { return *((int *) p) - *((int *) q); } int sort_and_get_most_popular(int *arr, size_t nelem) { int ret = *arr; size_t i, count = 1, best = 1; qsort(arr, nelem, sizeof *arr, cmp); for(i = 1; i < nelem; i++) { if(arr[i] == arr[i - 1]) count++; else { if (count > best) { ret = arr[i - 1]; best = count; } count = 1; } } return ret; } int main(void) { int arr[] = { 1, 3, 6, 7, 6, 3, 3, 4, 9, 10 }; int result = sort_and_get_most_popular(arr, sizeof arr / sizeof *arr); printf("result: %d\n", result); return 0; } [mark@icepick]\$ gcc -Wall -ansi -pedantic -O2 foo.c -o foo [mark@icepick]\$ ./foo result: 3 Give it a good review, though-- It's late, I'm tired, and I whipped it up very quickly. Mark F. Haigh mf*****@sbcglobal.net Mar 15 '06 #22

### This discussion thread is closed

Replies have been disabled for this discussion. 