473,385 Members | 1,409 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

frequency list

I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.
% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab. I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet. Thank you time.

Apr 25 '06 #1
19 3110
jason_box said:
I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.


If you just want to know how many words there are in all, it's pretty easy,
yes. If you wish to record the frequency of each distinct word, it's not as
simple as it sounds. I don't see anyone doing it in fifteen lines, without
using either extremely long lines or, perhaps, non-standard language
extensions (such as, say, C++).

I recommend that you do /not/ try, just yet, to try the latter. Instead,
just write a program to count words. This is easy to do. Start a counter at
0, and record the current "state" of your input - which should be "not in a
word" - i.e. 0. Whenever you hit a non-whitespace character, check the
state. If it's 0 (i.e. not in a word), change it to 1 (i.e. in a word) and
add 1 to the counter. Otherwise, do nothing. Whenever you hit a whitespace
character, change the state to 0.

#include <ctype.h> to get the prototype for isspace().

Once you've dealt with all the input, print the counter.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Apr 25 '06 #2
On 25 Apr 2006 12:07:09 -0700, "jason_box" <cp******@gmail.com> wrote:
I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.
% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab.
Your first line says "frequency of words", but your example looks like
frequency of characters. Which is it?
I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet.


Do try. Then you can come back for help and critique.

--
Al Balmer
Sun City, AZ
Apr 25 '06 #3
Oh sorry about that, yes it is a frequency of characters in the input
file.

Al Balmer wrote:
On 25 Apr 2006 12:07:09 -0700, "jason_box" <cp******@gmail.com> wrote:
I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.
% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab.


Your first line says "frequency of words", but your example looks like
frequency of characters. Which is it?
I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet.


Do try. Then you can come back for help and critique.

--
Al Balmer
Sun City, AZ


Apr 25 '06 #4
jason_box said:
Oh sorry about that, yes it is a frequency of characters in the input
file.


#include <limits.h>
#include <stdio.h>
int main(void){
unsigned long c[UCHAR_MAX + 1]={0};int ch;while((ch=getchar())!=EOF)++c[ch];
for(ch=0;ch<256;ch++)if(c[ch]>0)printf("%d:%lu\n",ch,c[ch]);return 0;}
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Apr 25 '06 #5
What is the lenght of an unsigned long versus using a standard long for
the character array?

Richard Heathfield wrote:
jason_box said:
Oh sorry about that, yes it is a frequency of characters in the input
file.


#include <limits.h>
#include <stdio.h>
int main(void){
unsigned long c[UCHAR_MAX + 1]={0};int ch;while((ch=getchar())!=EOF)++c[ch];
for(ch=0;ch<256;ch++)if(c[ch]>0)printf("%d:%lu\n",ch,c[ch]);return 0;}
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)


Apr 25 '06 #6
On 25 Apr 2006 13:28:19 -0700, "jason_box" <cp******@gmail.com> wrote:

<top posting corrected>

Al Balmer wrote:
On 25 Apr 2006 12:07:09 -0700, "jason_box" <cp******@gmail.com> wrote:
<snip>
>
>It just checks for the characters and converts it to the ascci value
>and number of occurrences and just uses a simple space or tab.
Your first line says "frequency of words", but your example looks like
frequency of characters. Which is it? Oh sorry about that, yes it is a frequency of characters in the input
file.


That's actually easier <g>. Do you have your first attempt yet?
> I think
>someone said that it shouldn't take more than 10-15 lines but I haven't
>tried yet.


Do try. Then you can come back for help and critique.


--
Al Balmer
Sun City, AZ
Apr 25 '06 #7
jason_box said:
What is the lenght of an unsigned long versus using a standard long
An standard long is sizeof(long) bytes in size, whereas an unsigned long,
which is just as standard as a standard long, is sizeof(unsigned long)
bytes in size.
for the character array?


What character array?

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Apr 25 '06 #8
Oh nevermind, I understand it now. From this list that is produced from
the frequency encoding, I would need to generate the huffman encoding
for the list. I read that there can be many ways of doing this and even
as simple as not using any crazy data structures to get the encoding. I
know for the huffman tree, it uses a tree structure and generates a
node and eventually combines it with a root and left traversal would
denote a '0' and right traversal would denote '1' and with the
traversal to a specific node, aka a character is your encoding. Guess
now I have to find a smart way of encoding the list that is given into
the huffman codes.

Richard Heathfield wrote:
jason_box said:
What is the lenght of an unsigned long versus using a standard long


An standard long is sizeof(long) bytes in size, whereas an unsigned long,
which is just as standard as a standard long, is sizeof(unsigned long)
bytes in size.
for the character array?


What character array?

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)


Apr 25 '06 #9
jason_box wrote:
I'm alittle new at C and I'm trying to write a simple program that will
record the frequency of words and just print it out. It is suppose to
take stdin and I heard it's only a few lines but I'm not sure where to
start.

The only example I have to work with is if you ran the program say:

File list contains:
This is a test.
% test < list

10 1
32 3
46 1
84 1
97 1
101 1
104 1
105 2
115 3
116 2

It just checks for the characters and converts it to the ascci value
and number of occurrences and just uses a simple space or tab. I think
someone said that it shouldn't take more than 10-15 lines but I haven't
tried yet. Thank you time.


Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.
Apr 25 '06 #10
jason_box said:
Oh nevermind, I understand it now. From this list that is produced from
the frequency encoding, I would need to generate the huffman encoding
for the list.


That's pretty trivial; and having seen you display such mastery over the
frequency list program, I don't think you'll have any trouble doing it
yourself.

Just as a brief hint, though, Huffman encoding works like this: if you have
n bytes in your data and a particular character c occurs f times, then
character c should be represented in the Huffman tree by a bit sequence
log2(n / f) bits long (except that you might want to reserve one relatively
short bit sequence as a "quote" sequence for characters that are very rare
but nevertheless present).

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Apr 25 '06 #11
I guess the generation of the huffman code is where it'll get a little
complicated. Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier. I guess I'll haver to jot this on paper
and see what I can come up with.

Richard Heathfield wrote:
jason_box said:
Oh nevermind, I understand it now. From this list that is produced from
the frequency encoding, I would need to generate the huffman encoding
for the list.


That's pretty trivial; and having seen you display such mastery over the
frequency list program, I don't think you'll have any trouble doing it
yourself.

Just as a brief hint, though, Huffman encoding works like this: if you have
n bytes in your data and a particular character c occurs f times, then
character c should be represented in the Huffman tree by a bit sequence
log2(n / f) bits long (except that you might want to reserve one relatively
short bit sequence as a "quote" sequence for characters that are very rare
but nevertheless present).

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)


Apr 25 '06 #12
I guess the generation of the huffman code is where it'll get a little
complicated. Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier. I guess I'll haver to jot this on paper
and see what I can come up with.

Richard Heathfield wrote:
jason_box said:
Oh nevermind, I understand it now. From this list that is produced from
the frequency encoding, I would need to generate the huffman encoding
for the list.


That's pretty trivial; and having seen you display such mastery over the
frequency list program, I don't think you'll have any trouble doing it
yourself.

Just as a brief hint, though, Huffman encoding works like this: if you have
n bytes in your data and a particular character c occurs f times, then
character c should be represented in the Huffman tree by a bit sequence
log2(n / f) bits long (except that you might want to reserve one relatively
short bit sequence as a "quote" sequence for characters that are very rare
but nevertheless present).

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)


Apr 25 '06 #13
jason_box said:
I guess the generation of the huffman code is where it'll get a little
complicated.
No, not really.
Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier.


Look up "trie".

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Apr 25 '06 #14
I've worked with Trie's before and I didn't have a good time with them,
though that is one good structure to use since you can traverse down
the tree and each node holds a character and there can be extra param
to encode 0 or 1. I'm going to look for a more basic structure, maybe
priority queue would work. Thanks for your help.
Richard Heathfield wrote:
jason_box said:
I guess the generation of the huffman code is where it'll get a little
complicated.


No, not really.
Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier.


Look up "trie".

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)


Apr 25 '06 #15
"jason_box" <cp******@gmail.com> writes:
I guess the generation of the huffman code is where it'll get a little
complicated. Depending on what sort of structure that I used. I just
read from other forums that it does have to be done in a BST structure
but perhaps something simplier. I guess I'll haver to jot this on paper
and see what I can come up with.


You posted two copies of the same article, less than a minute apart.
Apparently there's some flaw in Google Groups that encourages this
kind of mistake.

Also, please don't top-post. Your response goes after, or mixed with,
any quoted text, and the quoted text should be trimmed to just what's
relevant to your response. In particular, don't quote the signature
unless you're commenting on it. See most of the articles in this
newsgroup for examples; see also <http://www.caliburn.nl/topposting.html>.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Apr 25 '06 #16
On Tue, 25 Apr 2006 21:54:06 GMT, Andrew Poelstra <ap*******@shaw.ca>
wrote:


Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.


Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.

--
Al Balmer
Sun City, AZ
Apr 25 '06 #17
Al Balmer wrote:
On Tue, 25 Apr 2006 21:54:06 GMT, Andrew Poelstra <ap*******@shaw.ca>
wrote:
Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.


Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.

That's true; however, when sorting you run into problems. So,

Z - 5
X - 4
B - 3
C - 9
H - 11

will turn into
Z - 11
X - 9
B - 5
C - 4
H - 3

when sorted, because obviously the indices are not changed.
Apr 25 '06 #18
>From reading some more in our book, I think a min-heap would be the
best to use? or would ti be a max heap since at the end of the
encoding, the max value will be computed from the two previous
children? Also I had a question on how i could store the values
generated by the frequency list that is read in through stdin. I have
the first column as the characters in ascii value adn the second column
thhe frequency and I would liek to store it somehow that they would
always link to each other. What would be teh best method to do the
follwing?

Thank you.

Andrew Poelstra wrote:
Al Balmer wrote:
On Tue, 25 Apr 2006 21:54:06 GMT, Andrew Poelstra <ap*******@shaw.ca>
wrote:
Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.


Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.

That's true; however, when sorting you run into problems. So,

Z - 5
X - 4
B - 3
C - 9
H - 11

will turn into
Z - 11
X - 9
B - 5
C - 4
H - 3

when sorted, because obviously the indices are not changed.


Apr 26 '06 #19
>From reading some more in our book, I think a min-heap would be the
best to use? or would ti be a max heap since at the end of the
encoding, the max value will be computed from the two previous
children? Also I had a question on how i could store the values
generated by the frequency list that is read in through stdin. I have
the first column as the characters in ascii value adn the second column
thhe frequency and I would liek to store it somehow that they would
always link to each other. What would be teh best method to do the
follwing?

Thank you.

Andrew Poelstra wrote:
Al Balmer wrote:
On Tue, 25 Apr 2006 21:54:06 GMT, Andrew Poelstra <ap*******@shaw.ca>
wrote:
Assuming you meant 'frequency of characters':

I wrote a program like that for help with cryptograms... unfortunately,
it's on my laptop, so I can't simply copy-and-paste it onto here. Also,
it had a curses interface, and so I'd have to modify the display module
to post it.

Basically, you create a
struct letter {
int chr;
int freq;
}

Then make an array of those, and each time you encounter a character,
increase the list[x].freq. chr should be set when you initialize the array.


Or, considering that there are a limited number of possible
characters, just declare a frequency array of UCHAR_MAX + 1 elements
and use the character to index into it.

That's true; however, when sorting you run into problems. So,

Z - 5
X - 4
B - 3
C - 9
H - 11

will turn into
Z - 11
X - 9
B - 5
C - 4
H - 3

when sorted, because obviously the indices are not changed.


Apr 26 '06 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Johan Hahn | last post by:
Wouldn't it be nice if list.count, called without any arguments, returned a dict with the list's unique items as keys and their frequency of occurance as values? >>> .count() {'a': 1, 1: 2, 2:...
2
by: s | last post by:
Here is a snippet of my code: <code> list< MyClass * >outgoing_pool; MyClass* GetWaitingObject(string freq) { MyClass *temp_ptr = NULL; list<MyClass*>::iterator i;
9
by: christopher diggins | last post by:
I would like to survey how widespread the usage of smart pointers in C++ code is today. Any anecdotal experience about the frequency of usage of smart pointer for dynamic allocation in your own...
7
by: Udhay | last post by:
How to get the frequency of an audio file and how to separate the low and high frequency of an audio file
2
by: fatenn | last post by:
I need make from the Oracle Text index of the "word-frequency histogram", this is list of the tokens in this index, where each token contains the list of documents that contain that token and...
19
by: Juha Nieminen | last post by:
If I'm not completely mistaken, the only reason why std::list::size() may be (and usually is) a linear-time operation is because they want std::list::splice() to be a constant-time operation, and...
8
by: Andrew Savige | last post by:
I'm learning Python by reading David Beazley's "Python Essential Reference" book and writing a few toy programs. To get a feel for hashes and sorting, I set myself this little problem today (not...
13
by: umpsumps | last post by:
Hello, Here is my code for a letter frequency counter. It seems bloated to me and any suggestions of what would be a better way (keep in my mind I'm a beginner) would be greatly appreciated.. ...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.