473,401 Members | 2,125 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,401 software developers and data experts.

An interview question.

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt
Nov 14 '05 #1
12 1517
Matt wrote on 27/07/04 :
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC


A classical word count program. What exactly is your C-question?

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html

"C is a sharp tool"

Nov 14 '05 #2

"Matt" <ma********@hotmail.com> wrote
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Firstly you want to query your spec. Usually by "compression" we mean a
reversible process by which the original text is represented in a more
compact format. What you have been given is more like a concordance program.

What you have been given is a typical general assignment.

Firstly you need a definition of what is a word. Probably any sequence of
alphabetical characters or apostrophes separated by whitespace or
non-apostrophe punctuation will do.

You need to go through the input file, inputting lines and pulling out each
word. Beware that some text files contain extremely long lines, because
newline is used only as a paragraph marker.

As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

You need to ask "what running time is acceptable?" to know whether or not
they are looking for optimisation.
Nov 14 '05 #3
Matt wrote:

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt


You might want to read the chapter on compression in "The Turing Omnibus"
by A.K. Dewdney, for a popular introduction to the gist of this algorithm.

--
Julian V. Noble
Professor Emeritus of Physics
jv*@lessspamformother.virginia.edu
^^^^^^^^^^^^^^^^^^
http://galileo.phys.virginia.edu/~jvn/

"For there was never yet philosopher that could endure the toothache
patiently." -- Wm. Shakespeare, Much Ado about Nothing. Act v. Sc. 1.
Nov 14 '05 #4
"Malcolm" <ma*****@55bank.freeserve.co.uk> wrote in message
news:ce**********@newsg1.svr.pol.co.uk...

"Matt" <ma********@hotmail.com> wrote
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.
[snip] As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this using a red-black tree. An alternative strategy is to use a hash table to
store the words.


Or just a series of linked-lists with first letter + character count. So
AAAAA goes to the a5 bucket and BBB goes to the b3 bucket (alphabetically).
Worry about comparing / counting after reading in all the words. No need to
compare, say, AAA with AAAA, might save time, and you should know the size
from parsing the input file looking for whitespace.

I'm sure it's a short list of words that would fit in memory, as most
homework usually is. ;-)

--
Mabden
Nov 14 '05 #5
"> > I was asked to compress a text file in a way that evey word be
written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.
The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.


If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

Rob Gamble
Nov 14 '05 #6

"tigervamp" <ro**********@hotmail.com> wrote in message

If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

I should have been a bit clearer. The text file is n bytes in size, so we
need to search the dictionary O(n) times. However the dictionary gets bigger
as we add words, and the bigger the text file the more unique words there
will be. For medium size English language text files the relationship will
be roughly linear, so we have O(n) words in the dictionary. A naive program
is thus O(n^2).

If you use the first-letter index method, you are still technically O(n^2),
but you have cut the search time by a factor of 26, and for practical
purposes this might well be as good as a red-black tree implementation.
Nov 14 '05 #7
kal
ro**********@hotmail.com (tigervamp) wrote in message news:<c9**************************@posting.google. com>...
If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.
Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.


Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

OTOH, if the stream consists solely of 1 million occurrences
of the word "genius"...

OTOOH, You ought not to have beeen writing your name so many
times.
Nov 14 '05 #8
"kal" <k_*****@yahoo.com> wrote in message
news:a5*************************@posting.google.co m...
ro**********@hotmail.com (tigervamp) wrote in message
news:<c9**************************@posting.google. com>...
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.
Shouldn't it be O(n/2) on average?


O(n/2) makes as much sense as O(n(n-1)/4).

[snip] Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.


Alex
Nov 14 '05 #9
Matt wrote:
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt


One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in. :-)

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book

Nov 14 '05 #10

On Wed, 28 Jul 2004, Thomas Matthews wrote:

Matt wrote:

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. [...]


One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in. :-)

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.


This is the way /all/ lossless compression algorithms work!
The simplest of these is ROT-26, in which the index corresponding
to a pattern P is the value of P itself. In Huffman compression,
the size of the index value of P gets shorter as P gets more
common, and longer as P gets less common. In LZW compression,
the number of patterns "tracked" with index values changes dynamically
with the contents of the input data stream.

Remember: The OP's problem is /not/ lossless compression! His
output lets us reconstruct that the input contained, say, two
"AAA"s and one "AAB", but not in which order they occurred.

F-up-to: comp.programming

-Arthur
Nov 14 '05 #11
k_*****@yahoo.com (kal) wrote in message news:<a5*************************@posting.google.c om>...
ro**********@hotmail.com (tigervamp) wrote in message news:<c9**************************@posting.google. com>...
If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.


Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.


Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.


Not quite sure how you come up with this. The act of searching such a
list should be O(n), the time it takes to find an item is directly
propertional to the number of items in the list. Even if on average
you only search half the list before finding the target, you will
still spend twice as long looking on a list that's twice the size.
What am I missing here?

Rob Gamble
Nov 14 '05 #12
k_*****@yahoo.com (kal) writes:
ro**********@hotmail.com (tigervamp) wrote in message
news:<c9**************************@posting.google. com>...
If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.
Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.


Shouldn't it be O(n/2) on average?


n/2 *is* O(n); there's no difference between O(n/2) and O(n).
Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.


Similarly, (n(n-1))) / 4 is O(n^2).

A Google search turns up <http://en.wikipedia.org/wiki/Big_O_notation>.

BTW, this really has nothing to do with C. You might try
comp.programming.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

54
by: Spammay Blockay | last post by:
I've been tasked with doing technical interviews at my company, and I have generally ask a range of OO, Java, and "good programming technique" concepts. However, one of my favorite exercises I...
10
by: Gopal Krish | last post by:
I was asked this question in an interview. How can you display the contents of an ASP page (from another web server) in a aspx page, ie, first half of the screen will display contents from asp...
0
by: Jobs | last post by:
Download the JAVA , .NET and SQL Server interview sheet and rate yourself. This will help you judge yourself are you really worth of attending interviews. If you own a company best way to judge if...
2
by: Jobs | last post by:
Download the JAVA , .NET and SQL Server interview with answers Download the JAVA , .NET and SQL Server interview sheet and rate yourself. This will help you judge yourself are you really worth of...
0
by: connectrajesh | last post by:
INTERVIEWINFO.NET http://www.interviewinfo.net FREE WEB SITE AND SERVICE FOR JOB SEEKERS /FRESH GRADUATES NO ADVERTISEMENT
18
by: Nobody | last post by:
I've been looking for a job for a while now, and have run into this interview question twice now... and have stupidly kind of blown it twice... (although I've gotten better)... time to finally...
2
by: freepdfforjobs | last post by:
Full eBook with 4000 C#, JAVA,.NET and SQL Server Interview questions http://www.questpond.com/SampleInterviewQuestionBook.zip Download the JAVA , .NET and SQL Server interview sheet and rate...
0
by: freesoftwarepdfs | last post by:
Ultimate list of Interview question website.....Do not miss it http://www.questpond.com http://msdotnetsupport.blogspot.com/2007/01/net-interview-questions-by-dutt-part-2.html...
0
by: Free PDF | last post by:
..NET , SQL Server interview questions websites.... http://www.questpond.com http://www.geocities.com/dotnetinterviews/...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.