An interview question.

Matt

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

Nov 14 '05 #1

Subscribe Post Reply

1517

Emmanuel Delahaye

Matt wrote on 27/07/04 :

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

A classical word count program. What exactly is your C-question?

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html

"C is a sharp tool"

Nov 14 '05 #2

Malcolm

"Matt" <ma********@hotmail.com> wrote

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Firstly you want to query your spec. Usually by "compression" we mean a
reversible process by which the original text is represented in a more
compact format. What you have been given is more like a concordance program.

What you have been given is a typical general assignment.

Firstly you need a definition of what is a word. Probably any sequence of
alphabetical characters or apostrophes separated by whitespace or
non-apostrophe punctuation will do.

You need to go through the input file, inputting lines and pulling out each
word. Beware that some text files contain extremely long lines, because
newline is used only as a paragraph marker.

As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

You need to ask "what running time is acceptable?" to know whether or not
they are looking for optimisation.

Nov 14 '05 #3

Julian V. Noble

Matt wrote:

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........

the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

You might want to read the chapter on compression in "The Turing Omnibus"
by A.K. Dewdney, for a popular introduction to the gist of this algorithm.

--
Julian V. Noble
Professor Emeritus of Physics
jv*@lessspamformother.virginia.edu
^^^^^^^^^^^^^^^^^^
http://galileo.phys.virginia.edu/~jvn/

"For there was never yet philosopher that could endure the toothache
patiently." -- Wm. Shakespeare, Much Ado about Nothing. Act v. Sc. 1.

Nov 14 '05 #4

Mabden

"Malcolm" <ma*****@55bank.freeserve.co.uk> wrote in message
news:ce**********@newsg1.svr.pol.co.uk...

"Matt" <ma********@hotmail.com> wrote
I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.
[snip] As you pull out each word, you need to check whether it is in the
dictionary. If it is, the count is incremented, if not the word is added.

The interesting part comes when you need to optimise the program. Searching through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this using a red-black tree. An alternative strategy is to use a hash table to
store the words.

Or just a series of linked-lists with first letter + character count. So
AAAAA goes to the a5 bucket and BBB goes to the b3 bucket (alphabetically).
Worry about comparing / counting after reading in all the words. No need to
compare, say, AAA with AAAA, might save time, and you should know the size
from parsing the input file looking for whitespace.

I'm sure it's a short list of words that would fit in memory, as most
homework usually is. ;-)

--
Mabden

Nov 14 '05 #5

tigervamp

"> > I was asked to compress a text file in a way that evey word be
written

with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.
The interesting part comes when you need to optimise the program. Searching
through the dictionary is an O(N^2) operation. However if you can keep the
dictionary sorted alphabetically, it reduces to O(N log N). You can do this
using a red-black tree. An alternative strategy is to use a hash table to
store the words.

If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

Rob Gamble

Nov 14 '05 #6

Malcolm

"tigervamp" <ro**********@hotmail.com> wrote in message

If its taking you O(n^2) time to search a list of items, there is
something horribly wrong with your algorithm. You can search an
unsorted list of items by simply scanning each item in the list until
you find what you are looking for as an O(n) operation. Searching a
binary tree, such as a red-black tree, is an O(log n) operation.

I should have been a bit clearer. The text file is n bytes in size, so we
need to search the dictionary O(n) times. However the dictionary gets bigger
as we add words, and the bigger the text file the more unique words there
will be. For medium size English language text files the relationship will
be roughly linear, so we have O(n) words in the dictionary. A naive program
is thus O(n^2).

If you use the first-letter index method, you are still technically O(n^2),
but you have cut the search time by a factor of 26, and for practical
purposes this might well be as good as a red-black tree implementation.

Nov 14 '05 #7

kal

ro**********@hotmail.com (tigervamp) wrote in message news:<c9**************************@posting.google. com>...

If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.
Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.

Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

OTOH, if the stream consists solely of 1 million occurrences
of the word "genius"...

OTOOH, You ought not to have beeen writing your name so many
times.

Nov 14 '05 #8

Alex Fraser

"kal" <k_*****@yahoo.com> wrote in message
news:a5*************************@posting.google.co m...

ro**********@hotmail.com (tigervamp) wrote in message
news:<c9**************************@posting.google. com>...
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.
Shouldn't it be O(n/2) on average?

O(n/2) makes as much sense as O(n(n-1)/4).

[snip] Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Alex

Nov 14 '05 #9

Thomas Matthews

Matt wrote:

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. Something like
this:

If this is the original file:

---------------------------------------
AAAA BBB CCCC AAAA CCCC AAAA........
the result will be:
---------------------------------------
3 AAAA
1 BBB
2 CCCC

I appreciate any hints.

Matt

One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in. :-)

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book

Nov 14 '05 #10

Arthur J. O'Dwyer

On Wed, 28 Jul 2004, Thomas Matthews wrote:

Matt wrote:

I was asked to compress a text file in a way that evey word be written
with the number of times that such a word is repeated. [...]

One of the popular compression algorithm is based on this
technique. I think it might be Huffman, or start with an 'L'.
Perhaps somebody else will chime in. :-)

Essentially, the algorithm creates a table of patterns, then
if the pattern is repeated, writes out the index of the
pattern.

This is the way /all/ lossless compression algorithms work!
The simplest of these is ROT-26, in which the index corresponding
to a pattern P is the value of P itself. In Huffman compression,
the size of the index value of P gets shorter as P gets more
common, and longer as P gets less common. In LZW compression,
the number of patterns "tracked" with index values changes dynamically
with the contents of the input data stream.

Remember: The OP's problem is /not/ lossless compression! His
output lets us reconstruct that the input contained, say, two
"AAA"s and one "AAB", but not in which order they occurred.

F-up-to: comp.programming

-Arthur

Nov 14 '05 #11

tigervamp

k_*****@yahoo.com (kal) wrote in message news:<a5*************************@posting.google.c om>...

ro**********@hotmail.com (tigervamp) wrote in message news:<c9**************************@posting.google. com>...
If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.

Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.

Shouldn't it be O(n/2) on average?

Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Not quite sure how you come up with this. The act of searching such a
list should be O(n), the time it takes to find an item is directly
propertional to the number of items in the list. Even if on average
you only search half the list before finding the target, you will
still spend twice as long looking on a list that's twice the size.
What am I missing here?

Rob Gamble

Nov 14 '05 #12

Keith Thompson

k_*****@yahoo.com (kal) writes:

ro**********@hotmail.com (tigervamp) wrote in message
news:<c9**************************@posting.google. com>...
If its taking you O(n^2) time to search a list of items,
there is something horribly wrong with your algorithm.
Could that "something" be that it works?
You can search an unsorted list of items by simply scanning
each item in the list until you find what you are looking
for as an O(n) operation.

Shouldn't it be O(n/2) on average?

n/2 *is* O(n); there's no difference between O(n/2) and O(n).
Sounds reasonable. But for a stream consisting of two or more
different words: another O((n-1)/2) for the previous word,
and O((n-2)/2), O((n-3)/2), O((n-4)/2), ...

Looks to be about O( (n(n-1)) / 4 ). But it is customary to
refer to this as O(n^2) for sufficiently large values of n.

Similarly, (n(n-1))) / 4 is O(n^2).

A Google search turns up <http://en.wikipedia.org/wiki/Big_O_notation>.

BTW, this really has nothing to do with C. You might try
comp.programming.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #13

by: Spammay Blockay | last post by:

I've been tasked with doing technical interviews at my company, and I have generally ask a range of OO, Java, and "good programming technique" concepts. However, one of my favorite exercises I...

Java

Interview question

by: Gopal Krish | last post by:

I was asked this question in an interview. How can you display the contents of an ASP page (from another web server) in a aspx page, ie, first half of the screen will display contents from asp...

ASP.NET

Download the JAVA , .NET and SQL Server interview question with answers

by: Jobs | last post by:

Download the JAVA , .NET and SQL Server interview sheet and rate yourself. This will help you judge yourself are you really worth of attending interviews. If you own a company best way to judge if...

Microsoft SQL Server

Download the JAVA , .NET and SQL Server interview with answers

by: Jobs | last post by:

Download the JAVA , .NET and SQL Server interview with answers Download the JAVA , .NET and SQL Server interview sheet and rate yourself. This will help you judge yourself are you really worth of...

C# / C Sharp

VIEW, RATE & SHARE Interview Questions and experiences

by: connectrajesh | last post by:

INTERVIEWINFO.NET http://www.interviewinfo.net FREE WEB SITE AND SERVICE FOR JOB SEEKERS /FRESH GRADUATES NO ADVERTISEMENT

C / C++

PLS HELP w/ Interview Question!!

by: Nobody | last post by:

I've been looking for a job for a while now, and have run into this interview question twice now... and have stupidly kind of blown it twice... (although I've gotten better)... time to finally...

C / C++

Full eBook with 4000 .NET 3.0 ,C#, Ajax,JAVA,.NET and SQL Server Interview questions

by: freepdfforjobs | last post by:

Full eBook with 4000 C#, JAVA,.NET and SQL Server Interview questions http://www.questpond.com/SampleInterviewQuestionBook.zip Download the JAVA , .NET and SQL Server interview sheet and rate...

Javascript

Ultimate list of Interview question website.....Do not miss it

by: freesoftwarepdfs | last post by:

Ultimate list of Interview question website.....Do not miss it http://www.questpond.com http://msdotnetsupport.blogspot.com/2007/01/net-interview-questions-by-dutt-part-2.html...

C# / C Sharp

.NET , SQL Server interview questions websites....

by: Free PDF | last post by:

..NET , SQL Server interview questions websites.... http://www.questpond.com http://www.geocities.com/dotnetinterviews/...

ASP.NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

An interview question.

Similar topics