IsAnagram - C / C++

Martin

Hi all!

I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

bool IsAnagram(char[] wordOne, char[] wordTwo)

Here are the strategies I am thinking of :
- Brute-Force (One loop inside the other one) O(n^2)
- Sort both words and compare both arrays. O(n^2) considering that I
use BubbleSort and words are both short (less than 30 characters).
- Find a way to create a signature for each word (e.g. A = 1, B = 2,
etc.) However, I can have a problem here... The sum for both words can
be the same with different letters.

So I am looking for the most efficient algorithm. Do you have an idea?
Thank you!

May 10 '07 #1

Subscribe Reply

3376

Peter Nilsson

Martin <martinga...@gmail.comwrote:

I am looking for the best algorithm to check if two words
are anagrams.

Do you have a question about the C language?

Here the signature of the method :

bool IsAnagram(char[] wordOne, char[] wordTwo)

I see you do, although you don't realise it. The signature in
C would be closer to...

bool IsAnagram(char wordOne[], char wordTwo[])

Note that bool is not standard in C90.

Here are the strategies I am thinking of :
- Brute-Force (One loop inside the other one) O(n^2)

How would this work in practice?

- Sort both words and compare both arrays. O(n^2)
considering that I use BubbleSort and words are both
short (less than 30 characters).

A simple binary tree would be quicker.

- Find a way to create a signature for each word
(e.g. A = 1, B = 2, etc.) However, I can have a problem
here... The sum for both words can be the same with
different letters.

You could use the first 26 prime numbers and a bignum
library...

sig = 1;
if (*p == 'a') mul(sig, 2);
if (*p == 'b') mul(sig, 3);
if (*p == 'c') mul(sig, 5);
...

So I am looking for the most efficient algorithm.
Do you have an idea?

Assume you're only considering 26 letters. Create an
array of counts for each letter. Do the same for both
words and compare the arrays. O(N).

If you have a problem implementing that in C, then
feel free to ask a C question. If you have problems
with the algorithm, please post in a more appropriate
newsgroup.

--
Peter

May 10 '07 #2

Jack Klein

On 9 May 2007 19:36:45 -0700, Martin <ma*********@gmail.comwrote in
comp.lang.c:

Hi all!

I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

If you're looking for an algorithm, you should ask in an algorithm
group. news:comp.programming comes to mind.

bool IsAnagram(char[] wordOne, char[] wordTwo)

Here are the strategies I am thinking of :
- Brute-Force (One loop inside the other one) O(n^2)
- Sort both words and compare both arrays. O(n^2) considering that I
use BubbleSort and words are both short (less than 30 characters).
- Find a way to create a signature for each word (e.g. A = 1, B = 2,
etc.) However, I can have a problem here... The sum for both words can
be the same with different letters.

So I am looking for the most efficient algorithm. Do you have an idea?
Thank you!

There is nothing at all about C in your question. An algorithm is
independent of the programming language used to implement it.

Once you have decided on an algorithm, if you have difficulty writing
correct standard C code to implement it, post your problem code here
and ask for help.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html

May 10 '07 #3

CBFalconer

Martin wrote:

>
I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

bool IsAnagram(char[] wordOne, char[] wordTwo)

Here are the strategies I am thinking of :
- Brute-Force (One loop inside the other one) O(n^2)
- Sort both words and compare both arrays. O(n^2) considering that I
use BubbleSort and words are both short (less than 30 characters).
- Find a way to create a signature for each word (e.g. A = 1, B = 2,
etc.) However, I can have a problem here... The sum for both words
can be the same with different letters.

So I am looking for the most efficient algorithm. Do you have an idea?

Sort both sounds like the most efficient. You should probably use
a simple sort such as shell sort, which is much faster than bubble,
and is better than O(N*N). If one argument comes from a table then
it can be stored in sorted order.

Bear in mind that you are not passing arrays, you are passing
pointers to the first item in the array. So you need to copy the
array first. Assuming they are strings (zero terminated) you will
find the lengths in the process.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>
<http://kadaitcha.cx/vista/dogsbreakfast/index.html>
cbfalconer at maineline dot net
--
Posted via a free Usenet account from http://www.teranews.com

May 10 '07 #4

Chris Dollin

Martin wrote:

I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

bool IsAnagram(char[] wordOne, char[] wordTwo)

Assuming you're going to write this algorithm in C (why else post
/here/?), a couple of points to watch out for:

(a) C doesn't have a `bool` type by default. C90 -- the older but
widely supported standard -- doesn't have it at all; C99 --
the spanking-new [1] but spottily supported standard -- only
has it if you #include <stdbool.h>.

(b) It's `char wordOne[]`, not `char [] wordOne`; C is not Java.

(c) Actually `char wordOne[]` /as a function argument/ means
`char *wordOne`, ie a pointer to one (or more) character(s).
When you try to pass an array as an argument [2] it decays
into a pointer to its first element.

(d) It follows that once you're inside `isAnagram` [3] there's
no way to know /how long the words are/ without extra
information. If the words are C strings, you're OK, because
they will end with the nul character (the one with value 0).
If not, you'll have to smuggle the length in somehow -- an
extra argument or two will do. (Of course, if their lengths
are different, they're not anagrams.)

I have no advice on the algorithm, since that's not a C question;
comp.programming's perhaps the place for that.

[1] Hell's teeth, eight years already?

[2] Or use it in all but a couple of specialised contexts.

[3] I've gratuitously decapitalised your function name because
my Style Rules reserve initial caps for (some) #defines and
type names -- function names, not so. Your stylage may vary.

--
`x.sort = y.sort` Hedgehog
"It took a very long time, much longer than the most generous estimates."
- James White, /Sector General/

May 10 '07 #5

Richard Bos

CBFalconer <cb********@yahoo.comwrote:

Martin wrote:

I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

- Sort both words and compare both arrays. O(n^2) considering that I
use BubbleSort and words are both short (less than 30 characters).
- Find a way to create a signature for each word (e.g. A = 1, B = 2,
etc.) However, I can have a problem here... The sum for both words
can be the same with different letters.

Sort both sounds like the most efficient. You should probably use
a simple sort such as shell sort, which is much faster than bubble,
and is better than O(N*N).

For a max of 30 letters? Insertion or selection. At that size it doesn't
pay to be complicated.

Alternatively, do create a signature, with unsigned long longs, using
not addition but multiplication. If the signatures don't match, they're
not anagrams. Since this is the most usual case, you win a little. If
they _do_ match, fall back to the sorting solution. Yes, the signature
may wrap around for large words, but that's not going to matter because
you need the final check anyway, and all large anagrams are going to
wrap around to the same signature. You do need _unsigned_ integers for
this, mind you. And do check whether it's actually faster or not.

And yet another: if you're going to do great numbers of these checks,
consider using a database. For each word you're being asked to check,
check to see if it's already in the DB. If not, sort it _once_, then add
it to the DB - next time you don't need to sort, you can just fetch the
sorted version from the DB. If you hit two words you've already seen
before, all you need is two DB checks and a single compare. Drawback:
for this to save you anything, you really do need _lots_ of checks, and
a good database.

Richard

May 10 '07 #6

Joachim Schmitz

"Chris Dollin" <eh@electrichedgehog.netschrieb im Newsbeitrag
news:xi***************@fe2.news.blueyonder.co.uk.. .

Martin wrote:

>I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

bool IsAnagram(char[] wordOne, char[] wordTwo)

Assuming you're going to write this algorithm in C (why else post
/here/?), a couple of points to watch out for:

(a) C doesn't have a `bool` type by default. C90 -- the older but
widely supported standard -- doesn't have it at all; C99 --
the spanking-new [1] but spottily supported standard -- only
has it if you #include <stdbool.h>.

(b) It's `char wordOne[]`, not `char [] wordOne`; C is not Java.

(c) Actually `char wordOne[]` /as a function argument/ means
`char *wordOne`, ie a pointer to one (or more) character(s).
When you try to pass an array as an argument [2] it decays
into a pointer to its first element.

(d) It follows that once you're inside `isAnagram` [3] there's
no way to know /how long the words are/ without extra
information. If the words are C strings, you're OK, because
they will end with the nul character (the one with value 0).
If not, you'll have to smuggle the length in somehow -- an
extra argument or two will do. (Of course, if their lengths
are different, they're not anagrams.)

I have no advice on the algorithm, since that's not a C question;
comp.programming's perhaps the place for that.

[1] Hell's teeth, eight years already?

[2] Or use it in all but a couple of specialised contexts.

[3] I've gratuitously decapitalised your function name because
my Style Rules reserve initial caps for (some) #defines and
type names -- function names, not so. Your stylage may vary.

wouldn't isAnagram invade the standard namespace?

Bye, Jojo

May 10 '07 #7

Chris Dollin

Joachim Schmitz wrote:

>
"Chris Dollin" <eh@electrichedgehog.netschrieb im Newsbeitrag
news:xi***************@fe2.news.blueyonder.co.uk.. .

(fx:snip lots of stuff not relevant to Joachim's response, hint)

>[3] I've gratuitously decapitalised your function name because
my Style Rules reserve initial caps for (some) #defines and
type names -- function names, not so. Your stylage may vary.

wouldn't isAnagram invade the standard namespace?

(fx:shock)
(fx:peek)

No, because `A` isn't a lower-case letter.

(fx:mops-brow)

--
"Who are you? What do you want?" /Babylon 5/

Hewlett-Packard Limited registered no:
registered office: Cain Road, Bracknell, Berks RG12 1HN 690597 England

May 10 '07 #8

Joachim Schmitz

"Chris Dollin" <ch**********@hp.comschrieb im Newsbeitrag
news:f1**********@murdoch.hpl.hp.com...

Joachim Schmitz wrote:

>>
"Chris Dollin" <eh@electrichedgehog.netschrieb im Newsbeitrag
news:xi***************@fe2.news.blueyonder.co.uk. ..

(fx:snip lots of stuff not relevant to Joachim's response, hint)

>>[3] I've gratuitously decapitalised your function name because
my Style Rules reserve initial caps for (some) #defines and
type names -- function names, not so. Your stylage may vary.

>wouldn't isAnagram invade the standard namespace?

(fx:shock)
(fx:peek)

No, because `A` isn't a lower-case letter.

Hmm, OK, I had that suspicion but was too lazy to check myself 8-)

Bye, Jojo

May 10 '07 #9

David Mathog

Richard Bos wrote:

CBFalconer <cb********@yahoo.comwrote:

>Martin wrote:
>>I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

>>- Sort both words and compare both arrays. O(n^2) considering that I
use BubbleSort and words are both short (less than 30 characters).
- Find a way to create a signature for each word (e.g. A = 1, B = 2,
etc.) However, I can have a problem here... The sum for both words
can be the same with different letters.

>Sort both sounds like the most efficient. You should probably use
a simple sort such as shell sort, which is much faster than bubble,
and is better than O(N*N).

For a max of 30 letters? Insertion or selection. At that size it doesn't
pay to be complicated.

Alternatively, do create a signature, with unsigned long longs, using
not addition but multiplication. If the signatures don't match, they're
not anagrams. Since this is the most usual case, you win a little. If
they _do_ match, fall back to the sorting solution.

For languages with alphabets of <=32 characters (or 64 I guess)
there's a particularly efficient signature you could use.
Let each letter correspond to one bit position. For each
word, clear a 32 bit integer, XOR the appropriate bit each
time that the corresponding letter appears. Then compare the
integers. If the words are anagrams the two integers will be the same.
The opposite isn't true though, since strings like TAA and TBB yield the
same integer. So a second signature is helpful, which is just setting
the bit, which would produce a second pair of integers which would
distinguish between TAA and TBB. (The second pair can't distinguish
between TAA and TTA, but the first pair can. Together they still can't
distinguish between TAAA and TTTA and so forth, but with real words
it should do a pretty good job. There are other similar methods, for
instance adding each letter's set bit to the test integer, and allow an
extra bit space to expand into for the more common letters.)

This type of test can be done in O(N). As the previous poster noted, if
most of the tests are expected to fail, this will allow you to reject
most of the mismatches very quickly. Also of course the number of
letters in each word must be the same, which is easy enough to test at
the end of the loop where the bits are set - if speed is an issue
you definitely don't want to do this in your central loop:

if(strlen(string1) != strlen(string2)){
mismatch=1;
break;
}
/* now compare strings */

Regards,

David Mathog

May 10 '07 #10

Christopher Benson-Manica

CBFalconer <cb********@yahoo.comwrote:

Martin wrote:

I am looking for the best algorithm to check if two words are
anagrams.

Sort both sounds like the most efficient.

For sufficiently long strings, you can do better than two O(n log n)
sort operations plus n comparisons. Assuming the words are composed
of single-byte unsigned characters (signed chars would require a
little more work), you can use an array of counters of size 2^CHAR_BIT
and count the occurrence of each character in both strings. This
requires two O(n) traversals and counter increment/decrement
(increment when traversing the first string, decrement for the second)
and one O(n) (where n is 2^CHAR_BIT) traversal of the array of
counters; if all counters are 0, the anagram test passes.

--
C. Benson Manica | I *should* know what I'm talking about - if I
cbmanica(at)gmail.com | don't, I need to know. Flames welcome.

May 10 '07 #11

user923005

On May 9, 7:36 pm, Martin <martinga...@gmail.comwrote:

Hi all!

I am looking for the best algorithm to check if two words are
anagrams. Here the signature of the method :

bool IsAnagram(char[] wordOne, char[] wordTwo)

Here are the strategies I am thinking of :
- Brute-Force (One loop inside the other one) O(n^2)
- Sort both words and compare both arrays. O(n^2) considering that I
use BubbleSort and words are both short (less than 30 characters).
- Find a way to create a signature for each word (e.g. A = 1, B = 2,
etc.) However, I can have a problem here... The sum for both words can
be the same with different letters.

So I am looking for the most efficient algorithm. Do you have an idea?
Thank you!

1. change both strings to the same case
2. remove all whitespace and punctuation from both words
3. count the characters into two arrays of UCHAR_MAX dimension by
character.
4. compare the bin lists. If the bins are identical, then anagrams.
If not, then not.
It's O(N), where N is the length of the string.
Followups set to news:comp.programming

May 10 '07 #12

Peter Nilsson

Chris Dollin <chris.dol...@hp.comwrote:

Joachim Schmitz wrote:
wouldn't isAnagram invade the standard namespace?

(fx:shock)
(fx:peek)

No, because `A` isn't a lower-case letter.

(fx:mops-brow)

But C90 needn't be case sensitive with respect to linkage of
external identifiers. I've yet to see one that wasn't, but C90
still allows such implementations.

Try is_anagram.

--
Peter

May 10 '07 #13

Chris Dollin

Peter Nilsson wrote:

Chris Dollin <chris.dol...@hp.comwrote:
>Joachim Schmitz wrote:
wouldn't isAnagram invade the standard namespace?

(fx:shock)
(fx:peek)

No, because `A` isn't a lower-case letter.

(fx:mops-brow)

But C90 needn't be case sensitive with respect to linkage of
external identifiers. I've yet to see one that wasn't, but C90
still allows such implementations.

Heavens, you're right. Also there's the dread 6-character limit.

Bummer.

Well ... hm. I wouldn't use such an implementation unless there
were a gun-like to my head-like.

Try is_anagram.

I'm leaning towards `iz` ... somebody stop me.

--
I have an apt signature for that, but this isn't it.

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

May 11 '07 #14

Richard Tobin

In article <f2**********@murdoch.hpl.hp.com>,
Chris Dollin <ch**********@hp.comwrote:

>Try is_anagram.

>I'm leaning towards `iz` ... somebody stop me.

You could use the Lisp convention: anagramp.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

May 11 '07 #15

Chris Dollin

Richard Tobin wrote:

In article <f2**********@murdoch.hpl.hp.com>,
Chris Dollin <ch**********@hp.comwrote:

>>Try is_anagram.

>>I'm leaning towards `iz` ... somebody stop me.

You could use the Lisp convention: anagramp.

A convention I've disliked from my first encounter with it, alas.

I can't help wondering what the prefix "ana" means.

Oh well, so long as Gremper doesn't come for me ...

--
"We live for the One, you die for the One." Unsaid /Babylon 5/

Hewlett-Packard Limited registered no:
registered office: Cain Road, Bracknell, Berks RG12 1HN 690597 England

May 11 '07 #16

Richard Heathfield

Chris Dollin said:

Richard Tobin wrote:

>In article <f2**********@murdoch.hpl.hp.com>,
Chris Dollin <ch**********@hp.comwrote:

>>>Try is_anagram.

>>>I'm leaning towards `iz` ... somebody stop me.

You could use the Lisp convention: anagramp.

A convention I've disliked from my first encounter with it, alas.

I can't help wondering what the prefix "ana" means.

"back", "upward", or "again".

And the suffix "gram" means "writing" or "record".

So "anagram" presumably means "hold your book up".

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at the above domain, - www.

May 11 '07 #17

Chris Dollin

Richard Heathfield wrote:

Chris Dollin said:

>Richard Tobin wrote:

>>In article <f2**********@murdoch.hpl.hp.com>,
Chris Dollin <ch**********@hp.comwrote:

Try is_anagram.

I'm leaning towards `iz` ... somebody stop me.

You could use the Lisp convention: anagramp.

A convention I've disliked from my first encounter with it, alas.

I can't help wondering what the prefix "ana" means.

"back", "upward", or "again".

And the suffix "gram" means "writing" or "record".

For "gramp" I was thinking of "grandfather" -- "gramps" (too much
American SF). Upward? "anagramp": levitating grandfather.

So "anagram" presumably means "hold your book up".

If "gramp" could mean "head", that would be shiny.

Look! Over there! Something topical! Quick, run!

--
"Never ask that question!" Ambassador Kosh, /Babylon 5/

Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

May 11 '07 #18

Christopher Benson-Manica

Chris Dollin <ch**********@hp.comwrote:

Peter Nilsson wrote:

Try is_anagram.

I'm leaning towards `iz` ... somebody stop me.

If you're willing to discard the idea of grammar, bAnagram should do
the trick. bAnagram or !bAnagram, that is the question :-)

--
C. Benson Manica | I *should* know what I'm talking about - if I
cbmanica(at)gmail.com | don't, I need to know. Flames welcome.

May 11 '07 #19