Pattern Matching in C/C++

I've learned a programming language in C/C++. I want to make a program which try to compare two arrays (random chars)and notify the user about the differrents. The program have to notify both the subtitution and insertion/deletion.

Here are the examples :

array 1 : xxxxxxxxx...(as standard)
array 2 : xxxyxxxxx...

the program will notify that,
"there are subtitution in 4th"

and

array 1 : xxxxxxx..
array 2 : xxyyxxxxx..

"there are insertion yy at 3rd"

I've designed the algorithm with conditional if, but they are still many bugs when i'm used on variant data (array2)

Apr 20 '10 #1

Subscribe Post Reply

5319

Dheeraj Joshi

1,123

Expert 1GB

What errors you are getting? What does the code look like. Please post the appropriate code snippet along with code tags.

Regards
Dheeraj Joshi

Apr 20 '10 #2

abz89

i don't know the error mechanism,
this algorithm results the wrong analyze output..

here is the code (in C#) but i want to porting it to C++

Expand|Select|Wrap|Line Numbers

 /* here is the fragment of my program (abz)

the program kindly array comparer
 
*/

public void algoritma(string dt, string ds, ref string dm, int n, int m, int array2_counts, int array1_counts, string filename)

        {

            // algorithm attributtes

            int different_counts = 0;

            int deletion_counts = 0;

            int insertion_counts = 0;

            array2_counts = ds.Length;

            dm += "array filename: " + filename + "\r\n" + "total char: " + array2_counts + " \r\n ";

            ds += "XXXXXXXXXXXX";

            dt += "XXXXXXXXXXXX";

            array1_counts = dt.Length;

            array2_counts = ds.Length;

            this.progressBar1.Maximum = array1_counts;

            if (this.Button1.Clicked)

            {

                #conditional region

                try

                {

while (n < array2_counts - 12 && m < array1_counts)

                    {

                        if (ds[n] != dt[m])

                        {

                            if (n + 5 < array2_counts && m + 6 < array1_counts)

                            {

                                if ((ds[n] == dt[m + 1]) && (ds[n + 1] == dt[m + 2]) && (ds[n + 2] == dt[m + 3]) && (ds[n + 3] == dt[m + 4]) && (ds[n + 4] == dt[m + 5]) && (ds[n + 5] == dt[m + 6]))

                                {

                                    dm += "there are deletion at, " + dt[m + 1] + " " + (m + 1) + " D";

                                    dm += "\r\n";

                                    m++;

                                    different_counts++;

                                    deletion_counts++;

                                }

                                else if (n + 5 < array2_counts && m + 5 < array1_counts)

                                {

                                    if ((ds[n + 1] == dt[m + 1]) && (ds[n + 2] == dt[m + 2]) && (ds[n + 3] == dt[m + 3]) && (ds[n + 4] == dt[m + 4]) && (ds[n + 5] == dt[m + 5]))

                                    {

                                        dm += "there are substitution at, " + dt[m] + " " + (m + 1) + " " + ds[n];

                                        dm += "\r\n";

                                        different_counts++;

                                    }

                                    else if (n + 5 < array2_counts && m + 5 < array1_counts)

                                    {

                                        if ((ds[n + 1] == dt[m + 1]) && (ds[n + 2] != dt[m + 2]) && (ds[n + 3] == dt[m + 3]) && (ds[n + 4] == dt[m + 4]) && (ds[n + 5] == dt[m + 5]))

                                        {

                                            dm += "there are substitution at, " + dt[m] + " " + (m + 1) + " " + ds[n];

                                            dm += "\r\n";

                                            different_counts++;

                                        }

                                        else if (n + 8 < array2_counts && m + 8 < array1_counts)

                                        {

                                            if ((ds[n + 2] == dt[m + 2]) && (ds[n + 3] == dt[m + 3]) && (ds[n + 4] == dt[m + 4]) && (ds[n + 5] == dt[m + 5]) && (ds[n + 6] == dt[m + 6]) && (ds[n + 7] != dt[m + 7]) && (ds[n + 8] == dt[m + 8]))

                                            {

                                                dm += "there are substitution at, " + dt[m + 1] + " " + (m + 1) + " " + ds[n];

                                                dm += "\r\n";
 
                                                dm += "ada substitusii di posisi, " + dt[m + 1] + " " + (m + 2) + " " + ds[n + 1];

                                                dm += "\r\n";

                                                m++;

                                                n++;

                                                different_counts = different_counts + 2;

                                            }

                                            else if (n + 4 < array2_counts && m + 6 < array1_counts)

                                            {

                                                if ((ds[n] == dt[m + 2]) && (ds[n + 1] == dt[m + 3]) && (ds[n + 2] == dt[m + 4]) && (ds[n + 3] == dt[m + 5]) && (ds[n + 4] == dt[m + 6]))

                                                {

                                                    dm += "there are deletion at, " + dt[m + 1] + " " + (m + 1) + "D";

                                                    dm += "\r\n";

                                                    dm += "there are deletion at, " + dt[m + 2] + " " + (m + 2) + "D";

                                                    dm += "\r\n";

                                                    m = m + 2;

                                                    different_counts = different_counts + 2;

                                                    deletion_counts = deletion_counts + 2;

                                                }

                                                else if (n + 4 < array2_counts && m + 13 < array1_counts)

                                                {

                                                    if ((ds[n] == dt[m + 9]) && (ds[n + 1] == dt[m + 10]) && (ds[n + 2] == dt[m + 11]) && (ds[n + 3] == dt[m + 12]) && (ds[n + 4] == dt[m + 13]))

                                                    {

                                                        dm += "there are deletion at, " + dt[m + 1] + " " + (m + 1) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 2] + " " + (m + 2) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 3] + " " + (m + 3) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 4] + " " + (m + 4) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 5] + " " + (m + 5) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 6] + " " + (m + 6) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 7] + " " + (m + 7) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 8] + " " + (m + 8) + " D" + "\r\n";

                                                        dm += "there are deletion at, " + dt[m + 9] + " " + (m + 9) + " D" + "\r\n";

                                                        m = m + 9;

                                                        different_counts = different_counts + 9;

                                                        deletion_counts = deletion_counts + 9;

                                                    }

                                                    else if (n + 5 < array2_counts && m + 4 < array1_counts)

                                                    {

                                                        if ((ds[n + 1] == dt[m]) && (ds[n + 2] == dt[m + 1]) && (ds[n + 3] == dt[m + 2]) && (ds[n + 4] == dt[m + 3]) && (ds[n + 5] == dt[m + 4]))

                                                        {

                                                            dm += "there are insertion at, " + "I " + m + ".1" + " " + ds[n + 1] + "\r\n";

                                                            m = m - 1;

                                                            different_counts++;

                                                            insertion_counts++;

                                                        }

                                                        else if (n + 7 < array2_counts && m + 5 < array1_counts)

                                                        {

                                                            if ((ds[n + 2] == dt[m]) && (ds[n + 3] == dt[m + 1]) && (ds[n + 4] == dt[m + 2]) && (ds[n + 5] == dt[m + 3]) && (ds[n + 6] == dt[m + 4]) && (ds[n + 7] == dt[m + 5]))

                                                            {

                                                                dm += "there are insertion at, " + "I " + m + ".2" + " " + ds[n + 2] + "\r\n";

                                                                m = m - 1;

                                                                n++;
 
                                                                different_counts = different_counts + 2;

                                                                insertion_counts = insertion_counts + 2;

                                                            }

                                                            else if (n + 8 < array2_counts && m + 4 < array1_counts)

                                                            {

                                                                if ((ds[n + 4] == dt[m]) && (ds[n + 5] == dt[m + 1]) && (ds[n + 6] == dt[m + 2]) && (ds[n + 7] == dt[m + 3]) && (ds[n + 8] == dt[m + 4]))

                                                                {

                                                                    dm += "there are insertion at, " + "I " + m + ".4" + " " + ds[n + 4] + "\r\n";

                                                                    m = m - 1;

                                                                    n = n + 3;

                                                                    different_counts = different_counts + 4;

                                                                    insertion_counts = insertion_counts + 4;

                                                                }

                                                                else if (n + 13 < array2_counts && m + 7 < array1_counts)

                                                                {

                                                                    if ((ds[n + 6] == dt[m]) && (ds[n + 7] == dt[m + 1]) && (ds[n + 8] == dt[m + 2]) && (ds[n + 9] == dt[m + 3]) && (ds[n + 10] == dt[m + 4]) && (ds[n + 11] == dt[m + 5]) && (ds[n + 12] == dt[m + 6]) && (ds[n + 13] == dt[m + 7]))

                                                                    {

                                                                        dm += "there are insertion at, " + "I " + m + ".6" + " " + ds[n + 6] + "\r\n";

                                                                        m = m - 1;

                                                                        n = n + 5;

                                                                        different_counts = different_counts + 6;

                                                                        insertion_counts = insertion_counts + 6;

                                                                    }

                                                                    else if (n + 13 < array2_counts && m + 7 < array1_counts)

                                                                    {

                                                                        if ((ds[n + 6] == dt[m]) && (ds[n + 7] == dt[m + 1]) && (ds[n + 8] == dt[m + 2]) && (ds[n + 9] == dt[m + 3]) && (ds[n + 10] == dt[m + 4]) && (ds[n + 11] == dt[m + 5]) && (ds[n + 12] == dt[m + 6]) && (ds[n + 13] == dt[m + 7]))

                                                                        {

                                                                            dm += "there are insertion at, " + "I " + m + ".6" + " " + ds[n + 6] + "\r\n";

                                                                            m = m - 1;

                                                                            n = n + 5;

                                                                            different_counts = different_counts + 6;

                                                                            insertion_counts = insertion_counts + 6;

                                                                        }

                                                                        else if (n + 2 < array2_counts && m + 1 < array1_counts)

                                                                        {

                                                                            if ((ds[n + 1] == dt[m]) && (ds[n + 2] == dt[m + 1]))

                                                                            {

                                                                                dm += "there are insertion at, " + "I" + m + ".1" + " " + ds[n + 1] + "\r\n";

                                                                                m = m - 1;

                                                                                different_counts++;

                                                                                insertion_counts++;

                                                                            }

                                                                            else if (n + 1 < array2_counts && m < array1_counts)

                                                                            {

                                                                                if ((ds[n + 1] == dt[m]))

                                                                                {

                                                                                    dm += "there are insertion at, " + "I " + m + ".1" + " " + ds[n + 1] + "\r\n";

                                                                                    m = m - 1;

                                                                                    different_counts++;

                                                                                    insertion_counts++;

                                                                                }

                                                                                else if (n + 1 < array2_counts && m + 1 < array1_counts)

                                                                                {

                                                                                    if (ds[n + 1] == dt[m + 1])

                                                                                    {
 
                                                                                        dm += "ada substitusi diposisi, " + dt[m] + " " + (m + 1) + " " + ds[n] + "\r\n";

                                                                                        different_counts++;

                                                                                    }

                                                                                    else

                                                                                    {

                                                                                        dm += "there are substitution at, " + dt[m] + " " + (m + 1) + " " + ds[n] + "\r\n";

                                                                                        different_counts++;

                                                                                    }

                                                                                }

                                                                            }

                                                                        }

                                                                    }

                                                                }

                                                            }

                                                        }

                                                    }

                                                }

                                            }

                                        }

                                    }

                                }

                            }

                        }

                        n++;

                        m++;

Apr 20 '10 #3

sridhard2406

if you know the total index of two arrays, you can compare the each and every index and print the output, which one is getting differed until end of the index.

Apr 20 '10 #4

Dheeraj Joshi

1,123

Expert 1GB

Find out the size of 2 arrays.
If sizes are not same no need to do further calculation. You can exit.
Is sizes are same, traverse through each element of array and compare them. If character at position "n" are not same exit. You can also count the position of difference(which you can calculate using loop index).

Regards
Dheeraj Joshi

Apr 20 '10 #5

abz89

yupz, my algorithm do..
it compare the array length to find out the insertion/deletion

after the array length are the same, continue to substitution compare..
but my algorithm have some error on some data..

can somebody suggest the effective algorithm?? please
thank you.. :)

Apr 20 '10 #6

Dheeraj Joshi

1,123

Expert 1GB

Effective in time and space?
Some algorithm experts may help you.
If the input set is small, almost all algorithms works ok. But as the input set grows the complexity changes.

Regards
Dheeraj Joshi

Apr 20 '10 #7

jkmyoung

2,057

Expert 2GB

By any chance do you have the code in pseudocode? You've haven't really commented your code so it's hard to see what's going on.

Also, let's say you have
bnanan
banana

Is this considered an insertion of an a, then a deletion of a n,
OR a deletion of an n, then an insertion of an a?

Apr 20 '10 #8

abz89

@dheerajjoshim
hmm, my algo is just conditional if, which prior to insertion/deletion then substitution..

@jkmyoung
owh i cannot commented the code well, my english was bad, so i can't descripting it fine :(

at your example, it will insertion of an a, then a deletion of a n
because the array 1 is a standard which to be fixed (may not changed or reassembly)

thx a lot everybody :)

Apr 20 '10 #9

jkmyoung

2,057

Expert 2GB

Ok, so you prefer insertion over deletion.

Where are you setting n and m? If not set they probably have the old values in them. Actually looking at the current algorithm, it is painful.

offhand pseudo algorithm: bound checking excluded.

Expand|Select|Wrap|Line Numbers

 Comparing ds to dt

m, n set to 0

mmax = ds.length - 1

nmax = dt.length - 1

while (m <= mmax and n <= nmax)

  if ds[m] = dt[n], m++, n++, continue

  //did not match. 

  invalidChars = 1

  // try pure insertion first

  for (insert = invalidChars; insert >= 0; insert --){

    if dt[n+insert] = ds[m+invalidChars - insert]{

      ***print off list of inserted and invalid characters

      n += 1 + insert

      m += 1 + invalidChars - insert

      break;

      }

  }

To explain this: for (insert = invalidChars; insert >= 0; insert --){
Lets say you have
abcde
fcghi

We list the comparisons in order below:
compare a -> f, no
invalidChars = 1, insert = 1
a -> c? no, insert = 0
b -> f? no.
invalidChars = 2, insert = 2
a -> g? no. insert = 1
b-> c? no. insert = 0
c -> f? no.
invalidChars = 3, insert = 3
a -> h? no. insert = 2
b-> g? no. insert = 1
c -> c? MATCH!
So,
a deleted
b deleted
f inserted.
===
This biggest problem with this is that it is iterative. Eg if you have
abcbde
acbde
the algorithm will match the wrong b's and say:
c inserted at 2
c deleted at 3
b deleted at 4.
As opposed to the simple:
b deleted at 2

If you want to avoid that, you need to implement saving states and backtracking with some sort of recursion.

Apr 20 '10 #10

abz89

my mean is not prefer insertion over deletion..
at your example, if i prefer deletion over insertion, the array1 lenght will changed :)

the algo have not changed the array1 but array2, cause array1 act as standard, so the array2 have to matched with array1

at my algo, i make the 'fragmen' to compare
cause the real array can up to thousands,

this algo will create a fragments with 13char (array) for each
to make possible char matching process go to the end

here are the links, my algo with a little comment

http://pastebin.com/chvTQq0w
sorry cause it too ugly to post here
my english was bad too :(

Apr 21 '10 #11

abz89

@jkmyoung
no not like this dude..
insertion is the insertion data to array1 so the array1 will changed

eg:

array1 : abcde < 5 char
array2 : abcfde < 6 char

for deletion:

array1 : abcde < 5 char
array2 : abde < 4 char, d deleted

so substitution will executed after the lenght of both array are same..

here is a insertion and subt case:

array1 : abcdef < 6 char
array2 : abcghef < 7 char

at array2 : g will deleted to reduce the array length

array1 : abcdef
array2 : abchef

the insertion/deletion have higher prior than subs, so after lenght data matched the subs comparison will executed..

thx a lot :D

Apr 21 '10 #12

jkmyoung

2,057

Expert 2GB

I know what you mean, but what I mean is coding up the algorithm smartly will not be so simple: Take the core of this example again:
bcb
cb
To the human eye, we can easily see that we just need to delete b, to get the string. To the computer's eye, it's not that easy. It only looks at what we tell it, so it might match the wrong b's first. If we tell it to look for deletion first, we get the reverse problem. eg:
cb
bcb
Again the wrong b's are matched up.
===
To do this smartly, you need to save the results of multiple methods, recursively.
public void algorithma(string dt, string ds, ref string dm, int n, int m, int array2_counts, int array1_counts, string filename)
Is this the only method declaration you are allowed to use?

The algorithm would be more like:

Expand|Select|Wrap|Line Numbers

 
Algorithm(..arguments...){

  while matching, increment m and n

    once we get to our first mismatch, break;
 
  If we reached the end of one of the strings, finish results and return them.
 
  Otherwise:

    Recursively find all possible solutions from here on.

    Pick the best solution.

  return the best solution

}

In order to not go through every case, have a variable, which holds the best results so far, passed in to the function. Once we reach that number, we end output and return null.

Eg a saved state might look something like

Expand|Select|Wrap|Line Numbers

 
 class State

    {

        String mismatchOutputString;

        int errors;
 
        public State()        {

            errors = 0;

            mismatchOutputString = "";

        }
 
        public void Insert(int i, char c)

        {

            mismatchOutputString += "inserted at " + i + ":" + c + "\n";

            errors++;

        }

}

Then when you recurse, you check the number of errors. If worse than your best case so far, throw it away.

Apr 21 '10 #13

abz89

thx a lot to the responses

i've found the algorithm!
what i need is just "levensthein distance"

:D

now i'm trying on it..
hope i can :p

Apr 23 '10 #14

jkmyoung

2,057

Expert 2GB

interesting! Note, that the algorithm uses replacement as a single operation, but you can easily modify it.

Apr 23 '10 #15

abz89

owh my...
i been worked at levenshtein algo in weeks..
and i got nothing! T.T

yesterday i've found what i'm exactly want!
Needleman–Wunsch algorithm

i been trying to see the naligner source code too, but i don't understand, how stupid i am T.T

anyone can help me to write this algo in c#, so i can insert the conditional if when insertion/deletion/subs happened,...please, thx u

Apr 25 '10 #16

jkmyoung

2,057

Expert 2GB

That algorithm is much more complicated. Even if you have a match, the best thing to do might not be to match them!

At each stage, you calculate 3 possible values:
a. use characters from both strings
b. use 1 char from string 1. (gap penalty)
c. use 1 char from string 2. (gap penalty)

Unfortunately, because of the nature of the algorithm, the truncation detection is much more difficult.

A bad approximation for maximum possible value of 2 remaining strings is
Take the maximum value in that array. max1Value
Assuming length(m) <= length (n), the max value is
max1Value * length(m) - gapPenalty * (length (n - m))
(if n is short, just swap m and n)
After calculating the first possible value, you use that as a best value so far, and truncate branches if they don't fit.

If you want more help, please show how you're laying out this matrix.

Apr 28 '10 #17

Pattern Matching in C/C++

Similar topics