On Fri, 03 Oct 2008 16:48:27 -0700, Fernando A. Gómez F.
<fe************ ****@gmail.comw rote:
[...]
>Note that I wasn't kidding when I said it's not efficient code. The
application I needed it for involved very short strings and brute force
worked great. I haven't looked at the Levenshtein article yet, but I
suspect that the algorithm described there is similar to what I did,
but with some actual thought and cleverness applied to it so that it
has a reasonable computational cost. :)
I see. However perhaps it could be improved. It would be interesting, at
least, to see the approach you used, so if you have the time and mood,
I'd be most grateful if you could post such code.
Here you go (see below). Note that in my original problem, I was
comparing two complete strings and producing a metric that indicated how
similar the strings were, rather than searching for one string in
another. But you can easily modify this approach by using the same basic
technique, but only looking at substrings of the string being searched.
Though, actually...now that I look at this implementation again, I think
you could probably feed it a string to be searched and a string to look
for, and it would work okay. It could even be modified to return the
character index of the starting point of the best match for the string
being looked for. You'd want to limit how far it looks ahead, maybe
simply by delimiting attempted matches, or otherwise take into account
gaps within the matched characters, because otherwise it will consider
itself to have successfully matched the string even if each character it
found was actually just located in a different word. But other than that,
it's probably pretty close to what could work.
The basic idea below is to start out doing a basic string compare, but
when reaching a character that doesn't match, scan forward in each string
to try to match the current character in the other string (for two scans
each mis-match). If a matching character can be found, continue trying to
match characters again, looking for the point at which a comparison of the
remaining characters returns the highest "score". In this context, the
score is the number of characters that the algorithm could match before
running out of characters in one of the strings.
You'll note that the algorithm basically enumerates all of the possible
ways that the strings could match. There's essentially a binary search
tree that bifurcates each time there's a character mis-match, which means
that the cost of the algorithm has a worst-case cost that is exponential
by the length of the longest search. In practice, it will hit the
worst-case scenario only rarely, but it's at least worth understanding,
especially since you don't have to get very far into the algorithm for
even an average-case scenario to be relatively expensive as compared to
some other algorithm that might be designed more optimally.
Like I said, I was dealing with such short strings, I made no attempt at
all to be clever. It worked very well for my needs at the time, but I
make no promises about anyone else's. :)
Pete
static int ScoreStrings(st ring str1, string str2)
{
int ich = 0, cchMax = Math.Min(str1.L ength, str2.Length);
int scoreMax;
// Scan the strings, stopping at the first difference
while (ich < cchMax && str1[ich] == str2[ich])
{
ich++;
}
// If we've exhausted one of the strings, that's the best we're
// going to do
if (ich == cchMax)
{
return ich;
}
// For each string, scan forward looking for the current
character
// in the other. Check each possible match, and use the best
score
// we find.
scoreMax = Math.Max(ich,
Math.Max(ScoreS tringsScan(str1 .Substring(ich) , str2, ich),
ScoreStringsSca n(str2.Substrin g(ich), str1, ich)));
return scoreMax;
}
static int ScoreStringsSca n(string str1, string str2, int ichStart)
{
int scoreMax = 0;
for (int ichScan = ichStart; ichScan < str2.Length; ichScan++)
{
if (str1[0] == str2[ichScan])
{
int score = ScoreStrings(st r1,
str2.Substring( ichScan));
if (score scoreMax)
{
scoreMax = score;
}
}
}
return scoreMax;
}