You've gotten a couple of really useful replies, IMHO, regarding
optimizing specifically via a cache, as you've requested. Though,
personally it seems to me that unless you have a very specific need to
reuse the enumeration in completely different places in the code, it
might be better to just have the _client_ of the enumerator cache the
results as needed.
But even so, I think the basic question has been answered pretty well.
So here are a couple of comments that digress a bit, but which IMHO
should still be useful...
First, the specific code you posted:
[...]
public IEnumerable<Key ValuePair<int, string>GetValue s()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf (someThingToFin d,
StringCompariso n.InvariantCult ureIgnoreCase);
while (savData.Length 0 && pos >= 0)
{
cpt++;
savData = savData.Substri ng(pos +
someThingToFind .Length);
pos = savData.IndexOf (someThingToFin d,
StringCompariso n.InvariantCult ureIgnoreCase);
}
yield return new KeyValuePair<in t, string>(cpt,
someThingToFind );
}
}
I may be missing something, but the above code appears to me to only
return an iteration instance once for each word in your search list.
This may be your intended semantics, but if so it's my opinion that
returning a KeyValuePair isn't appropriate.
Actually, I don't think returning a KeyValuePair is appropriate in any
case. The reason why is that semantically, a KeyValuePair implies a key
that is unique and a value that is associated with that unique key. In
this case, your "key" isn't unique at all as near as I can tell; it's
simply the number of times a specific word showed up in your string.
I understand the temptation to use the KeyValuePair since it _seems_ to
encapsulate the very information you're returning (a pair of related
values). But IMHO it would be much better for you to just define your
own generic struct that does basically the same thing.
Alternatively, at the very least it's my opinion that the key in the
KeyValuePair should be the string and the count be the value. It seems
much more likely to me that your string is the unique element, rather
than the count.
Why would you care? Well, IMHO the biggest risk is that a KeyValuePair
could be used to construct a Dictionary or similar object directly, and
this would fail as soon as the key -- that is, the integer count --
showed up more than once.
I think there are other semantic reasons to fix this part of the code,
but the fact that there's a genuine potential risk for future uses of
the code is probably the only reason one would need to justify fixing it.
My second comment has to do with the "optimizati on" aspect itself.
IMHO, caching the results is a "sort of" optimization. That is, yes it
has the potential for improving performance, but it doesn't really
change the efficiency of the algorithm.
I note that this algorithm searches the string once for each word in
your list of words. For a very small number of words, this might not be
too bad, but as the number of words goes up, it can dramatically inflate
the cost of the algorithm.
The idea of caching the results can mitigate that problem somewhat, but
given that the algorithm is basically O(N^2), if you can change the
algorithm to something more efficient, you could wind up getting a HUGE
performance increase even without caching.
It depends somewhat on the number of words, the length of the string,
and the number of times you'd enumerate the results. But assuming that
the number of words is more than the number of times you'd enumerate the
results, an O(N) enumeration executed multiple times will still be
faster than an O(N^2) enumeration executed once and cached for future use.
The exact break-even point would depend a little on the specific
performance and how many times you need to do the enumeration for a
given list of words. But it seems to me that in any situation where you
are actually concerned about performance, that means the number of
words, length of string, and/or number of enumerations is relatively
high, and these are the situations where the disparity between an O(N)
algorithm and an O(N^2) algorithm can be very significant.
Of course, if you cache the results _and_ fix the algorithm to be better
than O(N^2), you stand to gain even more.
So, what's an O(N) algorithm to accomplish the same thing? A common one
used for string searches such as yours would be a state machine
implementation, where the nodes of the state graph are connected
according to the letters in the search words.
You traverse the graph by extracting a single character at a time from
the string being searched and using that character to determine which
node to move to; if the character has no matching link to a subsequent
node, you return to the root of the graph. The graph will have
"terminal" nodes representing the end of a search word; if and when you
reach one of those, you know you've found a word.
The exact implementation depends on your needs. The basic
implementation suffices in many cases, but if it is possible for your
search words to overlap, that can be addressed as well.
Using a state graph requires a small amount of initialization to take
place, and for relatively small tasks this initialization could cost
more than the overhead of a less-efficient algorithm. But in those
cases, usually you don't care about performance anyway. For larger
tasks where performance is actually a concern, the initialization is
relatively minor compared to the benefits of being able to go through a
string only once while still recovering every single word you're looking
for.
So, that's a long way of saying, if you are really serious about
optimizing the code, what you need is a different algorithm. You can
mitigate the problems of the O(N^2) algorithm you've got somewhat, but
in the end you've still got an O(N^2) algorithm. It will never scale
well, no matter what you do to it.
For what it's worth, there are other forms of search algorithms that are
similarly efficient. I just mention the state graph because IMHO it's
relatively easy to understand _and_ implement. You might want to look
at other search algorithms; pretty much anything designed specifically
for matching search terms to an input string will be better than the
algorithm you've got now, and it's entirely possible you could find
something that would suit your specific needs better than the
general-purpose state graph solution.
Pete