Optimize an IEnumerable

timor.super

Hi group,

imagine I want to find number of values of a word in a string ...

Look at my code :

List<stringthingsToFind = new List<string>(new string[]
{"error", "values"});
MyClass myClass = new MyClass(thingsToFind, data);
foreach (KeyValuePair<int, stringaValue in
myClass.GetValues())
{ // do something
}

.....

class MyClass
{
private readonly List<string_list;
private readonly string _data;

public MyClass(List<stringlistOfThingToSeach, string
data)
{
_list = listOfThingToSeach;
_data = data;
}

public IEnumerable<KeyValuePair<int, string>GetValues()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
while (savData.Length 0 && pos >= 0)
{
cpt++;
savData = savData.Substring(pos +
someThingToFind.Length);
pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
}
yield return new KeyValuePair<int, string>(cpt,
someThingToFind);
}
}
}
the GetValues function use an yield return to construct an IEnumerable

If I do :

List<stringthingsToFind = new List<string>(new string[]
{"error", "values"});
MyClass myClass = new MyClass(thingsToFind, data);
foreach (KeyValuePair<int, stringaValue in
myClass.GetValues())
{ // do something
}
foreach (KeyValuePair<int, stringaValue in
myClass.GetValues())
{ // 2 TIMES
}

then, I will parse the string data 2 times, and I need only one
time ...

so I would like to optimize my GetValues function by constructing a
List<(for example, but maybe something else) during the parsing and
if the function is called a 2nd times, I return the list instead of
doing the parsing again ...
Something like this

public IEnumerable<KeyValuePair<int, string>GetValues()
{
if my list is nothing
{
instanciate my list
do the parsing and create a new keyvaluepair
add the keyvaluepair to my list
yield return the keyvaluepair
}
else
{
return my list
}
}

but ... of course, I only can yield return, not return. Is there a way
to do what I want ?
Thanks for your answer

S.

Sep 21 '07 #1

Subscribe Post Reply

2509

Nicholas Paldino [.NET/C# MVP]

S.

Why not do the parsing in the constructor and then save the list? Then,
your enumerator can just issue a yield return for each element in the parsed
list.

If you want the lazy-load semantics, then your method could just check
to see if a member is set, if it is, then just use it, otherwise, construct
the list of return values and assign it to the member variable. Then, just
iterate through the list and yield return each item.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<ti*********@gmail.comwrote in message
news:11**********************@n39g2000hsh.googlegr oups.com...

Hi group,

imagine I want to find number of values of a word in a string ...

Look at my code :

List<stringthingsToFind = new List<string>(new string[]
{"error", "values"});
MyClass myClass = new MyClass(thingsToFind, data);
foreach (KeyValuePair<int, stringaValue in
myClass.GetValues())
{ // do something
}

....

class MyClass
{
private readonly List<string_list;
private readonly string _data;

public MyClass(List<stringlistOfThingToSeach, string
data)
{
_list = listOfThingToSeach;
_data = data;
}

public IEnumerable<KeyValuePair<int, string>GetValues()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
while (savData.Length 0 && pos >= 0)
{
cpt++;
savData = savData.Substring(pos +
someThingToFind.Length);
pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
}
yield return new KeyValuePair<int, string>(cpt,
someThingToFind);
}
}
}
the GetValues function use an yield return to construct an IEnumerable

If I do :

List<stringthingsToFind = new List<string>(new string[]
{"error", "values"});
MyClass myClass = new MyClass(thingsToFind, data);
foreach (KeyValuePair<int, stringaValue in
myClass.GetValues())
{ // do something
}
foreach (KeyValuePair<int, stringaValue in
myClass.GetValues())
{ // 2 TIMES
}

then, I will parse the string data 2 times, and I need only one
time ...

so I would like to optimize my GetValues function by constructing a
List<(for example, but maybe something else) during the parsing and
if the function is called a 2nd times, I return the list instead of
doing the parsing again ...
Something like this

public IEnumerable<KeyValuePair<int, string>GetValues()
{
if my list is nothing
{
instanciate my list
do the parsing and create a new keyvaluepair
add the keyvaluepair to my list
yield return the keyvaluepair
}
else
{
return my list
}
}

but ... of course, I only can yield return, not return. Is there a way
to do what I want ?
Thanks for your answer

S.

Sep 21 '07 #2

Christof Nordiek

<ti*********@gmail.comschrieb im Newsbeitrag
news:11**********************@n39g2000hsh.googlegr oups.com...

Hi group,

imagine I want to find number of values of a word in a string ...

<snip>

public IEnumerable<KeyValuePair<int, string>GetValues()
{
if my list is nothing
{
instanciate my list
do the parsing and create a new keyvaluepair
add the keyvaluepair to my list
yield return the keyvaluepair
}
else
{
return my list
}
}

but ... of course, I only can yield return, not return. Is there a way
to do what I want ?

You can do it like:

public IEnumerable<KeyValuePair<int, string>GetValue()
{
if my list is nothing
initialice and fill list:

foreach(.....)
yield return item;

}
or

public IEnumerable<KeyValuePair<int, string>GetValue()
{
if my list is nothing
{
instanciate list
foreach()
{
get item
add item to list
yield return item
}
else
foreach(.....)
yield return item;
}

This doubles the yield return statment (in code), but in the first run it
will start faster.

Christof

Sep 21 '07 #3

Nicholas Paldino [.NET/C# MVP]

It will only start slightly faster, as the only overhead you are
removing with returning the item from the list as you generate it is the
overhead from iterating through the loop a second time, and the lookup of
the item from the list, which probably in most cases, is going to be
negligible.

It might even be more costly, as I think the enumerator that is
generated by the compiler is going to be more complex in this case, but
tests would have to be run to be sure.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Christof Nordiek" <cn@nospam.dewrote in message
news:OB**************@TK2MSFTNGP06.phx.gbl...

<ti*********@gmail.comschrieb im Newsbeitrag
news:11**********************@n39g2000hsh.googlegr oups.com...
>Hi group,

imagine I want to find number of values of a word in a string ...

<snip>
>public IEnumerable<KeyValuePair<int, string>GetValues()
{
if my list is nothing
{
instanciate my list
do the parsing and create a new keyvaluepair
add the keyvaluepair to my list
yield return the keyvaluepair
}
else
{
return my list
}
}

but ... of course, I only can yield return, not return. Is there a way
to do what I want ?

You can do it like:

public IEnumerable<KeyValuePair<int, string>GetValue()
{
if my list is nothing
initialice and fill list:

foreach(.....)
yield return item;

}
or

public IEnumerable<KeyValuePair<int, string>GetValue()
{
if my list is nothing
{
instanciate list
foreach()
{
get item
add item to list
yield return item
}
else
foreach(.....)
yield return item;
}

This doubles the yield return statment (in code), but in the first run it
will start faster.

Christof

Sep 21 '07 #4

Peter Duniho

Nicholas Paldino [.NET/C# MVP] wrote:

It will only start slightly faster, as the only overhead you are
removing with returning the item from the list as you generate it is the
overhead from iterating through the loop a second time, and the lookup of
the item from the list, which probably in most cases, is going to be
negligible.

It really depends on what you mean by "faster".

I agree that the total cost of the complete enumeration may be basically
the same either way.

However, the initial construction of the parsed data could be very
expensive. By caching it as it's being enumerated the first time, this
cost is spread over the entire enumeration. If the client code is doing
something relatively expensive with each iteration (there's no way to
know from the sample code here whether that's the case), then the client
code may find the parsing enumerator more responsive implemented that
way. Otherwise, while the subsequent calls to the enumerator might be
extremely quick, that first one could be VERY slow.

Why would the code care? Any place where responsiveness is an issue.
For example, if the iteration is somehow tied to the user-interface,
making the user sit and wait for that first one to complete without any
feedback isn't all that great of an idea. Or if the code is doing some
sort of i/o as part of the iteration of the enumerator, at the very
least intermingling the enumeration with the i/o allows for efficient
use of the CPU and i/o resources. And in some situations, there may be
a responsiveness requirement for the i/o itself (for example, network
i/o where the other end will drop the connection after some timeout).

IMHO, all of the proposals from you and Christof's have value, including
the "build the list as you go" one from Christof. Which is most
appropriate depends on the specific situation.

Finally, one warning about the "build the list as you go". One design
disadvantage I see with this technique is that there's no guarantee that
the client will do the entire enumeration. So it would be important not
just to check for existence of the cached list, but to verify that it's
complete. This would probably require inclusion of some sort of "where
did the enumeration leave off last" state in the enumerator that can be
referenced the next time the enumeration is done.

Pete

Sep 21 '07 #5

Peter Duniho

You've gotten a couple of really useful replies, IMHO, regarding
optimizing specifically via a cache, as you've requested. Though,
personally it seems to me that unless you have a very specific need to
reuse the enumeration in completely different places in the code, it
might be better to just have the _client_ of the enumerator cache the
results as needed.

But even so, I think the basic question has been answered pretty well.
So here are a couple of comments that digress a bit, but which IMHO
should still be useful...

First, the specific code you posted:

[...]
public IEnumerable<KeyValuePair<int, string>GetValues()
{
foreach (string someThingToFind in _list)
{
string savData = _data;
int cpt = 0;
int pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
while (savData.Length 0 && pos >= 0)
{
cpt++;
savData = savData.Substring(pos +
someThingToFind.Length);
pos = savData.IndexOf(someThingToFind,
StringComparison.InvariantCultureIgnoreCase);
}
yield return new KeyValuePair<int, string>(cpt,
someThingToFind);
}
}

I may be missing something, but the above code appears to me to only
return an iteration instance once for each word in your search list.
This may be your intended semantics, but if so it's my opinion that
returning a KeyValuePair isn't appropriate.

Actually, I don't think returning a KeyValuePair is appropriate in any
case. The reason why is that semantically, a KeyValuePair implies a key
that is unique and a value that is associated with that unique key. In
this case, your "key" isn't unique at all as near as I can tell; it's
simply the number of times a specific word showed up in your string.

I understand the temptation to use the KeyValuePair since it _seems_ to
encapsulate the very information you're returning (a pair of related
values). But IMHO it would be much better for you to just define your
own generic struct that does basically the same thing.

Alternatively, at the very least it's my opinion that the key in the
KeyValuePair should be the string and the count be the value. It seems
much more likely to me that your string is the unique element, rather
than the count.

Why would you care? Well, IMHO the biggest risk is that a KeyValuePair
could be used to construct a Dictionary or similar object directly, and
this would fail as soon as the key -- that is, the integer count --
showed up more than once.

I think there are other semantic reasons to fix this part of the code,
but the fact that there's a genuine potential risk for future uses of
the code is probably the only reason one would need to justify fixing it.
My second comment has to do with the "optimization" aspect itself.
IMHO, caching the results is a "sort of" optimization. That is, yes it
has the potential for improving performance, but it doesn't really
change the efficiency of the algorithm.

I note that this algorithm searches the string once for each word in
your list of words. For a very small number of words, this might not be
too bad, but as the number of words goes up, it can dramatically inflate
the cost of the algorithm.

The idea of caching the results can mitigate that problem somewhat, but
given that the algorithm is basically O(N^2), if you can change the
algorithm to something more efficient, you could wind up getting a HUGE
performance increase even without caching.

It depends somewhat on the number of words, the length of the string,
and the number of times you'd enumerate the results. But assuming that
the number of words is more than the number of times you'd enumerate the
results, an O(N) enumeration executed multiple times will still be
faster than an O(N^2) enumeration executed once and cached for future use.

The exact break-even point would depend a little on the specific
performance and how many times you need to do the enumeration for a
given list of words. But it seems to me that in any situation where you
are actually concerned about performance, that means the number of
words, length of string, and/or number of enumerations is relatively
high, and these are the situations where the disparity between an O(N)
algorithm and an O(N^2) algorithm can be very significant.

Of course, if you cache the results _and_ fix the algorithm to be better
than O(N^2), you stand to gain even more.

So, what's an O(N) algorithm to accomplish the same thing? A common one
used for string searches such as yours would be a state machine
implementation, where the nodes of the state graph are connected
according to the letters in the search words.

You traverse the graph by extracting a single character at a time from
the string being searched and using that character to determine which
node to move to; if the character has no matching link to a subsequent
node, you return to the root of the graph. The graph will have
"terminal" nodes representing the end of a search word; if and when you
reach one of those, you know you've found a word.

The exact implementation depends on your needs. The basic
implementation suffices in many cases, but if it is possible for your
search words to overlap, that can be addressed as well.

Using a state graph requires a small amount of initialization to take
place, and for relatively small tasks this initialization could cost
more than the overhead of a less-efficient algorithm. But in those
cases, usually you don't care about performance anyway. For larger
tasks where performance is actually a concern, the initialization is
relatively minor compared to the benefits of being able to go through a
string only once while still recovering every single word you're looking
for.

So, that's a long way of saying, if you are really serious about
optimizing the code, what you need is a different algorithm. You can
mitigate the problems of the O(N^2) algorithm you've got somewhat, but
in the end you've still got an O(N^2) algorithm. It will never scale
well, no matter what you do to it.

For what it's worth, there are other forms of search algorithms that are
similarly efficient. I just mention the state graph because IMHO it's
relatively easy to understand _and_ implement. You might want to look
at other search algorithms; pretty much anything designed specifically
for matching search terms to an input string will be better than the
algorithm you've got now, and it's entirely possible you could find
something that would suit your specific needs better than the
general-purpose state graph solution.

Pete

Sep 21 '07 #6

Similar topics

problem using IEnumerable<T>

by: jcc | last post by:

Hi guys, I'm a newbie to C#. My Visual Studio 2005 failed to compile the following code with error as 'HelloWorld.A' does not implement interface member...

C# / C Sharp

Implementing generic IEnumerable

by: KH | last post by:

I can't seem to figure this one out... I've searched MSDN and Goog, and made my best guesses to no avail,, so help would be much appreciated! public ref class T sealed : public...

.NET Framework

IEnumerable is strange

by: Tin Gherdanarra | last post by:

Dear mpdls, here is a simple example of an IEnumerable that generates integers: It works, but I have only a vague idea of what's going on. I understand that /yield/ wraps the humble integer...

C# / C Sharp

Returning data contained in Ienumerable<type> as type[]?

by: =?Utf-8?B?RnJhbmNvaXNWaWxqb2Vu?= | last post by:

Hi there, Does anybody know how to return the results of an IEnumerable<typeas an array of the same type i.e type in a web service call without first having to collect all elements in the...

.NET Framework

syntax wonderings, iterator and IEnumerable

by: =?Utf-8?B?a2VubmV0aEBub3NwYW0ubm9zcGFt?= | last post by:

When creating multiple iterators, the original is defined as returning IEnumerator, ie public IEnumerator GetEnumerator() { yield x; ...} whereas the additional ones are defined as returning...

C# / C Sharp

Why IEnumerable<T> inherits from IEnumerable?

by: Morgan Cheng | last post by:

In .Net 2.0, Generics are introduced. I found that IEnumerable<T> inherites from IEnumerable. This makes implementation of IEnumerable<Thas to have two GetEnumerator methods defined( one for...

C# / C Sharp

IEnumerable objects are essentially arrays, right?

by: gnewsgroup | last post by:

I've read the msdn doc about IEnumerable. It seems to me that IEnumerable objects are essentially wrapped-up arrays. It simply gives us the foreach convenience. Is this correct?

C# / C Sharp

What is the disadvantage if I don't implement IEnumerable<T>

by: Tony Johansson | last post by:

Hello! Below I have a working program. I have one generic class called Farm<T> with this header definition public class Farm<T: IEnumerable<Twhere T : Animal Now to my question I changed the...

C# / C Sharp

IEnumerable<T> interface

by: =?Utf-8?B?UElFQkFMRA==?= | last post by:

Something I found surprising this week involves the IEnumerable<Tinterface. I have a class that I wrote a couple of years ago which implements the IEnumerable interface. This week I realized it...

C# / C Sharp

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA