Remove items from list while enumerating

pamelafluente

Hi

I have a sorted list with several thousands items. In my case, but this
is not important, objects are stored only in Keys, Values are all
Nothing. Several of the stored objects (might be a large number) have
to be removed (say when ToBeRemoved = true).

class SomeObj
ToBeRemoved be a boolean field
end class

It is more convenient to iterate backword the collection, getting the
element
DirectCast(SList.List.GetKey(i), SomeObj) and in case (say if
ToBeRemoved is true) to remove them with SList.List.RemoveAt(i) or to
rebuild the List taking only the ones not removed?

(In the latter case, a problem is that the list should not be recreated
using a New()
constructor, because it may be being used by other objects and one does

not want to disconnect the SList reference. )

How do I do this in the most efficient way?

-P

Oct 11 '06 #1

Subscribe Post Reply

6729

Bruce Wood

pa***********@libero.it wrote:

Hi

I have a sorted list with several thousands items. In my case, but this
is not important, objects are stored only in Keys, Values are all
Nothing. Several of the stored objects (might be a large number) have
to be removed (say when ToBeRemoved = true).

class SomeObj
ToBeRemoved be a boolean field
end class

It is more convenient to iterate backword the collection, getting the
element
DirectCast(SList.List.GetKey(i), SomeObj) and in case (say if
ToBeRemoved is true) to remove them with SList.List.RemoveAt(i) or to
rebuild the List taking only the ones not removed?

(In the latter case, a problem is that the list should not be recreated
using a New()
constructor, because it may be being used by other objects and one does

not want to disconnect the SList reference. )

How do I do this in the most efficient way?

There are two usual ways.

You hit upon one of them: iterate backward through the list and use
RemoveAt(i).

The other is to build a second list of items to be removed, and then
after the loop is done, loop through the list of items and remove them.
This tends to work better with keyed collections like Hashtable where
indexing isn't available, or in which it doesn't mean much:

ArrayList thingsToRemove = new ArrayList();
foreach (Thing t in myCollection)
{
if (ShouldRemove(t))
{
thingsToRemove.Add(t);
}
}
foreach (Thing t in thingsToRemove)
{
myCollection.Remove(t);
}

Oct 11 '06 #2

Brian Gideon

pa***********@libero.it wrote:

>
[snip]

How do I do this in the most efficient way?

Hello,

I assume by efficient you mean execution speed. I also assume that
you're talking about the SortedList because you said the collection
stores key-value pairs. This is actually quite tricky. You mentioned
enumerating through the collection backwards and calling RemoveAt.
That seems like an efficient method on the surface, but in the case of
a SortedList consider what's happening. The RemoveAt method will cause
all items lower in the list to move up to replace the removed item.
That's a O(n) operation. The loop to enumerate the collection in the
first place is also a O(n) operation. The combined result is O(n^2).

You could use a SortedDictionary or equivalent data structure. The
Remove and RemoveAt methods are O(log(n)). So the combined operation
would be O(n*log(n)). That's a significant improvement.
Unfortunately, the SortedDictionary only exists in 2.0. So if you're
still on 1.1 you'd have to download the open source .NET Power
Collections to get an equivalent data structure.

Depending on the exact requirements you could also create your own
custom collection that uses an ArrayList internally. The
responsibility of keeping the list sorted would be on you. The
advantage is that you have complete control of how and what items to
remove. I *think* I could convince you that removing the desired items
would be a O(n) operation if done correctly. I'm prepared to be wrong
about that though. The disadvantage is that it would require more
coding on your part.

Brian

Oct 11 '06 #3

pamelafluente

Thanks all.

*Very* helpful.

Actually I was suspecting that the mere removal could be O(n) since
I assume that some recompact operation on the binary tree must
be carried out.

So the SortedDictionary is better fore removal operations.
But at this point would be better to recreate the whole list. This
would be O(n) ?

In such a a case, the problem is how do I keep the references that have
already
been attached to other objects?

And is perhaps the list inizialization a time consuming operation ?

-P

Brian Gideon ha scritto:

pa***********@libero.it wrote:

[snip]

How do I do this in the most efficient way?

Hello,

I assume by efficient you mean execution speed. I also assume that
you're talking about the SortedList because you said the collection
stores key-value pairs. This is actually quite tricky. You mentioned
enumerating through the collection backwards and calling RemoveAt.
That seems like an efficient method on the surface, but in the case of
a SortedList consider what's happening. The RemoveAt method will cause
all items lower in the list to move up to replace the removed item.
That's a O(n) operation. The loop to enumerate the collection in the
first place is also a O(n) operation. The combined result is O(n^2).

You could use a SortedDictionary or equivalent data structure. The
Remove and RemoveAt methods are O(log(n)). So the combined operation
would be O(n*log(n)). That's a significant improvement.
Unfortunately, the SortedDictionary only exists in 2.0. So if you're
still on 1.1 you'd have to download the open source .NET Power
Collections to get an equivalent data structure.

Depending on the exact requirements you could also create your own
custom collection that uses an ArrayList internally. The
responsibility of keeping the list sorted would be on you. The
advantage is that you have complete control of how and what items to
remove. I *think* I could convince you that removing the desired items
would be a O(n) operation if done correctly. I'm prepared to be wrong
about that though. The disadvantage is that it would require more
coding on your part.

Brian

Oct 12 '06 #4

pamelafluente

Brian Gideon ha scritto:

Depending on the exact requirements you could also create your own
custom collection that uses an ArrayList internally. The
responsibility of keeping the list sorted would be on you. The
advantage is that you have complete control of how and what items to
remove. I *think* I could convince you that removing the desired items

Actually I do use something similar to what you are describing. I even
have
a heavy structure which uses 2 hashtables and 1 arraylist for maximum
speed (I use it for a few items).

In this specific case like you rightly noted I was talking about a
sorted list.

For this task it's a convenient structure because I need to keep the
(key) objects
ordered with respect to a given comparer (Values are always Null).

The brilliant idea you are sketching about an O(n) removal using
arraylist I imagine is based on having an hashtable with (key, Index)
so one removes directly the affected Index, which would be O(n).

I have also another doubt. Since I only need an ordered list of KEYs
(no values) with a comparer, is there something I can use instead of
having to add (key, Nothing) each time ?
-P

would be a O(n) operation if done correctly. I'm prepared to be wrong
about that though. The disadvantage is that it would require more
coding on your part.

Brian

Oct 12 '06 #5

Brian Gideon

pa***********@libero.it wrote:

The brilliant idea you are sketching about an O(n) removal using
arraylist I imagine is based on having an hashtable with (key, Index)
so one removes directly the affected Index, which would be O(n).

It's really not all that brilliant. It would look something like the
following.

public class YourCollection
{
private IList list = new ArrayList();

public void Add(SomeObj item)
{
// Your code to add items goes here.
// Remember to keep everything sorted!
}

public void Remove(SomeObj item)
{
// Your code to remove items goes here.
// Remember to keep everything sorted!
}

public void RemoveWhere(YourPredicateDelegate predicateMethod)
{
IList newList = new ArrayList(list.Length);
foreach (SomeObj item in list)
{
if (predicateMethod(item))
{
newList.Add(item);
}
}
newList.TrimToSize();
list = newList;
}
}

public static void Main()
{
YourCollection collection = new YourCollection();

// Add a bunch of items here.
collection.Add(...);
collection.Add(...);

// Remove unwanted items based on the result of a delegate.
collection.RemoveWhere(new YourPredicateDelegate(RemoveTest));
}

private static bool RemoveTest(SomeObj item)
{
return item.ToBeRemoved;
}

Let me explain a few things. First, notice that YourCollection is
never recreated. Instead, it is the internal ArrayList that is
recreated. But, since that logic is encapsulated by the proxy
collection code outside of YourCollection never even knows it's
happening. Second, notice that I have created the new list with an
initial capacity equal to the number of items in the old list and then
TrimToSize is called later. This will prevent the ArrayList from
performing resizing operations in the loop. Third, I'm using the fact
that the items were already sorted when adding them to the new list and
since adding to the end of an ArrayList is O(1) then the entire loop is
O(n). Fourth, I'm using a delegate to drive the conditional logic of
removing items. This predicate logic, as it is often called, is built
into 2.0, but I didn't know which framework version you were using so I
defined my own predicate delegate. It looks more elegant in 2.0
especially when using anonymous methods :) Fifth, the tradeoff is
memory usage. Notice that the RemoveWhere method will consume twice as
much memory until the GC runs.

I have also another doubt. Since I only need an ordered list of KEYs
(no values) with a comparer, is there something I can use instead of
having to add (key, Nothing) each time ?

Well, if you use a custom collection like I did in my example then it's
easy to make that happen.

Oct 12 '06 #6

Brian Gideon

pa***********@libero.it wrote:

Thanks all.

*Very* helpful.

Actually I was suspecting that the mere removal could be O(n) since
I assume that some recompact operation on the binary tree must
be carried out.

First, it sounds like you're familiar with Big-Oh notation. I was a
little worried I would just confuse you more, but I also felt that
using terms like constant, linear, exponential, and logarithmic could
be just as confusing. Anyway, a SortedList is implemented with an
array. That's why it's so slow.

>
So the SortedDictionary is better fore removal operations.

Yes, it is implemented using a top-down red black tree. It's just a
special kind of binary search tree.

But at this point would be better to recreate the whole list. This
would be O(n) ?

In such a a case, the problem is how do I keep the references that have
already
been attached to other objects?

And is perhaps the list inizialization a time consuming operation ?

I answered those questions in my other post :)

Oct 12 '06 #7

pamelafluente

Brian Gideon ha scritto:

pa***********@libero.it wrote:
The brilliant idea you are sketching about an O(n) removal using

>
Let me explain a few things. First, notice that YourCollection is
never recreated. Instead, it is the internal ArrayList that is
recreated. But, since that logic is encapsulated by the proxy
collection code outside of YourCollection never even knows it's
happening.

Hmmm this seems a real good idea.

Further, if I get correctly the essence of your logic, I would have
"almost" the same result (without the assle of having to code the
insertion method, which are anyway not so easy to do the smart way) by
wrapping a SortedDictionary. I could wrap with a "YourCollection "
object and recreate the internal SortedDictionary when doing the
removal. And usually I do need wrapping anyway because I need to carry
some additional information such as, comparers, culture info, etc. But
if I understand well, your class would also have insertion O(n) while
the SortedDictionary doesn't.... Maybe not, I guess that to find the
insertion point you may have to spend about the same as to perform an
ordering task.

Hmm I am just wondering how come the class you just sketched is not
readily available in the framework ? Or am I missing something ?

Second, notice that I have created the new list with an

initial capacity equal to the number of items in the old list and then
TrimToSize is called later. This will prevent the ArrayList from
performing resizing operations in the loop.

Nice trick to keep in mind ;)

Third, I'm using the fact

that the items were already sorted when adding them to the new list and
since adding to the end of an ArrayList is O(1) then the entire loop is
O(n). Fourth, I'm using a delegate to drive the conditional logic of
removing items. This predicate logic, as it is often called, is built
into 2.0, but I didn't know which framework version you were using so I
defined my own predicate delegate. It looks more elegant in 2.0
especially when using anonymous methods :) Fifth, the tradeoff is
memory usage. Notice that the RemoveWhere method will consume twice as
much memory until the GC runs.

Well, if you use a custom collection like I did in my example then it's
easy to make that happen.

Yes I can "see" that. But I doubt I will be ever able to code that
collection... I think it's not so easy as it might seem at first sight.
:)

-P

Oct 12 '06 #8

Brian Gideon

pa***********@libero.it wrote:

Brian Gideon ha scritto:
Let me explain a few things. First, notice that YourCollection is
never recreated. Instead, it is the internal ArrayList that is
recreated. But, since that logic is encapsulated by the proxy
collection code outside of YourCollection never even knows it's
happening.

Hmmm this seems a real good idea.

Further, if I get correctly the essence of your logic, I would have
"almost" the same result (without the assle of having to code the
insertion method, which are anyway not so easy to do the smart way) by
wrapping a SortedDictionary.

Yes, that is correct. The advantage, like you said, is that insertions
are really simple because you'd let the SortedDictionary ensure the
sort order. The disadvantage is that removing the unwanted items
becomes O(n*log(n)) instead of O(n).

I could wrap with a "YourCollection "
object and recreate the internal SortedDictionary when doing the
removal.

In the case of a SortedDictionary there would be no benefit in
recreating it. The runtime complexity would be the same as if you just
started removing from the existing collection.

And usually I do need wrapping anyway because I need to carry
some additional information such as, comparers, culture info, etc. But
if I understand well, your class would also have insertion O(n) while
the SortedDictionary doesn't....

Yes, that's correct. Using an ArrayList would yield O(n) insertions.
Using a SortedDictionary would yield O(log(n)) insertions.

Maybe not, I guess that to find the
insertion point you may have to spend about the same as to perform an
ordering task.

Most good sorting algorithms are O(n*log(n)) while finding the
insertions point in a well balanced BST is O(log(n)).

Hmm I am just wondering how come the class you just sketched is not
readily available in the framework ? Or am I missing something ?

It's because you're wanting to do something specialized as fast as
possible. The building blocks are there, but I don't think it's
feasible for the framework to provide a collection for every possible
scenario. Keep in mind that the SortedDictionary is a pretty good
match and could be used right out of the box. It's just that in your
scenario there is theoretically a faster way to remove unwanted items
at the cost of more memory, insertion speed, and more coding.

>
Second, notice that I have created the new list with an
initial capacity equal to the number of items in the old list and then
TrimToSize is called later. This will prevent the ArrayList from
performing resizing operations in the loop.

Nice trick to keep in mind ;)

That's actually an important trick. If you allowed the list to resize
then insertions would no longer be O(1). It's difficult to determine
what the complexity is because I don't know what growth algorithm the
ArrayList is using. But, assume it's a trivial 100% growth algorithm
then the number of resizing operations would be log-base-2(n). Now,
each resizing operation involves copying more and more of the n so it
gets really complicated from here trying to figure out the Big-Oh.
After a fair bit of non-trivial calculus I *think* I could convince
someone that the resizes suddenly cause amoritized O(log(n)) insertions
instead of O(1). In reality I'm sure the ArrayList uses a smarter
growth algorithm than 100%, but I bet it still results in an amoritized
complexity O(log(n)). Maybe someone smarter than I can confirm that?

Have I totally confused you? It just goes to show that analyzing
runtime performance is very complicated.

Oct 12 '06 #9

pamelafluente

Brian Gideon ha scritto:

>
In the case of a SortedDictionary there would be no benefit in
recreating it. The runtime complexity would be the same as if you just
started removing from the existing collection.

I meant recreating it omitting the removed items. I am not clear why
it's the same. I would think that recreating a collection with the
remaining items would be less than O(n), while removing m items, should
be m * O(n), *if* it is true that the "recompact" procedure is O(n).

>
Maybe not, I guess that to find the
insertion point you may have to spend about the same as to perform an
ordering task.

Most good sorting algorithms are O(n*log(n)) while finding the
insertions point in a well balanced BST is O(log(n)).

You have almost convinced me to create this Collection. :) But there is
something holding me back. If I use an *arraylist*, even if by simple
iteration and comparing the object I find the insertion point in
maximum n operations, say O(n), then the mere insertion, as we said,
would be O(n) due to recompact (is that true?). So we end up again with
O(n*n).

So what I am really missing is how to do the Add() method. The Remove
method you have illustrated seems simple enough and efficient.

>
Hmm I am just wondering how come the class you just sketched is not
readily available in the framework ? Or am I missing something ?

It's because you're wanting to do something specialized as fast as
possible. The building blocks are there, but I don't think it's
feasible for the framework to provide a collection for every possible
scenario. Keep in mind that the SortedDictionary is a pretty good
match and could be used right out of the box. It's just that in your
scenario there is theoretically a faster way to remove unwanted items
at the cost of more memory, insertion speed, and more coding.

I use all the time a sorted collection of keys only. This is because
when using a comparer you can easily get rid of the necessity of having
a key,value
pair. I think would be invaluable to have a Sorted Collection of keys
only.

Also I have taken a look at the SortedDictionary and tried to replace
my SortedList, but I do not seems to find an easy way to do it. The
index based method I use to restore the object on deserialization seem
missing in the SortedDictionary, and it does not seem to work as a
replacement. Further all that "Of Type" stuff is really annoying. It
seemed theoretically a good idea to have typed collection. But in
practice I find them very bothering.

>

Second, notice that I have created the new list with an
initial capacity equal to the number of items in the old list and then
TrimToSize is called later. This will prevent the ArrayList from
performing resizing operations in the loop.
Nice trick to keep in mind ;)

That's actually an important trick.

Indeed. I will use it all the time.

If you allowed the list to resize

then insertions would no longer be O(1). It's difficult to determine
what the complexity is because I don't know what growth algorithm the
ArrayList is using. But, assume it's a trivial 100% growth algorithm
then the number of resizing operations would be log-base-2(n). Now,
each resizing operation involves copying more and more of the n so it
gets really complicated from here trying to figure out the Big-Oh.
After a fair bit of non-trivial calculus I *think* I could convince
someone that the resizes suddenly cause amoritized O(log(n)) insertions
instead of O(1). In reality I'm sure the ArrayList uses a smarter
growth algorithm than 100%, but I bet it still results in an amoritized
complexity O(log(n)). Maybe someone smarter than I can confirm that?

Intuitively I would expect its complexity to depend on the memory
allocated (not on n, is this true?). I think that allocating space for
16 items should be different than allocating mem for 1000. Just
intuition...

>
Have I totally confused you? It just goes to show that analyzing
runtime performance is very complicated.

No at all. I love this kind of teaching and I value it. *Very* helpful.

-P

Oct 12 '06 #10

Brian Gideon

pa***********@libero.it wrote:

Brian Gideon ha scritto:
In the case of a SortedDictionary there would be no benefit in
recreating it. The runtime complexity would be the same as if you just
started removing from the existing collection.

I meant recreating it omitting the removed items. I am not clear why
it's the same. I would think that recreating a collection with the
remaining items would be less than O(n), while removing m items, should
be m * O(n), *if* it is true that the "recompact" procedure is O(n).

I was talking about the SortedDictionary which is a BST. So even just
adding the omitted items, which in the worst case would be n of them,
is O(n*log(n)).

Most good sorting algorithms are O(n*log(n)) while finding the
insertions point in a well balanced BST is O(log(n)).

You have almost convinced me to create this Collection. :) But there is
something holding me back. If I use an *arraylist*, even if by simple
iteration and comparing the object I find the insertion point in
maximum n operations, say O(n), then the mere insertion, as we said,
would be O(n) due to recompact (is that true?). So we end up again with
O(n*n).

The problem with the ArrayList is the growing operation and not the
recompact. The growing doesn't happen everytime and since you'd be
adding the items inorder then they would always be added at the end of
list which doesn't require any searching. So most insertions would be
constant and then others would be linear in the case of the growing
operation.

So what I am really missing is how to do the Add() method. The Remove
method you have illustrated seems simple enough and efficient.

Yeah, in the case of an ArrayList you'd have to make sure it stays
sorted. That's why I think would be easier for you to just use the
SortedDictionary. Sure, it will be a little slower to do this removal
operation, but unless you're dealing with millions of operations it's
not going to be that big of a deal.

Hmm I am just wondering how come the class you just sketched is not
readily available in the framework ? Or am I missing something ?
It's because you're wanting to do something specialized as fast as
possible. The building blocks are there, but I don't think it's
feasible for the framework to provide a collection for every possible
scenario. Keep in mind that the SortedDictionary is a pretty good
match and could be used right out of the box. It's just that in your
scenario there is theoretically a faster way to remove unwanted items
at the cost of more memory, insertion speed, and more coding.

I use all the time a sorted collection of keys only. This is because
when using a comparer you can easily get rid of the necessity of having
a key,value
pair. I think would be invaluable to have a Sorted Collection of keys
only.

Maybe, but I don't think it's that big of deal in most cases to just
store null for the value. I suppose it would be a waste of memory, but
it shouldn't be critical unless you're storing a lot of items.

>
Also I have taken a look at the SortedDictionary and tried to replace
my SortedList, but I do not seems to find an easy way to do it. The
index based method I use to restore the object on deserialization seem
missing in the SortedDictionary, and it does not seem to work as a
replacement. Further all that "Of Type" stuff is really annoying. It
seemed theoretically a good idea to have typed collection. But in
practice I find them very bothering.

Yep, SortedDictionary is missing the indexing. I didn't realize you
needed that functionality. The Of Type stuff may seem annoying, but it
actually speeds the collection up because value types don't have to be
boxed/unboxed.

If you allowed the list to resize
then insertions would no longer be O(1). It's difficult to determine
what the complexity is because I don't know what growth algorithm the
ArrayList is using. But, assume it's a trivial 100% growth algorithm
then the number of resizing operations would be log-base-2(n). Now,
each resizing operation involves copying more and more of the n so it
gets really complicated from here trying to figure out the Big-Oh.
After a fair bit of non-trivial calculus I *think* I could convince
someone that the resizes suddenly cause amoritized O(log(n)) insertions
instead of O(1). In reality I'm sure the ArrayList uses a smarter
growth algorithm than 100%, but I bet it still results in an amoritized
complexity O(log(n)). Maybe someone smarter than I can confirm that?

Intuitively I would expect its complexity to depend on the memory
allocated (not on n, is this true?). I think that allocating space for
16 items should be different than allocating mem for 1000. Just
intuition...

I don't think the memory allocation is an issue. The issue is that
when the ArrayList resizes itself it will create a new array with a
larger size and then copy the contents of the original array into the
new array. At least that's what remember when I looked at the SSCLI
implementation code. I ran few tests yesterday and I did, in fact,
observe amortized logarithmic complexity when inserting many items at
the end of an ArrayList. But, the operations were *so* fast that I
couldn't tell until I was experimenting with over 1,000,000 insertions.
I bet the ArrayList is doing super fast bulk memory copy operations
when it resizes, but who knows. I just know that the MSDN
documentation says that an insertion that causes a resize becomes O(n)
which means the amoritized complexity of doing many insertions should
be O(1).

Have I totally confused you? It just goes to show that analyzing
runtime performance is very complicated.

No at all. I love this kind of teaching and I value it. *Very* helpful.

Yeah, I love trying to figure out the fastest way of doing things.
Though, honestly, we shouldn't be that worried about it unless we're
dealing with millions of items. I just like the theory behind it all :)

Oct 14 '06 #11

Remove items from list while enumerating

Similar topics