Quickly finding duplicates in an ArrayList

paradox

Basically I have an ArrayList of strings; I need a fast way of
searching through and comparing the elements of the ArrayList and the
return an ArrayList of items that have 3 Duplicates.

For example, if the ArrayList has the following elements:

elems[0] = "blue"
elems[1] = "red"
elems[2] = "blue"
elems[3] = "green"
elems[4] = "red"
elems[5] = "red"

I want to have a function that goes through the list and then returns a
new arraylist like:

newElems[0] = "red"
newElems[1] = "red"
newElems[2] = "red"

In actuality my arraylist is an array of objects, which have a string
property and I'm referencing like: elems[0].filename = "red"; but the
above example is easier to follow.

I would like to be able to perform this search as fast as possible. I
personally don't have any real experience or familiarity with topics
such as binary trees, etc; would something like that be the route to
take? Thanks.

Nov 16 '05 #1

Subscribe Post Reply

33420

Philipp Schumann

I know of two routes that you can follow.

The first - look at the ArrayList.IndexOf method, which is probably the
quickest way to find a certain item in the underlying array. I believe there
is an overload to specify at which index to start the search.

Secondly, if you create the arraylist yourself (not some other 3rd-party
code) you could use ArrayList.Contains() when populating the list to ensure
the duplicates don't get into the list in the first place.

My "two" euro cents. :)

"paradox" <de*@demiurgeinc.com> schrieb im Newsbeitrag
news:11**********************@o13g2000cwo.googlegr oups.com...

Basically I have an ArrayList of strings; I need a fast way of
searching through and comparing the elements of the ArrayList and the
return an ArrayList of items that have 3 Duplicates.

For example, if the ArrayList has the following elements:

elems[0] = "blue"
elems[1] = "red"
elems[2] = "blue"
elems[3] = "green"
elems[4] = "red"
elems[5] = "red"

I want to have a function that goes through the list and then returns a
new arraylist like:

newElems[0] = "red"
newElems[1] = "red"
newElems[2] = "red"

In actuality my arraylist is an array of objects, which have a string
property and I'm referencing like: elems[0].filename = "red"; but the
above example is easier to follow.

I would like to be able to perform this search as fast as possible. I
personally don't have any real experience or familiarity with topics
such as binary trees, etc; would something like that be the route to
take? Thanks.

Nov 16 '05 #2

Mike McIntyre [MVP]

Though it is written in VB.NET, the demo project at the link below is easy
to follow and should be helpful.

http://getdotnetco.web101.discountas...wnloadPage.htm
"paradox" <de*@demiurgeinc.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...

Basically I have an ArrayList of strings; I need a fast way of
searching through and comparing the elements of the ArrayList and the
return an ArrayList of items that have 3 Duplicates.

For example, if the ArrayList has the following elements:

elems[0] = "blue"
elems[1] = "red"
elems[2] = "blue"
elems[3] = "green"
elems[4] = "red"
elems[5] = "red"

I want to have a function that goes through the list and then returns a
new arraylist like:

newElems[0] = "red"
newElems[1] = "red"
newElems[2] = "red"

In actuality my arraylist is an array of objects, which have a string
property and I'm referencing like: elems[0].filename = "red"; but the
above example is easier to follow.

I would like to be able to perform this search as fast as possible. I
personally don't have any real experience or familiarity with topics
such as binary trees, etc; would something like that be the route to
take? Thanks.

Nov 16 '05 #3

paradox

Well... actually the point of the code is to find the duplicates and
then display them to users. To give you a much better understanding,
the program I am creating is for determining conflicts in the pk3 (zip)
files used in Quake 3. As people develop new maps, they create and use
existing map elements (textures, shaders, etc) and they are all stored
in these pk3 files. The problem that occurs is that they create their
own version of certain shaders or textures without renaming or changing
the filename directory structure; and this causes the game to fail to
load properly. Specifically the problem seems to be associated to
occasions when there are 3 pk3 files with a shader element that has the
same name but different file size.

I've gone through the pk3 (zip) files and populated an arraylist called
conlficts with a object of type element (a class I've defined).

I'm trying to provide a report that will go thru this list and return
these elements (and their associated pk3 files). So that the end user
can easily determine which files that they need to remove in order to
be able to connect successfully.

Using IndexOf isn't much different than the slow 3 nested loop approach
I'm trying to avoid.

Nov 16 '05 #4

Kenny

I would iterate through the ArrayList and for each item, add it to a
HashTable if it does not already exist in the HashTable and set its
value to 1. If it already exists, increment the value of that item in
the HashTable.

Once you've gone through the ArrayList, iterate through the HashTable
to see which items have a value greater than 1.

It's heavy on memory, but runs in linear time. Should be much better
than 3 nested loops.

Nov 16 '05 #5

Brian Gideon

Walk through the array list and add each item to a hashtable that
stores the number of occurrences of that item. Since hashtable inserts
and searches have O(1) time complexity the entire operation would be
O(n), which is good.

Hashtable table = new Hashtable();
foreach (string item in elems)
{
if (!table.Contains(item))
{
table.Add(item, 0);
}

int count = (int)table[item] + 1;
table[item] = count;
if (count == 2)
{
// Duplicate items are identified here.
}
}

Brian

paradox wrote:

Basically I have an ArrayList of strings; I need a fast way of
searching through and comparing the elements of the ArrayList and the
return an ArrayList of items that have 3 Duplicates.

For example, if the ArrayList has the following elements:

elems[0] = "blue"
elems[1] = "red"
elems[2] = "blue"
elems[3] = "green"
elems[4] = "red"
elems[5] = "red"

I want to have a function that goes through the list and then returns
a new arraylist like:

newElems[0] = "red"
newElems[1] = "red"
newElems[2] = "red"

In actuality my arraylist is an array of objects, which have a string
property and I'm referencing like: elems[0].filename = "red"; but the
above example is easier to follow.

I would like to be able to perform this search as fast as possible. I
personally don't have any real experience or familiarity with topics
such as binary trees, etc; would something like that be the route to
take? Thanks.

Nov 16 '05 #6

Jason Black [MSFT]

This is almost what I was going to suggest, but with the difference that
since you need to track the original ArrayList indexes of the duplicate
items, you should create new ArrayLists and store them in the hashtable.
Something like:

Hashtable h = new Hashtable();
for(i=0; i < MainArrayList.Count; i++) {
// get a key (name, whatever) in some way that's appropriate for your
objects:
someKeyType key = GetKeyFromObject(MainArrayList[i]);
// make a new arraylist if necessary
if(h[key] == null) {
h[key] = new ArrayList();
}
// store the index of this item in the arraylist
((ArrayList)h[key]).Add(i);
}
foreach(someKeyType key in h.Keys) {
if( ((ArrayList)h[key]).Count > 1) {
// do whatever you need to do to show the duplicates to the user.
// ((ArrayList)h[key]) contains the indexes of the duplicate items.
}
}

Again, harsh on memory, but fast.
"Kenny" <kv*@thereturnexchange.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...

I would iterate through the ArrayList and for each item, add it to a
HashTable if it does not already exist in the HashTable and set its
value to 1. If it already exists, increment the value of that item in
the HashTable.

Once you've gone through the ArrayList, iterate through the HashTable
to see which items have a value greater than 1.

It's heavy on memory, but runs in linear time. Should be much better
than 3 nested loops.

Nov 16 '05 #7

paradox

Very nice Jason, that is exactly what I needed.

Thank you all very much!

Nov 16 '05 #8

paradox

Out of curiousity.. how exactly does a Hashtable work? Why is it so
fast?

Nov 16 '05 #9

Brian Gideon

Google for it first. There's a lot of information available that does
a better job of explaining it than I could do. If there is still
something you don't quite understand go ahead and post back and we'll
be more than happy to clarify something.

Brian

paradox wrote:

Out of curiousity.. how exactly does a Hashtable work? Why is it so
fast?

Nov 16 '05 #10

paradox

Yeah, I did google it immediately after posting the question and was
able to find sufficient information. If I could of deleted the post, I
would have. ;)

I do have a small question though. Does the .NET Hashtable work in
exactly the same manner as a general Hashtable? Also, I tried using
reflector to track down the GetHashCode function to see how it worked;
I ended up at the String class (taking into consideration the type of
key I was using was a string), and the code I found for the GetHashCode
function was:

[MethodImpl(MethodImplOptions.InternalCall)]
public override extern int GetHashCode();

I couldn't figure out where to find anymore info...

Nov 16 '05 #11

Brian Gideon

The GetHashCode method is a virtual member of an Object so everything
has a default implementation for it. The string's version of the
method is internal to the .NET framework. That's why you can't see it.
One of the most frequently used ways of getting the hash code of an
object is to XOR all of the object's fields. In the case of a string
you might XOR all characters together. I seriously doubt that's what
the framework is doing though. XOR works well because it is fast and
produces a very random distribution of values. That's important when
it comes time to map the hash code to a slot in the memory array. A
good distribution of hash codes lowers the probability that two keys
will get mapped to the same slot in memory. This is called a
collision. There are many strategies for handling collisions. I
believe the Hashtable in dotnet uses quadratic probing. This strategy
involves number theory and the use of prime numbers.

paradox wrote:

Yeah, I did google it immediately after posting the question and was
able to find sufficient information. If I could of deleted the post,
I would have. ;)

I do have a small question though. Does the .NET Hashtable work in
exactly the same manner as a general Hashtable? Also, I tried using
reflector to track down the GetHashCode function to see how it
worked; I ended up at the String class (taking into consideration
the type of key I was using was a string), and the code I found for
the GetHashCode function was:

[MethodImpl(MethodImplOptions.InternalCall)]
public override extern int GetHashCode();

I couldn't figure out where to find anymore info...

Nov 16 '05 #12

by: Eric Linders | last post by:

Hi, I'm trying to figure out the most efficient method for taking the first character in a string (which will be a number), and use it as a variable to check to see if the other numbers in the...

PHP

Finding Duplicates

by: Erich | last post by:

I have a company table and I would like to write a query that will return to me any duplicate companies. However, it is a little more complicated then just matching on exact company names. I would...

Microsoft SQL Server

xpath finding duplicates

by: Timo Nentwig | last post by:

Hi! <node> <name>foo</name> <id>1</id> <age>7</age> </node> <node.....> <node> <name>foo</name>

.NET Framework

finding duplicates with two fields....

by: AlanAylett | last post by:

Hi all, i am having a problems with a SQL select query. I have to find all the records in a ONE table where TWO fields, field1 and field2 for example, are duplicated. eg field1 field2 a ...

Microsoft Access / VBA

Finding duplicates

by: Maxi | last post by:

I have 100 tabes in an Access database, every table has 1 filed with 100 names (records), no primary key assigned. I would like to find duplicates. Here is the criteria: The computer should...

Microsoft Access / VBA

Finding duplicates without the "Find duplicates" wizard

by: limperger | last post by:

Hello everyone! Is out there any way to search for duplicate entries without using the "Find duplicates" option? In the Access 97 installed in my workplace, the Find duplicates option is disabled...

Microsoft Access / VBA

Finding duplicates rows in a table

by: Gayatree | last post by:

I have to concatenate 2 colimns in a table and find duplicates in them. I already used the method select * from tableA a where (select count(*) from TableA b where acol1+ +col2 =...

Microsoft SQL Server

Finding duplicates in an array

by: Thekid | last post by:

I'm trying to figure out a way to find if there are duplicates in an array. My idea was to take the array as 'a' and make a second array as 'b' and remove the duplicates from 'b' using 'set' and then...

Python

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Quickly finding duplicates in an ArrayList

Similar topics