473,396 Members | 2,139 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

set matching

hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

..possibility one ...
...is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
...if i find an equal i iterate in both arrays, ...
...if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
...this is called "Merged join", i read.
...it performs well if the matching elements are "close" together.

..possibility two...
...i enhanced this method with binarysearch, as the arrays are several
10 million elements big.
...binarysearch is used instead of sequentially searching the remainder
of an array.
...this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

any suggestions for a still faster solution to find matches in two
sorted arrays?

thanks

Mar 30 '06 #1
11 2941
Solandre:

Do you have a requirement (other than searching) to maintain your lists in
sorted order?
What are you storing in the arrays? (is it just a primitive, like a number,
or is it an array of data structures)?

Maybe you could get better performance by switching to a different data
structure (hash tables, binary tree, etc).

For example, have you looked at Hash tables?
Pro: Lookup time in a hash table is logarithmic (O(logN)), so even a very
large hash table will yield extremely fast lookups
Con: A hash table is not sorted, so you won't have a list in order, also,
each key (lookup value) must be unique.

There could be a few different ways to go, but which is the best depends
upon your specific situation. You can find a pretty good discussion of data
structures here:
http://msdn.microsoft.com/vcsharp/pr...atastructures/
"solandre" <am*@gmx.info> wrote in message
news:11********************@i40g2000cwc.googlegrou ps.com...
hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

.possibility one ...
..is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
..if i find an equal i iterate in both arrays, ...
..if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.

.possibility two...
..i enhanced this method with binarysearch, as the arrays are several
10 million elements big.
..binarysearch is used instead of sequentially searching the remainder
of an array.
..this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

any suggestions for a still faster solution to find matches in two
sorted arrays?

thanks

Mar 30 '06 #2


solandre wrote:
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.


I don't really understand this. Are your input lists sorted or not?

if they are, and they are reasonable equally sized (within a few orders
of magnitude), you can iterate them simultaneously, by maintaining an
index into them seperatly and move the index pointing to the smallest
element forward, perhaps this is what you described above?

Using binary-search to find the next matching pair isn't really likely
to be good. Although an approach where you jump in steps with increasing
length by squaring and then afterwards search linearly may work well if
the lists have very different sizes.

--
Helge Jensen
mailto:he**********@slog.dk
sip:he**********@slog.dk
-=> Sebastian cover-music: http://ungdomshus.nu <=-
Mar 30 '06 #3
thanks for the replies.

yes the arrays are sorted. merged join might be the wron name for it,
is it called transversal searching?
yes we hold two pointers one for each array.
jumping with increasing steps instead binarysearching the rest is
interesting and i will evaluate it.
i think it really depends on the structure of the data, where the net
match or next higher possibly is (in our case it is assumingly far
away).

we have sorted out binary trees, as trees always need to hold pointers
to the next or previouse node. which we dont want to afford now for
reason of size.

the arrays hold primitive ulongs (is it a primitve type?)

Mar 30 '06 #4
Matching sorted lists... sounds like what I heard called a "master file
update" waaaay back in college (a COBOL class).

If the referenced objects match, do what you do with them and advance both
references
Else advance the reference to the lesser object

Mar 30 '06 #5

J.Marsch wrote:
For example, have you looked at Hash tables?
Pro: Lookup time in a hash table is logarithmic (O(logN)), so even a very
large hash table will yield extremely fast lookups
Con: A hash table is not sorted, so you won't have a list in order, also,
each key (lookup value) must be unique.


Hash tables have an average case runtime of O(1). One con I would add
to the list is that hash tables consume more memory.

Mar 30 '06 #6
Solandre,

I'm still a little confused. Can you provide an example of what the
operation in question would produce. Consider two sets.

A = { 2, 3, 5, 7, 11, 13, 17, 19 }
B = { 1, 1, 2, 3, 5, 8, 13, 21 }

Would the result be the intersection of the two? Result = { 2, 3, 5, 13
}

Brian

solandre wrote:
hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

.possibility one ...
..is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
..if i find an equal i iterate in both arrays, ...
..if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.

.possibility two...
..i enhanced this method with binarysearch, as the arrays are several
10 million elements big.
..binarysearch is used instead of sequentially searching the remainder
of an array.
..this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.

any suggestions for a still faster solution to find matches in two
sorted arrays?

thanks


Mar 30 '06 #7


Brian Gideon wrote:
B = { 1, 1, 2, 3, 5, 8, 13, 21 }


B is not actually a set, unless the two 1's are somehow different :)

--
Helge Jensen
mailto:he**********@slog.dk
sip:he**********@slog.dk
-=> Sebastian cover-music: http://ungdomshus.nu <=-
Mar 31 '06 #8
..there would be no duplicates in the arrays
..considering these arrays, with no dup. 1 in arrB, the result would be
your mentioned { 2, 3, 5, 13}

Consider two sets.
A = { 2, 3, 5, 7, 11, 13, 17, 19 }
B = { 1, 1, 2, 3, 5, 8, 13, 21 }
Would the result be the intersection of the two? Result = { 2, 3, 5, 13 }


Mar 31 '06 #9
Yeah, I was just trying to be cute and make B the Fibonacci numbers :)

Helge Jensen wrote:
Brian Gideon wrote:
B = { 1, 1, 2, 3, 5, 8, 13, 21 }


B is not actually a set, unless the two 1's are somehow different :)

--
Helge Jensen
mailto:he**********@slog.dk
sip:he**********@slog.dk
-=> Sebastian cover-music: http://ungdomshus.nu <=-


Mar 31 '06 #10
Okay. Yeah, you want to intersect two sets. Try the following code.
The ICollection that is returned will be in sorted order if the two
input collections are sorted. You said the collections would not have
duplicates, but this code can handle that case anyway. It runs in
linear time so it should be pretty fast.

public static ICollection Intersect(ICollection lhs, ICollection rhs)
{
ICollection larger = lhs.Count <= rhs.Count ? lhs : rhs;
ICollection smaller = lhs.Count <= rhs.Count ? rhs : lhs;

Hashtable table = new Hashtable();
foreach (object element in larger)
{
if (!table.Contains(element))
{
table.Add(element, element);
}
}

ArrayList result = new ArrayList();
foreach (object element in smaller)
{
if (table.Contains(element))
{
table.Remove(element);
result.Add(element);
}
}

return result;
}

Brian

solandre wrote:
.there would be no duplicates in the arrays
.considering these arrays, with no dup. 1 in arrB, the result would be
your mentioned { 2, 3, 5, 13}

Consider two sets.
A = { 2, 3, 5, 7, 11, 13, 17, 19 }
B = { 1, 1, 2, 3, 5, 8, 13, 21 }
Would the result be the intersection of the two? Result = { 2, 3, 5, 13 }


Mar 31 '06 #11
"solandre" <am*@gmx.info> wrote in message
news:11********************@i40g2000cwc.googlegrou ps.com...
hi there,

i have to find matches in two sorted arrays. there are two good
possibilities as a temporary solution, performing different, depending
on the inner structure of the data. but i try to find a better one.

.possibility one ...
..is running through the arrays in a zig-zag, always taking the an
element from array A and searching the remaining elements of array B
sequentially for an equal or a higher match.
..if i find an equal i iterate in both arrays, ...
..if i find a higher, i flip sides take the found element from array B
and search array A from the position where the last element from array
A was taken +1 sequentially for an equal or a higher.
..this is called "Merged join", i read.
..it performs well if the matching elements are "close" together.
Basically this sounds like a single pass of the Merge Sort algorithm (without the actual merge that
is)
.possibility two...
..i enhanced this method with binarysearch, as the arrays are several
10 million elements big.
Are Both arrays of similar Size?
..binarysearch is used instead of sequentially searching the remainder
of an array.
..this performs better if the matching elements are far appart, but
incredibly poor, if the matches are close together.
Define "Incredibly"
If you "Know" that a section of your data is highly localized you can narrow the scope of the binary
search to attempt to speed up the performance in the localized section.
any suggestions for a still faster solution to find matches in two
sorted arrays?


The first variant (Merge sort) should perform fairly well in the general case, however, if you have
knowledge about your data and how it is likely to be distributed you can possibly improve the
performance further. Let me ask you a few questions about the usage patterns.

After you intersect 2 Arrays, What then?
Intersect 2 new Arrays?
Intersect the first array with a whole bunch of secondary arrays?
The reason that I ask is that if you are repeatedly searching the same array, you could optimize the
search for THAT particular data set. A hashtable could improve your performance, but only if you are
searching the same Array more than once.

Are both Arrays of similar size?
Does the data have any known clumping or is it fairly uniform?
Do both Arrays, in general span the same range of possible values or is one of the arrays more
localized or clumpy?

If you "Know" you are in a localized/clumpy section of data....use the standard Merge sort method
If you "Know" you are likely going to have to go a long way to find the next match ...Use a binary
search.

If you can develop a heuristic to determine locality, you can alter your search method accordingly.
Unfortunately, adding the calculation of a heuristic ADDS to the overhead. So you will have to
decide if it is worth it.

Good luck
Bill



Apr 2 '06 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: gsv2com | last post by:
One of my weaknesses has always been pattern matching. Something I definitely need to study up on and maybe you guys can give me a pointer here. I'm looking to remove all of this code and just...
176
by: Thomas Reichelt | last post by:
Moin, short question: is there any language combining the syntax, flexibility and great programming experience of Python with static typing? Is there a project to add static typing to Python? ...
17
by: Andrew McLean | last post by:
I have a problem that is suspect isn't unusual and I'm looking to see if there is any code available to help. I've Googled without success. Basically, I have two databases containing lists of...
1
by: Henry | last post by:
I have a table that stores a list of zip codes using a varchar column type, and I need to perform some string prefix pattern matching search. Let's say that I have the columns: 94000-1235 94001...
10
by: bpontius | last post by:
The GES Algorithm A Surprisingly Simple Algorithm for Parallel Pattern Matching "Partially because the best algorithms presented in the literature are difficult to understand and to implement,...
5
by: olaufr | last post by:
Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I...
1
by: solarin | last post by:
Hi, I've developed a program under VS 6.0. I can compile it and run it, but when I try to debbug , all my breakpoints are dissabled and I can see the following messages: Loaded...
2
by: santhoshs | last post by:
Hello I am required to parse two files that contain email addresses and figure out a way to get the matching and non-matching email addresses from both the files. I was able to get the matching...
11
by: tech | last post by:
Hi, I need a function to specify a match pattern including using wildcard characters as below to find chars in a std::string. The match pattern can contain the wildcard characters "*" and "?",...
1
by: sora | last post by:
Hi, I've developed a MFC program under VS 6.0. My debugger *was* working fine and I've used it often for my project. Then, one day, the errors below appear and they prevent me from using the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.