473,230 Members | 1,481 Online

# What's the cleanest way to compare 2 dictionary?

Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,

Aug 9 '06 #1
11 14436

John Henry wrote:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values

Something like:

a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq
If you now simple set arithmatic, it should read OK.

Aug 9 '06 #2

John Henry wrote:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values

Something like:

a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq
If you now simple set arithmatic, it should read OK.

Thanks, that's very clean. Give me good reason to move up to Python
2.4.

Aug 9 '06 #3

John Henry wrote:
John Henry wrote:
Hi list,
>
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
>
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
>
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
>
Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values

Something like:

a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq
If you now simple set arithmatic, it should read OK.

Thanks, that's very clean. Give me good reason to move up to Python
2.4.
Oh, wait, works in 2.3 too.

Just have to:

from sets import Set as set

Aug 9 '06 #4

John Henry wrote:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
definition: common keys with different values.

Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:

a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -2 to
save typing :-)

Now that happens if the other dictionary contains:

b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}

Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?

All comparisons are equal, but some comparisons are more equal than
others :-)

Cheers,
John

Aug 9 '06 #5
John,

Yes, there are several scenerios.

a) Comparing keys only.

That's been answered (although I haven't gotten it to work under 2.3
yet)

b) Comparing records.

Now it gets more fun - as you pointed out. I was assuming that there
is no short cut here. If the key exists on both set, and if I wish to
know if the records are the same, I would have to do record by record
comparsion. However, since there are only a handful of records per
key, this wouldn't be so bad. Maybe I just overload the compare
operator or something.

John Machin wrote:
John Henry wrote:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

definition: common keys with different values.

Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:

a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -2 to
save typing :-)

Now that happens if the other dictionary contains:

b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}

Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?

All comparisons are equal, but some comparisons are more equal than
others :-)

Cheers,
John
Aug 10 '06 #6

John Machin wrote:
John Henry wrote:
Hi list,

I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.

I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.

What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )

definition: common keys with different values.

Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:

a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -2 to
save typing :-)

Now that happens if the other dictionary contains:

b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}

Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?

All comparisons are equal, but some comparisons are more equal than
others :-)

Cheers,
John
Hi Johns,
The following is my attempt to give more/deeper comparison info.
Assume you have your data parsed and presented as two dicts a and b
each having as values a dict representing a record.
Further assume you have a function that can compute if two record level
dicts are the same and another function that can compute if two values
in a record level dict are the same.

With a slight modification of my earlier prog we get:

def komparator(a,b, check_equal):
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if check_equal(a[k],b[k]))
common_neq = _common - common_eq
return (a_xclusive, b_xclusive, common_eq, common_neq)

a_xclusive, b_xclusive, common_eq, common_neq = komparator(a,b,
record_dict__equality_checker)

common_neq = [ (key,
komparator(a[key],b[key], value__equality_checker) )
for key in common_neq ]

Now we get extra info on intra record differences with little extra
code.

Look out though, you could get swamped with data :-)

Aug 10 '06 #7
John Henry wrote:
John,

Yes, there are several scenerios.

a) Comparing keys only.

That's been answered (although I haven't gotten it to work under 2.3
yet)
(1) What's the problem with getting it to work under 2.3?
>
b) Comparing records.
You haven't got that far yet. The next problem is actually comparing
two *collections* of records, and you need to decide whether for
equality purposes the collections should be treated as an unordered
list, an ordered list, a set, or something else. Then you need to
consider how equality of records is to be defined e.g. case sensitive
or not.
>
Now it gets more fun - as you pointed out. I was assuming that there
is no short cut here. If the key exists on both set, and if I wish to
know if the records are the same, I would have to do record by record
comparsion. However, since there are only a handful of records per
key, this wouldn't be so bad. Maybe I just overload the compare
operator or something.
IMHO, "something" would be better than "overload the compare operator".
In any case, you need to DEFINE what you mean by equality of a
collection of records, *then* implement it.

"only a handful":. Naturally 0 and 1 are special, but otherwise the
number of records in the bag shoudn't really be a factor in your
implementation.

HTH,
John

Aug 10 '06 #8

John Machin wrote:
John Henry wrote:
John,

Yes, there are several scenerios.

a) Comparing keys only.

That's been answered (although I haven't gotten it to work under 2.3
yet)

(1) What's the problem with getting it to work under 2.3?
Let me comment on this part first, I am still chewing other parts of

When I do it under 2.3, I get:

common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax

Don't know why that is.

I can't upgrade yet. Some part of my code doesn't compile under 2.4
and I haven't got a chance to investigate further.

Aug 11 '06 #9
wrote:
When I do it under 2.3, I get:

common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax

Don't know why that is.
There are no generator expressions in 2.3. Turn it into a list
comprehension::

common_eq = set([k for k in _common if a[k] == b[k]])

Ciao,
Marc 'BlackJack' Rintsch
Aug 11 '06 #10
Thank you. That works.
Marc 'BlackJack' Rintsch wrote:
wrote:
When I do it under 2.3, I get:

common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax

Don't know why that is.

There are no generator expressions in 2.3. Turn it into a list
comprehension::

common_eq = set([k for k in _common if a[k] == b[k]])

Ciao,
Marc 'BlackJack' Rintsch
Aug 11 '06 #11

I have gone the whole hog and got something thats run-able:

========dict_diff.py=============================

from pprint import pprint as pp

a = {1:{'1':'1'}, 2:{'2':'2'}, 3:dict("AA BB CC".split()), 4:{'4':'4'}}
b = { 2:{'2':'2'}, 3:dict("BB CD EE".split()), 5:{'5':'5'}}
def record_comparator(a,b, check_equal):
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if check_equal(a[k],b[k]))
common_neq = _common - common_eq
return {"A excl keys":a_xclusive, "B excl keys":b_xclusive,
"Common & eq":common_eq, "Common keys neq
values":common_neq}

comp_result = record_comparator(a,b, dict.__eq__)

# Further dataon common keys, neq values
common_neq = comp_result["Common keys neq values"]
common_neq = [ (key, record_comparator(a[key],b[key], str.__eq__))
for key in common_neq ]
comp_result["Common keys neq values"] = common_neq

print "\na =",; pp(a)
print "\nb =",; pp(b)
print "\ncomp_result = " ; pp(comp_result)

==========================================

When run it gives:

a ={1: {'1': '1'},
2: {'2': '2'},
3: {'A': 'A', 'C': 'C', 'B': 'B'},
4: {'4': '4'}}

b ={2: {'2': '2'}, 3: {'C': 'D', 'B': 'B', 'E': 'E'}, 5: {'5': '5'}}

comp_result =
{'A excl keys': set([1, 4]),
'B excl keys': set([5]),
'Common & eq': set([2]),
'Common keys neq values': [(3,
{'A excl keys': set(['A']),
'B excl keys': set(['E']),
'Common & eq': set(['B']),
'Common keys neq values': set(['C'])})]}