Hi list,
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
Thanks, 11 14359
John Henry wrote:
Hi list,
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values
Something like:
a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq
If you now simple set arithmatic, it should read OK.
- Paddy.
Paddy wrote:
John Henry wrote:
Hi list,
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values
Something like:
a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq
If you now simple set arithmatic, it should read OK.
- Paddy.
Thanks, that's very clean. Give me good reason to move up to Python
2.4.
John Henry wrote:
Paddy wrote:
John Henry wrote:
Hi list,
>
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
>
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
>
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
>
Thanks,
I make it 4 bins:
a_exclusive_keys
b_exclusive_keys
common_keys_equal_values
common_keys_diff_values
Something like:
a={1:1, 2:2,3:3,4:4}
b = {2:2, 3:-3, 5:5}
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if a[k] == b[k])
common_neq = _common - common_eq
If you now simple set arithmatic, it should read OK.
- Paddy.
Thanks, that's very clean. Give me good reason to move up to Python
2.4.
Oh, wait, works in 2.3 too.
Just have to:
from sets import Set as set
John Henry wrote:
Hi list,
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
Paddy has already pointed out a necessary addition to your requirement
definition: common keys with different values.
Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:
a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -2 to
save typing :-)
Now that happens if the other dictionary contains:
b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}
Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?
All comparisons are equal, but some comparisons are more equal than
others :-)
Cheers,
John
John,
Yes, there are several scenerios.
a) Comparing keys only.
That's been answered (although I haven't gotten it to work under 2.3
yet)
b) Comparing records.
Now it gets more fun - as you pointed out. I was assuming that there
is no short cut here. If the key exists on both set, and if I wish to
know if the records are the same, I would have to do record by record
comparsion. However, since there are only a handful of records per
key, this wouldn't be so bad. Maybe I just overload the compare
operator or something.
John Machin wrote:
John Henry wrote:
Hi list,
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
Paddy has already pointed out a necessary addition to your requirement
definition: common keys with different values.
Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:
a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -2 to
save typing :-)
Now that happens if the other dictionary contains:
b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}
Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?
All comparisons are equal, but some comparisons are more equal than
others :-)
Cheers,
John
John Machin wrote:
John Henry wrote:
Hi list,
I am sure there are many ways of doing comparision but I like to see
what you would do if you have 2 dictionary sets (containing lots of
data - like 20000 keys and each key contains a dozen or so of records)
and you want to build a list of differences about these two sets.
I like to end up with 3 lists: what's in A and not in B, what's in B
and not in A, and of course, what's in both A and B.
What do you think is the cleanest way to do it? (I am sure you will
come up with ways that astonishes me :=) )
Paddy has already pointed out a necessary addition to your requirement
definition: common keys with different values.
Here's another possible addition: you say that "each key contains a
dozen or so of records". I presume that you mean like this:
a = {1: ['rec1a', 'rec1b'], 42: ['rec42a', 'rec42b']} # "dozen" -2 to
save typing :-)
Now that happens if the other dictionary contains:
b = {1: ['rec1a', 'rec1b'], 42: ['rec42b', 'rec42a']}
Key 42 would be marked as different by Paddy's classification, but the
values are the same, just not in the same order. How do you want to
treat that? avalue == bvalue? sorted(avalue) == sorted(bvalue)? Oh, and
are you sure the buckets don't contain duplicates? Maybe you need
set(avalue) == set(bvalue). What about 'rec1a' vs 'Rec1a' vs 'REC1A'?
All comparisons are equal, but some comparisons are more equal than
others :-)
Cheers,
John
Hi Johns,
The following is my attempt to give more/deeper comparison info.
Assume you have your data parsed and presented as two dicts a and b
each having as values a dict representing a record.
Further assume you have a function that can compute if two record level
dicts are the same and another function that can compute if two values
in a record level dict are the same.
With a slight modification of my earlier prog we get:
def komparator(a,b, check_equal):
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if check_equal(a[k],b[k]))
common_neq = _common - common_eq
return (a_xclusive, b_xclusive, common_eq, common_neq)
a_xclusive, b_xclusive, common_eq, common_neq = komparator(a,b,
record_dict__equality_checker)
common_neq = [ (key,
komparator(a[key],b[key], value__equality_checker) )
for key in common_neq ]
Now we get extra info on intra record differences with little extra
code.
Look out though, you could get swamped with data :-)
- Paddy.
John Henry wrote:
John,
Yes, there are several scenerios.
a) Comparing keys only.
That's been answered (although I haven't gotten it to work under 2.3
yet)
(1) What's the problem with getting it to work under 2.3?
(2) Why not upgrade?
>
b) Comparing records.
You haven't got that far yet. The next problem is actually comparing
two *collections* of records, and you need to decide whether for
equality purposes the collections should be treated as an unordered
list, an ordered list, a set, or something else. Then you need to
consider how equality of records is to be defined e.g. case sensitive
or not.
>
Now it gets more fun - as you pointed out. I was assuming that there
is no short cut here. If the key exists on both set, and if I wish to
know if the records are the same, I would have to do record by record
comparsion. However, since there are only a handful of records per
key, this wouldn't be so bad. Maybe I just overload the compare
operator or something.
IMHO, "something" would be better than "overload the compare operator".
In any case, you need to DEFINE what you mean by equality of a
collection of records, *then* implement it.
"only a handful":. Naturally 0 and 1 are special, but otherwise the
number of records in the bag shoudn't really be a factor in your
implementation.
HTH,
John
John Machin wrote:
John Henry wrote:
John,
Yes, there are several scenerios.
a) Comparing keys only.
That's been answered (although I haven't gotten it to work under 2.3
yet)
(1) What's the problem with getting it to work under 2.3?
(2) Why not upgrade?
Let me comment on this part first, I am still chewing other parts of
your message.
When I do it under 2.3, I get:
common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax
Don't know why that is.
I can't upgrade yet. Some part of my code doesn't compile under 2.4
and I haven't got a chance to investigate further.
In <11*********************@h48g2000cwc.googlegroups. com>, John Henry
wrote:
When I do it under 2.3, I get:
common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax
Don't know why that is.
There are no generator expressions in 2.3. Turn it into a list
comprehension::
common_eq = set([k for k in _common if a[k] == b[k]])
Ciao,
Marc 'BlackJack' Rintsch
Thank you. That works.
Marc 'BlackJack' Rintsch wrote:
In <11*********************@h48g2000cwc.googlegroups. com>, John Henry
wrote:
When I do it under 2.3, I get:
common_eq = set(k for k in _common if a[k] == b[k])
^
SyntaxError: invalid syntax
Don't know why that is.
There are no generator expressions in 2.3. Turn it into a list
comprehension::
common_eq = set([k for k in _common if a[k] == b[k]])
Ciao,
Marc 'BlackJack' Rintsch
I have gone the whole hog and got something thats run-able:
========dict_diff.py=============================
from pprint import pprint as pp
a = {1:{'1':'1'}, 2:{'2':'2'}, 3:dict("AA BB CC".split()), 4:{'4':'4'}}
b = { 2:{'2':'2'}, 3:dict("BB CD EE".split()), 5:{'5':'5'}}
def record_comparator(a,b, check_equal):
keya=set(a.keys())
keyb=set(b.keys())
a_xclusive = keya - keyb
b_xclusive = keyb - keya
_common = keya & keyb
common_eq = set(k for k in _common if check_equal(a[k],b[k]))
common_neq = _common - common_eq
return {"A excl keys":a_xclusive, "B excl keys":b_xclusive,
"Common & eq":common_eq, "Common keys neq
values":common_neq}
comp_result = record_comparator(a,b, dict.__eq__)
# Further dataon common keys, neq values
common_neq = comp_result["Common keys neq values"]
common_neq = [ (key, record_comparator(a[key],b[key], str.__eq__))
for key in common_neq ]
comp_result["Common keys neq values"] = common_neq
print "\na =",; pp(a)
print "\nb =",; pp(b)
print "\ncomp_result = " ; pp(comp_result)
==========================================
When run it gives:
a ={1: {'1': '1'},
2: {'2': '2'},
3: {'A': 'A', 'C': 'C', 'B': 'B'},
4: {'4': '4'}}
b ={2: {'2': '2'}, 3: {'C': 'D', 'B': 'B', 'E': 'E'}, 5: {'5': '5'}}
comp_result =
{'A excl keys': set([1, 4]),
'B excl keys': set([5]),
'Common & eq': set([2]),
'Common keys neq values': [(3,
{'A excl keys': set(['A']),
'B excl keys': set(['E']),
'Common & eq': set(['B']),
'Common keys neq values': set(['C'])})]}
- Paddy. This discussion thread is closed Replies have been disabled for this discussion. Similar topics
reply
views
Thread by William Stacey [MVP] |
last post: by
|
21 posts
views
Thread by Helge Jensen |
last post: by
|
59 posts
views
Thread by Chris Dunaway |
last post: by
|
2 posts
views
Thread by Locia |
last post: by
|
50 posts
views
Thread by lovecreatesbea... |
last post: by
|
6 posts
views
Thread by Tony |
last post: by
|
7 posts
views
Thread by shapper |
last post: by
|
21 posts
views
Thread by Peter Duniho |
last post: by
|
14 posts
views
Thread by Jukka K. Korpela |
last post: by
| | | | | | | | | | |