470,848 Members | 1,489 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,848 developers. It's quick & easy.

Re: Checking for unique fields: performance.

En Fri, 18 Apr 2008 12:23:08 -0300, Shawn Milochik <Sh***@Milochik.comescribió:
I'm looping through a tab-delimited file to gather statistics on fill rates,
lengths, and uniqueness.

For the uniqueness, I made a dictionary with keys which correspond to the
field names. The values were originally lists, where I would store values
found in that field. Once I detected a duplicate, I deleted the entire
element from the dictionary. Any which remained by the end are considered
unique. Also, if the value was empty, the dictionary element was deleted and
that field considered not unique.

A friend of mine suggested changing that dictionary of lists into a
dictionary of dictionaries, for performance reasons. As it turns out, the
speed increase was ridiculous -- a file which took 42 minutes to run dropped
down to six seconds.
A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation).
Here is the excerpt of the bit of code which checks for uniqueness. It's
fully functional, so I'm just looking for any suggestions for improving it
or any comments. Note that fieldNames is a list containing all column
headers.

#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldUnique.has_key(fieldNames[index]):
#if the current value is a duplicate
if fieldUnique[fieldNames[index]].has_key(value):
#sys.stderr.write("Field %s is not unique. Found a
duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
#drop the whole hash element
fieldUnique.__delitem__(fieldNames[index])
else:
#add the new value to the list
fieldUnique[fieldNames[index]][value] = 1
- Instead of using fieldNames[index] all along the place, save it in a variable.
- Your code doesn't show it, but if you are traversing the fieldNames vector like this:
for index in range(len(fieldNames)):
... using fieldNames[index] ...
it's better to use:
for fieldName in fieldNames:
... using fieldName all along the place ...

- Instead of a.has_key(b), use: b in a
- Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName]
(In normal code, __special__ methods are never used)

#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldName in fieldUnique:
#if the current value is a duplicate
if value in fieldUnique[fieldName]:
#drop the whole field as it's not unique
del fieldUnique[fieldName]
else:
#add the new value to the set
fieldUnique[fieldName].add(value)
else:
# use a set to store the unique values
fieldUnique[fieldName] = set([value])

--
Gabriel Genellina

Jun 27 '08 #1
0 1236

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by reneeccwest | last post: by
4 posts views Thread by bwmiller16 | last post: by
7 posts views Thread by john | last post: by
10 posts views Thread by Phil Latio | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.