I'm looping through a tab-delimited file to gather statistics on fill rates,A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation).
lengths, and uniqueness.
For the uniqueness, I made a dictionary with keys which correspond to the
field names. The values were originally lists, where I would store values
found in that field. Once I detected a duplicate, I deleted the entire
element from the dictionary. Any which remained by the end are considered
unique. Also, if the value was empty, the dictionary element was deleted and
that field considered not unique.
A friend of mine suggested changing that dictionary of lists into a
dictionary of dictionaries, for performance reasons. As it turns out, the
speed increase was ridiculous -- a file which took 42 minutes to run dropped
down to six seconds.
Here is the excerpt of the bit of code which checks for uniqueness. It's- Instead of using fieldNames[index] all along the place, save it in a variable.
fully functional, so I'm just looking for any suggestions for improving it
or any comments. Note that fieldNames is a list containing all column
headers.
#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldUnique.has_key(fieldNames[index]):
#if the current value is a duplicate
if fieldUnique[fieldNames[index]].has_key(value):
#sys.stderr.write("Field %s is not unique. Found a
duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
#drop the whole hash element
fieldUnique.__delitem__(fieldNames[index])
else:
#add the new value to the list
fieldUnique[fieldNames[index]][value] = 1
- Your code doesn't show it, but if you are traversing the fieldNames vector like this:
for index in range(len(fieldNames)):
... using fieldNames[index] ...
it's better to use:
for fieldName in fieldNames:
... using fieldName all along the place ...
- Instead of a.has_key(b), use: b in a
- Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName]
(In normal code, __special__ methods are never used)
#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldName in fieldUnique:
#if the current value is a duplicate
if value in fieldUnique[fieldName]:
#drop the whole field as it's not unique
del fieldUnique[fieldName]
else:
#add the new value to the set
fieldUnique[fieldName].add(value)
else:
# use a set to store the unique values
fieldUnique[fieldName] = set([value])
--
Gabriel Genellina