473,396 Members | 1,827 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Re: Checking for unique fields: performance.

En Fri, 18 Apr 2008 12:23:08 -0300, Shawn Milochik <Sh***@Milochik.comescribió:
I'm looping through a tab-delimited file to gather statistics on fill rates,
lengths, and uniqueness.

For the uniqueness, I made a dictionary with keys which correspond to the
field names. The values were originally lists, where I would store values
found in that field. Once I detected a duplicate, I deleted the entire
element from the dictionary. Any which remained by the end are considered
unique. Also, if the value was empty, the dictionary element was deleted and
that field considered not unique.

A friend of mine suggested changing that dictionary of lists into a
dictionary of dictionaries, for performance reasons. As it turns out, the
speed increase was ridiculous -- a file which took 42 minutes to run dropped
down to six seconds.
A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation).
Here is the excerpt of the bit of code which checks for uniqueness. It's
fully functional, so I'm just looking for any suggestions for improving it
or any comments. Note that fieldNames is a list containing all column
headers.

#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldUnique.has_key(fieldNames[index]):
#if the current value is a duplicate
if fieldUnique[fieldNames[index]].has_key(value):
#sys.stderr.write("Field %s is not unique. Found a
duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
#drop the whole hash element
fieldUnique.__delitem__(fieldNames[index])
else:
#add the new value to the list
fieldUnique[fieldNames[index]][value] = 1
- Instead of using fieldNames[index] all along the place, save it in a variable.
- Your code doesn't show it, but if you are traversing the fieldNames vector like this:
for index in range(len(fieldNames)):
... using fieldNames[index] ...
it's better to use:
for fieldName in fieldNames:
... using fieldName all along the place ...

- Instead of a.has_key(b), use: b in a
- Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName]
(In normal code, __special__ methods are never used)

#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldName in fieldUnique:
#if the current value is a duplicate
if value in fieldUnique[fieldName]:
#drop the whole field as it's not unique
del fieldUnique[fieldName]
else:
#add the new value to the set
fieldUnique[fieldName].add(value)
else:
# use a set to store the unique values
fieldUnique[fieldName] = set([value])

--
Gabriel Genellina

Jun 27 '08 #1
0 1349

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: reneeccwest | last post by:
Hello, I plan to create a table with 3 unique keys. Combination of three fields has to be unique for each row in a table that are vendor ID (char 8), vendor name (char 40), and vendor...
3
by: June Moore | last post by:
Hi, I would like to add a unique index that consists of two fields in a table. e.g. tbl_A (field1,field2) -- field1 & field2 Indexed and combination must be Unique. Can anyone tell me the...
5
by: Kamil | last post by:
Hello What should I use for better perfomance since unique constraint always use index ? Thanks Kamil
18
by: Neil | last post by:
I am using SQL 7 with an MS Access 2000 MDB front end, using bound forms with ODBC linked tables. In one form, the user needs to be able to check a box to select one or more records. This is...
4
by: bwmiller16 | last post by:
Guys - I'm doing a database consistency check for a client and I find that they're building unique indexes for performance/query reasons where they could be using non-unique indexes. Note...
5
by: Terri | last post by:
I have a form with a multi-select combo. I dynamically build a SELECT statement, open a report, and set the recordsource to my dynamic SELECT statement. I count the records returned in the report...
10
by: deko | last post by:
I understand it's possible to make a composite Primary Key by holding down the control key and selecting multiple fields, then right-clicking and selecting Primary Key. But I've heard that's not a...
7
by: john | last post by:
I've made an import macro to import bank transactions. Those transactions don't have an ID_number when downloaded from internet. They get it when they are added to the main transaction table in my...
10
by: Phil Latio | last post by:
I am inserting data into user table which contains 5 fields, sounds simple enough normally but 2 of the fields are designated as UNIQUE. If someone does enter a value which already exists, how do I...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.