Re: Checking for unique fields: performance.

En Fri, 18 Apr 2008 12:23:08 -0300, Shawn Milochik <Sh***@Milochik.comescribió:

I'm looping through a tab-delimited file to gather statistics on fill rates,
lengths, and uniqueness.

For the uniqueness, I made a dictionary with keys which correspond to the
field names. The values were originally lists, where I would store values
found in that field. Once I detected a duplicate, I deleted the entire
element from the dictionary. Any which remained by the end are considered
unique. Also, if the value was empty, the dictionary element was deleted and
that field considered not unique.

A friend of mine suggested changing that dictionary of lists into a
dictionary of dictionaries, for performance reasons. As it turns out, the
speed increase was ridiculous -- a file which took 42 minutes to run dropped
down to six seconds.

A dictionary with keys is perfectly reasonable. But a *list* of values has to be searched linearly for every value: a O(n) process. As your friend suggested, searching a dictionary requires O(1) time. A set is even better in this case, because you don't have any use for the values in the inner dictionary (sets and dictionaries are very similar in the implementation).

Here is the excerpt of the bit of code which checks for uniqueness. It's
fully functional, so I'm just looking for any suggestions for improving it
or any comments. Note that fieldNames is a list containing all column
headers.

#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldUnique.has_key(fieldNames[index]):
#if the current value is a duplicate
if fieldUnique[fieldNames[index]].has_key(value):
#sys.stderr.write("Field %s is not unique. Found a
duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
#drop the whole hash element
fieldUnique.__delitem__(fieldNames[index])
else:
#add the new value to the list
fieldUnique[fieldNames[index]][value] = 1

- Instead of using fieldNames[index] all along the place, save it in a variable.
- Your code doesn't show it, but if you are traversing the fieldNames vector like this:
for index in range(len(fieldNames)):
... using fieldNames[index] ...
it's better to use:
for fieldName in fieldNames:
... using fieldName all along the place ...

- Instead of a.has_key(b), use: b in a
- Instead of fieldUnique.__delitem__(fieldNames[index]) use: del fieldUnique[fieldName]
(In normal code, __special__ methods are never used)

#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
if fieldName in fieldUnique:
#if the current value is a duplicate
if value in fieldUnique[fieldName]:
#drop the whole field as it's not unique
del fieldUnique[fieldName]
else:
#add the new value to the set
fieldUnique[fieldName].add(value)
else:
# use a set to store the unique values
fieldUnique[fieldName] = set([value])

--
Gabriel Genellina

Jun 27 '08 #1

Subscribe Post Reply

1349

Similar topics

unique keys

by: reneeccwest | last post by:

Hello, I plan to create a table with 3 unique keys. Combination of three fields has to be unique for each row in a table that are vendor ID (char 8), vendor name (char 40), and vendor...

Microsoft SQL Server

Unique Index for two columns in a table

by: June Moore | last post by:

Hi, I would like to add a unique index that consists of two fields in a table. e.g. tbl_A (field1,field2) -- field1 & field2 Indexed and combination must be Unique. Can anyone tell me the...

Microsoft SQL Server

unique constraint vs unique index in MS SQL 2000

by: Kamil | last post by:

Hello What should I use for better perfomance since unique constraint always use index ? Thanks Kamil

Microsoft SQL Server

Unique Selections In Back End

by: Neil | last post by:

I am using SQL 7 with an MS Access 2000 MDB front end, using bound forms with ODBC linked tables. In one form, the user needs to be able to check a box to select one or more records. This is...

Microsoft SQL Server

Unique Vs. Non-Unique Indexes

by: bwmiller16 | last post by:

Guys - I'm doing a database consistency check for a client and I find that they're building unique indexes for performance/query reasons where they could be using non-unique indexes. Note...

DB2 Database

Count unique records in report with dynamic recordsource

by: Terri | last post by:

I have a form with a multi-select combo. I dynamically build a SELECT statement, open a report, and set the recordsource to my dynamic SELECT statement. I count the records returned in the report...

Microsoft Access / VBA

How to make composite unique index?

by: deko | last post by:

I understand it's possible to make a composite Primary Key by holding down the control key and selecting multiple fields, then right-clicking and selecting Primary Key. But I've heard that's not a...

Microsoft Access / VBA

Checking for doubles

by: john | last post by:

I've made an import macro to import bank transactions. Those transactions don't have an ID_number when downloaded from internet. They get it when they are added to the main transaction table in my...

Microsoft Access / VBA

Error checking for unique value in database table?

by: Phil Latio | last post by:

I am inserting data into user table which contains 5 fields, sounds simple enough normally but 2 of the fields are designated as UNIQUE. If someone does enter a value which already exists, how do I...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server