bellefy@gmail.com wrote:[color=blue]
> key|streetnum|streetname|streetsuffix|unitnum|
> ---------------------------------------------
> 12345|2069|Happy St.|NULL|A-2|
> ACGH23|2069|Happy|St.|A-2|
>
>
> So my question is: How can I identify duplicate records using multiple
> columns without looping through the entire recordset in PHP?[/color]
You can do a self-join:
SELECT t1.key, t2.key
FROM tbl AS t1 INNER JOIN tbl AS t2 ON
CONCAT_WS(' ', t1.streetnum, t1.streetname, t1.streetsuffix, t1.unitnum) =
CONCAT_WS(' ', t2.streetnum, t2.streetname, t2.streetsuffix, t2.unitnum)
That catches the specific example you give above. Unfortunately,
finding duplicates in such free-form data is one of those limitless
problems. There are so many ways that the data can be "the same" that
it's hard to automate.
For instance, what if `streetname` contains the string such as "Happy
St., suite A-2"? Is that equal to "Happy St #A2"? How can you make an
expression to compare these?
Database analysts are often assigned with multi-week projects to de-dupe
vast sets of data. It's unlikely that you'll ever catch 100% of the
duplicates through a single query, but you can catch some reasonable set
of them. Then it becomes a question of how much work is it worth to
catch the last few cases?
Sometimes the most successful method is to sort the dataset by
streetname (or other likely key) and then use your eyeballs to spot
duplicates. A very tedious task, but much more likely to spot all the
uncommon cases of duplicates.
[color=blue]
> I tried using CONCAT to combine the columns into one string but didn't
> work because CONCAT fails if any column is NULL.[/color]
Read the description of CONCAT_WS() here:
http://dev.mysql.com/doc/mysql/en/string-functions.html
That function skips NULL arguments, instead of returning NULL for the
whole expression.
Regards,
Bill K.