By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,091 Members | 1,546 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,091 IT Pros & Developers. It's quick & easy.

Can Access use Fuzzy Logic

P: n/a
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a record.
These records were hand entered by people in call centers.

The problem is, there can and are duplicate phone numbers, and emails
and addresses even person names. I need to sift through all this data
(roughly 300,000+ records and use fuzzy logic to break it down, so that
i have only unique records.

Can I use Access or what to sort through all this data?

Apr 4 '06 #1
Share this Question
Share on Google+
24 Replies


P: n/a
ca******@gmail.com wrote:
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a record.
These records were hand entered by people in call centers.

The problem is, there can and are duplicate phone numbers, and emails
and addresses even person names. I need to sift through all this data
(roughly 300,000+ records and use fuzzy logic to break it down, so that
i have only unique records.

Can I use Access or what to sort through all this data?


Fuzzy logic is basically the same as determining a probability density
distribution empirically. Of course, a temperature that is considered
'cold' varies from month to month as well as from person to person. I
think all you need is a Public function that assigns a value to each
record based on the fuzzy logic in the function. Then you can use the
function's return value to identify duplicate records where duplicate
is used in a fuzzy sense.

James A. Fortune
CD********@FortuneJames.com

Apr 4 '06 #2

P: n/a
Not really clear on what you want to do, but I think the answer to your
question is Yes.

You can import the data to Access and then use SQL to pare the records
down to a distinct set. Check out the help file on SELECT DISTINCT and
SELECT DISTINCTROW by looking in the help file contents under Microsoft
Jet SQL Reference|Overview|SQL Reserved Words.

WIth that many records, I'd be very cautious about importing radicals
and be very strict in your Field attributes for each Record. I'd
suggest you ensure continuity of your dataset from a formatting
perspective and Trim all the fields in Excel before moving them to
Access. You'll also have to watch out for Nulls, Empty Sets, etc., but
if you are sure to remove all radicals like these from your dataset,
you can do pretty well anything you want in SQL to arrive at a unique
recordset including using the aforementioned fuzzy logic to identify
records with properly phrased WHERE clauses.

Apr 4 '06 #3

P: n/a

ca******@gmail.com wrote:
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a record.
These records were hand entered by people in call centers.

The problem is, there can and are duplicate phone numbers, and emails
and addresses even person names. I need to sift through all this data
(roughly 300,000+ records and use fuzzy logic to break it down, so that
i have only unique records.

Can I use Access or what to sort through all this data?

Strictly speaking, one could use fuzzy logic here, but I would think
that a few clever heuristics might be simpler. You don't give details
on the nature of the duplication, but many slight variations on name,
address, etc. can be cleared up by normalizing values by removing extra
space, forcing all capitalization to be the same, forcing all phone
numbers to the same format, etc.
-Will Dwinnell
http://will.dwinnell.com

Apr 5 '06 #4

P: n/a

CD********@FortuneJames.com wrote:
Fuzzy logic is basically the same as determining a probability density
distribution empirically. Of course, a temperature that is considered
'cold' varies from month to month as well as from person to person. I
think all you need is a Public function that assigns a value to each
record based on the fuzzy logic in the function. Then you can use the
function's return value to identify duplicate records where duplicate
is used in a fuzzy sense.

Fuzziness is not probability.

Here are links to some reasonably good introductory material on fuzzy
logic:

http://www.austinlinks.com/Fuzzy/overview.html
http://www.ncst.ernet.in/education/a...uzzy/fuzzy.pdf
http://www.fpk.tu-berlin.de/~anderl/...uzzyintro4.pdf
http://www.phys.ufl.edu/docs/matlab/...y/fuzzy_tb.pdf
http://www.faqs.org/faqs/fuzzy-logic/part1/
http://www.fuzzy-logic.com/ch3.htm
-Will Dwinnell
http://will.dwinnell.com

Apr 5 '06 #5

P: n/a
"ca******@gmail.com" <ca******@gmail.com> wrote in
news:11**********************@t31g2000cwb.googlegr oups.com:
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call centers.

The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through all
this data (roughly 300,000+ records and use fuzzy logic to break
it down, so that i have only unique records.

Can I use Access or what to sort through all this data?


I think you've asked the wrong question. You don't want fuzzy logic
but fuzzy criteria matching.

I've de-duped millions of records, and it's very hard work.

The first thing you need to do is decide what constitutes uniqueness
(name, name+address, etc.).

Then you need to get all records into the exactly the same format.
This means that you need to process certain fields to get them all
formatted the same. For instance, phone numbers might be entered as
123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123
456-7890. To de-dupe on phone numbers, you'd need to first process
out all the variations. What I'd do with phone numbers is first
strip out the leading +1 or 1, then I'd strip out all characters
except the numbers. I wouldn't reformat them, but leave them that
way because the de-duping process is not going to be helped by
adding more characters. When the final data is cleaned up, you may
choose to reformat the data (or just display it with the appropriate
parens and dashes).

Another problem is that if there are multiple phone numbers, you
might end up with them in the wrong columns. That is, if there are
just Phone1 and Phone2 instead of Home, Work, Mobile, you might end
up with two records that are the same person and the Phone1 field
might match the Phone2 field of the other record. You can
restructure the data to fix this problem (normalization), by putting
the phone numbers in a separate table with a link to the source
record and the phone number. Then you'd only have to search for dups
on a single field in this phone number table.

Of course, there could be lots of false matches if, say, there are
many different people sharing the same phone number, so you wouldn't
want to use phone number by itself to determine duplicates.

To do all this parsing, you'll need to familiarize yourself with the
VBA string functions, Replace(), Mid(), Left() and Right().

Addresses are much more complicated. I've got lots of code that I've
written to process addresses into standard formats, and to evalute
the contents and encode them in a format that is easily comparable.
But it's very complicated and I'm not prepared to share it here. The
main point I want to make is that this is a hard task.

For names, you probably want to look into the name processing
schemes, like Soundex(), Soundex2() and Simul(). Plain old Soundex()
is virtually useless (the matches are way too loose), and Simul() is
rather heavy on processing, so I generally use a combination of
Soundex(), Soundex2() and substring comparisons. However, none of
these works well with anything but standard English/American names.
If there are lots of foreign names, it is going to be less
successful.

To summarize, the way I'd do this is import the spreadsheet(s) into
Access, then add fields for the processed data (so I'd retain the
original values in case I needed to go back to them), and
restructure the data into more than one table. Then I'd do my
de-duping on the resulting fields. First I'd start with exact
matches on the processed/derived fields and then I'd work with
various forms of LIKE matches on the records to find "fuzzy"
matches.

The problem is that none of this can be easily done in an automated
fashion. It almost always requires a human being to make a final
determination of what really is a duplicate. What I do is try to do
as many of them with queries and code, and then have the remainder
evaluated by a human being. But with hundreds of thousands of
records, this can be impossible. So you're left with recognizing
that you can never get a perfectly de-duped resultset.

This is why it's important in a database application to have as many
protections against the production of duplicate records as possible.
This requires designing your data entry process to check for
duplicates and flag possible dupes for the users who then decide
whether or not to go ahead and create the new record, or use the
existing record. It's also important to build tools for the users so
that they can review possible duplicates and reconcile them. This
can be very complicated, and probably should be restricted to
administrative or well-trained users.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
Apr 5 '06 #6

P: n/a
Predictor wrote:
Fuzziness is not probability.

Here are links to some reasonably good introductory material on fuzzy
logic:

http://www.austinlinks.com/Fuzzy/overview.html
http://www.ncst.ernet.in/education/a...uzzy/fuzzy.pdf
http://www.fpk.tu-berlin.de/~anderl/...uzzyintro4.pdf
http://www.phys.ufl.edu/docs/matlab/...y/fuzzy_tb.pdf
http://www.faqs.org/faqs/fuzzy-logic/part1/
http://www.fuzzy-logic.com/ch3.htm
-Will Dwinnell
http://will.dwinnell.com


Great set of links on an interesting subject, thanks, Will. I was
keenly interested in FL after one of our mathematicians at Atomic Energy
gave a presentation on it in the early 90s, but have never really
pursued it.
--
Tim http://www.ucs.mun.ca/~tmarshal/
^o<
/#) "Burp-beep, burp-beep, burp-beep?" - Quaker Jake
/^^ "Whatcha doin?" - Ditto "TIM-MAY!!" - Me
Apr 5 '06 #7

P: n/a
David W. Fenton wrote
"ca******@gmail.com" <ca******@gmail.com> wrote
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call centers.

The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through all
this data (roughly 300,000+ records and use fuzzy logic to break
it down, so that i have only unique records.

Can I use Access or what to sort through all this data?
I think you've asked the wrong question. You don't want fuzzy logic
but fuzzy criteria matching.


One way is finding groups of similar records, then decide
statistically (fuzzy) wheter they represent the same
object. If so, pick its most probable correct attributes.
The first thing you need to do is decide what constitutes uniqueness
(name, name+address, etc.).
Actually, when using fuzzy logic you want to decide what
constitutes similarity. Uniqueness will alraedy be defined
in the database.
Then you need to get all records into the exactly the same format.
This means that you need to process certain fields to get them all
formatted the same. For instance, phone numbers might be entered as
123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123
456-7890. To de-dupe on phone numbers, you'd need to first process
out all the variations. What I'd do with phone numbers is first
strip out the leading +1 or 1, then I'd strip out all characters
except the numbers. I wouldn't reformat them, but leave them that
way because the de-duping process is not going to be helped by
adding more characters. When the final data is cleaned up, you may
choose to reformat the data (or just display it with the appropriate
parens and dashes).
Similarity of phone numbers can also be defined as (a function
of) the number of common digits comparing reversed.
The problem is that none of this can be easily done in an automated
fashion. It almost always requires a human being to make a final
determination of what really is a duplicate. What I do is try to do
as many of them with queries and code, and then have the remainder
evaluated by a human being. But with hundreds of thousands of
records, this can be impossible. So you're left with recognizing
that you can never get a perfectly de-duped resultset.
A bayesian belief network could be maintained of what
constitutes similarity. Distributions can be initially be derived
from the available data. Tune and set treshholds to produce
acceptable results on new problems as they arise.
It may sound overly complex, but if it's a repeating problem
and the dataset is as simple, it may be worth building your
own (in Access).
This is why it's important in a database application to have as many
protections against the production of duplicate records as possible.


amen

--
Paul

Apr 5 '06 #8

P: n/a
Predictor wrote:
CD********@FortuneJames.com wrote:
Fuzzy logic is basically the same as determining a probability density
distribution empirically. Of course, a temperature that is considered
'cold' varies from month to month as well as from person to person. I
think all you need is a Public function that assigns a value to each
record based on the fuzzy logic in the function. Then you can use the
function's return value to identify duplicate records where duplicate
is used in a fuzzy sense.

Fuzziness is not probability.

Here are links to some reasonably good introductory material on fuzzy
logic:

http://www.austinlinks.com/Fuzzy/overview.html
http://www.ncst.ernet.in/education/a...uzzy/fuzzy.pdf
http://www.fpk.tu-berlin.de/~anderl/...uzzyintro4.pdf
http://www.phys.ufl.edu/docs/matlab/...y/fuzzy_tb.pdf
http://www.faqs.org/faqs/fuzzy-logic/part1/
http://www.fuzzy-logic.com/ch3.htm
-Will Dwinnell
http://will.dwinnell.com


It's analogous to what I said. I agree that fuzziness is not
probability, but it acts in exactly the same fashion. Fuzziness is
probability density distribution determination in a fuzzy sense. I'll
check the links to see if fuzzy logic has grown beyond this. Plus, it
doesn't change the advice I gave either way.

James A. Fortune
CD********@FortuneJames.com

Apr 5 '06 #9

P: n/a
"kaniest" <ka*****@invalid.invalid> wrote in
news:44***********************@news.xs4all.nl:
David W. Fenton wrote
"ca******@gmail.com" <ca******@gmail.com> wrote
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call
centers.

The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through
all this data (roughly 300,000+ records and use fuzzy logic to
break it down, so that i have only unique records.

Can I use Access or what to sort through all this data?


I think you've asked the wrong question. You don't want fuzzy
logic but fuzzy criteria matching.


One way is finding groups of similar records, then decide
statistically (fuzzy) wheter they represent the same
object. If so, pick its most probable correct attributes.


This statement is meaningless to me. Can you amplify what you mean
by it and how it would be implemented?
The first thing you need to do is decide what constitutes
uniqueness (name, name+address, etc.).


Actually, when using fuzzy logic you want to decide what
constitutes similarity. Uniqueness will alraedy be defined
in the database.


Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.

Secondly, for a database there are database-level definitions of
uniqueness and then there are real-world definitions of uniqueness,
which aren't necessarily the same thing (this is the old natural vs.
surrogate key question). I find that you *can't* depend on the db
engine validation of unique data for eliminating real duplicates,
because real-world data is often incomplete.

And it won't help with variations in the way data that is actually
identical is entered (e.g., Street vs. St.).
Then you need to get all records into the exactly the same
format. This means that you need to process certain fields to get
them all formatted the same. For instance, phone numbers might be
entered as 123-456-7890 and (123) 456-7890 and 123.456.7890 and
+1 123 456-7890. To de-dupe on phone numbers, you'd need to first
process out all the variations. What I'd do with phone numbers is
first strip out the leading +1 or 1, then I'd strip out all
characters except the numbers. I wouldn't reformat them, but
leave them that way because the de-duping process is not going to
be helped by adding more characters. When the final data is
cleaned up, you may choose to reformat the data (or just display
it with the appropriate parens and dashes).


Similarity of phone numbers can also be defined as (a function
of) the number of common digits comparing reversed.


Eh? Telephone numbers are unique once they are all in a common
format. What you've described sounds like you'd consider 212
123-4567 and 212 123-7654 as the same statistically, since they have
the same digits. The numbers 212 123-4567 and 212 124-4567 differ by
only one number, but are less likely to be an indication of a
duplicate record than 212 123-4567 and 212 123-4568.

So, I don't see much that can be gained by statistical analysis of
phone numbers.
The problem is that none of this can be easily done in an
automated fashion. It almost always requires a human being to
make a final determination of what really is a duplicate. What I
do is try to do as many of them with queries and code, and then
have the remainder evaluated by a human being. But with hundreds
of thousands of records, this can be impossible. So you're left
with recognizing that you can never get a perfectly de-duped
resultset.


A bayesian belief network could be maintained of what
constitutes similarity. Distributions can be initially be derived
from the available data. Tune and set treshholds to produce
acceptable results on new problems as they arise.
It may sound overly complex, but if it's a repeating problem
and the dataset is as simple, it may be worth building your
own (in Access).


I see this as a ridiculous recommendation for a practical problem,
since I don't have the mathetmatical skills to even begin to
implement a Bayesian algorithm (though I do understand how it
works). It would also be rather difficult to implement, I would
think.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
Apr 5 '06 #10

P: n/a
kaniest wrote:
A bayesian belief network could be maintained of what
constitutes similarity. Distributions can be initially be derived
from the available data. Tune and set treshholds to produce
acceptable results on new problems as they arise.
It may sound overly complex, but if it's a repeating problem
and the dataset is as simple, it may be worth building your
own (in Access).


Paul,

That's a brilliant idea. It's too bad that a Dot Net data structure
rather than Access lends itself to this approach. If this is done in
software it should work at least as well as Fuzzy Logic based code.
Fuzzy logic tends to shine in hardware implementation. A Bayesian
belief network would have to be optimally pruned for burning onto chips
once a decent distribution converges but that idea can't match the
hardware options or sizes available with Fuzzy Logic.

James A. Fortune
CD********@FortuneJames.com

Conversation from about 15 years ago:
Me: Why did you start your node probabilities at zero rather than using
the information gained from building it in the first place?
M. Ramsisi: Because all the text books do it that way.

Apr 5 '06 #11

P: n/a

David W. Fenton wrote:
"kaniest" <ka*****@invalid.invalid> wrote in
news:44***********************@news.xs4all.nl:
David W. Fenton wrote
"ca******@gmail.com" <ca******@gmail.com> wrote
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call
centers.

The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through
all this data (roughly 300,000+ records and use fuzzy logic to
break it down, so that i have only unique records.

Can I use Access or what to sort through all this data?

I think you've asked the wrong question. You don't want fuzzy
logic but fuzzy criteria matching.
One way is finding groups of similar records, then decide
statistically (fuzzy) wheter they represent the same
object. If so, pick its most probable correct attributes.


This statement is meaningless to me. Can you amplify what you mean
by it and how it would be implemented?


Several links on fuzzy logic has been given.
The first thing you need to do is decide what constitutes
uniqueness (name, name+address, etc.).
Actually, when using fuzzy logic you want to decide what
constitutes similarity. Uniqueness will alraedy be defined
in the database.


Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.


Some people keep databases in directories and text files ...
Why not in a spreadsheet?
Similarity of phone numbers can also be defined as (a function
of) the number of common digits comparing reversed.


Eh? Telephone numbers are unique once they are all in a common
format. What you've described sounds like you'd consider 212
123-4567 and 212 123-7654 as the same statistically, since they have
the same digits. The numbers 212 123-4567 and 212 124-4567 differ by
only one number, but are less likely to be an indication of a
duplicate record than 212 123-4567 and 212 123-4568.


I meant the length of a common tail. That could be one of
the applied measures, along with many other that take phone
numbers into account.
So, I don't see much that can be gained by statistical analysis of
phone numbers.


The point is that several competing measures of similarity can be
weighted againt each other. Simply pick measures that work,
whatever their "real" meaning may be.
Apr 5 '06 #12

P: n/a
CD********@FortuneJames.com schreef:
kaniest wrote:
A bayesian belief network could be maintained of what
constitutes similarity. Distributions can be initially be derived
from the available data. Tune and set treshholds to produce
acceptable results on new problems as they arise.
It may sound overly complex, but if it's a repeating problem
and the dataset is as simple, it may be worth building your
own (in Access).


Paul,

That's a brilliant idea. It's too bad that a Dot Net data structure
rather than Access lends itself to this approach. If this is done in
software it should work at least as well as Fuzzy Logic based code.
Fuzzy logic tends to shine in hardware implementation. A Bayesian
belief network would have to be optimally pruned for burning onto
chips once a decent distribution converges but that idea can't match
the hardware options or sizes available with Fuzzy Logic.

Yes, an unattended system that works from the basic data would require some
special tricks. But creating a simple network manually (after some data
mining and decision analasys) and coding it should't be that much of a
problem. Some marketeers do it all the time.

Apr 5 '06 #13

P: n/a
kaniest wrote:
David W. Fenton wrote:

Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.


Some people keep databases in directories and text files ...
Why not in a spreadsheet?


Because if one wishes to apply relational principles and a means to
enforce data integrity, a spreadsheet or a text file or a directory
structure isn't going to do it. For one person managing one's own data,
it's OK, but especially in an environment where multiple people need to
access the information, forget it.

But you're right. A collection of data does not necessarily have to be
kept in a specific database jar. It's like clothes. I know what I have
when it's strewn around my room. But when my maid or laundry service
needs to put away clean clothes, she needs a set of drawers/closets in
order to know where to put things.

--
Tim http://www.ucs.mun.ca/~tmarshal/
^o<
/#) "Burp-beep, burp-beep, burp-beep?" - Quaker Jake
/^^ "Whatcha doin?" - Ditto "TIM-MAY!!" - Me
Apr 5 '06 #14

P: n/a
kaniest wrote:
CD********@FortuneJames.com schreef:
kaniest wrote:
A bayesian belief network could be maintained of what
constitutes similarity. Distributions can be initially be derived
from the available data. Tune and set treshholds to produce
acceptable results on new problems as they arise.
It may sound overly complex, but if it's a repeating problem
and the dataset is as simple, it may be worth building your
own (in Access).


Paul,

That's a brilliant idea. It's too bad that a Dot Net data structure
rather than Access lends itself to this approach. If this is done in
software it should work at least as well as Fuzzy Logic based code.
Fuzzy logic tends to shine in hardware implementation. A Bayesian
belief network would have to be optimally pruned for burning onto
chips once a decent distribution converges but that idea can't match
the hardware options or sizes available with Fuzzy Logic.

Yes, an unattended system that works from the basic data would require some
special tricks. But creating a simple network manually (after some data
mining and decision analasys) and coding it should't be that much of a
problem. Some marketeers do it all the time.


I agree that the data mining and decision analysis (or their
equivalents) are important parts of the process. Everybody thinks they
know how to connect the data better than what an analysis shows to be
the best. That often results in, ahem..., non-optimal performance. A
simple network shouldn't be too much trouble to set up in Access. The
concept of data normalization is how database designers attempt to
minimize entropy. For the way I would go about it, Access would not be
a good choice, especially if further development is anticipated.

James A. Fortune
CD********@FortuneJames.com

Apr 5 '06 #15

P: n/a
CD********@FortuneJames.com wrote:
I agree that the data mining and decision analysis (or their
equivalents) are important parts of the process. Everybody thinks
they know how to connect the data better than what an analysis shows
to be the best. That often results in, ahem..., non-optimal
performance. A simple network shouldn't be too much trouble to set
up in Access. The concept of data normalization is how database
designers attempt to minimize entropy. For the way I would go about
it, Access would not be a good choice, especially if further
development is anticipated.


Serious apps in this area often "grow" from a some manager's
spreadsheet. Migrating the manager to Access may be an
improvement ..
Apr 5 '06 #16

P: n/a
Tim Marshall schreef:
kaniest wrote:
David W. Fenton wrote:

Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.


Some people keep databases in directories and text files ...
Why not in a spreadsheet?


Because if one wishes to apply relational principles and a means to
enforce data integrity, a spreadsheet or a text file or a directory
structure isn't going to do it. For one person managing one's own
data, it's OK, but especially in an environment where multiple people
need to access the information, forget it.


Sounds almost like Access. Anyway, when handling structured data
some kind of db engine or db _management system_ comes in handy.

--
Paul
Apr 6 '06 #17

P: n/a
"kaniest" <ka*****@invalid.invalid> wrote in
news:44***********************@news.xs4all.nl:
David W. Fenton wrote:
"kaniest" <ka*****@invalid.invalid> wrote in
news:44***********************@news.xs4all.nl:
David W. Fenton wrote
"ca******@gmail.com" <ca******@gmail.com> wrote
> Here's the issue:
>
> I have roughly 20 MS excel spreadsheets, each row contains a
> record. These records were hand entered by people in call
> centers.
>
> The problem is, there can and are duplicate phone numbers, and
> emails and addresses even person names. I need to sift through
> all this data (roughly 300,000+ records and use fuzzy logic to
> break it down, so that i have only unique records.
>
> Can I use Access or what to sort through all this data?

I think you've asked the wrong question. You don't want fuzzy
logic but fuzzy criteria matching.

One way is finding groups of similar records, then decide
statistically (fuzzy) wheter they represent the same
object. If so, pick its most probable correct attributes.


This statement is meaningless to me. Can you amplify what you
mean by it and how it would be implemented?


Several links on fuzzy logic has been given.


But what's the real-world application to *this* problem space?
The first thing you need to do is decide what constitutes
uniqueness (name, name+address, etc.).

Actually, when using fuzzy logic you want to decide what
constitutes similarity. Uniqueness will alraedy be defined
in the database.


Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.


Some people keep databases in directories and text files ...
Why not in a spreadsheet?


Huh?

The key poiht here is that the accuracy of any "fuzzy logic"
comparison is going to be increased if your source data is already
pre-processed and regularized. A spreadsheet has very few facilities
for maintaining data consistency during the data entry process. A
database *does* have the ability to define what constitutes
uniqueness at the db engine level, but in the present instance,
there is no engine enforcing any uniqueness rules in the data entry
process.

Of course, it could very well be that the spreadsheet is just a data
transfer medium, and not the data entry method. It may be that the
data actually comes out of a database and actually *did* have data
validation and uniqueness rules applied to it.

But the point I'm making is that any de-duping will be made more
accurate by pre-processing the data to weed out irregularities in
the data entry, by regularizing the format of the data values.
Similarity of phone numbers can also be defined as (a function
of) the number of common digits comparing reversed.


Eh? Telephone numbers are unique once they are all in a common
format. What you've described sounds like you'd consider 212
123-4567 and 212 123-7654 as the same statistically, since they
have the same digits. The numbers 212 123-4567 and 212 124-4567
differ by only one number, but are less likely to be an
indication of a duplicate record than 212 123-4567 and 212
123-4568.


I meant the length of a common tail. That could be one of
the applied measures, along with many other that take phone
numbers into account.


"Common tail" is a statistical term (if I'm understanding it
correctly, like the long tail of a distribution), not a database
term. You'll have to explain how it applies to database operations.
So, I don't see much that can be gained by statistical analysis
of phone numbers.


The point is that several competing measures of similarity can be
weighted againt each other. Simply pick measures that work,
whatever their "real" meaning may be.


I understand now. But it would help those of us who have no real
statistical background if, when you're posting in a database
newsgroup, you translate the jargon into terms that mean something.
This whole back and forth could have been avoided.

I don't know that the original poster has actually clarified whether
he really meant "fuzzy logic" as you're defining it, or if he meant
"fuzzy matches".

My address de-duping routines actually do use certain kinds of
measures of similarity between certain fields. But I've never done
any kind of real statisticial evaluation of this, just an ad hoc
choice based on eyeballing the results.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
Apr 6 '06 #18

P: n/a
David W. Fenton schreef:
I don't know that the original poster has actually clarified whether
he really meant "fuzzy logic" as you're defining it, or if he meant
"fuzzy matches".
Right, fuzzy logic proper works with a fuzzy notion
of set-membership. Then "fuzzy" is used in many other
areas where an imprecise result is appreciated.
My address de-duping routines actually do use certain kinds of
measures of similarity between certain fields. But I've never done
any kind of real statisticial evaluation of this, just an ad hoc
choice based on eyeballing the results.


Since you seem familiar with the problem, I'ld figure you may
appreciate the possibilities of another perspective on the
matter. Replacing boolean criteria by statistical criteria may
solve some problems, but the interplay of criteria (possibly
using bayes rule) carries things one step further.

keep up the good work,
Paul
Apr 6 '06 #19

P: n/a
"kaniest" <ka*****@invalid.invalid> wrote
Well, since the source data is a spreadsheet, there *aren't* any
definitions of what constitutes uniqueness.


Some people keep databases in directories and text files ...
Why not in a spreadsheet?


I disagree. I would agree that some people keep "data" in directories, text
files, and spreadsheets, but those are not, in my view, "databases", and
especially not "relational databases." Databases have structure and rules
for storing and retrieving data, that directories, text files, and
spreadsheets lack. Calling something a "database" doesn't, in fact, make it
one.

A given database implementation may use directories, text files, and perhaps
even spreadsheets. Access does not -- it has a monolithic file structure
containing the database objects.

Just because you have some data stored, and in someone's view it constitutes
a "database" does not mean it is even one step along the path from raw,
random data to a relational database.

David is correct that deduplicating/cleaning/scrubbing data that is not in
proper relational form is a "decidedly non-trivial task" and one that is
unlikely to be _easily_ handled even by a highly skilled database developer
/ programmer.

Larry Linson
Microsoft Access MVP
Apr 6 '06 #20

P: n/a
Larry Linson wrote
Just because you have some data stored, and in someone's view it
constitutes a "database" does not mean it is even one step along the
path from raw, random data to a relational database.


The validity of the data in a rdbms does not guarantee its
relevance. The merits of solutions to import problems
compare to the (busines) risk of errors - manual intervention,
missing accounts, offended customers, etc.
Consider James' remark on connecting data. Of course
it requires expertise and skill - and analysis. Access isn't
that different from other tools.

--
Paul
Apr 6 '06 #21

P: n/a
"kaniest" <ka*****@invalid.invalid> wrote
Larry Linson wrote
Just because you have some data stored, and in someone's view it
constitutes a "database" does not mean it is even one step along the
path from raw, random data to a relational database.


The validity of the data in a rdbms does not guarantee its
relevance. The merits of solutions to import problems
compare to the (busines) risk of errors - manual intervention,
missing accounts, offended customers, etc.
Consider James' remark on connecting data. Of course
it requires expertise and skill - and analysis. Access isn't
that different from other tools.


Are you saying "GIGO" in a lot of words? That's been a given, ever since the
computers I used were made with vacuum tubes.

"Relevance" of data was not the subject of this thread... but a specific
question about using databases, especially Access with "fuzzy logic" as a
solution to a particular problem posed by the original poster. "Fuzzy logic"
or "fuzzy matching" are not native features in Access and may, or may not,
be useful for the o.p.'s needs. Reminds me of some time I spent, about
twenty years ago, working with rule-based-systems in AI when all the pundits
predicted that the next year was going to be "the Year of Artificial
Intelligence", for several years running. None of those years actually was
"the Year of AI", and interest waned in the subject.

Access is, in fact, the UI and development tool. The actual database engine
can be the Jet database engine that comes with Access and is installed by
default, or you can use Access as a client to the heaviest-duty server DBs.
Access and whatever database engine you choose are a combination that is
quite different from some user's collection of directories, files, and
spreadsheets. And, so are a number of other databases.
Apr 7 '06 #22

P: n/a
Larry Linson schreef:
"kaniest" <ka*****@invalid.invalid> wrote
Larry Linson wrote
Just because you have some data stored, and in someone's view it
constitutes a "database" does not mean it is even one step along the
path from raw, random data to a relational database.


The validity of the data in a rdbms does not guarantee its
relevance. The merits of solutions to import problems
compare to the (busines) risk of errors - manual intervention,
missing accounts, offended customers, etc.
Consider James' remark on connecting data. Of course
it requires expertise and skill - and analysis. Access isn't
that different from other tools.


Are you saying "GIGO" in a lot of words? That's been a given, ever
since the computers I used were made with vacuum tubes.

"Relevance" of data was not the subject of this thread... but a
specific question about using databases, especially Access with
"fuzzy logic" as a solution to a particular problem posed by the
original poster. "Fuzzy logic" or "fuzzy matching" are not native
features in Access and may, or may not, be useful for the o.p.'s
needs. Reminds me of some time I spent, about twenty years ago,
working with rule-based-systems in AI when all the pundits predicted
that the next year was going to be "the Year of Artificial
Intelligence", for several years running. None of those years
actually was "the Year of AI", and interest waned in the subject.


At the time we (being db guys) figured that if it really works,
it's not AI. It's one thing to offer the user a set of near matches
from a non-equi join (like a Levenhstein distance in another
thread), it's another thing to decide how to modify invalid import
data to make it pass validation.
BTW, analysis doesn't necessarily shows an experienced
developer or handler to make suboptimal choices, although
there may be surprises. Once most specs were decided from
"this is the best we can do for the money" proposals by
techies, now there's an increasing demand for transparent,
rational trade-offs in a wider scope.

--
Paul
Apr 7 '06 #23

P: n/a
David W. Fenton wrote:
"ca******@gmail.com" <ca******@gmail.com> wrote in
news:11**********************@t31g2000cwb.googlegr oups.com:
Here's the issue:

I have roughly 20 MS excel spreadsheets, each row contains a
record. These records were hand entered by people in call centers.

The problem is, there can and are duplicate phone numbers, and
emails and addresses even person names. I need to sift through all
this data (roughly 300,000+ records and use fuzzy logic to break
it down, so that i have only unique records.

Can I use Access or what to sort through all this data?


I think you've asked the wrong question. You don't want fuzzy logic
but fuzzy criteria matching.

I've de-duped millions of records, and it's very hard work.

The first thing you need to do is decide what constitutes uniqueness
(name, name+address, etc.).

Then you need to get all records into the exactly the same format.
This means that you need to process certain fields to get them all
formatted the same. For instance, phone numbers might be entered as
123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123
456-7890. To de-dupe on phone numbers, you'd need to first process
out all the variations. What I'd do with phone numbers is first
strip out the leading +1 or 1, then I'd strip out all characters
except the numbers. I wouldn't reformat them, but leave them that
way because the de-duping process is not going to be helped by
adding more characters. When the final data is cleaned up, you may
choose to reformat the data (or just display it with the appropriate
parens and dashes).

Another problem is that if there are multiple phone numbers, you
might end up with them in the wrong columns. That is, if there are
just Phone1 and Phone2 instead of Home, Work, Mobile, you might end
up with two records that are the same person and the Phone1 field
might match the Phone2 field of the other record. You can
restructure the data to fix this problem (normalization), by putting
the phone numbers in a separate table with a link to the source
record and the phone number. Then you'd only have to search for dups
on a single field in this phone number table.

Of course, there could be lots of false matches if, say, there are
many different people sharing the same phone number, so you wouldn't
want to use phone number by itself to determine duplicates.

To do all this parsing, you'll need to familiarize yourself with the
VBA string functions, Replace(), Mid(), Left() and Right().

Addresses are much more complicated. I've got lots of code that I've
written to process addresses into standard formats, and to evalute
the contents and encode them in a format that is easily comparable.
But it's very complicated and I'm not prepared to share it here. The
main point I want to make is that this is a hard task.


Regular expressions are a much more powerful than VBA string functions
for parsing phone numbers, email addresses, etc. out of unstructured
data. And OLE makes it easy to use VBScript's RegEx object within an
Access app. Plus you'll be able to leverage your knowledge of RegEx in
just about every other programming evironment you encounter in the
future.

Search MSDN or the Internet to learn more about what you can do with
RegEx, or check out Jeffrey E. F. Friedl's "Mastering Regular
Expressions" book (published by O'Reilly).

Kevin

Apr 7 '06 #24

P: n/a
"Kevin G. Anderson" <kg*****@pobox.com> wrote in
news:11*********************@g10g2000cwb.googlegro ups.com:
Regular expressions are a much more powerful than VBA string
functions for parsing phone numbers, email addresses, etc. out of
unstructured data. And OLE makes it easy to use VBScript's RegEx
object within an Access app. Plus you'll be able to leverage your
knowledge of RegEx in just about every other programming
evironment you encounter in the future.
I, for one, have never found RegExp to be easy to understand at all.
My eyes just glaze over when I try to figure it out, to be honest.
Search MSDN or the Internet to learn more about what you can do
with RegEx, or check out Jeffrey E. F. Friedl's "Mastering Regular
Expressions" book (published by O'Reilly).


I don't see how this helps you. I don't know how RegExp works for
stripping out characters, but given that it's an external library,
while VBA string handling functions are native to Access (and thus,
perhaps, more efficient when accessed through the Access Expression
Service (which is how functions are used in queries)), I would
expect it to be less efficient than the native VBA string handling
functions.

On the other hand, if you're writing custom functions using the
string handling tools, and RegExp provides a function that does the
same thing, the RegExp is perhaps going to be faster.

If you use RegExp already, then it's certainly worth looking at.

But if you don't, I'm not certain there's going to be much utility
to delving too deeply into learning it.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
Apr 7 '06 #25

This discussion thread is closed

Replies have been disabled for this discussion.