440,091 Members | 1,546 Online
Need help? Post your question and get tips & solutions from a community of 440,091 IT Pros & Developers. It's quick & easy.

# Can Access use Fuzzy Logic

 P: n/a Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records. Can I use Access or what to sort through all this data? Apr 4 '06 #1
24 Replies

 P: n/a ca******@gmail.com wrote: Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records. Can I use Access or what to sort through all this data? Fuzzy logic is basically the same as determining a probability density distribution empirically. Of course, a temperature that is considered 'cold' varies from month to month as well as from person to person. I think all you need is a Public function that assigns a value to each record based on the fuzzy logic in the function. Then you can use the function's return value to identify duplicate records where duplicate is used in a fuzzy sense. James A. Fortune CD********@FortuneJames.com Apr 4 '06 #2

 P: n/a Not really clear on what you want to do, but I think the answer to your question is Yes. You can import the data to Access and then use SQL to pare the records down to a distinct set. Check out the help file on SELECT DISTINCT and SELECT DISTINCTROW by looking in the help file contents under Microsoft Jet SQL Reference|Overview|SQL Reserved Words. WIth that many records, I'd be very cautious about importing radicals and be very strict in your Field attributes for each Record. I'd suggest you ensure continuity of your dataset from a formatting perspective and Trim all the fields in Excel before moving them to Access. You'll also have to watch out for Nulls, Empty Sets, etc., but if you are sure to remove all radicals like these from your dataset, you can do pretty well anything you want in SQL to arrive at a unique recordset including using the aforementioned fuzzy logic to identify records with properly phrased WHERE clauses. Apr 4 '06 #3

 P: n/a ca******@gmail.com wrote: Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records. Can I use Access or what to sort through all this data? Strictly speaking, one could use fuzzy logic here, but I would think that a few clever heuristics might be simpler. You don't give details on the nature of the duplication, but many slight variations on name, address, etc. can be cleared up by normalizing values by removing extra space, forcing all capitalization to be the same, forcing all phone numbers to the same format, etc. -Will Dwinnell http://will.dwinnell.com Apr 5 '06 #4

 P: n/a CD********@FortuneJames.com wrote: Fuzzy logic is basically the same as determining a probability density distribution empirically. Of course, a temperature that is considered 'cold' varies from month to month as well as from person to person. I think all you need is a Public function that assigns a value to each record based on the fuzzy logic in the function. Then you can use the function's return value to identify duplicate records where duplicate is used in a fuzzy sense. Fuzziness is not probability. Here are links to some reasonably good introductory material on fuzzy logic: http://www.austinlinks.com/Fuzzy/overview.html http://www.ncst.ernet.in/education/a...uzzy/fuzzy.pdf http://www.fpk.tu-berlin.de/~anderl/...uzzyintro4.pdf http://www.phys.ufl.edu/docs/matlab/...y/fuzzy_tb.pdf http://www.faqs.org/faqs/fuzzy-logic/part1/ http://www.fuzzy-logic.com/ch3.htm -Will Dwinnell http://will.dwinnell.com Apr 5 '06 #5

 P: n/a Predictor wrote: Fuzziness is not probability. Here are links to some reasonably good introductory material on fuzzy logic: http://www.austinlinks.com/Fuzzy/overview.html http://www.ncst.ernet.in/education/a...uzzy/fuzzy.pdf http://www.fpk.tu-berlin.de/~anderl/...uzzyintro4.pdf http://www.phys.ufl.edu/docs/matlab/...y/fuzzy_tb.pdf http://www.faqs.org/faqs/fuzzy-logic/part1/ http://www.fuzzy-logic.com/ch3.htm -Will Dwinnell http://will.dwinnell.com Great set of links on an interesting subject, thanks, Will. I was keenly interested in FL after one of our mathematicians at Atomic Energy gave a presentation on it in the early 90s, but have never really pursued it. -- Tim http://www.ucs.mun.ca/~tmarshal/ ^o< /#) "Burp-beep, burp-beep, burp-beep?" - Quaker Jake /^^ "Whatcha doin?" - Ditto "TIM-MAY!!" - Me Apr 5 '06 #7

 P: n/a David W. Fenton wrote "ca******@gmail.com" wrote Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records. Can I use Access or what to sort through all this data? I think you've asked the wrong question. You don't want fuzzy logic but fuzzy criteria matching. One way is finding groups of similar records, then decide statistically (fuzzy) wheter they represent the same object. If so, pick its most probable correct attributes. The first thing you need to do is decide what constitutes uniqueness (name, name+address, etc.). Actually, when using fuzzy logic you want to decide what constitutes similarity. Uniqueness will alraedy be defined in the database. Then you need to get all records into the exactly the same format. This means that you need to process certain fields to get them all formatted the same. For instance, phone numbers might be entered as 123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123 456-7890. To de-dupe on phone numbers, you'd need to first process out all the variations. What I'd do with phone numbers is first strip out the leading +1 or 1, then I'd strip out all characters except the numbers. I wouldn't reformat them, but leave them that way because the de-duping process is not going to be helped by adding more characters. When the final data is cleaned up, you may choose to reformat the data (or just display it with the appropriate parens and dashes). Similarity of phone numbers can also be defined as (a function of) the number of common digits comparing reversed. The problem is that none of this can be easily done in an automated fashion. It almost always requires a human being to make a final determination of what really is a duplicate. What I do is try to do as many of them with queries and code, and then have the remainder evaluated by a human being. But with hundreds of thousands of records, this can be impossible. So you're left with recognizing that you can never get a perfectly de-duped resultset. A bayesian belief network could be maintained of what constitutes similarity. Distributions can be initially be derived from the available data. Tune and set treshholds to produce acceptable results on new problems as they arise. It may sound overly complex, but if it's a repeating problem and the dataset is as simple, it may be worth building your own (in Access). This is why it's important in a database application to have as many protections against the production of duplicate records as possible. amen -- Paul Apr 5 '06 #8

 P: n/a Predictor wrote: CD********@FortuneJames.com wrote: Fuzzy logic is basically the same as determining a probability density distribution empirically. Of course, a temperature that is considered 'cold' varies from month to month as well as from person to person. I think all you need is a Public function that assigns a value to each record based on the fuzzy logic in the function. Then you can use the function's return value to identify duplicate records where duplicate is used in a fuzzy sense. Fuzziness is not probability. Here are links to some reasonably good introductory material on fuzzy logic: http://www.austinlinks.com/Fuzzy/overview.html http://www.ncst.ernet.in/education/a...uzzy/fuzzy.pdf http://www.fpk.tu-berlin.de/~anderl/...uzzyintro4.pdf http://www.phys.ufl.edu/docs/matlab/...y/fuzzy_tb.pdf http://www.faqs.org/faqs/fuzzy-logic/part1/ http://www.fuzzy-logic.com/ch3.htm -Will Dwinnell http://will.dwinnell.com It's analogous to what I said. I agree that fuzziness is not probability, but it acts in exactly the same fashion. Fuzziness is probability density distribution determination in a fuzzy sense. I'll check the links to see if fuzzy logic has grown beyond this. Plus, it doesn't change the advice I gave either way. James A. Fortune CD********@FortuneJames.com Apr 5 '06 #9

 P: n/a "kaniest" wrote in news:44***********************@news.xs4all.nl: David W. Fenton wrote "ca******@gmail.com" wrote Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records. Can I use Access or what to sort through all this data? I think you've asked the wrong question. You don't want fuzzy logic but fuzzy criteria matching. One way is finding groups of similar records, then decide statistically (fuzzy) wheter they represent the same object. If so, pick its most probable correct attributes. This statement is meaningless to me. Can you amplify what you mean by it and how it would be implemented? The first thing you need to do is decide what constitutes uniqueness (name, name+address, etc.). Actually, when using fuzzy logic you want to decide what constitutes similarity. Uniqueness will alraedy be defined in the database. Well, since the source data is a spreadsheet, there *aren't* any definitions of what constitutes uniqueness. Secondly, for a database there are database-level definitions of uniqueness and then there are real-world definitions of uniqueness, which aren't necessarily the same thing (this is the old natural vs. surrogate key question). I find that you *can't* depend on the db engine validation of unique data for eliminating real duplicates, because real-world data is often incomplete. And it won't help with variations in the way data that is actually identical is entered (e.g., Street vs. St.). Then you need to get all records into the exactly the same format. This means that you need to process certain fields to get them all formatted the same. For instance, phone numbers might be entered as 123-456-7890 and (123) 456-7890 and 123.456.7890 and +1 123 456-7890. To de-dupe on phone numbers, you'd need to first process out all the variations. What I'd do with phone numbers is first strip out the leading +1 or 1, then I'd strip out all characters except the numbers. I wouldn't reformat them, but leave them that way because the de-duping process is not going to be helped by adding more characters. When the final data is cleaned up, you may choose to reformat the data (or just display it with the appropriate parens and dashes). Similarity of phone numbers can also be defined as (a function of) the number of common digits comparing reversed. Eh? Telephone numbers are unique once they are all in a common format. What you've described sounds like you'd consider 212 123-4567 and 212 123-7654 as the same statistically, since they have the same digits. The numbers 212 123-4567 and 212 124-4567 differ by only one number, but are less likely to be an indication of a duplicate record than 212 123-4567 and 212 123-4568. So, I don't see much that can be gained by statistical analysis of phone numbers. The problem is that none of this can be easily done in an automated fashion. It almost always requires a human being to make a final determination of what really is a duplicate. What I do is try to do as many of them with queries and code, and then have the remainder evaluated by a human being. But with hundreds of thousands of records, this can be impossible. So you're left with recognizing that you can never get a perfectly de-duped resultset. A bayesian belief network could be maintained of what constitutes similarity. Distributions can be initially be derived from the available data. Tune and set treshholds to produce acceptable results on new problems as they arise. It may sound overly complex, but if it's a repeating problem and the dataset is as simple, it may be worth building your own (in Access). I see this as a ridiculous recommendation for a practical problem, since I don't have the mathetmatical skills to even begin to implement a Bayesian algorithm (though I do understand how it works). It would also be rather difficult to implement, I would think. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ Apr 5 '06 #10

 P: n/a kaniest wrote: A bayesian belief network could be maintained of what constitutes similarity. Distributions can be initially be derived from the available data. Tune and set treshholds to produce acceptable results on new problems as they arise. It may sound overly complex, but if it's a repeating problem and the dataset is as simple, it may be worth building your own (in Access). Paul, That's a brilliant idea. It's too bad that a Dot Net data structure rather than Access lends itself to this approach. If this is done in software it should work at least as well as Fuzzy Logic based code. Fuzzy logic tends to shine in hardware implementation. A Bayesian belief network would have to be optimally pruned for burning onto chips once a decent distribution converges but that idea can't match the hardware options or sizes available with Fuzzy Logic. James A. Fortune CD********@FortuneJames.com Conversation from about 15 years ago: Me: Why did you start your node probabilities at zero rather than using the information gained from building it in the first place? M. Ramsisi: Because all the text books do it that way. Apr 5 '06 #11

 P: n/a David W. Fenton wrote: "kaniest" wrote in news:44***********************@news.xs4all.nl: David W. Fenton wrote "ca******@gmail.com" wrote Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate phone numbers, and emails and addresses even person names. I need to sift through all this data (roughly 300,000+ records and use fuzzy logic to break it down, so that i have only unique records. Can I use Access or what to sort through all this data? I think you've asked the wrong question. You don't want fuzzy logic but fuzzy criteria matching. One way is finding groups of similar records, then decide statistically (fuzzy) wheter they represent the same object. If so, pick its most probable correct attributes. This statement is meaningless to me. Can you amplify what you mean by it and how it would be implemented? Several links on fuzzy logic has been given. The first thing you need to do is decide what constitutes uniqueness (name, name+address, etc.). Actually, when using fuzzy logic you want to decide what constitutes similarity. Uniqueness will alraedy be defined in the database. Well, since the source data is a spreadsheet, there *aren't* any definitions of what constitutes uniqueness. Some people keep databases in directories and text files ... Why not in a spreadsheet? Similarity of phone numbers can also be defined as (a function of) the number of common digits comparing reversed. Eh? Telephone numbers are unique once they are all in a common format. What you've described sounds like you'd consider 212 123-4567 and 212 123-7654 as the same statistically, since they have the same digits. The numbers 212 123-4567 and 212 124-4567 differ by only one number, but are less likely to be an indication of a duplicate record than 212 123-4567 and 212 123-4568. I meant the length of a common tail. That could be one of the applied measures, along with many other that take phone numbers into account. So, I don't see much that can be gained by statistical analysis of phone numbers. The point is that several competing measures of similarity can be weighted againt each other. Simply pick measures that work, whatever their "real" meaning may be. Apr 5 '06 #12

 P: n/a CD********@FortuneJames.com schreef: kaniest wrote: A bayesian belief network could be maintained of what constitutes similarity. Distributions can be initially be derived from the available data. Tune and set treshholds to produce acceptable results on new problems as they arise. It may sound overly complex, but if it's a repeating problem and the dataset is as simple, it may be worth building your own (in Access). Paul, That's a brilliant idea. It's too bad that a Dot Net data structure rather than Access lends itself to this approach. If this is done in software it should work at least as well as Fuzzy Logic based code. Fuzzy logic tends to shine in hardware implementation. A Bayesian belief network would have to be optimally pruned for burning onto chips once a decent distribution converges but that idea can't match the hardware options or sizes available with Fuzzy Logic. Yes, an unattended system that works from the basic data would require some special tricks. But creating a simple network manually (after some data mining and decision analasys) and coding it should't be that much of a problem. Some marketeers do it all the time. Apr 5 '06 #13

 P: n/a kaniest wrote: David W. Fenton wrote: Well, since the source data is a spreadsheet, there *aren't* anydefinitions of what constitutes uniqueness. Some people keep databases in directories and text files ... Why not in a spreadsheet? Because if one wishes to apply relational principles and a means to enforce data integrity, a spreadsheet or a text file or a directory structure isn't going to do it. For one person managing one's own data, it's OK, but especially in an environment where multiple people need to access the information, forget it. But you're right. A collection of data does not necessarily have to be kept in a specific database jar. It's like clothes. I know what I have when it's strewn around my room. But when my maid or laundry service needs to put away clean clothes, she needs a set of drawers/closets in order to know where to put things. -- Tim http://www.ucs.mun.ca/~tmarshal/ ^o< /#) "Burp-beep, burp-beep, burp-beep?" - Quaker Jake /^^ "Whatcha doin?" - Ditto "TIM-MAY!!" - Me Apr 5 '06 #14

 P: n/a kaniest wrote: CD********@FortuneJames.com schreef: kaniest wrote: A bayesian belief network could be maintained of what constitutes similarity. Distributions can be initially be derived from the available data. Tune and set treshholds to produce acceptable results on new problems as they arise. It may sound overly complex, but if it's a repeating problem and the dataset is as simple, it may be worth building your own (in Access). Paul, That's a brilliant idea. It's too bad that a Dot Net data structure rather than Access lends itself to this approach. If this is done in software it should work at least as well as Fuzzy Logic based code. Fuzzy logic tends to shine in hardware implementation. A Bayesian belief network would have to be optimally pruned for burning onto chips once a decent distribution converges but that idea can't match the hardware options or sizes available with Fuzzy Logic. Yes, an unattended system that works from the basic data would require some special tricks. But creating a simple network manually (after some data mining and decision analasys) and coding it should't be that much of a problem. Some marketeers do it all the time. I agree that the data mining and decision analysis (or their equivalents) are important parts of the process. Everybody thinks they know how to connect the data better than what an analysis shows to be the best. That often results in, ahem..., non-optimal performance. A simple network shouldn't be too much trouble to set up in Access. The concept of data normalization is how database designers attempt to minimize entropy. For the way I would go about it, Access would not be a good choice, especially if further development is anticipated. James A. Fortune CD********@FortuneJames.com Apr 5 '06 #15

 P: n/a CD********@FortuneJames.com wrote: I agree that the data mining and decision analysis (or their equivalents) are important parts of the process. Everybody thinks they know how to connect the data better than what an analysis shows to be the best. That often results in, ahem..., non-optimal performance. A simple network shouldn't be too much trouble to set up in Access. The concept of data normalization is how database designers attempt to minimize entropy. For the way I would go about it, Access would not be a good choice, especially if further development is anticipated. Serious apps in this area often "grow" from a some manager's spreadsheet. Migrating the manager to Access may be an improvement .. Apr 5 '06 #16

 P: n/a Tim Marshall schreef: kaniest wrote: David W. Fenton wrote: Well, since the source data is a spreadsheet, there *aren't* any definitions of what constitutes uniqueness. Some people keep databases in directories and text files ... Why not in a spreadsheet? Because if one wishes to apply relational principles and a means to enforce data integrity, a spreadsheet or a text file or a directory structure isn't going to do it. For one person managing one's own data, it's OK, but especially in an environment where multiple people need to access the information, forget it. Sounds almost like Access. Anyway, when handling structured data some kind of db engine or db _management system_ comes in handy. -- Paul Apr 6 '06 #17

 P: n/a David W. Fenton schreef: I don't know that the original poster has actually clarified whether he really meant "fuzzy logic" as you're defining it, or if he meant "fuzzy matches". Right, fuzzy logic proper works with a fuzzy notion of set-membership. Then "fuzzy" is used in many other areas where an imprecise result is appreciated. My address de-duping routines actually do use certain kinds of measures of similarity between certain fields. But I've never done any kind of real statisticial evaluation of this, just an ad hoc choice based on eyeballing the results. Since you seem familiar with the problem, I'ld figure you may appreciate the possibilities of another perspective on the matter. Replacing boolean criteria by statistical criteria may solve some problems, but the interplay of criteria (possibly using bayes rule) carries things one step further. keep up the good work, Paul Apr 6 '06 #19

 P: n/a "kaniest" wrote Well, since the source data is a spreadsheet, there *aren't* any definitions of what constitutes uniqueness. Some people keep databases in directories and text files ... Why not in a spreadsheet? I disagree. I would agree that some people keep "data" in directories, text files, and spreadsheets, but those are not, in my view, "databases", and especially not "relational databases." Databases have structure and rules for storing and retrieving data, that directories, text files, and spreadsheets lack. Calling something a "database" doesn't, in fact, make it one. A given database implementation may use directories, text files, and perhaps even spreadsheets. Access does not -- it has a monolithic file structure containing the database objects. Just because you have some data stored, and in someone's view it constitutes a "database" does not mean it is even one step along the path from raw, random data to a relational database. David is correct that deduplicating/cleaning/scrubbing data that is not in proper relational form is a "decidedly non-trivial task" and one that is unlikely to be _easily_ handled even by a highly skilled database developer / programmer. Larry Linson Microsoft Access MVP Apr 6 '06 #20

 P: n/a Larry Linson wrote Just because you have some data stored, and in someone's view it constitutes a "database" does not mean it is even one step along the path from raw, random data to a relational database. The validity of the data in a rdbms does not guarantee its relevance. The merits of solutions to import problems compare to the (busines) risk of errors - manual intervention, missing accounts, offended customers, etc. Consider James' remark on connecting data. Of course it requires expertise and skill - and analysis. Access isn't that different from other tools. -- Paul Apr 6 '06 #21

 P: n/a "kaniest" wrote Larry Linson wrote Just because you have some data stored, and in someone's view it constitutes a "database" does not mean it is even one step along the path from raw, random data to a relational database. The validity of the data in a rdbms does not guarantee its relevance. The merits of solutions to import problems compare to the (busines) risk of errors - manual intervention, missing accounts, offended customers, etc. Consider James' remark on connecting data. Of course it requires expertise and skill - and analysis. Access isn't that different from other tools. Are you saying "GIGO" in a lot of words? That's been a given, ever since the computers I used were made with vacuum tubes. "Relevance" of data was not the subject of this thread... but a specific question about using databases, especially Access with "fuzzy logic" as a solution to a particular problem posed by the original poster. "Fuzzy logic" or "fuzzy matching" are not native features in Access and may, or may not, be useful for the o.p.'s needs. Reminds me of some time I spent, about twenty years ago, working with rule-based-systems in AI when all the pundits predicted that the next year was going to be "the Year of Artificial Intelligence", for several years running. None of those years actually was "the Year of AI", and interest waned in the subject. Access is, in fact, the UI and development tool. The actual database engine can be the Jet database engine that comes with Access and is installed by default, or you can use Access as a client to the heaviest-duty server DBs. Access and whatever database engine you choose are a combination that is quite different from some user's collection of directories, files, and spreadsheets. And, so are a number of other databases. Apr 7 '06 #22

 P: n/a Larry Linson schreef: "kaniest" wrote Larry Linson wrote Just because you have some data stored, and in someone's view it constitutes a "database" does not mean it is even one step along the path from raw, random data to a relational database. The validity of the data in a rdbms does not guarantee its relevance. The merits of solutions to import problems compare to the (busines) risk of errors - manual intervention, missing accounts, offended customers, etc. Consider James' remark on connecting data. Of course it requires expertise and skill - and analysis. Access isn't that different from other tools. Are you saying "GIGO" in a lot of words? That's been a given, ever since the computers I used were made with vacuum tubes. "Relevance" of data was not the subject of this thread... but a specific question about using databases, especially Access with "fuzzy logic" as a solution to a particular problem posed by the original poster. "Fuzzy logic" or "fuzzy matching" are not native features in Access and may, or may not, be useful for the o.p.'s needs. Reminds me of some time I spent, about twenty years ago, working with rule-based-systems in AI when all the pundits predicted that the next year was going to be "the Year of Artificial Intelligence", for several years running. None of those years actually was "the Year of AI", and interest waned in the subject. At the time we (being db guys) figured that if it really works, it's not AI. It's one thing to offer the user a set of near matches from a non-equi join (like a Levenhstein distance in another thread), it's another thing to decide how to modify invalid import data to make it pass validation. BTW, analysis doesn't necessarily shows an experienced developer or handler to make suboptimal choices, although there may be surprises. Once most specs were decided from "this is the best we can do for the money" proposals by techies, now there's an increasing demand for transparent, rational trade-offs in a wider scope. -- Paul Apr 7 '06 #23