By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,968 Members | 1,226 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,968 IT Pros & Developers. It's quick & easy.

String Matching in Database, how to negate leading zeros and partial match strings of

natalie99
P: 41
Hi All

This may be a very basic question, please bare with me I am learning!

I have two sets of data, one is the baseline of a Billing Inventory, the other is a seperate data set from the Live Inventory.

The field I am trying to match is the ID field.

There is no primary key in the Baseline inventory, ID can be duplicated, however, the other fields in the record will determine if it is, or is not, a duplicate entry or two legitimate data fields.

(The baseline inventory is not able to be altered, it is externally owned)

Below are some examples that are in fact, the same ID:

eg.
a) 0000016ABC-00-00-1
=
b) 16ABC-00-00-1

a) FFTG7654GRFT
=
b) FFTG-7654-GRFT

a) King Street West 1022456
=
b) KingStreetWest 1022456

There is no set format for ID field entries, they may contain spaces, hyphens, numbers, letters or other punctuation marks eg. underscores.

I need a way of finding these string or pattern matches, and then confirming whether or not they are legitimate single IDs, or simply formatting errors or typos.

I have tried the following basic solutions:

find_unmatched()
fuzzy duplicate finder excel add-in
get_similar()

Both sets of data can export into Excel or Access.

There are approximately 35,000 lines per data set, however, I can eliminate approximately 20,000 with the find_unmatched query.

Thanks for any help anyone can give.

Natalie
Feb 1 '08 #1
Share this Question
Share on Google+
4 Replies


ADezii
Expert 5K+
P: 8,679
Hi All

This may be a very basic question, please bare with me I am learning!

I have two sets of data, one is the baseline of a Billing Inventory, the other is a seperate data set from the Live Inventory.

The field I am trying to match is the ID field.

There is no primary key in the Baseline inventory, ID can be duplicated, however, the other fields in the record will determine if it is, or is not, a duplicate entry or two legitimate data fields.

(The baseline inventory is not able to be altered, it is externally owned)

Below are some examples that are in fact, the same ID:

eg.
a) 0000016ABC-00-00-1
=
b) 16ABC-00-00-1

a) FFTG7654GRFT
=
b) FFTG-7654-GRFT

a) King Street West 1022456
=
b) KingStreetWest 1022456

There is no set format for ID field entries, they may contain spaces, hyphens, numbers, letters or other punctuation marks eg. underscores.

I need a way of finding these string or pattern matches, and then confirming whether or not they are legitimate single IDs, or simply formatting errors or typos.

I have tried the following basic solutions:

find_unmatched()
fuzzy duplicate finder excel add-in
get_similar()

Both sets of data can export into Excel or Access.

There are approximately 35,000 lines per data set, however, I can eliminate approximately 20,000 with the find_unmatched query.

Thanks for any help anyone can give.

Natalie
I'm not sure if it would be feasible in your case, but from what you have shown me, could you possibly strip all "0", "-", and " "s from the Strings then do an exact comparison on the IDs.? The following Function, not Optimized, will do the stripping:
Expand|Select|Wrap|Line Numbers
  1. Public Function fRemoveCharactersFromString(strIDField As String) As String
  2. Dim intLengthOfString As Integer, intCounter As Integer, strNewString As String
  3. Dim strCharToTest As String
  4.  
  5. If Len(strIDField) = 0 Then Exit Function
  6.  
  7. intLengthOfString = Len(strIDField)
  8.  
  9. For intCounter = 1 To intLengthOfString
  10.   strCharToTest = Mid$(strIDField, intCounter, 1)
  11.   If strCharToTest <> "0" And strCharToTest <> "-" And strCharToTest <> " " Then
  12.     strNewString = strNewString & Mid$(strIDField, intCounter, 1)
  13.   End If
  14. Next
  15.  
  16. Debug.Print strNewString
  17. End Function
Expand|Select|Wrap|Line Numbers
  1. fRemoveCharactersFromString("0-00P-h-il 00a-de0l ph i -a") ==> Philadelphia
Feb 3 '08 #2

natalie99
P: 41
Hi!

Thanks for your reply, I had thought of that avenue as a possible solution, however, some IDs are only differentiated through punctuation, and therefore the "-" etc are needed. That is why I am trying to find a 'similar data' type comparison rather than exact.

For example: the following are three different IDs

09-33-4
09334
0900003340

which i found in the stripping and compare process. I am considering this as an avenue to pursue once all other comparitive processes have failed, as I have 35000 IDs, it would be very timely to sort the results with ID duplicates like these.

:(

I really appreciate your idea and help though, please keep it coming!

natalie
Feb 12 '08 #3

ADezii
Expert 5K+
P: 8,679
Hi!

Thanks for your reply, I had thought of that avenue as a possible solution, however, some IDs are only differentiated through punctuation, and therefore the "-" etc are needed. That is why I am trying to find a 'similar data' type comparison rather than exact.

For example: the following are three different IDs

09-33-4
09334
0900003340

which i found in the stripping and compare process. I am considering this as an avenue to pursue once all other comparitive processes have failed, as I have 35000 IDs, it would be very timely to sort the results with ID duplicates like these.

:(

I really appreciate your idea and help though, please keep it coming!

natalie
Here's a long shot, how about a Public Function, excluding Hypens, that would check for all 'base' values contained within a String? In the about case, it would check for a 0, 9, 2-3s, and a 4 within the Base String, and if these are found would return True for a Partial Match.
Feb 12 '08 #4

natalie99
P: 41
thanks adezi, I have finally got about 90% matches - woohoo!
Mar 14 '08 #5

Post your reply

Sign in to post your reply or Sign up for a free account.