By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,921 Members | 1,471 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,921 IT Pros & Developers. It's quick & easy.

remove duplicate text within a string

P: 4
How do I remove text that was duplicated within a field. For example:
Series
---------
AC Delco AC Delco Taper Spoiler
Bendix Bendix High Performance Altinators
B&I B&I Headlamps
Bridgestone Firestone Birdgestone Firestone AllWeather Tires
Dec 2 '11 #1
Share this Question
Share on Google+
21 Replies


Rabbit
Expert Mod 10K+
P: 12,366
You could split the string by the space character and then loop through the array looking for distinct words.
Dec 2 '11 #2

P: 4
How do I go about doing that in sql. I have a process in java script but I can't get it to work in Access.

Expand|Select|Wrap|Line Numbers
  1. Function removeDuplicates (field) {
  2.     var temp = field.value;
  3.     var array = temp.split(" ");
  4.     array.sort();
  5.     temp = array.join(" ");
  6.  
  7.     do {
  8.         var newTemp = temp;
  9.         var temp = newTemp.replace(/\s(\w+\s)\1/, " $1");
  10.     } while (temp.length != newTemp.length);
  11.  
  12.     temp = temp.replace(/^(\w+\s)\1/, "$1");
  13.     temp = temp.replace(/(\s\w+)\1$/, "$1");
  14.  
  15. var orig = field.value.split(" ");
  16.     var finalStr = "";
  17.     for (var i=0; i<orig.length; i++) {
  18.         if (temp.indexOf(" " + orig[i] + " ") != -1) {
  19.             finalStr += orig[i] + " ";
  20.             temp = temp.split(" " + orig[i] + " ").join(" ");
  21.         } else if ((temp.indexOf(orig[i]) != -1) && (temp.indexOf(" "
  22. + orig[i]) == (temp.length-orig[i].length-1))) {
  23.             finalStr += orig[i] + " ";
  24.             temp = temp.substring(0, (temp.length-orig[i].length-1));
  25.         } else if (temp.indexOf(orig[i] + " ") == 0) {
  26.             finalStr += orig[i] + " ";
  27.             temp = temp.substring(orig[i].length+1, temp.length);
  28.         } else if (temp == orig[i]) {
  29.             finalStr += orig[i];
  30.             temp = "";
  31.         }
  32.     }
  33.  
  34.     if (finalStr.substring(finalStr.length-1, finalStr.length) == " ") {
  35.         finalStr = finalStr.substring(0, finalStr.length-1);
  36.     }
  37.     field.value = finalStr;
  38. }
Dec 2 '11 #3

100+
P: 332
This would be a starter. i.e. remove successive words from string:
Expand|Select|Wrap|Line Numbers
  1. Function removeSuccessingWordInString(val As String) As String
  2. Dim sVal, s As Variant
  3. Dim p As Integer
  4. Static cleanString As String
  5. sVal = Split(val, " ")
  6. For p = 0 To UBound(sVal) - 1
  7.     If sVal(p) = sVal(p + 1) Then
  8.         cleanString = Trim(Replace(val, sVal(p), "", 1, 1))
  9.         removeSuccessingWordInString cleanString
  10.     End If
  11. Next p
  12. removeSuccessingWordInString = cleanString
  13. End Function
Dec 2 '11 #4

dsatino
100+
P: 393
Just as a comment:
This appears to be the concatenation of 3 seperate fields from 3 separate tables. If you have control over the concatentation, then it would be easier to rewrite that then try an remove duplication.

If you're getting this from an external source, you might try asking them to modify the concatenation.
Dec 2 '11 #5

NeoPa
Expert Mod 15k+
P: 31,492
Bobbie, why not explain in words (preferably as part of the original question, but failing that now) what it is you want, rather than dumping some code (in a language other than VBA) for us to try to determine what you should have asked.
Dec 3 '11 #6

ADezii
Expert 5K+
P: 8,638
I see no easy Method to accomplish this. The best results that I can come up with is the following, using a very unorthodox approach. Notice the last 2 Records:
Expand|Select|Wrap|Line Numbers
  1. Series                                                              No_Dups
  2. AC Delco AC Delco Taper Spoiler                                     AC Delco Taper Spoiler
  3. Bendix Bendix High Performance Altinators                           Bendix High Performance tinators
  4. B&I B&I Headlamps                                                    B&I Headlamps
  5. Bridgestone Firestone Bridgestone Firestone                          Bridgestone Firestone
  6. AllWeather Tires                                                     AllWeather Tires
  7. BIG KMART BIG KMART BIG KMART BIG KMART                              BIG KMART
  8. TJ Max TJ Thompsom                                                   TJ Max Thompsom
  9. Bill Clinton Hillary Clinton George Bush Bill Clinton George Bush    Bill Clinton Hillary George Bush
  10.  
Dec 3 '11 #7

P: 4
@NeoPa
I have an Acces table of fields and one of the fields is concatenated and unfortunately one of the fields was concatenated twice. I don't have the data in from the separate fields otherwise I will simply fix the concatenated field. So I'm stuck with this field to fix. I can't use spaces or drop the first or second word because this problem of duplicate words isn't in all the fields.

So I was looking for a way using an access query to remove the duplicae words. The code I attached earlier I found on the internet and looks like it would work but it's in Java.
Dec 3 '11 #8

NeoPa
Expert Mod 15k+
P: 31,492
I recognise that at least as an attempt to specify the problem Bobbie (which is a good step forward :-)). Specifications are often a problem to get right, so all we expect is an honest attempt.

Now let's see if we can tidy it up a little. Correct me if I go astray.
  1. You have a table within which is a single field which was previously put together by concatenating various separate data strings.
  2. Unfortunately, the first of these values was included twice in the resultant string.
  3. The string values were originally joined together using spaces to separate them.
  4. Any or all of the values may, but don't necessarily have to, contain embedded spaces within themselves.
  5. So, we have a string field in a table which has a duplication at the start which may consist of one or more words separated in all cases by spaces.
  6. The only indication we have that an original value may have ended and a new one begun (within the resultant data we're working with) is the occurrence of a space character.
  7. You need to determine some logic first, and later some code that implements that logic, to handle de-duplication of the first value added to the string.
Dec 3 '11 #9

ADezii
Expert 5K+
P: 8,638
Thanks to NeoPa's clarification, the following Function, when used in a Query (Calculated Field), will handle 'Start of Field Duplication' for a Maximum of 6 Words. It can easily be expanded if so desired. I ran it against Test Data, some of it yours, and I'll post the Code and results below:
Expand|Select|Wrap|Line Numbers
  1. Public Function fCheckForDupes(strSeries As String)
  2. Dim varDups As Variant
  3. Dim intNumOfWords As Integer
  4. Dim strRetVal As String
  5.  
  6. varDups = Split(strSeries, " ")
  7.  
  8. intNumOfWords = UBound(varDups) + 1
  9.  
  10. Select Case intNumOfWords
  11.   Case 1    '1 Word, return Word
  12.     strRetVal = varDups(0)
  13.   Case 2    '2 Words, check for equality
  14.     If varDups(0) = varDups(1) Then     'Equal, drop the 2nd Word
  15.       strRetVal = varDups(0)
  16.     Else    '2 Words, not equal, return original String
  17.       strRetVal = varDups(0) + " " & varDups(1)
  18.     End If
  19.   Case 3    'Is the 1st equal to the 2nd Word?
  20.     If varDups(0) = varDups(1) Then     '1st & 2nd equal, Drop 2nd
  21.       strRetVal = varDups(0) & " " & varDups(2)
  22.     Else    'No equality, return entire String
  23.       strRetVal = varDups(0) + " " & varDups(1) & " " & varDups(2)
  24.     End If
  25.   Case 4   'Is the 1st = the 2nd OR are the 1st and 2nd = 3rd and 4th?
  26.     If varDups(0) = varDups(1) Then     'Drop the 2nd
  27.       strRetVal = varDups(0) & " " & varDups(2) & " " & varDups(3)
  28.     ElseIf (varDups(0) = varDups(2)) And (varDups(1) = varDups(3)) Then
  29.       strRetVal = varDups(0) & " " & varDups(1)     'Drop 3rd and 4th
  30.     Else
  31.       strRetVal = strSeries
  32.     End If
  33.   Case 5
  34.     If varDups(0) = varDups(1) Then     'Drop the 2nd
  35.       strRetVal = varDups(0) & " " & varDups(2) & " " & varDups(3) & _
  36.                   " " & varDups(4)
  37.     ElseIf (varDups(0) = varDups(2)) And (varDups(1) = varDups(3)) Then
  38.       strRetVal = varDups(0) & " " & varDups(1) & " " & varDups(4)    '
  39.     Else
  40.       strRetVal = strSeries
  41.     End If
  42.   Case 6
  43.     If varDups(0) = varDups(1) Then
  44.       strRetVal = varDups(0) & " " & varDups(2) & " " & varDups(3) & _
  45.                   " " & varDups(4) & " " & varDups(5)
  46.     ElseIf (varDups(0) = varDups(2)) And (varDups(1) = varDups(3)) Then
  47.       strRetVal = varDups(0) & " " & varDups(1) & " " & varDups(4) & _
  48.                   " " & varDups(5)
  49.     ElseIf (varDups(0) = varDups(3)) And (varDups(1) = varDups(4)) And _
  50.            (varDups(2) = varDups(5)) Then
  51.       strRetVal = varDups(0) & " " & varDups(1) & " " & varDups(2)
  52.     Else
  53.       strRetVal = strSeries
  54.     End If
  55.   Case Else
  56.     'If all else fails, return the Original String
  57.     strRetVal = strSeries
  58. End Select
  59.  
  60. fCheckForDupes = strRetVal
  61. End Function
Expand|Select|Wrap|Line Numbers
  1. Series                                             New_Series
  2. AC Delco AC Delco Taper Spoiler                    AC Delco Taper Spoiler
  3. Bendix Bendix High Performance Altinators          Bendix High Performance Altinators
  4. B&I B&I Headlamps                                  B&I Headlamps
  5. Bridgestone Firestone Bridgestone Firestone        Bridgestone Firestone
  6. AllWeather Tires                                   AllWeather Tires
  7. Encyclopedia                                       Encyclopedia
  8. One Two Three                                      One Two Three
  9. YaDa YaDa                                          YaDa
  10. One One Three Four                                 One Three Four
  11. One Two One Two Five                               One Two Five
  12. One Two Three Four Five                            One Two Three Four Five
  13. One Two Three Four                                 One Two Three Four
  14. One One Three Four Five Six                        One Three Four Five Six
  15. One Two Three One Two Three                        One Two Three
  16. 1 2 3 4 5 6                                        1 2 3 4 5 6
Dec 3 '11 #10

NeoPa
Expert Mod 15k+
P: 31,492
My understanding is that strings with no duplications at the start are not possible in this scenario, but let's wait for confirmation of the spec before taking the next step (at least that's how I prefer to play, personally).
Dec 4 '11 #11

ADezii
Expert 5K+
P: 8,638
@NeoPa:
My understanding is that strings with no duplications at the start are not possible in this scenario
That was my understanding also after reading your Post (#9), and my Code is based entirely on this Assumption.
Dec 4 '11 #12

NeoPa
Expert Mod 15k+
P: 31,492
Excellent. I was thrown by your posted example data, but as long as that's all clear all's good :-)
Dec 4 '11 #13

NeoPa
Expert Mod 15k+
P: 31,492
This is an alternative procedure to allow you to strip the duplicates from your field :

Expand|Select|Wrap|Line Numbers
  1. Public Function StripDup(ByVal varWith As Variant) As String
  2.     Dim intX As Integer, intY As Integer
  3.     Dim strWith As String
  4.  
  5.     If IsNull(varWith) Then
  6.         StripDup = ""
  7.         Exit Function
  8.     End If
  9.     StripDup = Trim(varWith)
  10.     varWith = Split(StripDup, " ")
  11.     For intX = (UBound(varWith) \ 2) + 1 To 1 Step -1
  12.         For intY = 0 To intX - 1
  13.             If varWith(intY) <> varWith(intY + intX) Then Exit For
  14.         Next intY
  15.         If intY > (intX - 1) Then Exit For
  16.     Next intX
  17.     If intX > 0 Then
  18.         StripDup = ""
  19.         For intY = intX To UBound(varWith)
  20.             StripDup = StripDup & " " & varWith(intY)
  21.         Next intY
  22.         StripDup = LTrim(StripDup)
  23.     End If
  24. End Function
Dec 4 '11 #14

ADezii
Expert 5K+
P: 8,638
Nice job, NeoPa! Let me be the first to say that your StripDup is better then my StripDup.
Dec 4 '11 #15

NeoPa
Expert Mod 15k+
P: 31,492
Thank you ADezii. A generous comment :-)
Dec 4 '11 #16

P: 4
I appreciate all your help and please forgive my ignorance because I clearly know very little about access and what it can do. Can I take the code you have just provided below and open access, go to design query in SQL view and paste your code. How do I get it to run and updated that field.
Table is called Distinct_Configs_w_Attirbutes and the field is called potential Group Description?
Dec 5 '11 #17

NeoPa
Expert Mod 15k+
P: 31,492
Not exactly. Let's see if we can describe the process clearly for you.
  1. Find a module in your project that you can use to hold the procedure code (that I posted). This can be an existing procedure or a new one created specially.

    To do this you first need to find your way to where the modules are visible to you :
    1. Ensure you're in your database in Access.
    2. Switch to the VBA IDE using Alt-F11.
    3. Switch to the Project Explorer pane using Ctrl-R.
    4. Expand the project and you will see (if there is anything there with code) up to three separate trees :
      1. Microsoft Office Access Class Objects.
      2. Modules.
      3. Class Objects.
    5. Standard modules are found within the Modules branch.
  2. Select an existing Standard module, or create a new one :
    1. Right click in the Project Explorer pane.
    2. Select Insert.
    3. Select Module (and name it etc).
  3. If the module isn't opened then double-click on it to open it up.
  4. See Require Variable Declaration before going any further and follow instructions found there.
  5. Copy and Paste the code from post #14 (Click on Line Numbers and Select at the top first to allow it to work smoothly).
  6. In your query, simply refer to the procedure by name and pass, as the parameter, the field that contains the unfixed data :
    Expand|Select|Wrap|Line Numbers
    1. SELECT StripDup([UnfixedField]) AS [FixedField]
    2. FROM   [SomeTable]
    3. etc

You should now see the data as it should have been.
Dec 5 '11 #18

P: 2
Code posted in #14 simply doesn't work. I test with input "Hello World Hello Hello World World" and get back the same string unchanged. I test with input "Hello Hello Hello" and get a runtime error 9, subscript out of range.

May I offer the following function instead. This is basically an O(n^2) algorithm and does not take advantage of VBA Collections. However for small strings it's probably just fine. Don't call it on a many-megabyte input :)

Expand|Select|Wrap|Line Numbers
  1. Public Function DropDupeWords(ByVal words As Variant) As String
  2.     'Removes duplicate words in the input, ignores case, preserves original order.
  3.     'Result has the first unique words found in the original casing.
  4.     'A word is any sequence of non-space characters.
  5.  
  6.     'Don't do anything on null or empty
  7.     DropDupeWords = ""
  8.     If IsNull(words) Then
  9.         Exit Function
  10.     End If
  11.     Dim trimmed As String
  12.     trimmed = Trim(words)
  13.     If trimmed = "" Then
  14.         Exit Function
  15.     End If
  16.     'If no embedded space, there's nothing to do
  17.     If InStr(trimmed, " ") = 0 Then
  18.         DropDupeWords = trimmed
  19.         Exit Function
  20.     End If
  21.  
  22.     'Split into words
  23.     Dim wordarr As Variant
  24.     wordarr = Split(trimmed, " ")
  25.     Dim limit As Long
  26.     limit = UBound(wordarr)
  27.  
  28.     'Build new array of unique words, initialized with first word.
  29.     Dim uniquewords() As String
  30.     ReDim uniquewords(limit)
  31.     Dim i As Integer, unique As Integer, u As Integer, found As Boolean
  32.     uniquewords(0) = wordarr(0)
  33.     unique = 0
  34.     For i = 1 To limit
  35.         'Skip empty elements resulting from multiple spaces
  36.         If wordarr(i) <> "" Then
  37.             found = False
  38.             For u = 0 To unique
  39.                 'Ignore case when comparing words
  40.                 If StrComp(wordarr(i), uniquewords(u), vbTextCompare) = 0 Then
  41.                     found = True
  42.                 End If
  43.             Next u
  44.             If Not found Then
  45.                 unique = unique + 1
  46.                 uniquewords(unique) = wordarr(i)
  47.             End If
  48.         End If
  49.     Next i
  50.  
  51.     'Build result from the array of unique words
  52.     Dim result As String
  53.     result = uniquewords(0)
  54.     For i = 1 To unique
  55.         result = result + " " + uniquewords(i)
  56.     Next i
  57.     DropDupeWords = result
  58.  
  59. End Function
  60.  
Aug 26 '15 #19

NeoPa
Expert Mod 15k+
P: 31,492
ChrisLott:
Code posted in #14 simply doesn't work. I test with input "Hello World Hello Hello World World" and get back the same string unchanged. I test with input "Hello Hello Hello" and get a runtime error 9, subscript out of range.
I was surprised to find you're correct in your statement. Although the second example is the only problem here.

Bear in mind that "Hello World Hello Hello World World" is supposed to return "Hello World Hello Hello World World", as there is no duplicate that matches the requirements specified in the question (or more particularly expressed in post #9 as the OP was unable to express themself clearly enough when they posted the question originally).

Your suggested code is not a match for the stated requirement and it's also far from efficient. Remember, this is not simply about deleting duplicates. It is about deleting a specific duplicate introduced by doubling the first value added to the result. Thus "A B C A B C B C A A B C" should return only "A B C B C A A B C". No other duplicates need, or should, be removed.

With this in mind I've revisited my earlier code and this new version should do the trick as expected :
Expand|Select|Wrap|Line Numbers
  1. Public Function StripDup(ByVal varWith As Variant) As String
  2.     Dim intX As Integer, intY As Integer, intZ As Integer
  3.     Dim strWith As String
  4.  
  5.     If Not varWith > "" Then
  6.         StripDup = ""
  7.         Exit Function
  8.     End If
  9.     StripDup = Trim(varWith)
  10.     varWith = Split(StripDup, " ")
  11.     intX = 0
  12.     For intY = 1 To (UBound(varWith) + 1) \ 2
  13.         If varWith(intY) = varWith(intX) Then
  14.             For intX = 1 To intY - 1
  15.                 If varWith(intX) <> varWith(intX + intY) Then Exit For
  16.             Next intX
  17.             If intX = intY Then
  18.                 For intX = 0 To intY - 1
  19.                     varWith(intX) = ""
  20.                 Next intX
  21.                 Exit For
  22.             End If
  23.         End If
  24.     Next intY
  25.     If varWith(0) = "" Then
  26.         StripDup = Trim(Join(varWith, " "))
  27.     End If
  28. End Function
Aug 26 '15 #20

P: 2
Thanks for clarifying the spec. A person I'm supporting was struggling with a related problem and landed here via google but didn't read the intro carefully enough. My little function does not solve the problem of removing a proper prefix of duplicate words; it removes all duplicate words. I'll leave it up on the off chance it will help the next person who lands here by keyword search :)
Aug 27 '15 #21

NeoPa
Expert Mod 15k+
P: 31,492
ChrisLott:
I'll leave it up on the off chance it will help the next person who lands here by keyword search :)
That's good thinking and I agree. Let's leave it there. It may well help people looking for answers.
Aug 27 '15 #22

Post your reply

Sign in to post your reply or Sign up for a free account.