459,223 Members | 1,403 Online
Need help? Post your question and get tips & solutions from a community of 459,223 IT Pros & Developers. It's quick & easy.

# Algorithm Question

 P: n/a This really an algorithm question more that a Python question, but it would be implemented in Python.... I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string My guess is that finding the smallest possible T satisfying the constraint would be hard. However, for my application just keeping it reasonably small would help. In my case the list A contains over two million addresses. The (top down) heuristic approach I am tempted to employ is to start by dividing the entries in A into sets of tokens, then take the union of all these sets as a starting point for B. Then I would try to trim B by 1. looking for elements that I could remove while still satisfying the constraint 2. replacing two elements by a common sub-string if that reduced T Anyway. It occurred to me that this might be a known problem. Any pointers gratefully received. - Andrew Sep 10 '06 #1
9 Replies

 P: n/a On 2006-09-10, Andrew McLean

 P: n/a Andrew McLean wrote: I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. B=A? But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. Carl Banks Sep 11 '06 #3

 P: n/a Carl Banks wrote: Andrew McLean wrote: >I have a list of strings, A. I want to find a set of strings B such thatfor any "a in A" there exists "b in B" such that b is a sub-string of a. B=A? >But I also want to minimise T = sum_j t_j wheret_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. You are quite right. I was trying to generalise my real problem and missed out a constraint. I also want to keep length(B) small. Unfortunately, I'm a bit unsure about the relative importance of T and length(B), which makes the problem rather ill defined. I'll have to give this a bit more thought.... Sep 11 '06 #4

 P: n/a Andrew McLean wrote: Carl Banks wrote: Andrew McLean wrote: I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. B=A? But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. You are quite right. I was trying to generalise my real problem and missed out a constraint. I also want to keep length(B) small. Unfortunately, I'm a bit unsure about the relative importance of T and length(B), which makes the problem rather ill defined. I'll have to give this a bit more thought.... A quick silly question: what is the problem that you are trying to solve? Sep 11 '06 #5