459,223 Members | 1,403 Online Need help? Post your question and get tips & solutions from a community of 459,223 IT Pros & Developers. It's quick & easy.

# Algorithm Question

 P: n/a This really an algorithm question more that a Python question, but it would be implemented in Python.... I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string My guess is that finding the smallest possible T satisfying the constraint would be hard. However, for my application just keeping it reasonably small would help. In my case the list A contains over two million addresses. The (top down) heuristic approach I am tempted to employ is to start by dividing the entries in A into sets of tokens, then take the union of all these sets as a starting point for B. Then I would try to trim B by 1. looking for elements that I could remove while still satisfying the constraint 2. replacing two elements by a common sub-string if that reduced T Anyway. It occurred to me that this might be a known problem. Any pointers gratefully received. - Andrew Sep 10 '06 #1
9 Replies

 P: n/a On 2006-09-10, Andrew McLean

 P: n/a Andrew McLean wrote: I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. B=A? But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. Carl Banks Sep 11 '06 #3

 P: n/a Carl Banks wrote: Andrew McLean wrote: >I have a list of strings, A. I want to find a set of strings B such thatfor any "a in A" there exists "b in B" such that b is a sub-string of a. B=A? >But I also want to minimise T = sum_j t_j wheret_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. You are quite right. I was trying to generalise my real problem and missed out a constraint. I also want to keep length(B) small. Unfortunately, I'm a bit unsure about the relative importance of T and length(B), which makes the problem rather ill defined. I'll have to give this a bit more thought.... Sep 11 '06 #4

 P: n/a Andrew McLean wrote: Carl Banks wrote: Andrew McLean wrote: I have a list of strings, A. I want to find a set of strings B such that for any "a in A" there exists "b in B" such that b is a sub-string of a. B=A? But I also want to minimise T = sum_j t_j where t_j = count of the number of elements in A which have b[j] as a sub-string If there not elements in A that are substrings of any other element in A, and if B=A, then t_j would be 1 for all j. Which means B=A would be optimal (since elements of B have to be substring of at least one element in A). It looks like the B={set of all elements in A that are not a substring of any other element in A} is the generally optimal solution. I suspect you mistyped or omitted something--problem is underspecified at best. You are quite right. I was trying to generalise my real problem and missed out a constraint. I also want to keep length(B) small. Unfortunately, I'm a bit unsure about the relative importance of T and length(B), which makes the problem rather ill defined. I'll have to give this a bit more thought.... A quick silly question: what is the problem that you are trying to solve? Sep 11 '06 #5

 P: n/a Andrew McLean wrote: Now I want to issue a series of queries, such that when I combine all the data returned I have accessed all the records in the database. However, I want to minimise the total number of queries and also want to keep the number of records returned by more than one query small. The first requirement makes sense, but what's the reason for the second one ? In any case, as you realized the problem is not well defined. For example, what's better: a) 2 queries each covering 2/3 of the total records (say R) whose overlap is 1/3*R, or b) 4 queries each covering 1/4*R with zero overlap ? George Sep 15 '06 #7

 P: n/a At Thursday 14/9/2006 20:31, Andrew McLean wrote: >Now I want to issue a series of queries, such that when I combine allthe data returned I have accessed all the records in the database.However, I want to minimise the total number of queries and also want tokeep the number of records returned by more than one query small. This is known as a "set cover" algorithm. You have a set of subsets, and want to determine the smallest set of those subsets, whose union is the universal set - (uh, what a mess!) See "The Algorithm Design Manual" by Steven Skiena. http://www.cs.sunysb.edu/~algorith/f...et-cover.shtml Full text (not-so-legal, I presume): http://www2.toki.or.id/book/AlgDesig...BOOK/NODE4.HTM Gabriel Genellina Softlab SRL __________________________________________________ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas Sep 15 '06 #8

 P: n/a Gabriel Genellina wrote: This is known as a "set cover" algorithm. You have a set of subsets, and want to determine the smallest set of those subsets, whose union is the universal set - (uh, what a mess!) I thought of that too, but he seems to be adding a second desired property: the intersections of the subsets should be as small as possible, or in other words the subsets should be as distinct as possible. George Sep 15 '06 #9

 P: n/a In message , Andrew McLean wrote: I have the ability to query a database in a legacy system and extract records which match a particular pattern. Specifically, I can perform queries for records that contain a given search term as a sub-string of a particular column. What are the constraints on search terms? Are you allowed to specify the null search term, which would match everything? Sep 26 '06 #10

### This discussion thread is closed

Replies have been disabled for this discussion. 