By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,223 Members | 1,403 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,223 IT Pros & Developers. It's quick & easy.

Algorithm Question

P: n/a
This really an algorithm question more that a Python question, but it
would be implemented in Python....

I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
But I also want to minimise T = sum_j t_j where

t_j = count of the number of elements in A which have b[j] as a sub-string

My guess is that finding the smallest possible T satisfying the
constraint would be hard. However, for my application just keeping it
reasonably small would help.

In my case the list A contains over two million addresses.

The (top down) heuristic approach I am tempted to employ is to start by
dividing the entries in A into sets of tokens, then take the union of
all these sets as a starting point for B. Then I would try to trim B by

1. looking for elements that I could remove while still satisfying the
constraint

2. replacing two elements by a common sub-string if that reduced T

Anyway. It occurred to me that this might be a known problem. Any
pointers gratefully received.

- Andrew

Sep 10 '06 #1
Share this Question
Share on Google+
9 Replies


P: n/a
On 2006-09-10, Andrew McLean <an*********@andros.org.ukwrote:
This really an algorithm question more that a Python question,
but it would be implemented in Python....
In that case, try comp.programming. But still...
I have a list of strings, A. I want to find a set of strings B
such that for any "a in A" there exists "b in B" such that b is
a sub-string of a. But I also want to minimise T = sum_j t_j
where

t_j = count of the number of elements in A which have b[j] as a
sub-string

My guess is that finding the smallest possible T satisfying the
constraint would be hard. However, for my application just
keeping it reasonably small would help.

In my case the list A contains over two million addresses.
I'm afraid I don't understand your description.

How about an example of B for some A?

A = ["abb", "bbc"]
B = ["a", "ab", "abb", "b", "bb", "bbc", "bc", "c"]

Is that what you're after?

--
Neil Cerutti
Sep 11 '06 #2

P: n/a
Andrew McLean wrote:
I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
B=A?
But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string
If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.
Carl Banks

Sep 11 '06 #3

P: n/a
Carl Banks wrote:
Andrew McLean wrote:
>I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.

B=A?
>But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string

If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.
You are quite right. I was trying to generalise my real problem and
missed out a constraint. I also want to keep length(B) small.
Unfortunately, I'm a bit unsure about the relative importance of T and
length(B), which makes the problem rather ill defined. I'll have to give
this a bit more thought....

Sep 11 '06 #4

P: n/a

Andrew McLean wrote:
Carl Banks wrote:
Andrew McLean wrote:
I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
B=A?
But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string
If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.

You are quite right. I was trying to generalise my real problem and
missed out a constraint. I also want to keep length(B) small.
Unfortunately, I'm a bit unsure about the relative importance of T and
length(B), which makes the problem rather ill defined. I'll have to give
this a bit more thought....
A quick silly question: what is the problem that you are trying to
solve?

Sep 11 '06 #5

P: n/a
John Machin wrote:
A quick silly question: what is the problem that you are trying to
solve?
A fair question :-)

The problem may seem a bit strange, but here it is:

I have the ability to query a database in a legacy system and extract
records which match a particular pattern. Specifically, I can perform
queries for records that contain a given search term as a sub-string of
a particular column. The specific column contains an address. This
database can only be accessed through this particular interface (don't
ask why, it's one of the reasons it's a *legacy* system).

I also have access to a list that contains the vast majority (possibly
all) the addresses which are stored in the database.

Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.

Now the current approach I use is to divide the addresses I have into
tokens and take the last token in the address (excluding the postal
code). The union of these "last tokens" forms my set of queries. The
last token in the address is typically a county or a town in a UK address.

This works, but I was wondering if I could do something more efficient.
The problem is that while the search term "London" matches all the
addresses in London it also returns all the addresses containing "London
Road", and a lot of towns have a London Road. Perhaps I would be better
off searching for "Road", "Street", "Avenue" ....

It occurred to me that this my be isomorphic to a known problem, however
given that I want to keep two things small, the problem isn't very well
defined.

The current approach works, I was just musing whether there was a faster
approach, so don't think about it too hard.

- Andrew
Sep 14 '06 #6

P: n/a
Andrew McLean wrote:
Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.
The first requirement makes sense, but what's the reason for the second
one ? In any case, as you realized the problem is not well defined. For
example, what's better:
a) 2 queries each covering 2/3 of the total records (say R) whose
overlap is 1/3*R, or
b) 4 queries each covering 1/4*R with zero overlap ?

George

Sep 15 '06 #7

P: n/a
At Thursday 14/9/2006 20:31, Andrew McLean wrote:
>Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.
This is known as a "set cover" algorithm. You have a set of subsets,
and want to determine the smallest set of those subsets, whose union
is the universal set - (uh, what a mess!)

See "The Algorithm Design Manual" by Steven Skiena.
http://www.cs.sunysb.edu/~algorith/f...et-cover.shtml
Full text (not-so-legal, I presume):
http://www2.toki.or.id/book/AlgDesig...BOOK/NODE4.HTM

Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Sep 15 '06 #8

P: n/a
Gabriel Genellina wrote:
This is known as a "set cover" algorithm. You have a set of subsets,
and want to determine the smallest set of those subsets, whose union
is the universal set - (uh, what a mess!)
I thought of that too, but he seems to be adding a second desired
property: the intersections of the subsets should be as small as
possible, or in other words the subsets should be as distinct as
possible.

George

Sep 15 '06 #9

P: n/a
In message <ee*******************@news.demon.co.uk>, Andrew McLean wrote:
I have the ability to query a database in a legacy system and extract
records which match a particular pattern. Specifically, I can perform
queries for records that contain a given search term as a sub-string of
a particular column.
What are the constraints on search terms? Are you allowed to specify the
null search term, which would match everything?
Sep 26 '06 #10

This discussion thread is closed

Replies have been disabled for this discussion.