473,396 Members | 1,860 Online

# mapping subintervals

hi,
I have the following problem which is turning out to be non-trivial. I
realize that this is not
exactly a python problem but more of an algorithm problem -- but I
post it here because
I want to implement this in python.

I want to write a code that given an interval (integer tuple:
start,stop) will find which other
interval it matches to. Here is an example,
suppose I have the following NON-OVERLAPPING intervals

10 - 21 =='a1'
34 - 51 =='a2'
77 - 101 =='a3'
etc

So, suppose I'm give an interval such as (42,50), I should go through
the list and map it to "a2" etc
if the region is a subset of an exiting interval. If the query
interval does not exist in the table or
maps to two or more intervals (eg, (45, 80)) the program return a
None.

One naive way to solve this problem is to create an array such as
follows:
[None, None, None, ...., None, a1, a1, a1, ..., a1, None, None ...,
None, a2, ... etc] at indicies
1 2 3 9 10 11 12 21
22 23 33 34, ...

now with this in place I can easily solve the problem. However, this
is not a feasable solution
because the initial list has intervals whose range can go to billions!
So I need a smarter
idea. So what I can do is sort the initial list and then go through
each element from the start
and test if the a X[i][0] and b < X[i][1]
where (a,b) is my query start and stop
and X is a n x 2 array with the known intervals.
I don't like this solution because it can be really really slow as for
each query I'm doing a
linear search and on average I'll be searching almost half the list
before I find the right
interval.
Is there a smarter way to do this?
I've my problem is not clear please let me know and I'll try to
explain the unclear parts again.
Many thanks
Lee

Jun 13 '07 #1
3 1288
On Jun 13, 2:32 pm, Lee Sander <lesa...@gmail.comwrote:
hi,
I have the following problem which is turning out to be non-trivial. I
realize that this is not
exactly a python problem but more of an algorithm problem -- but I
post it here because
I want to implement this in python.

I want to write a code that given an interval (integer tuple:
start,stop) will find which other
interval it matches to. Here is an example,
suppose I have the following NON-OVERLAPPING intervals

10 - 21 =='a1'
34 - 51 =='a2'
77 - 101 =='a3'
etc

So, suppose I'm give an interval such as (42,50), I should go through
the list and map it to "a2" etc
if the region is a subset of an exiting interval. If the query
interval does not exist in the table or
maps to two or more intervals (eg, (45, 80)) the program return a
None.
....snip...
Many thanks
Lee
OK - I'm going to assume your intervals are inclusive (i.e. 34-51
contains both 34 and 51).

If your intervals are all really all non-overlapping, one thing you
can try is to put all the endpoints in a single list, and sort it.
Then, you can use the bisect module to search for intervals, which
will give you a logarithmic time algorithm.

Here, I'm going to assume you just need the index of the containing
interval. If you really need a name (i.e. 'a1' or 'a2'), you can use a
list of names, and index into that.

I hope those assumptions are valid! if so, the following should work:

from bisect import bisect

# assume initial intervals are sorted
intervals=[(10,21),(34,51), (77,101)]
# get a sorted list of endpoints
endpts=sorted([a for a,b in intervals]+[b for a,b in intervals])

def test_interval(ivl,endpts):
# Find where the left endpoint of ivl would lie in the list
# i.e. the index of the first element greater than ivl[0]
l_idx=bisect(endpts,ivl[0])
# If l_idx is even, then it lies between two intervals, and thus
# is not contained in any interval. Otherwise it returns the index
if l_idx % 2 == 0: return None
# if l_idx is out of bounds (i.e. ==len(endpts)), then ivl is
# not contained in any interval (too large)
if l_idx==len(endpts): return None
# Now, all we need to do is check that the right endpoint of ivl is
# less than or equal to the corresponding endpoint in the list.
# Happily, that endpoint is at index l_idx
if ivl[1]<=endpts[l_idx]:
# So return the index of the interval:
return (l_idx-1)/2
else:
return None
Then, from a shell:
>>print tst.test_interval((0,13),endpts)
None
>>print tst.test_interval((15,21),endpts)
0
>>print tst.test_interval((35,40),endpts)
1
>>print tst.test_interval((40,80),endpts)
None
>>print tst.test_interval((109,200),endpts)
None

Jun 13 '07 #2
Matteo skrev:
OK - I'm going to assume your intervals are inclusive (i.e. 34-51
contains both 34 and 51).

If your intervals are all really all non-overlapping, one thing you
can try is to put all the endpoints in a single list, and sort it.
Then, you can use the bisect module to search for intervals, which
will give you a logarithmic time algorithm.

Here, I'm going to assume you just need the index of the containing
interval. If you really need a name (i.e. 'a1' or 'a2'), you can use a
list of names, and index into that.

I hope those assumptions are valid! if so, the following should work:
I have taken the liberty of simplifying your code, using the fact that
tuples are sorted lexicographically. Note that this requires all
intervals to be tuples and not lists (since list(a) < tuple(b) is always
True).

from bisect import bisect

def test_interval(ivl,intervals):

# Find where ivl would lie in the list
# i.e. the index of the first interval sorting as larger than ivl
idx=bisect(intervals,ivl)
# Left endpoints equal is a special case - a matching interval will be
# to the right of the insertion point
if idx < len(intervals) and intervals[idx][0] == ivl[0]:
if intervals[idx][1] >= ivl[1]:
return idx
else:
return None
# Otherwise, we need to check to the left of the insertion point
if idx 0 and intervals[idx-1][1] >= ivl[1]:
return idx-1
else:
return None
>>intervals =[(10, 21), (34, 51), (77, 101)]
print test_interval((34,35),intervals)
1
>>print test_interval((34,53),intervals)
None
>>print test_interval((77,53),intervals)
2
>>print test_interval((77,83),intervals)
2
>>print test_interval((77,102),intervals)
None
>>print test_interval((77,101),intervals)
2

u"Nis J\xf8rgensen"
Jun 13 '07 #3
Dear Matteo and Nis,
Thankyou very much for your help. I wasn't aware of the bisect
library but it's really useful.
thank you both once again
Lee

On 13 Jun, 23:21, Nis Jørgensen <n...@superlativ.dkwrote:
Matteo skrev:
OK - I'm going to assume your intervals are inclusive (i.e. 34-51
contains both 34 and 51).
If your intervals are all really all non-overlapping, one thing you
can try is to put all the endpoints in a single list, and sort it.
Then, you can use the bisect module to search for intervals, which
will give you a logarithmic time algorithm.
Here, I'm going to assume you just need the index of the containing
interval. If you really need a name (i.e. 'a1' or 'a2'), you can use a
list of names, and index into that.
I hope those assumptions are valid! if so, the following should work:

I have taken the liberty of simplifying your code, using the fact that
tuples are sorted lexicographically. Note that this requires all
intervals to be tuples and not lists (since list(a) < tuple(b) is always
True).

from bisect import bisect

def test_interval(ivl,intervals):

# Find where ivl would lie in the list
# i.e. the index of the first interval sorting as larger than ivl
idx=bisect(intervals,ivl)
# Left endpoints equal is a special case - a matching interval will be
# to the right of the insertion point
if idx < len(intervals) and intervals[idx][0] == ivl[0]:
if intervals[idx][1] >= ivl[1]:
return idx
else:
return None
# Otherwise, we need to check to the left of the insertion point
if idx 0 and intervals[idx-1][1] >= ivl[1]:
return idx-1
else:
return None
>intervals =[(10, 21), (34, 51), (77, 101)]
print test_interval((34,35),intervals)
1
>print test_interval((34,53),intervals)
None
>print test_interval((77,53),intervals)
2
>print test_interval((77,83),intervals)
2
>print test_interval((77,102),intervals)
None
>print test_interval((77,101),intervals)

2

u"Nis J\xf8rgensen"

Jun 14 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.