getting a submatrix of all true

John Hunter

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.
Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X
John Hunter

Jul 18 '05 #1

Subscribe Post Reply

4641

Bengt Richter

On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.
Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix
If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x's mark the corresponding zeroes)

1 0 0 0 1
x x 0 x x
x x 0 x x
x x 0 x x
x x 1 x x

Or do they have to be adjacent?

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X

Brute force seems to work on this little example. Maybe it can
be memoized and optimized and/or whatever to handle your larger matrix fast enough?

====< submatrix.py >================================
X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]

def optimrcs(mx, rows=[], cols=[]):
if not rows or not cols: return 0,[],[]
maxscore = 0
for r in rows:
if not mx[r].count(1): continue
for c in cols:
if not mx[r][c]: continue
# eval column submatrix vs row submatrix
colsxthis = cols[:]; colsxthis.remove(c)
colscore, rset, cset = optimrcs(mx, rows, colsxthis)
if colscore>maxscore:
maxscore, maxrows, maxcols = colscore, rset, cset
rowsxthis = rows[:]; rowsxthis.remove(r)
rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
if rowscore >= maxscore:
maxscore, maxrows, maxcols = rowscore, rset, cset
if not maxscore:
return len(rows)*len(cols), rows, cols
return maxscore, maxrows,maxcols

if __name__ == '__main__':
score, rowsel, colsel = optimrcs(X, range(5), range(5))
print 'Score = %s, rows = %s, cols = %s' % (score,rowsel,colsel)
print
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print '<%s>'% row[c],
else:
print ' %s '% row[c],
print
================================================== ==
Running it results thus:

[21:24] C:\pywk\clp>submatrix.py
Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]

1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>

Regards,
Bengt Richter

Jul 18 '05 #2

Terry Reedy

"John Hunter" <jd******@ace.bsd.uchicago.edu> wrote in message
news:ma**********************************@python.o rg...

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing.
All too typical -- missing data are the bane of statistics.
I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

Statisticians have tried a variety of approaches. Googling for '
statistics "missing data" 'will give you some leads if you want.

Terry J. Reedy

Jul 18 '05 #3

Bengt Richter

On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing. I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix
so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).

My current approach is to drop rows (observations) that have more than
some prespecified number of missing variables, and then drop the
columns (variables) of the reduced data set that have any missing
values. I chose the threshold for dropping a row by eyeballing the
distribution of number of missing variables per observation, pick a
number on the low end of the distribution, and dropping the rows that
exceed the threshold.

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.
Example:

0 0 0
0 0 0 candidate sub matrix has 12 elements
0 0 0
0 0 0

1 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0

0 0
0 0 candidate submatrix has 8 elements
0 0
0 0

I want to programatically extract the 15 element matrix
If I understand your original optimality definition, that would be suboptimal.
I.e., how about the 16-element matrix? (x's mark the corresponding zeroes)

1 0 0 0 1
x x 0 x x
x x 0 x x
x x 0 x x
x x 1 x x

Or do they have to be adjacent?

Following the approach described above, I get the desired answer in
the example below, though this is a hack solution and I have the
feeling there is a better one.

from Numeric import nonzero, array, take, sum

X = array([[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])

goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
X = take(X, goodObsInd) # drop the bad

goodVarInd = nonzero(sum(X)==0) # variables with no missing data
X = take(X, goodVarInd, 1 ) # drop the bad variables

print X

Jul 18 '05 #4

Terry Reedy

"John Hunter" <jd******@ace.bsd.uchicago.edu> wrote in message
news:ma**********************************@python.o rg...

I have a largish data set (1000 observations x 100 floating point
variables), and some of the of the data are missing.
All too typical -- missing data are the bane of statistics.
I want to try a
variety of clustering, neural network, etc, algorithms on the data,
and to keep life simple I want to reduce the dimensions of the matrix so that I have no missing values, since not all the algorithms are
able to handle them and there is sufficient redundancy in the
variables that I can afford to lose some.

Statisticians have tried a variety of approaches. Googling for '
statistics "missing data" 'will give you some leads if you want.

Terry J. Reedy

Jul 18 '05 #5

John Hunter

>>>>> "Bengt" == Bengt Richter <bo**@oz.net> writes:
Bengt> If I understand your original optimality definition, that
Bengt> would be suboptimal. I.e., how about the 16-element
Bengt> matrix? (x's mark the corresponding zeroes)

Bengt> Or do they have to be adjacent?

No, in general they won't be. I missed that one. Just goes to show
that my "solution" is not one, which I knew. But it does efficiently
deliver good solutions, where good means large but not optimal.

Bengt> Brute force seems to work on this little example. Maybe it
Bengt> can be memoized and optimized and/or whatever to handle
Bengt> your larger matrix fast enough?

Thanks for the example. Unfortunately, it is too slow even for
moderate size matrices (30,10). I've been running it for over two
hours for a 30x10 matrix and it hasn't finished. And my data are
1000x100!

Here is a little code to generate larger matrices for testing....

from Numeric import zeros, array, Int, put, reshape
from RandomArray import uniform

numRows, numCols = 30,10
numMissing = int(0.05*numRows*numCols) # 5% missing

X = zeros( (300,))
ind = uniform(0, numRows*numCols, (numMissing,)).astype(Int)

put(X,ind,1)
X = reshape(X, (numRows,numCols))

# turn it into a list for your code....
X = [ [val for val in row] for row in X]
for row in X: print row

Last night, I began to formulate the problem as a logic statement,
hoping this would give me an idea of how to proceed. But no progress
yet. But I have come to the conclusion that with P ones, brute force
requires 2^P combinations. With 1000x100 with 5% missing that gives
me 2^5000. Not good.

Thanks for your suggestion, though. If you have any more thoughts,
let me know.

John Hunter

Jul 18 '05 #6

John Hunter

>>>>> "Terry" == Terry Reedy <tj*****@udel.edu> writes:
Terry> Statisticians have tried a variety of approaches. Googling
Terry> for ' statistics "missing data" 'will give you some leads
Terry> if you want.

I have done some searching. I'm familiar with the common methods
(delete every row that contains any missing, replace missing via mean
or regression or something clever) but haven't seen any discussion of
dropping variables and observations together to yield data sets with
no missing values. Have you seen something like this?

John Hunter

Jul 18 '05 #7

Terry Reedy

"John Hunter" <jd******@ace.bsd.uchicago.edu> wrote in message
news:ma*********************************@python.or g...

>> "Terry" == Terry Reedy <tj*****@udel.edu> writes:

Terry> Statisticians have tried a variety of approaches.

Googling Terry> for ' statistics "missing data" 'will give you some leads
Terry> if you want.

I have done some searching. I'm familiar with the common methods
(delete every row that contains any missing, replace missing via mean or regression or something clever) but haven't seen any discussion of dropping variables and observations together to yield data sets with
no missing values. Have you seen something like this?

There are also calculation methods for at least some analyses that
allow for missing data . One of the google hits is for the book
Statistical Analysis with Missing Data. I have not seen it, but it is
not unique.

As I hinted, there are no really nice solutions to missing data. I
have done both row and column deletion. Sometimes I have done
multiple analyses with different deletion strategies: once with enough
vars deleted so all or most cases are complete, and again with enough
cases deleted so that all or most var are complete.

I would start by counting (with a program) the number of missing
values for each row and then construction the frequency distribution
thereof. Then the same for the columns, with the addition of a
correlation table or tables.

One thing one can do with vars is to combine some to make a composite
measure. For instance, if three variables more-or-less measure the
same thing, one can combine (perhaps by the mean of those present) to
make one variable that is present if any of the three are, so it is
only missing for cases (rows) that are missing all three. This type
of work requires that you look at the variables and consider their
meaning, rather than just inputing them into a blind proceedure that
consisders all vars to be the same.

Terry J. Reedy

Jul 18 '05 #8

Jon

bo**@oz.net (Bengt Richter) wrote in message news:<be**********@216.39.172.122>...

On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <jd******@ace.bsd.uchicago.edu> wrote:
I am currently using a hack that works, but it makes me wonder if
there is an optimal solution. I define optimal as the removal of rows
and columns such that there are no missing values and
max(numRows*numCols).
There must be! For each missing entry you can choose to drop either
the row or the column - giving a maximum of 2^N possible ways for N
missing elements. Given that the order in which the rows/columns are
deleted is unimportant the number of possibilities to test is less
than that, but the difficulty is that your decision about row or
column for one element affects the decision for other elements if they
share rows or columns. Graph theoretic approaches probably help - it's
similar to reordering matrix elements to avoid fill in when
decomposing sparse matrices. Depending on N, you might be able to test
all possibilities in a reasonable time, but I have a nasty feeling you
need to check all 2^M to be sure of an optimal solution, where M is
the subset of N which either a share a row or column with another
problematic element.
Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.

Sadly I can only give you a method which is certainly not optimal, but
at least finds something which is probably not too bad, and fairly
quickly. Uses the Numeric module to speed up, so that a few thousand
by a few hundred is quickly computable. I'm curious to know if it does
much better or worse than other algorithms.

Interesting problem!

Cheers,

Jon

================================================== ===================
from Numeric import *

def pickrcs(a,debug=0):
b=array(a,copy=1) # local copy of array is modified
dr=zeros(b.shape[0]) # Arrays to note deleted rows...
dc=zeros(b.shape[1]) # ... and columns
sizeleft=[b.shape[0],b.shape[1]] # Remaining dimensions
while 1: # Keep deleting till none left
sr=sum(b,1) # Column scores = ij's to remove
sc=sum(b,0) # Row scores
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
if sr[mr]==0 and sc[mc]==0 :
break # stop when nothing to delete
# pick the row/column to delete fewest useful elements
if sizeleft[1]-sr[mr] > sizeleft[0]-sc[mc]:
b[:,mc]=0 # Zero out column
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
b[mr,:]=0 # Zero out row
dr[mr]=1
sizeleft[0]-=1
# end of deletions loop - should be no missing elements now
#
# paranoia!, check if anything was left after deletions
if sum(dr)<b.shape[0] and sum(dc)<b.shape[1]:
dr=compress(dr==0,range(b.shape[0])) # gives return format of
dc=compress(dc==0,range(b.shape[1])) # optimrcs posted by Bengt
score=dr.shape[0]*dc.shape[0] # Richter
return score,dr,dc
else:
print "Sorry, didn't manage to let anything survive"
return 0,0,0

# test code

X = [
[1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]
]

if __name__ == '__main__':
a=array(X)
score, rowsel, colsel = pickrcs(a)
print "score=",score
for r in range(5):
row = X[r]
for c in range(5):
if r in rowsel and c in colsel:
print '<%s>'% row[c],
else:
print ' %s '% row[c],
print
# now try for a larger problem - use a random pattern
import RandomArray,time
start=time.time()
x0=1000 # problem size
x1=600
a=RandomArray.random((x0,x1))
a=where(a>0.999,1,0) # ~ 0.1% of elements are 1
print "Using a large random matrix"
print "Number of non-zeros=",sum(ravel(a)),"in",x0,"x",x1,"matrix"
score, rowsel, colsel = pickrcs(a)
print 'Score = %s' % (score)
print 'Number of rows deleted=',a.shape[0]-rowsel.shape[0]
print 'Number of columns deleted=',a.shape[1]-colsel.shape[0]
print time.time()-start," seconds"

==sample output (of course second part is random!)===============

score= 16
1 0 0 0 1
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 0 <0> <0>
<0> <0> 1 <0> <0>
Using a large random matrix
Number of non-zeros= 607 in 1000 x 600 matrix
Score = 331776
Number of rows deleted= 424
Number of columns deleted= 24
3.38999998569 seconds

Jul 18 '05 #9

Anton Vredegoor

John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

Another way of formulating the question: for a sparse boolean matrix
(sparse on True), what is the optimal way to remove rows and columns
so that the total number of elements in the matrix is maximal and
there are no True values left.

After having read Bengts code (and scraping the parts of my exploded
head from the floor) and after later getting the hint about the size
of the problem being 2**N, it finally dawned on me that the problem
would be related to getting all possible combinations of a range.

The following code has the advantage that it generates solutions by
index, for example "M[i]" returns the score and the row and column
information for index i. (note that is not the same as finding the
optimal solution, but it still could be handy to be able to generate a
specific one by index)

However for small matrices it is possible to do exhaustive searching
with "max(M)".

Maybe if this would be combined with some heuristic algorithm it would
be better (genetic algorithm?)

Code below (needs Python 2.3, very lightly tested), I hope this helps,

Anton

class Combinations:

def __init__(self,n,k):
self.n,self.k,self.count = n,k,self.noverk(n,k)

def __getitem__(self,index):
#combination number index
if index > self.count - 1: raise IndexError
res,x,rest = [],self.n,index
for i in range(self.k,0,-1):
while self.noverk(x,i) > rest : x -= 1
rest -= self.noverk(x,i)
res.append(x)
return res

def noverk(self,n,k):
return reduce(lambda a,b: a*(n-b)/(b+1),range(k),1)

class AllCombinations:

def __init__(self,n):
self.n, self.count = n, 2**n

def __getitem__(self,index):
#combination number index for all k in combinations(n,k)
if index > self.count - 1: raise IndexError
n,rest = self.n,index
for k in range(n+1):
c = Combinations(n,k)
if rest - c.count < 0:
return c[rest]
rest -= c.count

class ScoreMatrix:

def __init__(self, X):
self.AC = AllCombinations(len(X))
self.X = X

def __getitem__(self,i):
#score selected rows and columns by index
rows = self.AC[i][::-1]
if rows:
Y = [self.X[i] for i in rows]
Z = [(i,z) for i,z in enumerate(zip(*Y)) if 1 not in z]
cols = [i for i,z in Z]
score = sum(map(len,[z for i,z in Z]))
return score,rows,cols
else: return 0,[],[]

def test():
X = [ [1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0], ]

M = ScoreMatrix(X)
sc,rows,cols = max(M)
print sc
for i,r in enumerate(X):
for j,c in enumerate(r):
if i in rows and j in cols: print c,
else: print 'x',
print

if __name__=='__main__':
test()

output:

16
x x x x x
0 0 x 0 0
x x x x x
0 0 x 0 0
0 0 x 0 0
0 0 x 0 0

Jul 18 '05 #10

Anton Vredegoor

John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

>> "Bengt" == Bengt Richter <bo**@oz.net> writes:
Bengt> Brute force seems to work on this little example. Maybe it
Bengt> can be memoized and optimized and/or whatever to handle
Bengt> your larger matrix fast enough?

Thanks for the example. Unfortunately, it is too slow even for
moderate size matrices (30,10). I've been running it for over two
hours for a 30x10 matrix and it hasn't finished. And my data are
1000x100!

I posted a somewhat long version before (inspired by Bengts idea) but
I remembered an easier way to generate all combinations, and also I
noticed that there is freedom in choosing to analyze rows or columns,
which can be computationally advantageous. Below is a version that
depends mostly on 2**N where N is the smallest of those two values.
Unfortunately 2**100 is still way to big a number, but my code can now
routinely solve 100x15 matrices ...
Last night, I began to formulate the problem as a logic statement,
hoping this would give me an idea of how to proceed. But no progress
yet. But I have come to the conclusion that with P ones, brute force
requires 2^P combinations. With 1000x100 with 5% missing that gives
me 2^5000. Not good.
IMO it depends (except for the amount time spent in copying data of
course) on the smallest of the number of rows or columns, so that's
2**100 in this case. Maybe for some matrices, your number is
preferable?
Thanks for your suggestion, though. If you have any more thoughts,
let me know.

I hope you don't mind me following this too :-) Nice problem, and I'm
not so sure about my math as I sound above, if anyone can prove me
right or wrong I'd be happy anyway ...

Anton

---

class ScoreMatrix:

def __init__(self, X):
n1,n2 = len(X),len(X[0])
if n2<n1:
self.X = zip(*X)
self.rotated = True
n = n2
else:
self.X = X
n = n1
self.rotated = False
self.R = range(n)
self.count = 2**n

def __getitem__(self,i):
#score selected rows and columns by index
if not (-1<i<self.count): raise IndexError
rows =[x for j,x in enumerate(self.R) if 1<<j &i]
if rows:
Y = [self.X[i] for i in rows]
Z = [(i,z) for i,z in enumerate(zip(*Y)) if 1 not in z]
cols = [i for i,z in Z]
score = sum(map(len,[z for i,z in Z]))
if self.rotated: return score,cols,rows
else: return score,rows,cols
else: return 0,[],[]

def test():
from random import random,randint
f = .05
r,c = 100,15
X=[]
for i in range(r):
X.append([])
for j in range(c):
X[i].append(int(random()<f))
M = ScoreMatrix(X)
highscore = 0,[],[]
for i,sc in enumerate(M):
if sc > highscore:
mi,highscore = i,sc
sc,rows,cols = highscore
print "maximum score: %s" %(sc)
print "index of this score: %s" %(mi)
print "tabledata:"
for i,r in enumerate(X):
for j,c in enumerate(r):
if i in rows and j in cols: print c,
else: print 'x',
print

if __name__=='__main__':
test()

Jul 18 '05 #11

Scott David Daniels

Folllowing Bengt and an***@vredegoor.doge.nl (Anton Vredegoor)'s leads,
the following code can be fast (at times). It is quite sensitive to the
probability of non-zeroness, (.01 works well, the .o5 is nowhere near so
nice).

I get 2.25 min to do 25 x 25 at .05
2.75 min to do 30 x 30 at .05
It gets slow _very_ fast, but gets good numbers if the probability is low
.25 min to do 45 x 45 at .01
1.5 min to do 50 x 50 at .01
-Scott David Daniels
Sc***********@Acm.Org (not fully back up yet, though).
Here's the code:
################################
# myopt.py
__version__ = '0.3'

try:
assert list(enumerate('ab')) == [(0, 'a'), (1, 'b')]
except NameError:
def enumerate(iterable):
# Not exact, but close enough
lst = list(iterable)
return zip(range(len(lst)), lst)
def bitcount(row):
'''Return the number of on bits in the integer argsA'''
assert row >= 0
result = 0
while row:
result += 1
lsbit = row & -row
row ^= lsbit
return result
def rowencode(vector):
'''convert from a buncha numbers to a bit-representation'''
bit = 1L
result = 0
for element in vector:
if element:
result |= bit
bit <<= 1
return result
def rowdecode(value, mask=-1):
'''convert from a bit-rep to a buncha numbers (at least as long as mask)'''
if mask < 0: mask = value
assert value >= 0
result = []
while mask or value:
result.append(value & 1)
mask >>= 1
value >>= 1
return result
class Answer(object):
'''An answer represents a result calculation.'''
__slots__ = 'rows', 'colmask', '_score'
totalrequests = 0
totalcalcs = 0

def __init__(self, colmask, rows):
'''The columns in colmask are the ones we keep.'''
self.colmask = colmask
self.rows = rows
self._score = None

def score(self):
'''Calculate the score lazily'''
self.__class__.totalrequests += 1
if self._score is None:
self.__class__.totalcalcs += 1
self._score = bitcount(self.colmask) * len(self.rows)
return self._score

def __repr__(self):
return '%s(%d:%s, %x):%s' % (self.__class__.__name__,
len(self.rows), self.rows,
self.colmask, self._score)
totalcalls = 0

def choose(rows, keepcols, N=0, keeprows=None):
'''Choose rows and columns to keep. Return an Answer for the choice'''
global totalcalls
totalcalls += 1
if keeprows is None:
keeprows = []
try:
while 0 == rows[N] & keepcols:
keeprows.append(N)
N += 1
except IndexError:
return Answer(keepcols, keeprows)

# a difference: some kept columns in this row are non-0
# must drop either those columns or this row

# Calculate result if we keep this row (drop problem columns)
withrow = choose(rows, keepcols & ~rows[N], N+1, keeprows + [N])

# Calculate result if we drop this row
skiprow = choose(rows, keepcols, N+1, keeprows)

# Choose the better answer (keep the row if everything is equal).
if (withrow.colmask == skiprow.colmask
or withrow.score() >= skiprow.score()):
return withrow
else:
return skiprow
# The data used from the example
X = [ [1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0] ]
def printrep(row, symbols, mask=0):
'''A row representing a single row (column-masked by mask)'''
assert mask >= 0
result = []
for element in row:
result.append(symbols[(1 & mask) * 2 + (element != 0)])
mask >>= 1
assert mask == 0 # mask doesn't extend beyond data.
return ''.join(result)
def printanswer(data, rows, keepcols):
'''Print the represented row'''
toohuge = len(data)
rowqueue = rows + [toohuge]
rowqueue.reverse()
nextrow = rowqueue.pop()
for rownumber, row in enumerate(data):
if nextrow > rownumber:
# This row was zapped
print '#', printrep(row, '01OI', keepcols)
else:
assert rownumber == nextrow # This row was kept
nextrow = rowqueue.pop()
print '_', printrep(row, '01~@', keepcols)
assert nextrow == toohuge and not rowqueue
def getanswer(data):
'''Calculate the best-cut for a particular matrix'''
columns = max([len(row) for row in data])
rowdata = [rowencode(row) for row in data]
return choose(rowdata, (1L << columns) - 1)
def main(data=X):
global totalcalls

totalcalls = 0
answer = getanswer(data)
print 'Requested: %s, Calculated: %s,' % (
Answer.totalrequests, Answer.totalcalcs),

print 'answer: %r,' % answer,
print 'Score: %s' % answer.score()
print 'Total choose calls required: %s' % totalcalls

printanswer(data, answer.rows, answer.colmask)

def rangen(rows, columns, probability=0.05):
'''create a rows by columns data table with 1s at the given probability'''
import random
result = []
for row in range(rows):
result.append([int(random.random() < probability)
for column in range(columns)])
return result
if __name__ == '__main__':
import sys
assert getanswer([[0]]).score() == 1
assert getanswer([[0,1], [1,0]]).score() == 1
assert getanswer([[0,1,0], [1,0,0]]).score() == 2
if len(sys.argv) < 2:
main(X)
else:
args = sys.argv[1:]
if '.' in args[-1]:
assert len(args) > 1
probability = float(args.pop())
else:
probability = .2

rows = int(args[0])
if len(args) == 1:
cols = rows
else:
assert len(args) == 2
cols = int(args[1])
main(rangen(rows, cols, probability))

Jul 18 '05 #12

Scott David Daniels

In article <3f********@nntp0.pdx.net>,
sc***********@acm.org (Scott David Daniels) writes:

Folllowing Bengt and an***@vredegoor.doge.nl (Anton Vredegoor)'s leads,
the following code can be fast (at times). It is quite sensitive to the
probability of non-zeroness, (.01 works well, the .o5 is nowhere near so
nice).

I get 2.25 min to do 25 x 25 at .05
2.75 min to do 30 x 30 at .05
It gets slow _very_ fast, but gets good numbers if the probability is low
.25 min to do 45 x 45 at .01
1.5 min to do 50 x 50 at .01

OK, a bit faster: (bitcount faster, allow choose to sometimes quit early).
I get 0.25 min to do 25 x 25 at .05
5.5 min to do 30 x 30 at .05 (?! maybe above was wrong)
and:
1 sec to do 45 x 45 at .01
9 sec to do 50 x 50 at .01

-Scott David Daniels
Sc***********@Acm.Org

###################################

# myopt.py

__version__ = '0.4'
try:
assert list(enumerate('ab')) == [(0, 'a'), (1, 'b')]
except NameError:
def enumerate(iterable):
lst = list(iterable)
return zip(range(len(lst)), lst)
def bitcount(row):
'''Return the number of on bits in the integer'''
assert row >= 0
result = 0
while row:
result += 1
lsbit = row & -row
row ^= lsbit
return result

bytebits = [bitcount(n) for n in range(256)] # precalculate byte counts

def bitcount(row): # replace previous bitcount
'''Return the number of on bits in the integer (byte at a time)'''
assert row >= 0
result = bytebits[row & 255]
while row >= 256:
row >>= 8
result += bytebits[row & 255]
return result
def rowencode(vector):
'''convert from a buncha numbers to a bit-representation'''
bit = 1L
result = 0
for element in vector:
if element:
result |= bit
bit <<= 1
return result
class Answer(object):
'''An answer represents a result calculation.'''
__slots__ = 'rows', 'colmask', '_score'
totalrequests = 0
totalcalcs = 0

def __init__(self, colmask, rows):
'''The columns in colmask are the ones we keep.'''
self.colmask = colmask
self.rows = rows
self._score = None

def score(self):
'''Calculate the score lazily'''
self.__class__.totalrequests += 1
if self._score is None:
self.__class__.totalcalcs += 1
self._score = bitcount(self.colmask) * len(self.rows)
return self._score

def __repr__(self):
return '%s(%d:%s, %x):%s' % (self.__class__.__name__,
len(self.rows), self.rows,
self.colmask, self._score)
totalcalls = 0

def choose(rows, keepcols, N=0, keeprows=None, best=None):
'''Choose rows and columns to keep. Return an Answer for the choice'''
global totalcalls
totalcalls += 1
if keeprows is None:
keeprows = []
try:
while 0 == rows[N] & keepcols:
keeprows.append(N)
N += 1
except IndexError:
return Answer(keepcols, keeprows)

# a difference: some kept columns in this row are non-0
# must drop either those columns or this row

# Calculate result if we keep this row (drop problem columns)
newmask = keepcols & ~rows[N]
if best and newmask == best.colmask:
withrow = best # Already have a calculated with this mask. use it.
else:
withrow = choose(rows, keepcols & ~rows[N], N+1, keeprows + [N], best)

# Calculate result if we drop this row
skiprow = choose(rows, keepcols, N+1, keeprows, best or withrow)

# Choose the better answer (keep the row if everything is equal).
if (withrow.colmask == skiprow.colmask
or withrow.score() >= skiprow.score()):
return withrow
else:
return skiprow
# The data used from the example
X = [ [1, 0, 0, 0, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0] ]
def printrep(row, symbols, mask=0):
'''A row representing a single row (column-masked by mask)'''
assert mask >= 0
result = []
for element in row:
result.append(symbols[(1 & mask) * 2 + (element != 0)])
mask >>= 1
assert mask == 0 # mask doesn't extend beyond data.
return ''.join(result)
def printanswer(data, rows, keepcols):
'''Print the represented row'''
toohuge = len(data)
rowqueue = rows + [toohuge]
rowqueue.reverse()
nextrow = rowqueue.pop()
for rownumber, row in enumerate(data):
if nextrow > rownumber:
# This row was zapped
print '#', printrep(row, '01OI', keepcols)
else:
assert rownumber == nextrow # This row was kept
nextrow = rowqueue.pop()
print '_', printrep(row, '01~@', keepcols)
assert nextrow == toohuge and not rowqueue
def getanswer(data):
'''Calculate the best-cut for a particular matrix'''
columns = max([len(row) for row in data])
rowdata = [rowencode(row) for row in data]
return choose(rowdata, (1L << columns) - 1)
def main(data=X):
global totalcalls

totalcalls = 0
answer = getanswer(data)
print 'Requested: %s, Calculated: %s,' % (
Answer.totalrequests, Answer.totalcalcs),

print 'answer: %r,' % answer,
print 'Score: %s' % answer.score()
print 'Total choose calls required: %s' % totalcalls

printanswer(data, answer.rows, answer.colmask)

def rangen(rows, columns, probability=0.05):
'''create a rows by columns data table with 1s at the given probability'''
import random
result = []
for row in range(rows):
result.append([int(random.random() < probability)
for column in range(columns)])
return result
if __name__ == '__main__':
import sys
assert getanswer([[0]]).score() == 1
assert getanswer([[0,1], [1,0]]).score() == 1
assert getanswer([[0,1,0], [1,0,0]]).score() == 2
if len(sys.argv) < 2:
main(X)
else:
args = sys.argv[1:]
if '.' in args[-1]:
assert len(args) > 1
probability = float(args.pop())
else:
probability = .2

rows = int(args[0])
if len(args) == 1:
cols = rows
else:
assert len(args) == 2
cols = int(args[1])
main(rangen(rows, cols, probability))

Jul 18 '05 #13

John Hunter

>>>>> "Jon" == Jon <wr****@esrf.fr> writes:

Jon> Sadly I can only give you a method which is certainly not
Jon> optimal, but at least finds something which is probably not
Jon> too bad, and fairly quickly. Uses the Numeric module to speed
Jon> up, so that a few thousand by a few hundred is quickly
Jon> computable. I'm curious to know if it does much better or
Jon> worse than other algorithms.

Hi Jon, thanks for the suggestion. For my data set, your solution
performed a little better size-wise than my poor man's approach, which
I gave in the original post. And it is certainly fast enough. I
liked your idea of zero-ing out the elements when you made a decision
to delete a row or column. When I was thinking about solutions using
Numeric, I kept running up against the problem of deleting a row or
column (which entails a whole array copy so is expensive).

If anyone wants to try their algorithm on "real world data" which,
unlike the ones discussed here, have lots of correlation between
missing rows and columns that might be exploited, I have uploaded the
missing data matrix to a web server. It can be loaded with:

from urllib import urlopen
url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()]
for line in urlopen(url).readlines()]
# L is 1072x83

I think I may adapt your algorithm to terminate when the percent
missing falls below some threshold, removing the requirement that
*all* missing values be removed. This will drop the worst offenders
observation or variable wise. Then I can use a regression approach to
fill in the remaining ones.

Jon> Interesting problem!

I agree. There are lots of extensions to the original formulation
that are relevant to the missing value problem. As we have seen in
the posts here, there are good exact solutions for smallish matrices,
but for larger ones a statistical approach is required. The right
statistical solution surely depends on the correlation structure of
the matrix.

Also, it would be nice to generalize to higher dimensions (eg 3D
matrices). For my problem, I have N observations of M variables over
D days. So this is the 3D version of the 2D problem above. Your
solution is readily adaptable to this case. More generally, you might
want to impose extra requirements, like, no days can be dropped, or at
most 10% of the days can be dropped, or if you want to compute changes
in variables over days, that no days can be dropped.

The more general formulation is something like

Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
matrix C, what is the largest submatrix of X of dimensions (N1, M1,
P1, ...) such that the fraction of remaining dimensions satisfies

(N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)

and the total percent true <= p.

The original problem is a special case of this where alpha=0, beta=0,
p=0 and C unspecified.

Anyone looking for a dissertation project :-) ?

John Hunter

Jul 18 '05 #14

John Hunter

>>>>> "John" == John Hunter <jd******@ace.bsd.uchicago.edu> writes:

John> I think I may adapt your algorithm to terminate when the
John> percent missing falls below some threshold, removing the
John> requirement that *all* missing values be removed. This will
John> drop the worst offenders observation or variable wise. Then
John> I can use a regression approach to fill in the remaining
John> ones.
I think I have found a major improvement on your algorithm Jon, by
changing the rule which determines whether to drop a row or a column
after you have found the worst offenders for each. The original
approach was to keep the one with the most valid (zero) elements,
which makes sense since that is the quantity you are trying to
maximize. But consider the case where there is correlation so that
true values on a row or column tend to be correlated with other true
values on that row or column, as in

X = 1 0
1 0
1 0
1 0
1 0
0 0
0 0

Your algorithm would keep the column, since keeping a row gives you
only one zero whereas keeping a column gives you 2. By looking at the
*fraction* of ones in the worst offending row and column, you can do
much better. By deleting the row or column with the greatest fraction
of ones, you reduce the likelihood that you'll encounter a one in
future iterations. Using this approach, I was able to find a 893x67
submatrix with a score of 59831 versus the 462x81 submatrix with a
score of 37422.

I have also added thresholds to the code, so that you can terminate
with a fraction of missing values remaining. By setting the
thresholds to zero, you get the special case of no missing values.

The example below also includes a no-copy optimization that Jon
emailed me off list. The algorithm run time for a 1073x82 matrix is
around 30ms on my 700MHz machine.

John Hunter

from __future__ import division
import sys, time
from urllib import urlopen
from Numeric import *

def trim_missing(a, thresh=(0.05,0.05) ):
"""
a is a numObservations by numVariables boolean matrix

thresh = (othresh, vthresh) are the observation and variable
thresholds that trigger deletion

remove any rows where percent missing vars > vthresh
remove any cols where percent missing obs > othresh

This is done iteratively rechecking the percent missing on each
iteration or row or col removal, with the largest offender removed
on each iteration

return value is score, rowInd, colInd where the two ind arrays
are the indices of the rows and cols that were kept

"""
othresh, vthresh = thresh # theshold for deletion
dr=zeros(a.shape[0]) # Arrays to note deleted rows...
dc=zeros(a.shape[1]) # ... and columns
sizeleft=list(a.shape) # Remaining dimensions
sr=sum(a,1) # Column scores = ij's to remove
sc=sum(a,0) # Row scores
while 1: # Keep deleting till none left
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
fracRow = sr[mr]/sizeleft[1]
fracCol = sc[mc]/sizeleft[0]

if fracRow<=vthresh and fracCol<=othresh:
break # stop when missing criteria are satisfied
if fracRow-vthresh<fracCol-othresh:
sr=sr-a[:,mc] # rows reduced by contents of column
sc[mc]=0 # sum of column now zero
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
# deleting the row
sc=sc-a[mr,:] # columns reduced by contents of row
sr[mr]=0 # sum of row is now zero
dr[mr]=1 # tags row as deleted
sizeleft[0]-=1

score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
dr=compress(dr==0,range(a.shape[0])) # gives return format of
dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt

return score, dr, dc
url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()]
for line in urlopen(url).readlines()]
a= array(L)
start = time.time()
score, goodRows, goodCols = trim_missing(a, thresh=(0.0, 0.0))
end = time.time()
print end-start
print 'Original size', a.shape
print 'Number of non-zeros in', sum(ravel(a))
print 'Reduced size', len(goodRows), len(goodCols)
print 'Score', score

Jul 18 '05 #15

Bengt Richter

On Mon, 07 Jul 2003 11:00:11 -0500, John Hunter <jd******@ace.bsd.uchicago.edu> wrote:

>> "Jon" == Jon <wr****@esrf.fr> writes:

Jon> Sadly I can only give you a method which is certainly not
Jon> optimal, but at least finds something which is probably not
Jon> too bad, and fairly quickly. Uses the Numeric module to speed
Jon> up, so that a few thousand by a few hundred is quickly
Jon> computable. I'm curious to know if it does much better or
Jon> worse than other algorithms.

Hi Jon, thanks for the suggestion. For my data set, your solution
performed a little better size-wise than my poor man's approach, which
I gave in the original post. And it is certainly fast enough. I
liked your idea of zero-ing out the elements when you made a decision
to delete a row or column. When I was thinking about solutions using
Numeric, I kept running up against the problem of deleting a row or
column (which entails a whole array copy so is expensive).

If anyone wants to try their algorithm on "real world data" which,
unlike the ones discussed here, have lots of correlation between
missing rows and columns that might be exploited, I have uploaded the
missing data matrix to a web server. It can be loaded with:

from urllib import urlopen
url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
L = [ [ int(val) for val in line.split()]
for line in urlopen(url).readlines()]
# L is 1072x83

The statistics on this data is interesting, referring to counts of 1's:
(not very tested)

[15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts
nr=1072, nc=83
totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
count:freq -- cols with count ones occurred freq times
0:16 1:4 8:3 14:2 16:4 20:4 21:3 23:1
25:6 27:1 28:7 32:4 41:1 59:1 60:3 62:3
65:1 95:2 96:1 108:3 125:1 134:1 158:1 177:1
199:1 206:2 237:1 241:1 252:1 254:1 1061:2
count:freq -- rows with count ones occurred freq times
0:3 1:2 2:460 3:67 4:140 5:66 6:42 7:49
8:61 9:17 10:46 11:23 12:8 13:8 14:12 15:10
16:3 17:5 18:1 19:7 20:4 21:2 22:5 23:1
27:2 28:2 30:2 31:1 33:1 34:1 36:2 38:1
40:1 41:1 44:4 45:7 48:1 55:2 56:2

And sorted per descending frequency of number of cols and rows with particular counts:

[15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts -freq
nr=1072, nc=83
totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
freq:count -- cols with count ones occurred freq times
16:0 7:28 6:25 4:32 4:20 4:16 4:1 3:108
3:62 3:60 3:21 3:8 2:1061 2:206 2:95 2:14
1:254 1:252 1:241 1:237 1:199 1:177 1:158 1:134
1:125 1:96 1:65 1:59 1:41 1:27 1:23
freq:count -- rows with count ones occurred freq times
460:2 140:4 67:3 66:5 61:8 49:7 46:10 42:6
23:11 17:9 12:14 10:15 8:13 8:12 7:45 7:19
5:22 5:17 4:44 4:20 3:16 3:0 2:56 2:55
2:36 2:30 2:28 2:27 2:21 2:1 1:48 1:41
1:40 1:38 1:34 1:33 1:31 1:23 1:18

There are two columns with 1061 ones out of a possible 1072, so they account for 2
in most of the row counts. Hence the most popular row count (2) mostly goes to 0.
Interestingly, 4(2) is 'way more popular than 3(1). This doesn't look like what
I'd expect from uniform overall probability of some percentage. Discounting the
stuck columns, no ones shouldn't be more likely than one one, since there's only
one way to get all zeroes, but there's then 81 equally probable ways to get one one.

Look at a uniform 5% on 1072 x 83 (twice to show variation).
There's a distribution with tails, centered on 5% of 1072 and 83 it seems:

[17:48] C:\pywk\clp\submx>showmx.py 1072 83 5 -counts
nr=1072, nc=83
totcc=4338, totrc=4338 => *100./(83*1072) => 4.88
count:freq -- cols with count ones occurred freq times
35:1 39:1 40:2 41:1 42:5 43:1 45:5 46:3
47:2 48:6 49:3 50:5 51:5 52:5 53:5 54:5
55:5 56:5 58:3 59:4 61:1 62:2 63:1 64:1
65:3 70:1 78:2
count:freq -- rows with count ones occurred freq times
0:21 1:74 2:134 3:219 4:218 5:176 6:113 7:64
8:29 9:15 10:6 11:2 12:1

[17:48] C:\pywk\clp\submx>showmx.py 1072 83 5 -counts
nr=1072, nc=83
totcc=4394, totrc=4394 => *100./(83*1072) => 4.94
count:freq -- cols with count ones occurred freq times
37:1 40:1 41:3 43:1 44:2 45:2 46:5 47:5
48:1 49:3 50:6 51:8 52:7 53:7 54:3 55:3
56:4 57:3 58:3 59:1 60:1 61:1 62:1 63:1
64:3 65:2 68:2 69:1 70:1 72:1
count:freq -- rows with count ones occurred freq times
0:19 1:58 2:150 3:218 4:218 5:173 6:111 7:63
8:38 9:13 10:5 11:5 12:1

I think I may adapt your algorithm to terminate when the percent
missing falls below some threshold, removing the requirement that
*all* missing values be removed. This will drop the worst offenders
observation or variable wise. Then I can use a regression approach to
fill in the remaining ones.

Jon> Interesting problem! Yes. So many, so little time ... ;-/

I agree. There are lots of extensions to the original formulation
that are relevant to the missing value problem. As we have seen in
the posts here, there are good exact solutions for smallish matrices,
but for larger ones a statistical approach is required. The right
statistical solution surely depends on the correlation structure of
the matrix. I'm pretty sure there are better exact algorithms than the brute force
one I first posted, but I haven't done a new one since, just some little
things to show things about the data, as above.
Also, it would be nice to generalize to higher dimensions (eg 3D
matrices). For my problem, I have N observations of M variables over
D days. So this is the 3D version of the 2D problem above. Your
solution is readily adaptable to this case. More generally, you might
want to impose extra requirements, like, no days can be dropped, or at
most 10% of the days can be dropped, or if you want to compute changes
in variables over days, that no days can be dropped.
I'm wondering if you might not have other considerations about which
data points to drop too. E.g., if an observation makes the difference
between being able to invert a matrix or not, you might opt to drop
several other things instead, even though in terms of count the set
would be less optimum. (Curiousness: what do your observations represent?)
The more general formulation is something like

Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
matrix C, what is the largest submatrix of X of dimensions (N1, M1,
P1, ...) such that the fraction of remaining dimensions satisfies

(N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)

and the total percent true <= p.

The original problem is a special case of this where alpha=0, beta=0,
p=0 and C unspecified.

Anyone looking for a dissertation project :-) ?

John Hunter

Regards,
Bengt Richter

Jul 18 '05 #16

John Hunter

>>>>> "Bengt" == Bengt Richter <bo**@oz.net> writes:

Bengt> Curiousness: what do your observations represent?

By day, I do research in neurology and neurobiology. By night, stock
market analysis: academia doesn't pay! I have written several modules
to parse the Yahoo Finance pages, eg,
http://biz.yahoo.com/z/a/x/xoma.html and
http://biz.yahoo.com/p/x/xoma.html. The 2 fields which you identified
as missing on all but 2 observations are Revenue Estimates: "Year Ago
Sales" and "Sales Growth" for This Year, both of which are N/A for
almost all tickers.

Jon> Interesting problem!
Bengt> Yes. So many, so little time ... ;-/

Exactly.

JDH

Jul 18 '05 #17

John Hunter

>>>>> "John" == John Hunter <jd******@ace.bsd.uchicago.edu> writes:
John>Using this approach, I was able to find a 893x67 submatrix
John>with a score of 59831 versus the 462x81 submatrix with a
John>score of 37422.

Onwards and upwards.... Since this is beginning to look like a
gradient descent algorithm, it is natural to worry about local minima.
So I followed the lead of simulated annealing algorithms and added a
little noise to the step where the decision to delete the row or
column is made, eg,

from MLab import randn
def trim_missing(a, thresh=(0.05,0.05), sigma=0.005 ):
...snip...
if fracRow-vthresh+sigma*randn()<fracCol-othresh:
#delete the column
else:
#delete the row

where sigma is like the temperature parameter of a simulated annealing
algorithm.

Repeatedly stepping over a variety of sigmas in search of a minimum
has yielded minor improvement for the dataset of interest. Best
results:

Deterministic: 59831 893 67
Random: 59928 908 66 for sigma=0.0017

For random arrays the results are more striking. For random matrices
500x500, comparing the scores of the deterministic (sigma=0) versus
random (best sigma in [0,0.01]) searches:

score score
% ones mean determ mean random mean(ratio) std(ratio)
1% 41753.2 42864.6 1.03 0.0093
5% 5525.8 6174.2 1.12 0.06
10% 1968.3 2193.9 1.12 0.05

For a matrix of the same number of total elements, but unbalanced in
the dimensions, eg, 1000x250, the results are more pronounced, in
favor of the random search algorithm

score score
% ones mean determ mean random mean(ratio) std(ratio)
1% 48724.0 51451.8 1.06 0.018
5% 6340.5 7542.3 1.19 0.038
10% 2092.5 2458.6 1.180 0.091

Below is a script that performs a random search of a random matrix for
a variety of sigmas.

JDH

from __future__ import division
import sys, time
from urllib import urlopen
from Numeric import *
from MLab import randn, rand, mean, std

def trim_missing(a, thresh=(0.05,0.05), sigma=0.005 ):
"""
a is a numObservations by numVariables boolean matrix

thresh = (othresh, vthresh) are the observation and variable
thresholds that trigger deletion

remove any rows where percent missing vars > vthresh
remove any cols where percent missing obs > othresh

This is done iteratively rechecking the percent missing on each
iteration or row or col removal, with the largest offender removed
on each iteration

return value is score, rowInd, colInd where the two ind arrays
are the indices of the rows and cols that were kept

"""
othresh, vthresh = thresh # theshold for deletion
dr=zeros(a.shape[0]) # Arrays to note deleted rows...
dc=zeros(a.shape[1]) # ... and columns
sizeleft=list(a.shape) # Remaining dimensions
sr=sum(a,1) # Column scores = ij's to remove
sc=sum(a,0) # Row scores
while 1: # Keep deleting till none left
mr=argmax(sr) # index highest row score
mc=argmax(sc) # index highest column score
fracRow = sr[mr]/sizeleft[1]
fracCol = sc[mc]/sizeleft[0]

if fracRow<=vthresh and fracCol<=othresh:
break # stop when missing criteria are satisfied
if fracRow-vthresh+sigma*randn()<fracCol-othresh:
sr=sr-a[:,mc] # rows reduced by contents of column
sc[mc]=0 # sum of column now zero
dc[mc]=1 # tags column as deleted
sizeleft[1]-=1
else:
# deleting the row
sc=sc-a[mr,:] # columns reduced by contents of row
sr[mr]=0 # sum of row is now zero
dr[mr]=1 # tags row as deleted
sizeleft[0]-=1

score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
dr=compress(dr==0,range(a.shape[0])) # gives return format of
dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt

return score, dr, dc

results = []
for trial in range(10):
a=rand(1000,250)
a=where(a>0.90,1,0)

score0, goodRows, goodCols = trim_missing(a, thresh=(0,0), sigma=0)
maxScore = 0, score0, len(goodRows), len(goodCols)
print 'Deterministic:', score0, len(goodRows), len(goodCols)
for sigma in arange(0.0001, 0.01, 0.001):
for i in range(10):
score, goodRows, goodCols = trim_missing(
a, thresh=(0,0), sigma=sigma)
if score>maxScore[1]:
#print '\t', sigma, score, len(goodRows), len(goodCols)
maxScore = sigma, score, len(goodRows), len(goodCols)
print '\t', maxScore

results.append((score0, maxScore[1], maxScore[1]/score0))
results = array(results)
print mean(results)
print std(results)

Jul 18 '05 #18

getting a submatrix of all true

Similar topics