# Programmatically finding "significant" data points

 P: n/a Hi all, I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] In this data, some of the significant points include: data data data data data data data data ..... How do I sort through this data and pull out these points of significance? Thanks for your help! Erik Nov 14 '06 #1
 P: n/a erikcw wrote: I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. .... > How do I sort through this data and pull out these points of significance? Get a book on statistics. One idea is as follows. If you expect the points to be centred around a single value, you can calculate the median or mean of the points, calculate their standard deviation (aka spread), and remove points which are more than N-times the standard deviation from the median. Jeremy -- Jeremy Sanders http://www.jeremysanders.net/ Nov 14 '06 #2

 P: n/a "erikcw" wrote: I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] silly solution: for i in range(1, len(data)-1): if data[i-1] < data[i] data[i+1] or data[i-1] data[i] < data[i+1]: print i (the above doesn't handle the "edges", but that's easy to fix) Nov 14 '06 #3

 P: n/a erikcw wrote: Hi all, I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] In this data, some of the significant points include: data data data data data data data data .... How do I sort through this data and pull out these points of significance? I think you are looking for "extrema": def w3(items): items = iter(items) view = None, items.next(), items.next() for item in items: view = view[1:] + (item,) yield view for i, (a, b, c) in enumerate(w3(data)): if a b < c: print i+1, "min", b elif a < b c: print i+1, "max", b else: print i+1, "---", b Peter Nov 14 '06 #5

 P: n/a If the order doesn't matter, you can sort the data and remove x * 0.5 * n where x is the proportion of numbers you want. If you have too many similar values though, this falls down. I suggest you check out quantiles in a good statistics book. Alan. Peter Otten wrote: erikcw wrote: Hi all, I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] In this data, some of the significant points include: data data data data data data data data .... How do I sort through this data and pull out these points of significance? I think you are looking for "extrema": def w3(items): items = iter(items) view = None, items.next(), items.next() for item in items: view = view[1:] + (item,) yield view for i, (a, b, c) in enumerate(w3(data)): if a b < c: print i+1, "min", b elif a < b c: print i+1, "max", b else: print i+1, "---", b Peter Nov 14 '06 #6

 P: n/a erikcw wrote: I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. In calculus, you identify high and low points by looking where the derivative changes its sign. When working with discrete samples, you can look at the sign changes in finite differences: >>data = [...]diff = [data[i + 1] - data[i] for i in range(len(data))]map(str, diff) ['0.4', '0.1', '-0.2', '-0.01', '0.11', '0.5', '-0.2', '-0.2', '0.6', '-0.1', '0.2', '0.1', '0.1', '-0.45', '0.15', '-0.3', '-0.2', '0.1', '-0.4', '0.05', '-0.1', '-0.25'] The high points are those where diff changes from + to -, and the low points are those where diff changes from - to +. HTH, -- Roberto Bonvallet Nov 14 '06 #7

 P: n/a >>>>Jeremy Sanders How do I sort through this data and pull out these points ofsignificance? Get a book on statistics. One idea is as follows. If you expect the points to be centred around a single value, you can calculate the median or mean of the points, calculate their standard deviation (aka spread), and remove points which are more than N-times the standard deviation from the median. Standard deviation was the first thought that jumped to my mind too. However, that's not what the OP is after. He's seems to be looking for points when the direction changes. Ganesan -- Ganesan Rajagopal Nov 14 '06 #8

 P: n/a erikcw wrote: Hi all, I have a collection of ordered numerical data in a list. Called a "time series" in statistics. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] In this data, some of the significant points include: data data data data data data data data .... How do I sort through this data and pull out these points of significance? The best place to ask about an algorithm for this is not comp.lang.python -- maybe sci.stat.math would be better. Once you have an algorithm, coding it in Python should not be difficult. I'd suggest using the NumPy array rather than the native Python list, which is not designed for crunching numbers. Nov 14 '06 #10

 P: n/a erikcw wrote: Hi all, I have a collection of ordered numerical data in a list. The numbers when plotted on a line chart make a low-high-low-high-high-low (random) pattern. I need an algorithm to extract the "significant" high and low points from this data. Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] In this data, some of the significant points include: data data data data data data data data .... How do I sort through this data and pull out these points of significance? Its obviously a kind of time series and you are search for a "moving_max(data,t,window)>data(t)" / "moving_min(data,t,window)

 P: n/a "robert" Hi all,I have a collection of ordered numerical data in a list. The numberswhen plotted on a line chart make a low-high-low-high-high-low (random)pattern. I need an algorithm to extract the "significant" high and lowpoints from this data.Here is some sample data:data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20,1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35,0.10]In this data, some of the significant points include:datadatadatadatadatadatadatadata....How do I sort through this data and pull out these points ofsignificance? Using zip and map, it's easy to compute first and second derivatives of a time series of values. The first lambda computes Nov 19 '06 #12

 P: n/a .... dang touchy keyboard! Here is some sample data: data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] In this data, some of the significant points include: data data data data data data data data Using the first derivative, and looking for sign changes, finds many of the values you marked as "significant". -- Paul data = [0.10, 0.50, 0.60, 0.40, 0.39, 0.50, 1.00, 0.80, 0.60, 1.20, 1.10, 1.30, 1.40, 1.50, 1.05, 1.20, 0.90, 0.70, 0.80, 0.40, 0.45, 0.35, 0.10] delta = lambda (x1,x2) : x2-x1 dy_dx =+map(delta,zip(data,data[1:])) d2y_dx2 = +map(delta,zip(dy_dx,dy_dx[1:])) sgnChange = lambda (x1,x2) : x1*x2<0 sigs = map(sgnChange,zip(dy_dx,dy_dx[1:])) print [i for i,v in enumerate(sigs) if v] [2, 4, 6, 8, 9, 10, 13, 14, 15, 17, 18, 19, 20] Nov 19 '06 #13

