446,143 Members | 1,919 Online
Need help? Post your question and get tips & solutions from a community of 446,143 IT Pros & Developers. It's quick & easy.

# sampling items from a nested list

 P: n/a So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like: py> data = [[('a', 0), .... ('b', 1), .... ('c', 2)], .... .... [('d', 2), .... ('e', 0)], .... .... [('f', 0), .... ('g', 2), .... ('h', 1), .... ('i', 0), .... ('j', 0)]] Now, I'd like to sample down the number of items in each sublist in the following manner. I need to count the occurrences of each 'label' (the second item in each tuple) in all the items of all the sublists, and randomly remove some items until the number of occurrences of each 'label' is equal. So, given the data above, one possible resampling would be: [[('b', 1), ('c', 2)], [('e', 0)], [('g', 2), ('h', 1), ('i', 0)]] Note that there are now only 2 examples of each label. I have code that does this, but it's a little complicated: py> import random py> def resample(data): .... # determine which indices are associated with each label .... label_indices = {} .... for i, group in enumerate(data): .... for j, (item, label) in enumerate(group): .... label_indices.setdefault(label, []).append((i, j)) .... # sample each set of indices down .... min_count = min(len(indices) .... for indices in label_indices.itervalues()) .... for label, indices in label_indices.iteritems(): .... label_indices[label] = random.sample(indices, min_count) .... # return the resampled data .... return [[(item, label) .... for j, (item, label) in enumerate(group) .... if (i, j) in label_indices[label]] .... for i, group in enumerate(data)] .... py> py> resample(data) [[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]] py> resample(data) [[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]] Can anyone see a simpler way of doing this? Steve Jul 18 '05 #1
5 Replies

 P: n/a Steven Bethard wrote: So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like: .... Can anyone see a simpler way of doing this? Steve You just make these up to keep us amused, don't you? ;-) If you don't need to preserve the ordering, would the following work?: data = [[('a', 0), ... ('b', 1), ... ('c', 2)], ... ... [('d', 2), ... ('e', 0)], ... ... [('f', 0), ... ('g', 2), ... ('h', 1), ... ('i', 0), ... ('j', 0)]] ... def resample2(data): ... bag = {} ... random.shuffle(data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault(label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not random.shuffle(group)] ... resample2(data) [[('a', 0), ('c', 2), ('b', 1)], [('h', 1), ('g', 2), ('i', 0)], []] resample2(data) [[('h', 1), ('f', 0), ('j', 0), ('g', 2)], [('b', 1), ('c', 2)], []] resample2(data) [[('e', 0), ('d', 2)], [('i', 0), ('h', 1), ('g', 2)], [('b', 1)]] Michael Jul 18 '05 #2

 P: n/a Michael Spencer wrote: Steven Bethard wrote: So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like: ... Can anyone see a simpler way of doing this? Steve You just make these up to keep us amused, don't you? ;-) Heh heh. I wish. It's actually about resampling data read in the Yamcha data format: http://chasen.org/~taku/software/yamcha/ So each sublist is a "sentence" and each tuple is the feature vector for a "word". The point is to even out the number of positive and negative examples because support vector machines typically work better with balanced data sets. If you don't need to preserve the ordering, would the following work?: [snip] >>> def resample2(data): ... bag = {} ... random.shuffle(data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault(label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not random.shuffle(group)] It would be preferable to preserve ordering, but it's not absolutely crucial. Thanks for the suggestion! STeVe Jul 18 '05 #3

 P: n/a Michael Spencer wrote: >>> def resample2(data): ... bag = {} ... random.shuffle(data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault(label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not ....which failed to calculate the minimum count of labels, try this instead (while I was at it, I removed the insance LC) def resample3(data): ... bag = {} ... sample = [] ... labels = [label for group in data for item, label in group] ... min_count = min(labels.count(label) for label in set(labels)) ... random.shuffle(data) ... for subgroup in data: ... random.shuffle(subgroup) ... subgroupsample = [] ... for item, label in subgroup: ... bag.setdefault(label,[]).append(item) ... if len(bag[label]) <= min_count: ... subgroupsample.append((item,label)) ... sample.append(subgroupsample) ... return sample ... Cheers Michael Jul 18 '05 #4

 P: n/a Steven Bethard wrote: Michael Spencer wrote: Steven Bethard wrote: So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like: ... Can anyone see a simpler way of doing this? Steve You just make these up to keep us amused, don't you? ;-) Heh heh. I wish. It's actually about resampling data read in the Yamcha data format: http://chasen.org/~taku/software/yamcha/ So each sublist is a "sentence" and each tuple is the feature vector for a "word". The point is to even out the number of positive and negative examples because support vector machines typically work better with balanced data sets. If you don't need to preserve the ordering, would the following work?: [snip] >>> def resample2(data): ... bag = {} ... random.shuffle(data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault(label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not random.shuffle(group)] It would be preferable to preserve ordering, but it's not absolutely crucial. Thanks for the suggestion! STeVe Maybe combine this with a DSU pattern? Not sure whether the result would be better than what you started with Michael Jul 18 '05 #5

 P: n/a Steven Bethard wrote: py> data = [[('a', 0), ... ('b', 1), ... ('c', 2)], ... ... [('d', 2), ... ('e', 0)], ... ... [('f', 0), ... ('g', 2), ... ('h', 1), ... ('i', 0), ... ('j', 0)]] I need to count the occurrences of each 'label' (the second item in each tuple) in all the items of all the sublists, and randomly remove some items until the number of occurrences of each 'label' is equal. If the tuples are "heavier" than this, you can avoid comparing them using the following algorithm (which probably still leaves some room for optimization, e.g. simpler return_list building [or returning a generator instead of a list], or directly building the sample set instead of converting a random.sample to a set): def resample(data): counts = {} for i in data: for j in i: counts[j[1]] = counts.setdefault(j[1], 0) + 1 min_count = min(counts.itervalues()) # Same keys, so we can reuse the counts dictionary. indices = counts for label, count in counts.iteritems(): indices[label] = set(random.sample(xrange(count), min_count)) # Same thing with a generator expression, building a new dict (dunno # what's faster). #indices = dict(((label, set(random.sample(xrange(count), min_count))) # for label, count in counts.iteritems())) # "done" maps labels to the number of tuples (with that label) which # have been added to return_list. done = {} return_list = [] for i in data: return_list.append([]) for j in i: if done.setdefault(j[1], 0) in indices[j[1]]: return_list[-1].append(j) done[j[1]] += 1 return return_list -- Felix Wiemann -- http://www.ososo.de/ Jul 18 '05 #6

### This discussion thread is closed

Replies have been disabled for this discussion.