473,473 Members | 2,169 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

sampling items from a nested list

So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:

py> data = [[('a', 0),
.... ('b', 1),
.... ('c', 2)],
....
.... [('d', 2),
.... ('e', 0)],
....
.... [('f', 0),
.... ('g', 2),
.... ('h', 1),
.... ('i', 0),
.... ('j', 0)]]

Now, I'd like to sample down the number of items in each sublist in the
following manner. I need to count the occurrences of each 'label' (the
second item in each tuple) in all the items of all the sublists, and
randomly remove some items until the number of occurrences of each
'label' is equal. So, given the data above, one possible resampling
would be:

[[('b', 1),
('c', 2)],

[('e', 0)],

[('g', 2),
('h', 1),
('i', 0)]]

Note that there are now only 2 examples of each label. I have code that
does this, but it's a little complicated:

py> import random
py> def resample(data):
.... # determine which indices are associated with each label
.... label_indices = {}
.... for i, group in enumerate(data):
.... for j, (item, label) in enumerate(group):
.... label_indices.setdefault(label, []).append((i, j))
.... # sample each set of indices down
.... min_count = min(len(indices)
.... for indices in label_indices.itervalues())
.... for label, indices in label_indices.iteritems():
.... label_indices[label] = random.sample(indices, min_count)
.... # return the resampled data
.... return [[(item, label)
.... for j, (item, label) in enumerate(group)
.... if (i, j) in label_indices[label]]
.... for i, group in enumerate(data)]
....
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]

Can anyone see a simpler way of doing this?

Steve
Jul 18 '05 #1
5 1810
Steven Bethard wrote:
So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:
....
Can anyone see a simpler way of doing this?

Steve


You just make these up to keep us amused, don't you? ;-)

If you don't need to preserve the ordering, would the following work?:
data = [[('a', 0), ... ('b', 1),
... ('c', 2)],
...
... [('d', 2),
... ('e', 0)],
...
... [('f', 0),
... ('g', 2),
... ('h', 1),
... ('i', 0),
... ('j', 0)]]
... def resample2(data): ... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not random.shuffle(group)]
... resample2(data) [[('a', 0), ('c', 2), ('b', 1)], [('h', 1), ('g', 2), ('i', 0)], []] resample2(data) [[('h', 1), ('f', 0), ('j', 0), ('g', 2)], [('b', 1), ('c', 2)], []] resample2(data) [[('e', 0), ('d', 2)], [('i', 0), ('h', 1), ('g', 2)], [('b', 1)]]


Michael

Jul 18 '05 #2
Michael Spencer wrote:
Steven Bethard wrote:
So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:
...

Can anyone see a simpler way of doing this?

Steve


You just make these up to keep us amused, don't you? ;-)


Heh heh. I wish. It's actually about resampling data read in the
Yamcha data format:

http://chasen.org/~taku/software/yamcha/

So each sublist is a "sentence" and each tuple is the feature vector for
a "word". The point is to even out the number of positive and negative
examples because support vector machines typically work better with
balanced data sets.
If you don't need to preserve the ordering, would the following work?:
[snip]
>>> def resample2(data):

... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not
random.shuffle(group)]


It would be preferable to preserve ordering, but it's not absolutely
crucial. Thanks for the suggestion!

STeVe
Jul 18 '05 #3
Michael Spencer wrote:
>>> def resample2(data): ... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not


....which failed to calculate the minimum count of labels, try this instead
(while I was at it, I removed the insance LC)
def resample3(data): ... bag = {}
... sample = []
... labels = [label for group in data for item, label in group]
... min_count = min(labels.count(label) for label in set(labels))
... random.shuffle(data)
... for subgroup in data:
... random.shuffle(subgroup)
... subgroupsample = []
... for item, label in subgroup:
... bag.setdefault(label,[]).append(item)
... if len(bag[label]) <= min_count:
... subgroupsample.append((item,label))
... sample.append(subgroupsample)
... return sample
...


Cheers

Michael

Jul 18 '05 #4
Steven Bethard wrote:
Michael Spencer wrote:
Steven Bethard wrote:
So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:

...

Can anyone see a simpler way of doing this?

Steve

You just make these up to keep us amused, don't you? ;-)

Heh heh. I wish. It's actually about resampling data read in the
Yamcha data format:

http://chasen.org/~taku/software/yamcha/

So each sublist is a "sentence" and each tuple is the feature vector for
a "word". The point is to even out the number of positive and negative
examples because support vector machines typically work better with
balanced data sets.
If you don't need to preserve the ordering, would the following work?:

[snip]
>>> def resample2(data):

... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not
random.shuffle(group)]

It would be preferable to preserve ordering, but it's not absolutely
crucial. Thanks for the suggestion!

STeVe

Maybe combine this with a DSU pattern? Not sure whether the result would be
better than what you started with

Michael

Jul 18 '05 #5
Steven Bethard wrote:
py> data = [[('a', 0),
... ('b', 1),
... ('c', 2)],
...
... [('d', 2),
... ('e', 0)],
...
... [('f', 0),
... ('g', 2),
... ('h', 1),
... ('i', 0),
... ('j', 0)]]

I need to count the occurrences of each 'label' (the second item in
each tuple) in all the items of all the sublists, and randomly remove
some items until the number of occurrences of each 'label' is equal.


If the tuples are "heavier" than this, you can avoid comparing them
using the following algorithm (which probably still leaves some room for
optimization, e.g. simpler return_list building [or returning a
generator instead of a list], or directly building the sample set
instead of converting a random.sample to a set):

def resample(data):
counts = {}
for i in data:
for j in i:
counts[j[1]] = counts.setdefault(j[1], 0) + 1

min_count = min(counts.itervalues())

# Same keys, so we can reuse the counts dictionary.
indices = counts
for label, count in counts.iteritems():
indices[label] = set(random.sample(xrange(count), min_count))

# Same thing with a generator expression, building a new dict (dunno
# what's faster).
#indices = dict(((label, set(random.sample(xrange(count), min_count)))
# for label, count in counts.iteritems()))

# "done" maps labels to the number of tuples (with that label) which
# have been added to return_list.
done = {}
return_list = []
for i in data:
return_list.append([])
for j in i:
if done.setdefault(j[1], 0) in indices[j[1]]:
return_list[-1].append(j)
done[j[1]] += 1
return return_list

--
Felix Wiemann -- http://www.ososo.de/
Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Charlotte Henkle | last post by:
Hello; I'm pondering how to count the number of times an item appears in total in a nested list. For example: myList=,,] I'd like to know that a appeared three times, and b appeared twice,...
5
by: Peter Collinson | last post by:
Hi... Is there any way to style a List Item a different color and size than the <LI> in an Ordered List? I'd like a red super-script number and a dark blue text in a page's footnotes. And...
1
by: steflhermitte | last post by:
Dear cpp-ians, I want to apply a stratified sampling on an image. Following simplified example will explain my problem. The original image I with nrows and ncols is now a vector V of length...
2
by: Antoine | last post by:
I would like to construct my own list of items in a grid/ table/ item list layout but I have a problem. I want to add a sort of index row based on time, such as there might be blank values. Sure...
4
by: kenfar | last post by:
I've got a large table on db2 8.2.1 that I rarely perform runstats on. It has about 600 million rows organized in a single MDC time dimension on a non-dpf warehouse. Anyhow, we recently ran...
5
by: Brian Quinlan | last post by:
This is less a Python question and more a optimization/probability question. Imaging that you have a list of objects and there frequency in a population e.g. lst = and you want to drawn n...
6
by: deko | last post by:
How do I construct an XHTML-compliant nested unordered list? This displays correctly (both FF and IE): <ul> <li>list item</li> <li>list item</li> <li>list item</li> <ul> <li>nested list...
1
by: ib | last post by:
Is there a way that I can view the items/elements of a generic list in VS2005? For example, if I use STL std::vector, I am able to use the watch window to see each element in this container. But...
0
by: bearophileHUGS | last post by:
Alexy>But in Python it's very slow...< I'm the first one to say that CPython is slow, but almost any language is slow if you use such wrong algorithms like you do. There are many ways to solve...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.