So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:
py> data = [[('a', 0),
.... ('b', 1),
.... ('c', 2)],
....
.... [('d', 2),
.... ('e', 0)],
....
.... [('f', 0),
.... ('g', 2),
.... ('h', 1),
.... ('i', 0),
.... ('j', 0)]]
Now, I'd like to sample down the number of items in each sublist in the
following manner. I need to count the occurrences of each 'label' (the
second item in each tuple) in all the items of all the sublists, and
randomly remove some items until the number of occurrences of each
'label' is equal. So, given the data above, one possible resampling
would be:
[[('b', 1),
('c', 2)],
[('e', 0)],
[('g', 2),
('h', 1),
('i', 0)]]
Note that there are now only 2 examples of each label. I have code that
does this, but it's a little complicated:
py> import random
py> def resample(data):
.... # determine which indices are associated with each label
.... label_indices = {}
.... for i, group in enumerate(data) :
.... for j, (item, label) in enumerate(group ):
.... label_indices.s etdefault(label , []).append((i, j))
.... # sample each set of indices down
.... min_count = min(len(indices )
.... for indices in label_indices.i tervalues())
.... for label, indices in label_indices.i teritems():
.... label_indices[label] = random.sample(i ndices, min_count)
.... # return the resampled data
.... return [[(item, label)
.... for j, (item, label) in enumerate(group )
.... if (i, j) in label_indices[label]]
.... for i, group in enumerate(data)]
....
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]
Can anyone see a simpler way of doing this?
Steve 5 1815
Steven Bethard wrote: So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like:
.... Can anyone see a simpler way of doing this?
Steve
You just make these up to keep us amused, don't you? ;-)
If you don't need to preserve the ordering, would the following work?: data = [[('a', 0),
... ('b', 1),
... ('c', 2)],
...
... [('d', 2),
... ('e', 0)],
...
... [('f', 0),
... ('g', 2),
... ('h', 1),
... ('i', 0),
... ('j', 0)]]
... def resample2(data) :
... bag = {}
... random.shuffle( data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault( label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not random.shuffle( group)]
... resample2(data)
[[('a', 0), ('c', 2), ('b', 1)], [('h', 1), ('g', 2), ('i', 0)], []] resample2(data)
[[('h', 1), ('f', 0), ('j', 0), ('g', 2)], [('b', 1), ('c', 2)], []] resample2(data)
[[('e', 0), ('d', 2)], [('i', 0), ('h', 1), ('g', 2)], [('b', 1)]]
Michael
Michael Spencer wrote: Steven Bethard wrote:
So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like: ... Can anyone see a simpler way of doing this?
Steve
You just make these up to keep us amused, don't you? ;-)
Heh heh. I wish. It's actually about resampling data read in the
Yamcha data format: http://chasen.org/~taku/software/yamcha/
So each sublist is a "sentence" and each tuple is the feature vector for
a "word". The point is to even out the number of positive and negative
examples because support vector machines typically work better with
balanced data sets.
If you don't need to preserve the ordering, would the following work?:
[snip] >>> def resample2(data) :
... bag = {} ... random.shuffle( data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault( label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not random.shuffle( group)]
It would be preferable to preserve ordering, but it's not absolutely
crucial. Thanks for the suggestion!
STeVe
Michael Spencer wrote: >>> def resample2(data) : ... bag = {} ... random.shuffle( data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault( label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not
....which failed to calculate the minimum count of labels, try this instead
(while I was at it, I removed the insance LC) def resample3(data) :
... bag = {}
... sample = []
... labels = [label for group in data for item, label in group]
... min_count = min(labels.coun t(label) for label in set(labels))
... random.shuffle( data)
... for subgroup in data:
... random.shuffle( subgroup)
... subgroupsample = []
... for item, label in subgroup:
... bag.setdefault( label,[]).append(item)
... if len(bag[label]) <= min_count:
... subgroupsample. append((item,la bel))
... sample.append(s ubgroupsample)
... return sample
...
Cheers
Michael
Steven Bethard wrote: Michael Spencer wrote:
Steven Bethard wrote:
So, I have a list of lists, where the items in each sublist are of basically the same form. It looks something like: ...
Can anyone see a simpler way of doing this?
Steve
You just make these up to keep us amused, don't you? ;-)
Heh heh. I wish. It's actually about resampling data read in the Yamcha data format:
http://chasen.org/~taku/software/yamcha/
So each sublist is a "sentence" and each tuple is the feature vector for a "word". The point is to even out the number of positive and negative examples because support vector machines typically work better with balanced data sets.
If you don't need to preserve the ordering, would the following work?: [snip]
>>> def resample2(data) : ... bag = {} ... random.shuffle( data) ... return [[(item, label) ... for item, label in group ... if bag.setdefault( label,[]).append(item) ... or len(bag[label]) < 3] ... for group in data if not random.shuffle( group)]
It would be preferable to preserve ordering, but it's not absolutely crucial. Thanks for the suggestion!
STeVe
Maybe combine this with a DSU pattern? Not sure whether the result would be
better than what you started with
Michael
Steven Bethard wrote: py> data = [[('a', 0), ... ('b', 1), ... ('c', 2)], ... ... [('d', 2), ... ('e', 0)], ... ... [('f', 0), ... ('g', 2), ... ('h', 1), ... ('i', 0), ... ('j', 0)]]
I need to count the occurrences of each 'label' (the second item in each tuple) in all the items of all the sublists, and randomly remove some items until the number of occurrences of each 'label' is equal.
If the tuples are "heavier" than this, you can avoid comparing them
using the following algorithm (which probably still leaves some room for
optimization, e.g. simpler return_list building [or returning a
generator instead of a list], or directly building the sample set
instead of converting a random.sample to a set):
def resample(data):
counts = {}
for i in data:
for j in i:
counts[j[1]] = counts.setdefau lt(j[1], 0) + 1
min_count = min(counts.iter values())
# Same keys, so we can reuse the counts dictionary.
indices = counts
for label, count in counts.iteritem s():
indices[label] = set(random.samp le(xrange(count ), min_count))
# Same thing with a generator expression, building a new dict (dunno
# what's faster).
#indices = dict(((label, set(random.samp le(xrange(count ), min_count)))
# for label, count in counts.iteritem s()))
# "done" maps labels to the number of tuples (with that label) which
# have been added to return_list.
done = {}
return_list = []
for i in data:
return_list.app end([])
for j in i:
if done.setdefault (j[1], 0) in indices[j[1]]:
return_list[-1].append(j)
done[j[1]] += 1
return return_list
--
Felix Wiemann -- http://www.ososo.de/ This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Charlotte Henkle |
last post by:
Hello;
I'm pondering how to count the number of times an item appears in
total in a nested list. For example:
myList=,,]
I'd like to know that a appeared three times, and b appeared twice,
and the rest appeard only once.
|
by: Peter Collinson |
last post by:
Hi...
Is there any way to style a List Item a different color and size than
the <LI> in an Ordered List?
I'd like a red super-script number and a dark blue text in a page's
footnotes. And this be done using style sheets?
--
- Yours truly, Pete Collinson
|
by: steflhermitte |
last post by:
Dear cpp-ians,
I want to apply a stratified sampling on an image. Following simplified
example will explain my problem.
The original image I with nrows and ncols is now a vector V of length
(nrow x ncol) and every element of the vector contians its pixel value.
vector float V ;
|
by: Antoine |
last post by:
I would like to construct my own list of items in a grid/ table/ item list
layout but I have a problem. I want to add a sort of index row based on
time, such as there might be blank values.
Sure the data list I have at the moment is in order, but I would like to
have each interval possible, and instead of repeating the row index (the
time value) I would like to list it once and then list the items within it
(the items associated with...
|
by: kenfar |
last post by:
I've got a large table on db2 8.2.1 that I rarely perform runstats on.
It has about 600 million rows organized in a single MDC time dimension
on a non-dpf warehouse.
Anyhow, we recently ran runstats on it with a 20% sampling. After this
was performed the highly selective queries using the time dimension
slowed down drastically (8000% increase in duration). Earlier today I
marked the table 'volatile' and queries went back to normal. ...
| |
by: Brian Quinlan |
last post by:
This is less a Python question and more a optimization/probability
question. Imaging that you have a list of objects and there frequency in
a population e.g.
lst =
and you want to drawn n items from that list (duplicates allowed), with
that probability distribution.
The fastest algorithm that I have been able to devise for doing so is:
|
by: deko |
last post by:
How do I construct an XHTML-compliant nested unordered list?
This displays correctly (both FF and IE):
<ul>
<li>list item</li>
<li>list item</li>
<li>list item</li>
<ul>
<li>nested list item</li>
|
by: ib |
last post by:
Is there a way that I can view the items/elements of a generic list in
VS2005? For example, if I use STL std::vector, I am able to use the watch
window to see each element in this container. But this does not work for
the generic List collections. Currently, when I click on the '+' symbol to
expand the generic list collection, all I get is a duplicated name of the
generic list. This repeats itself each time I click on the '+' symbol of...
|
by: bearophileHUGS |
last post by:
Alexy>But in Python it's very slow...<
I'm the first one to say that CPython is slow, but almost any language
is slow if you use such wrong algorithms like you do.
There are many ways to solve your problem efficiently, one of such
ways, among the simpler ones is to to not modify the original list:
'i'
'd'
'l'
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |