I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.
class clusterable(str ):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self) )])
def __sub__(self,ot her): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)
I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.
Aka:sum([medium-word for word in words])
Thanks for ideas, Paolino
_______________ _______________ _____
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it 5 1798
On Wed, 10 Aug 2005, Paolino wrote: I have a self organizing net which aim is clustering words. Let's think the clustering is about their 2-grams set. Words then are instances of this class.
class clusterable(str ): def __abs__(self):# the set of q-grams (to be calculated only once) return set([(self+self[0])[n:n+2] for n in range(len(self) )]) def __sub__(self,ot her): # the q-grams distance between 2 words set1=abs(self) set2=abs(other) return len(set1|set2)-len(set1&set2)
Firstly:
- What do you mean by "to be calculated only once"? The code in __abs__
will run every time anyone calls abs() on the object. Do you mean that
clients should avoid calling abs more than once? If so, how about
memoising the function, or computing the 2-gram set up front, so clients
don't need to worry about it?
- Could i suggest frozenset instead of set, since the 2-gram set of a
string can't change?
- How about making the last line "return len(set1 ^ set2)"?
I'm looking for the medium of a set of words, as the word which minimizes the sum of the distances from those words.
I think i understand. Does the word have to be drawn from the set of words
you're looking at? You can do that straightforward ly like this:
def distance(w, ws):
return sum([w - x for x in ws])
def medium(ws):
return min([(distance(w, ws), w) for w in ws])[1]
However, this is not terribly efficient - it's O(N**2) if you're counting
calls to __sub__.
If you want a more efficient algorithm, well, that's tricky. Luckily, i am
one of the most brilliant hackers alive, so here is an O(N) solution:
def distance_(w, counts, h, n):
"Returns the total distance from the word to the words in the set; the set is specified by its digram counts, horizon and size."
return h + sum([(n - (2 * counts[digram])) for digram in abs(w)])
def horizon(counts) :
return sum(counts.iter values())
def countdigrams(ws ):
"Returns a map from digram to the number of words in which that digram appears."
counts = {}
for w in ws:
for digram in abs(w):
counts[digram] = counts.get(digr am, 0) + 1
return counts
def distance(w, ws):
"Returns the total distance from the word to the words in the set."
counts = countdigrams(ws )
return distance_(w, counts, horizon(counts) , len(ws))
def medium(ws):
"Returns the word in the set with the least total distance to the other words."
counts = countdigrams(ws )
h = horizon(counts)
n = len(ws)
return min([(distance_(w, counts, h, n), w) for w in ws])[1]
Note that this code calls abs a lot, so you'll want to memoise it. Also,
all of those list comprehensions could be replaced by generator
expressions, which would probably be faster - they certainly wouldn't
allocate as much memory; i'm on 2.3 at the moment, so i don't have
genexps.
I am ashamed to admit that i don't really understand how this code works.
I had a flash of insight into how the problem could be solved, wrote the
skeleton, then set to the details; by the time i'd finished with the
details, i'd forgotten the fundamental idea! I think it's something like
using the counts to represent the ensemble properties of the population of
words, which means measuring the total distance for each word is O(1).
Aka:sum([medium-word for word in words])
I have no idea what you're trying to do here!
tom
--
I sometimes think that the IETF is one of the crown jewels in all of
western civilization. -- Tim O'Reilly
On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <pa************ *@tiscali.it> wrote: I have a self organizing net which aim is clustering words. Let's think the clustering is about their 2-grams set. Words then are instances of this class.
class clusterable(str ): def __abs__(self):# the set of q-grams (to be calculated only once) return set([(self+self[0])[n:n+2] for n in range(len(self) )]) def __sub__(self,ot her): # the q-grams distance between 2 words set1=abs(self) set2=abs(other) return len(set1|set2)-len(set1&set2)
I'm looking for the medium of a set of words, as the word which minimizes the sum of the distances from those words.
Aka:sum([medium-word for word in words])
Thanks for ideas, Paolino
Just wondering if this is a desired result: clusterable('ba nana')-clusterable('ba nanana')
0
i.e., resulting from
abs(clusterable ('banana'))-abs(clusterable ('bananana'))
set([]) abs(clusterable ('banana'))
set(['na', 'ab', 'ba', 'an']) abs(clusterable ('bananana'))
set(['na', 'ab', 'ba', 'an'])
Regards,
Bengt Richter
Bengt Richter wrote: On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <pa************ *@tiscali.it> wrote:
I have a self organizing net which aim is clustering words. Let's think the clustering is about their 2-grams set. Words then are instances of this class.
class clusterable(str ): def __abs__(self):# the set of q-grams (to be calculated only once) return set([(self+self[0])[n:n+2] for n in range(len(self) )]) def __sub__(self,ot her): # the q-grams distance between 2 words set1=abs(self) set2=abs(other) return len(set1|set2)-len(set1&set2)
I'm looking for the medium of a set of words, as the word which minimizes the sum of the distances from those words.
Aka:sum([medium-word for word in words])
Thanks for ideas, Paolino
Just wondering if this is a desired result:
>>> clusterable('ba nana')-clusterable('ba nanana')
0
Yes, the clustering is the main filter,it's good (I hope) to cut the
space of words down one or two magnitudes.
Final choices must be done with the expensive Levenstain distance, or
other edit-type distance.
Now I'm using an empirical solution where I suppose the best set has
lenght L equal the medium of the lenghts.Then I choose from the
frequency distribution of 2-grams the first L 2-grams.
I have no clue this is the right set and I'm sure that set is not a word
as there is no chance to chain those 2-grams to form a word.
Thanks for comments
Paolino
On 10 Aug 2005 12:46:08 -0700, gene tani <ge*******@gmai l.com> wrote: this sounds like LSI / singular value decomposition (?)
Why do you think so? I don't see it, but you might see something I
don't. LSI can be used to cluster things, but I see no reason to
believe that he's using LSI for his clustering.
I ask because I've done some LSI [1], and could help him out with that
if he is doing it.
While I'm on the subject, is there any general interest in my python LSI code?
[1] http://llimllib.f2o.org/files/lsi_paper.pdf
Peace
Bill Mill This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: TazCoder |
last post by:
I need to figure out an algorithm to convert any file, into it's hex
representation in order to print out that file in windows forms. I do
not want byteviewer in this and that is why it is becoming a problem.
The algorithm needs to load up an existing file, convert it to hex,
(then save that hex representation into a temp file) output that file
into a richtextbox, and then close. Now I have an algorithm that works
well, but it is not...
|
by: Raghu |
last post by:
I am using SignedXml class to sign and verify soap xml documents. We are not
using WSE at this point. When I sign a soap document and send it to my
trading partner, they can verify the document without any problem. However
when they send me the signed soap document, I am not able to verify it. But
they can take their signed document and can verify it without any problem.
They are using Apache Xml Security Suite (v 1.0.4).
One thing we...
|
by: git_cs |
last post by:
Hey , guys and gals do any of you have the DES algorithm in C/C++
language. I would be happy if any of you could give the source code.
I studied the algorithm, and have written a C language program which
implements just a single round. But the problem is that it encyrpts
the whole file, but while decrypting it just decrypts a part of it. If
you guys are good in C and would like to help me,
The problem in my code, is not due to the...
|
by: ben |
last post by:
hello,
i'm following an algorithm book and am stuck on an early excersise in
it, not because of the c programming side of it or even the algorithm
side of it, i don't think, but because of maths. i don't really
understand what is expected, or really what the question means. could
anyone explain what the question's after please?
any help much appreciated.
thanks, ben.
Prove an upper bound on the number of machine instructions required to
|
by: aredo3604gif |
last post by:
On Sun, 10 Apr 2005 19:46:32 GMT, aredo3604gif@yahoo.com wrote:
>The user can dynamically enter and change the rule connection between
>objects. The rule is a "<" and so given two objects:
>a < b simply means that b < a can't be set, also it must be a != b.
>And with three objects a < b , b < c means a < c
>
>I studied Quick Union Find algorithms a bit and if I understood them
>correctly, once the user gives the input setting the...
| |
by: Nemok |
last post by:
Hi,
I am trying to write an additive encryption algorithm in C++ that will
encrypt a text by adding a random numer to each character in a string.
The code looks similar to this:
for(int i=0;i<=tlength-1;i++)///tlength is the length of the string to
encrypt
{
ctext+=x+i;/////x is a random number and ctext is a char*
|
by: Julia |
last post by:
Hi, there,
My task is:
There is a network of intersected lines, e.g., a road network. What I
have is a list of lines, and each item in the list includes the
coordinates of points on each line.
I need an algorithm to trace the whole network, find the intersections
of all the lines, and record the lengths between two different
|
by: nabh4u |
last post by:
hi, i need some help with progamming..i have a program which has to implement gale shapley's algorithm. i have 2 preference lists one is for companies and the other is for persons. i have to match the companies with the persons according to the gale shapley algorithm.
/----match.h--------------------------------------------------------------/
#include <iostream>
#include<vector>
using namespace std;
/*------Declarations------*/
|
by: StephQ |
last post by:
I need to implement an algorithm that takes as input a container and
write some output in another container.
The containers involved are usually vectors, but I would like not to
rule out the possibility of using lists.
The problem is that I need two versions of it, depending if I'm adding
the generated (by the algorithm) values to the target container or if
I just modify pre-existing values of the target container.
Efficiency is important...
|
by: kester83 |
last post by:
hi, i am actually doing a project in C++ that is able to carry out
voice morphing properties. however, i am kind of stuck with the peak
detection which is essential for carrying on with the project. can
anyone help me out in writing a peak detection algorithm?
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
| |
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |