help in algorithm

Paolino

I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

Aka:sum([medium-word for word in words])
Thanks for ideas, Paolino

___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it

Aug 10 '05 #1

Subscribe Post Reply

1770

gene tani

this sounds like LSI / singular value decomposition (?)

http://javelina.cet.middlebury.edu/l...xplanation.htm

Aug 10 '05 #2

Tom Anderson

On Wed, 10 Aug 2005, Paolino wrote:

I have a self organizing net which aim is clustering words. Let's think
the clustering is about their 2-grams set. Words then are instances of
this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)
Firstly:

- What do you mean by "to be calculated only once"? The code in __abs__
will run every time anyone calls abs() on the object. Do you mean that
clients should avoid calling abs more than once? If so, how about
memoising the function, or computing the 2-gram set up front, so clients
don't need to worry about it?

- Could i suggest frozenset instead of set, since the 2-gram set of a
string can't change?

- How about making the last line "return len(set1 ^ set2)"?
I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.
I think i understand. Does the word have to be drawn from the set of words
you're looking at? You can do that straightforwardly like this:

def distance(w, ws):
return sum([w - x for x in ws])

def medium(ws):
return min([(distance(w, ws), w) for w in ws])[1]

However, this is not terribly efficient - it's O(N**2) if you're counting
calls to __sub__.

If you want a more efficient algorithm, well, that's tricky. Luckily, i am
one of the most brilliant hackers alive, so here is an O(N) solution:

def distance_(w, counts, h, n):
"Returns the total distance from the word to the words in the set; the set is specified by its digram counts, horizon and size."
return h + sum([(n - (2 * counts[digram])) for digram in abs(w)])

def horizon(counts):
return sum(counts.itervalues())

def countdigrams(ws):
"Returns a map from digram to the number of words in which that digram appears."
counts = {}
for w in ws:
for digram in abs(w):
counts[digram] = counts.get(digram, 0) + 1
return counts

def distance(w, ws):
"Returns the total distance from the word to the words in the set."
counts = countdigrams(ws)
return distance_(w, counts, horizon(counts), len(ws))

def medium(ws):
"Returns the word in the set with the least total distance to the other words."
counts = countdigrams(ws)
h = horizon(counts)
n = len(ws)
return min([(distance_(w, counts, h, n), w) for w in ws])[1]

Note that this code calls abs a lot, so you'll want to memoise it. Also,
all of those list comprehensions could be replaced by generator
expressions, which would probably be faster - they certainly wouldn't
allocate as much memory; i'm on 2.3 at the moment, so i don't have
genexps.

I am ashamed to admit that i don't really understand how this code works.
I had a flash of insight into how the problem could be solved, wrote the
skeleton, then set to the details; by the time i'd finished with the
details, i'd forgotten the fundamental idea! I think it's something like
using the counts to represent the ensemble properties of the population of
words, which means measuring the total distance for each word is O(1).
Aka:sum([medium-word for word in words])

I have no idea what you're trying to do here!

tom

--
I sometimes think that the IETF is one of the crown jewels in all of
western civilization. -- Tim O'Reilly

Aug 11 '05 #3

Bengt Richter

On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <pa*************@tiscali.it> wrote:

I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

Aka:sum([medium-word for word in words])
Thanks for ideas, Paolino

Just wondering if this is a desired result:

clusterable('banana')-clusterable('bananana') 0

i.e., resulting from
abs(clusterable('banana'))-abs(clusterable('bananana')) set([]) abs(clusterable('banana')) set(['na', 'ab', 'ba', 'an']) abs(clusterable('bananana'))

set(['na', 'ab', 'ba', 'an'])

Regards,
Bengt Richter

Aug 11 '05 #4

Paolino

Bengt Richter wrote:

On Wed, 10 Aug 2005 16:51:55 +0200, Paolino <pa*************@tiscali.it> wrote:

I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

Aka:sum([medium-word for word in words])
Thanks for ideas, Paolino

Just wondering if this is a desired result:
>>> clusterable('banana')-clusterable('bananana')

0

Yes, the clustering is the main filter,it's good (I hope) to cut the
space of words down one or two magnitudes.
Final choices must be done with the expensive Levenstain distance, or
other edit-type distance.

Now I'm using an empirical solution where I suppose the best set has
lenght L equal the medium of the lenghts.Then I choose from the
frequency distribution of 2-grams the first L 2-grams.

I have no clue this is the right set and I'm sure that set is not a word
as there is no chance to chain those 2-grams to form a word.

Thanks for comments

Paolino

Aug 11 '05 #5

Bill Mill

On 10 Aug 2005 12:46:08 -0700, gene tani <ge*******@gmail.com> wrote:

this sounds like LSI / singular value decomposition (?)

Why do you think so? I don't see it, but you might see something I
don't. LSI can be used to cluster things, but I see no reason to
believe that he's using LSI for his clustering.

I ask because I've done some LSI [1], and could help him out with that
if he is doing it.

While I'm on the subject, is there any general interest in my python LSI code?

[1] http://llimllib.f2o.org/files/lsi_paper.pdf

Peace
Bill Mill

Aug 11 '05 #6

by: TazCoder | last post by:

I need to figure out an algorithm to convert any file, into it's hex representation in order to print out that file in windows forms. I do not want byteviewer in this and that is why it is...

C / C++

Help - Soap Message - SignedXml - Apache Xml Security Suite - Interoperability

by: Raghu | last post by:

I am using SignedXml class to sign and verify soap xml documents. We are not using WSE at this point. When I sign a soap document and send it to my trading partner, they can verify the document...

.NET Framework

need help in writing DES algorithm in C

by: git_cs | last post by:

Hey , guys and gals do any of you have the DES algorithm in C/C++ language. I would be happy if any of you could give the source code. I studied the algorithm, and have written a C language...

C / C++

math question in c algorithm book help please

by: ben | last post by:

hello, i'm following an algorithm book and am stuck on an early excersise in it, not because of the c programming side of it or even the algorithm side of it, i don't think, but because of maths....

C / C++

Need help: Is Quick-Union-Find the right solution to this problem (Now I don't think so and I think that topological sorting should be the way to go...?) ?

by: aredo3604gif | last post by:

On Sun, 10 Apr 2005 19:46:32 GMT, aredo3604gif@yahoo.com wrote: >The user can dynamically enter and change the rule connection between >objects. The rule is a "<" and so given two objects: >a <...

C / C++

need help with algorithm

by: Nemok | last post by:

Hi, I am trying to write an additive encryption algorithm in C++ that will encrypt a text by adding a random numer to each character in a string. The code looks similar to this: for(int...

C / C++

ask for help on algorithm to trace a network

by: Julia | last post by:

Hi, there, My task is: There is a network of intersected lines, e.g., a road network. What I have is a list of lines, and each item in the list includes the coordinates of points on each...

C / C++

need help with my program in c++

by: nabh4u | last post by:

hi, i need some help with progamming..i have a program which has to implement gale shapley's algorithm. i have 2 preference lists one is for companies and the other is for persons. i have to match...

C / C++

I'm stuck with the implementation of a generic algorithm.... need some help :P

by: StephQ | last post by:

I need to implement an algorithm that takes as input a container and write some output in another container. The containers involved are usually vectors, but I would like not to rule out the...

C / C++

Help with peak detection algorithm

by: kester83 | last post by:

hi, i am actually doing a project in C++ that is able to carry out voice morphing properties. however, i am kind of stuck with the peak detection which is essential for carrying on with the...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

help in algorithm

Similar topics