By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,913 Members | 1,379 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,913 IT Pros & Developers. It's quick & easy.

How to Calculate Article Uniqueness ?

100+
P: 290
When publishing a new article to our website we want to ensure it is unique.
- unique to our own website at least.

Of course there always similar articles around so there are softwares that calculate the uniqueness of your article.

But how do they usually calculate it - and in particular, how does Google calculate it ?

Here is one approach that I came up with based on what I have read.

1) From the new article, take the first four-word-group ( this will be the 1st, 2nd, 3rd and 4th words ) and see if that occurs in the comparison article.
If it is found in the second article record it as one hit.

2) Then take the next four-word-group ( this will be the 2nd, 3rd, 4th and 5th words )and see if that occurs in the comparison article.
If it is found in the second article record it as another hit.

3) Continue all through the article.
In a 1000 word article there will be 997 four-word-group s.

If there are 120 hits, then the similarity percentage
would be 120/997 * 100 = 12 %

Or the Article Uniqueness = 88 %

Is this similar to Google's calculation ?

Or does Google just compare every word in the article ?

Maybe a three-word-group would be better ?

Any thoughts ?


Thanks.


.
Oct 12 '13 #1
Share this Question
Share on Google+
3 Replies


Nepomuk
Expert 2.5K+
P: 3,112
There are many different ways to compare texts and many scientific articles have been written about that topic. Just have a look at this Google Scholar search. There are probably some articles there that would interest you.
From the (very limited) research I've done, a very common approach seams to be that described in An O(ND) Difference Algorithm and Its Variations by Eugene Myers. Maybe have a look at that.
Oct 12 '13 #2

100+
P: 290
Hmm - read some of them but found them
a bit over theoretical. - But thanks anyway.

I was hoping for something a bit more down-to-earth,
something I can write into my code, similar to the above.

...
Oct 17 '13 #3

Nepomuk
Expert 2.5K+
P: 3,112
It all depends on how accurate you want your results to be. Your question was how Google and others do it; well, they use algorithms based on research such as the stuff I linked to. And though comparing word groups may be part of that it is far from everything.

There are tools that do work similar to what you want; for example, diff compares two files and tells you about which lines have changed, been moved, etc. You could of course implement something similar to that just based on words rather than lines. GNU diff is open source and available here, so you can look at what it does. Of course you could always hack together a solution that actually uses such a diff tool or maybe there's a diff library available for the language you're using.
Oct 17 '13 #4

Post your reply

Sign in to post your reply or Sign up for a free account.