473,387 Members | 1,532 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

How to Calculate Article Uniqueness ?

290 100+
When publishing a new article to our website we want to ensure it is unique.
- unique to our own website at least.

Of course there always similar articles around so there are softwares that calculate the uniqueness of your article.

But how do they usually calculate it - and in particular, how does Google calculate it ?

Here is one approach that I came up with based on what I have read.

1) From the new article, take the first four-word-group ( this will be the 1st, 2nd, 3rd and 4th words ) and see if that occurs in the comparison article.
If it is found in the second article record it as one hit.

2) Then take the next four-word-group ( this will be the 2nd, 3rd, 4th and 5th words )and see if that occurs in the comparison article.
If it is found in the second article record it as another hit.

3) Continue all through the article.
In a 1000 word article there will be 997 four-word-group s.

If there are 120 hits, then the similarity percentage
would be 120/997 * 100 = 12 %

Or the Article Uniqueness = 88 %

Is this similar to Google's calculation ?

Or does Google just compare every word in the article ?

Maybe a three-word-group would be better ?

Any thoughts ?


Thanks.


.
Oct 12 '13 #1
3 1609
Nepomuk
3,112 Expert 2GB
There are many different ways to compare texts and many scientific articles have been written about that topic. Just have a look at this Google Scholar search. There are probably some articles there that would interest you.
From the (very limited) research I've done, a very common approach seams to be that described in An O(ND) Difference Algorithm and Its Variations by Eugene Myers. Maybe have a look at that.
Oct 12 '13 #2
jeddiki
290 100+
Hmm - read some of them but found them
a bit over theoretical. - But thanks anyway.

I was hoping for something a bit more down-to-earth,
something I can write into my code, similar to the above.

...
Oct 17 '13 #3
Nepomuk
3,112 Expert 2GB
It all depends on how accurate you want your results to be. Your question was how Google and others do it; well, they use algorithms based on research such as the stuff I linked to. And though comparing word groups may be part of that it is far from everything.

There are tools that do work similar to what you want; for example, diff compares two files and tells you about which lines have changed, been moved, etc. You could of course implement something similar to that just based on words rather than lines. GNU diff is open source and available here, so you can look at what it does. Of course you could always hack together a solution that actually uses such a diff tool or maybe there's a diff library available for the language you're using.
Oct 17 '13 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Puvendran Selvaratnam | last post by:
Hi, First of all my apologies if you have seen this mail already but I am re-sending as there were some initial problems. This query is related to defining indexes to be unique or not and...
1
by: Building Blocks | last post by:
Hi, All I need is a simle calculate form script which contains this: A script that can handle text input, radio buttons, checkboxes, and dropdowns. Each one of these variables will contain a...
2
by: chrispycrunch | last post by:
How do I output a row number for a table solely for the purpose of querying for a unique row? In my problem, the table from a legacy system does not have a primary key, so it limits various...
2
by: Dirk Declercq | last post by:
Hi, Is it possible in Xml to enfore the uniqueness of an element based on his attribute value. Say I have this schema : <?xml version="1.0" encoding="UTF-8"?> <xs:schema...
1
by: Mr. Almenares | last post by:
Hello: I’m trying to do a schema with recurrent structure for a Book like a Node can have many Nodes inside or One leave. So, the leaves have an attribute that is Identifier. My goal is define...
96
by: david ullua | last post by:
I am reading "Joel on Software" these days, and am in stuck with the question of "how to calculate bitsize of a byte" which is listed as one of the basic interview questions in Joel's book. Anyone...
5
by: Alan Little | last post by:
I have affiliates submitting batches of anywhere from 10 to several hundred orders. Each order in the batch must include an order ID, originated by the affiliate, which must be unique across all...
10
by: Man-wai Chang | last post by:
If two PCs from the same router connects to my web server, will unique session IDs be generated for each connection? In fact, is there an article talking about how PHP generates session cookies?...
1
by: Michel | last post by:
Hello, I need to calculate moving averages of weekly data during the last year. After some search, I believe that the best approach will be to get a dataset from the SQL Server database, browse...
6
by: timor.super | last post by:
Hi group, I need a way of calculating an unique id for a file. I've seen things like Crc32, 64, checksum .... there's a list here : http://en.wikipedia.org/wiki/List_of_hash_functions What...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.