How to Calculate Article Uniqueness ?

290 100+

When publishing a new article to our website we want to ensure it is unique.
- unique to our own website at least.

Of course there always similar articles around so there are softwares that calculate the uniqueness of your article.

But how do they usually calculate it - and in particular, how does Google calculate it ?

Here is one approach that I came up with based on what I have read.

1) From the new article, take the first four-word-group ( this will be the 1st, 2nd, 3rd and 4th words ) and see if that occurs in the comparison article.
If it is found in the second article record it as one hit.

2) Then take the next four-word-group ( this will be the 2nd, 3rd, 4th and 5th words )and see if that occurs in the comparison article.
If it is found in the second article record it as another hit.

3) Continue all through the article.
In a 1000 word article there will be 997 four-word-group s.

If there are 120 hits, then the similarity percentage
would be 120/997 * 100 = 12 %

Or the Article Uniqueness = 88 %

Is this similar to Google's calculation ?

Or does Google just compare every word in the article ?

Maybe a three-word-group would be better ?

Any thoughts ?

Thanks.

.

Oct 12 '13 #1

Subscribe Post Reply

1609

Nepomuk

3,112

Expert 2GB

There are many different ways to compare texts and many scientific articles have been written about that topic. Just have a look at this Google Scholar search. There are probably some articles there that would interest you.
From the (very limited) research I've done, a very common approach seams to be that described in An O(ND) Difference Algorithm and Its Variations by Eugene Myers. Maybe have a look at that.

Oct 12 '13 #2

jeddiki

290

100+

Hmm - read some of them but found them
a bit over theoretical. - But thanks anyway.

I was hoping for something a bit more down-to-earth,
something I can write into my code, similar to the above.

...

Oct 17 '13 #3

Nepomuk

3,112

Expert 2GB

It all depends on how accurate you want your results to be. Your question was how Google and others do it; well, they use algorithms based on research such as the stuff I linked to. And though comparing word groups may be part of that it is far from everything.

There are tools that do work similar to what you want; for example, diff compares two files and tells you about which lines have changed, been moved, etc. You could of course implement something similar to that just based on words rather than lines. GNU diff is open source and available here, so you can look at what it does. Of course you could always hack together a solution that actually uses such a diff tool or maybe there's a diff library available for the language you're using.

Oct 17 '13 #4

by: Puvendran Selvaratnam | last post by:

Hi, First of all my apologies if you have seen this mail already but I am re-sending as there were some initial problems. This query is related to defining indexes to be unique or not and...

Microsoft SQL Server

Calculate Form

by: Building Blocks | last post by:

Hi, All I need is a simle calculate form script which contains this: A script that can handle text input, radio buttons, checkboxes, and dropdowns. Each one of these variables will contain a...

Javascript

Calculate/create Row Number without identity

by: chrispycrunch | last post by:

How do I output a row number for a table solely for the purpose of querying for a unique row? In my problem, the table from a legacy system does not have a primary key, so it limits various...

Microsoft SQL Server

Uniqueness based on Attribute-value

by: Dirk Declercq | last post by:

Hi, Is it possible in Xml to enfore the uniqueness of an element based on his attribute value. Say I have this schema : <?xml version="1.0" encoding="UTF-8"?> <xs:schema...

.NET Framework

uniqueness schema problem

by: Mr. Almenares | last post by:

Hello: Iâ€™m trying to do a schema with recurrent structure for a Book like a Node can have many Nodes inside or One leave. So, the leaves have an attribute that is Identifier. My goal is define...

.NET Framework

to calculate bitsize of a byte

by: david ullua | last post by:

I am reading "Joel on Software" these days, and am in stuck with the question of "how to calculate bitsize of a byte" which is listed as one of the basic interview questions in Joel's book. Anyone...

C / C++

Efficient Uniqueness Check

by: Alan Little | last post by:

I have affiliates submitting batches of anywhere from 10 to several hundred orders. Each order in the batch must include an order ID, originated by the affiliate, which must be unique across all...

PHP

uniqueness of session

by: Man-wai Chang | last post by:

If two PCs from the same router connects to my web server, will unique session IDs be generated for each connection? In fact, is there an article talking about how PHP generates session cookies?...

PHP

Moving average (how to calculate)

by: Michel | last post by:

Hello, I need to calculate moving averages of weekly data during the last year. After some search, I believe that the best approach will be to get a dataset from the SQL Server database, browse...

Visual Basic .NET

calculate an unique id of a file

by: timor.super | last post by:

Hi group, I need a way of calculating an unique id for a file. I've seen things like Crc32, 64, checksum .... there's a list here : http://en.wikipedia.org/wiki/List_of_hash_functions What...

C# / C Sharp

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

How to Calculate Article Uniqueness ?

Similar topics