473,416 Members | 1,727 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

PHP/MySQL fulltext search relevancy formula update

The table already has a fulltext index and from there I can use the
MySQL fulltext search query to get results as well as the relevancy
score.

The problem I have is that MySQL has a default setting whereby the
minimum amount of characters is 4 for a search. Being that we're
government and full of TLA (three-letter acronyms), that is not
practical, and furthermore, the app I'm building must be fully
portable, so having MySQL tweaked everywhere this app goes is a
useless exercise.

Hence, for 3-letter words I have to fudge together a relevancy score
to match that of larger and more commonly accepted words.

Here is the function I created, calculate_relevancy(), with notes I
got from a posting by MySQL AB's Sergei Golubchik on the exact formula
(according to him):

[PHP]
/**
* Calculate the relevancy score of a search query if the "SELECT
MATCH () AGAINST ()" fulltext indexing query is unavailable
*
* @access public
* @param mixed $sql SQL query resulting in the resultset used for
calculating $qcount value
* @param object $result Resultset
* @param object $row Row of resultset to calculate relevancy
* @param mixed $keyword
* @param int $rows (conditional optional) number of rows in table -
must exist if $tableName is not passed
* @param mixed $tableName (optional) name of table to perform
counting query to calculate $rows value
* @return float $score
* @see DBConnection
* @see MySQLQuery
* @see http://groups.google.com/groups?q=mysql+fulltext+maximum+score&hl=en&lr=&se lm=9tt6u1%24sca%241%40FreeBSD.csie.NCTU.edu.tw&rnu m=1
*/
function calculate_relevancy($sql, $result, $row, $keyword, $rows = 0,
$tableName = '') {
/*---------------------------------------------------------------------------------------------------------------------------
Explanation of integer variables:

1) $w1 = intermediate weight
2) $dcount = number of times word is present in document (or query
row)
3) $sumw1 = sum of all values of $w1 found
4) $uniq = number of unique words in the document (or query row)
5) $rows = total number of rows in the table (as int parameter or
calculated via $tableName query
6) $found = number of rows in the table that contain the word in
question (@sizeof($result))
7) $qcount = number of times this word is present in the query
statement in $sql parameter
----------------------------------------------------------------------------------------------------------------------------*/
global $section;
if (!$tableName && (int)$rows === 0) return 0; // TO PREVENT DIVISION
BY ZERO
if ((int)$rows == 0 && (!class_exists('DBConnection') ||
!class_exists('MySQLQuery'))) return 0; // CLASSES MUST EXIST TO
RE-CALCULATE $rows

// INITIALIZE VARS, ARRAYS, ETC.
list($w1, $dcount, $sumw1, $uniq, $found, $qcount) = array(0, 0, 0,
0, 0, 0);
$keyword = trim(preg_replace('/[\s\t]{2,}/i', ' ', $keyword)); //
TRIM EXTRA WHITESPACE FOR MORE ACCURATE COUNT
$keywordArray = explode(' ', $keyword); // CONVERT INTO ARRAY TO
SEARCH WEIGHT FOR EACH INDIVIDUAL KEYWORD

if ((int)$rows !== 0 && $rows == $found) return 0; // TO PREVENT
DIVISION BY ZERO

// CALCULATE $uniq
$wordComparatorArray = array();
foreach (@array_keys(get_object_vars($row)) as $field) {
if ($row->$field && !in_array(strtolower($row->$field),
$wordComparatorArray)) {
array_push($wordComparatorArray, trim(strtolower($row->$field)));
$uniq++;
}
}

// CALCULATE $qcount
foreach ($keywordArray as $word)
$qcount += @substr_count($sql, strtolower($word)); // NUMBER OF
INSTANCES OF EACH KEYWORD IN SQL QUERY

// CALCULATE $sumw1, EACH VALUE OF $dcount, $w1 PER KEYWORD PARTITION
PER RESULTSET ROW
@reset($keywordArray); // RESET KEYWORD ARRAY TO ENSURE
ACCURATE SCAN
for ($i = 0; $i < @sizeof($result); $i++) { // SCAN EVERY ROW
foreach ($keywordArray as $word) { // SCAN EACH COLUMN FIELD
PER KEYWORD ARRAY OF WORDS
foreach (@array_keys(get_object_vars($row)) as $field) { // SCAN
EACH COLUMN FIELD
if (!$isSetDCount) $dcount +=
@substr_count(strtolower($row->$field), strtolower($word));
$isSetDCount = false;
}
$w1 = log((float)$dcount) + 1;
if (!$isSetSumW1) $sumw1 += $w1;
$dcount = 0;
}
list($isSetDCount, $isSetSumW1) = array(true, true);
}

// CALCULATE $found
$found = @sizeof($result);
// CALCULATE $rows IF NOT PROVIDED AS PARAMETER
if ((int)$rows === 0) {
$query = new MySQLQuery("SELECT count(*) AS rows FROM $tableName
LIMIT 1", $dbConn);
$tempTableRowCountResult = $query->getResult();
$query = null;
$dbConnObj->close();
$dbConn = null;
$dbConnObj = null;
$rows = $tempTableRowCountResult[0]->rows;
}

print_r($row); print_r("<P>");
print_r("$w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
log((float)(($rows - $found) / $found)) * $qcount<P>");
print_r(($w1 / $sumw1) . ' ' . ($uniq / (1 + 0.0115 * $uniq)) . ' ' .
(log((float)(($rows - $found) / $found))) . ' ' . $qcount . "<P>");
$score = (float)(($w1 / $sumw1) * ($uniq / (1 + 0.0115 * $uniq)) *
log((float)(($rows - $found) / $found)) * $qcount);
return $score;

}

[/PHP]

Problem is that the scores that come back, even if tested for
4+-letter words, are about 10X the size of an actual relevancy score
for 4+-letter words (for 3 letter words it is also anywhere between
7.75 and 30, which I unfortunately cannot compare against with an
actual relevancy score as a 3-letter word is too small for fulltext
query in default settings in MySQL 4.0+).

Just a heads-up, I am not sure what to do at this point.

Thanx
Phil
Jul 17 '05 #1
0 5746

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: JT | last post by:
I have a MySQL fulltext search form in place and I now want to filter the results further. I've added a few multiple select boxes on my form (ResourceType and Topic), and I'm able to build and...
2
by: Phil Powell | last post by:
Relevancy scores are normally defined by a MySQL query on a table that has a fulltext index. The rules for relevancy scoring will exclude certain words due to their being too short (minimum...
2
by: Maziar Aflatoun | last post by:
Hi, I have the following table CREATE TABLE `fulltext_sample` ( `copy` text, FULLTEXT KEY `copy` (`copy`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; INSERT INTO `fulltext_sample` VALUES...
4
by: Ka | last post by:
I install a mysql server in default installation with latin charset, but I want to use GBK(a chinese charset), so that I can store and search chinese words directly. so, I download, unpack and...
0
by: Henry Hank | last post by:
Environment: I'm setting up a database server on a Dell Poweredge 2650, dual 1.8GHZ pentium with 1GB of memory and RAID5 drives. I've installed RedHat 9, and updated the kernel to 2.4.20-19.9smp....
0
by: Phil Powell | last post by:
I have a rather complicated query with a combination of LEFT JOINs and two MATCHES where the first match is non-boolean to get the accurate score, the second to search as boolean: SELECT...
0
by: Phil Powell | last post by:
Retracing my problem leads me to believe I never successfully created fulltext indexes for MySQL 3.23.58 MyISAM tables. I went to the MySQL manual and was able - or so I thought - to create them,...
9
by: elyob | last post by:
Hi, I'm looking at storing snippets of details in MySQL about what credit cards a business excepts. Rather than have a whole column for Visa, another for Amex etc ... I am looking at having a...
2
by: John Kotuby | last post by:
Hi all, Maybe this belongs in the Full Text group but I am writing an ASP.NET application with a SQL Server 2005 backend, so I am posing the question here. I have been using fulltext search...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.