The table already has a fulltext index and from there I can use the
MySQL fulltext search query to get results as well as the relevancy
score.
The problem I have is that MySQL has a default setting whereby the
minimum amount of characters is 4 for a search. Being that we're
government and full of TLA (three-letter acronyms), that is not
practical, and furthermore, the app I'm building must be fully
portable, so having MySQL tweaked everywhere this app goes is a
useless exercise.
Hence, for 3-letter words I have to fudge together a relevancy score
to match that of larger and more commonly accepted words.
Here is the function I created, calculate_relevancy(), with notes I
got from a posting by MySQL AB's Sergei Golubchik on the exact formula
(according to him):
[PHP]
/**
* Calculate the relevancy score of a search query if the "SELECT
MATCH () AGAINST ()" fulltext indexing query is unavailable
*
* @access public
* @param mixed $sql SQL query resulting in the resultset used for
calculating $qcount value
* @param object $result Resultset
* @param object $row Row of resultset to calculate relevancy
* @param mixed $keyword
* @param int $rows (conditional optional) number of rows in table -
must exist if $tableName is not passed
* @param mixed $tableName (optional) name of table to perform
counting query to calculate $rows value
* @return float $score
* @see DBConnection
* @see MySQLQuery
* @see http://groups.google.com/groups?q=mysql+fulltext+maximum+score&hl=en&lr=&se lm=9tt6u1%24sca%241%40FreeBSD.csie.NCTU.edu.tw&rnu m=1
*/
function calculate_relevancy($sql, $result, $row, $keyword, $rows = 0,
$tableName = '') {
/*---------------------------------------------------------------------------------------------------------------------------
Explanation of integer variables:
1) $w1 = intermediate weight
2) $dcount = number of times word is present in document (or query
row)
3) $sumw1 = sum of all values of $w1 found
4) $uniq = number of unique words in the document (or query row)
5) $rows = total number of rows in the table (as int parameter or
calculated via $tableName query
6) $found = number of rows in the table that contain the word in
question (@sizeof($result))
7) $qcount = number of times this word is present in the query
statement in $sql parameter
----------------------------------------------------------------------------------------------------------------------------*/
global $section;
if (!$tableName && (int)$rows === 0) return 0; // TO PREVENT DIVISION
BY ZERO
if ((int)$rows == 0 && (!class_exists('DBConnection') ||
!class_exists('MySQLQuery'))) return 0; // CLASSES MUST EXIST TO
RE-CALCULATE $rows
// INITIALIZE VARS, ARRAYS, ETC.
list($w1, $dcount, $sumw1, $uniq, $found, $qcount) = array(0, 0, 0,
0, 0, 0);
$keyword = trim(preg_replace('/[\s\t]{2,}/i', ' ', $keyword)); //
TRIM EXTRA WHITESPACE FOR MORE ACCURATE COUNT
$keywordArray = explode(' ', $keyword); // CONVERT INTO ARRAY TO
SEARCH WEIGHT FOR EACH INDIVIDUAL KEYWORD
if ((int)$rows !== 0 && $rows == $found) return 0; // TO PREVENT
DIVISION BY ZERO
// CALCULATE $uniq
$wordComparatorArray = array();
foreach (@array_keys(get_object_vars($row)) as $field) {
if ($row->$field && !in_array(strtolower($row->$field),
$wordComparatorArray)) {
array_push($wordComparatorArray, trim(strtolower($row->$field)));
$uniq++;
}
}
// CALCULATE $qcount
foreach ($keywordArray as $word)
$qcount += @substr_count($sql, strtolower($word)); // NUMBER OF
INSTANCES OF EACH KEYWORD IN SQL QUERY
// CALCULATE $sumw1, EACH VALUE OF $dcount, $w1 PER KEYWORD PARTITION
PER RESULTSET ROW
@reset($keywordArray); // RESET KEYWORD ARRAY TO ENSURE
ACCURATE SCAN
for ($i = 0; $i < @sizeof($result); $i++) { // SCAN EVERY ROW
foreach ($keywordArray as $word) { // SCAN EACH COLUMN FIELD
PER KEYWORD ARRAY OF WORDS
foreach (@array_keys(get_object_vars($row)) as $field) { // SCAN
EACH COLUMN FIELD
if (!$isSetDCount) $dcount +=
@substr_count(strtolower($row->$field), strtolower($word));
$isSetDCount = false;
}
$w1 = log((float)$dcount) + 1;
if (!$isSetSumW1) $sumw1 += $w1;
$dcount = 0;
}
list($isSetDCount, $isSetSumW1) = array(true, true);
}
// CALCULATE $found
$found = @sizeof($result);
// CALCULATE $rows IF NOT PROVIDED AS PARAMETER
if ((int)$rows === 0) {
$query = new MySQLQuery("SELECT count(*) AS rows FROM $tableName
LIMIT 1", $dbConn);
$tempTableRowCountResult = $query->getResult();
$query = null;
$dbConnObj->close();
$dbConn = null;
$dbConnObj = null;
$rows = $tempTableRowCountResult[0]->rows;
}
print_r($row); print_r("<P>");
print_r("$w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
log((float)(($rows - $found) / $found)) * $qcount<P>");
print_r(($w1 / $sumw1) . ' ' . ($uniq / (1 + 0.0115 * $uniq)) . ' ' .
(log((float)(($rows - $found) / $found))) . ' ' . $qcount . "<P>");
$score = (float)(($w1 / $sumw1) * ($uniq / (1 + 0.0115 * $uniq)) *
log((float)(($rows - $found) / $found)) * $qcount);
return $score;
}
[/PHP]
Problem is that the scores that come back, even if tested for
4+-letter words, are about 10X the size of an actual relevancy score
for 4+-letter words (for 3 letter words it is also anywhere between
7.75 and 30, which I unfortunately cannot compare against with an
actual relevancy score as a 3-letter word is too small for fulltext
query in default settings in MySQL 4.0+).
Just a heads-up, I am not sure what to do at this point.
Thanx
Phil