473,230 Members | 1,817 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,230 software developers and data experts.

PHP/MySQL fulltext search relevancy formula update

The table already has a fulltext index and from there I can use the
MySQL fulltext search query to get results as well as the relevancy
score.

The problem I have is that MySQL has a default setting whereby the
minimum amount of characters is 4 for a search. Being that we're
government and full of TLA (three-letter acronyms), that is not
practical, and furthermore, the app I'm building must be fully
portable, so having MySQL tweaked everywhere this app goes is a
useless exercise.

Hence, for 3-letter words I have to fudge together a relevancy score
to match that of larger and more commonly accepted words.

Here is the function I created, calculate_relevancy(), with notes I
got from a posting by MySQL AB's Sergei Golubchik on the exact formula
(according to him):

[PHP]
/**
* Calculate the relevancy score of a search query if the "SELECT
MATCH () AGAINST ()" fulltext indexing query is unavailable
*
* @access public
* @param mixed $sql SQL query resulting in the resultset used for
calculating $qcount value
* @param object $result Resultset
* @param object $row Row of resultset to calculate relevancy
* @param mixed $keyword
* @param int $rows (conditional optional) number of rows in table -
must exist if $tableName is not passed
* @param mixed $tableName (optional) name of table to perform
counting query to calculate $rows value
* @return float $score
* @see DBConnection
* @see MySQLQuery
* @see http://groups.google.com/groups?q=mysql+fulltext+maximum+score&hl=en&lr=&se lm=9tt6u1%24sca%241%40FreeBSD.csie.NCTU.edu.tw&rnu m=1
*/
function calculate_relevancy($sql, $result, $row, $keyword, $rows = 0,
$tableName = '') {
/*---------------------------------------------------------------------------------------------------------------------------
Explanation of integer variables:

1) $w1 = intermediate weight
2) $dcount = number of times word is present in document (or query
row)
3) $sumw1 = sum of all values of $w1 found
4) $uniq = number of unique words in the document (or query row)
5) $rows = total number of rows in the table (as int parameter or
calculated via $tableName query
6) $found = number of rows in the table that contain the word in
question (@sizeof($result))
7) $qcount = number of times this word is present in the query
statement in $sql parameter
----------------------------------------------------------------------------------------------------------------------------*/
global $section;
if (!$tableName && (int)$rows === 0) return 0; // TO PREVENT DIVISION
BY ZERO
if ((int)$rows == 0 && (!class_exists('DBConnection') ||
!class_exists('MySQLQuery'))) return 0; // CLASSES MUST EXIST TO
RE-CALCULATE $rows

// INITIALIZE VARS, ARRAYS, ETC.
list($w1, $dcount, $sumw1, $uniq, $found, $qcount) = array(0, 0, 0,
0, 0, 0);
$keyword = trim(preg_replace('/[\s\t]{2,}/i', ' ', $keyword)); //
TRIM EXTRA WHITESPACE FOR MORE ACCURATE COUNT
$keywordArray = explode(' ', $keyword); // CONVERT INTO ARRAY TO
SEARCH WEIGHT FOR EACH INDIVIDUAL KEYWORD

if ((int)$rows !== 0 && $rows == $found) return 0; // TO PREVENT
DIVISION BY ZERO

// CALCULATE $uniq
$wordComparatorArray = array();
foreach (@array_keys(get_object_vars($row)) as $field) {
if ($row->$field && !in_array(strtolower($row->$field),
$wordComparatorArray)) {
array_push($wordComparatorArray, trim(strtolower($row->$field)));
$uniq++;
}
}

// CALCULATE $qcount
foreach ($keywordArray as $word)
$qcount += @substr_count($sql, strtolower($word)); // NUMBER OF
INSTANCES OF EACH KEYWORD IN SQL QUERY

// CALCULATE $sumw1, EACH VALUE OF $dcount, $w1 PER KEYWORD PARTITION
PER RESULTSET ROW
@reset($keywordArray); // RESET KEYWORD ARRAY TO ENSURE
ACCURATE SCAN
for ($i = 0; $i < @sizeof($result); $i++) { // SCAN EVERY ROW
foreach ($keywordArray as $word) { // SCAN EACH COLUMN FIELD
PER KEYWORD ARRAY OF WORDS
foreach (@array_keys(get_object_vars($row)) as $field) { // SCAN
EACH COLUMN FIELD
if (!$isSetDCount) $dcount +=
@substr_count(strtolower($row->$field), strtolower($word));
$isSetDCount = false;
}
$w1 = log((float)$dcount) + 1;
if (!$isSetSumW1) $sumw1 += $w1;
$dcount = 0;
}
list($isSetDCount, $isSetSumW1) = array(true, true);
}

// CALCULATE $found
$found = @sizeof($result);
// CALCULATE $rows IF NOT PROVIDED AS PARAMETER
if ((int)$rows === 0) {
$query = new MySQLQuery("SELECT count(*) AS rows FROM $tableName
LIMIT 1", $dbConn);
$tempTableRowCountResult = $query->getResult();
$query = null;
$dbConnObj->close();
$dbConn = null;
$dbConnObj = null;
$rows = $tempTableRowCountResult[0]->rows;
}

print_r($row); print_r("<P>");
print_r("$w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
log((float)(($rows - $found) / $found)) * $qcount<P>");
print_r(($w1 / $sumw1) . ' ' . ($uniq / (1 + 0.0115 * $uniq)) . ' ' .
(log((float)(($rows - $found) / $found))) . ' ' . $qcount . "<P>");
$score = (float)(($w1 / $sumw1) * ($uniq / (1 + 0.0115 * $uniq)) *
log((float)(($rows - $found) / $found)) * $qcount);
return $score;

}

[/PHP]

Problem is that the scores that come back, even if tested for
4+-letter words, are about 10X the size of an actual relevancy score
for 4+-letter words (for 3 letter words it is also anywhere between
7.75 and 30, which I unfortunately cannot compare against with an
actual relevancy score as a 3-letter word is too small for fulltext
query in default settings in MySQL 4.0+).

Just a heads-up, I am not sure what to do at this point.

Thanx
Phil
Jul 17 '05 #1
0 5734

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: JT | last post by:
I have a MySQL fulltext search form in place and I now want to filter the results further. I've added a few multiple select boxes on my form (ResourceType and Topic), and I'm able to build and...
2
by: Phil Powell | last post by:
Relevancy scores are normally defined by a MySQL query on a table that has a fulltext index. The rules for relevancy scoring will exclude certain words due to their being too short (minimum...
2
by: Maziar Aflatoun | last post by:
Hi, I have the following table CREATE TABLE `fulltext_sample` ( `copy` text, FULLTEXT KEY `copy` (`copy`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; INSERT INTO `fulltext_sample` VALUES...
4
by: Ka | last post by:
I install a mysql server in default installation with latin charset, but I want to use GBK(a chinese charset), so that I can store and search chinese words directly. so, I download, unpack and...
0
by: Henry Hank | last post by:
Environment: I'm setting up a database server on a Dell Poweredge 2650, dual 1.8GHZ pentium with 1GB of memory and RAID5 drives. I've installed RedHat 9, and updated the kernel to 2.4.20-19.9smp....
0
by: Phil Powell | last post by:
I have a rather complicated query with a combination of LEFT JOINs and two MATCHES where the first match is non-boolean to get the accurate score, the second to search as boolean: SELECT...
0
by: Phil Powell | last post by:
Retracing my problem leads me to believe I never successfully created fulltext indexes for MySQL 3.23.58 MyISAM tables. I went to the MySQL manual and was able - or so I thought - to create them,...
9
by: elyob | last post by:
Hi, I'm looking at storing snippets of details in MySQL about what credit cards a business excepts. Rather than have a whole column for Visa, another for Amex etc ... I am looking at having a...
2
by: John Kotuby | last post by:
Hi all, Maybe this belongs in the Full Text group but I am writing an ASP.NET application with a SQL Server 2005 backend, so I am posing the question here. I have been using fulltext search...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.