473,289 Members | 1,923 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,289 software developers and data experts.

PHP/MySQL fulltext search relevancy formula update

The table already has a fulltext index and from there I can use the
MySQL fulltext search query to get results as well as the relevancy
score.

The problem I have is that MySQL has a default setting whereby the
minimum amount of characters is 4 for a search. Being that we're
government and full of TLA (three-letter acronyms), that is not
practical, and furthermore, the app I'm building must be fully
portable, so having MySQL tweaked everywhere this app goes is a
useless exercise.

Hence, for 3-letter words I have to fudge together a relevancy score
to match that of larger and more commonly accepted words.

Here is the function I created, calculate_relevancy(), with notes I
got from a posting by MySQL AB's Sergei Golubchik on the exact formula
(according to him):

[PHP]
/**
* Calculate the relevancy score of a search query if the "SELECT
MATCH () AGAINST ()" fulltext indexing query is unavailable
*
* @access public
* @param mixed $sql SQL query resulting in the resultset used for
calculating $qcount value
* @param object $result Resultset
* @param object $row Row of resultset to calculate relevancy
* @param mixed $keyword
* @param int $rows (conditional optional) number of rows in table -
must exist if $tableName is not passed
* @param mixed $tableName (optional) name of table to perform
counting query to calculate $rows value
* @return float $score
* @see DBConnection
* @see MySQLQuery
* @see http://groups.google.com/groups?q=mysql+fulltext+maximum+score&hl=en&lr=&se lm=9tt6u1%24sca%241%40FreeBSD.csie.NCTU.edu.tw&rnu m=1
*/
function calculate_relevancy($sql, $result, $row, $keyword, $rows = 0,
$tableName = '') {
/*---------------------------------------------------------------------------------------------------------------------------
Explanation of integer variables:

1) $w1 = intermediate weight
2) $dcount = number of times word is present in document (or query
row)
3) $sumw1 = sum of all values of $w1 found
4) $uniq = number of unique words in the document (or query row)
5) $rows = total number of rows in the table (as int parameter or
calculated via $tableName query
6) $found = number of rows in the table that contain the word in
question (@sizeof($result))
7) $qcount = number of times this word is present in the query
statement in $sql parameter
----------------------------------------------------------------------------------------------------------------------------*/
global $section;
if (!$tableName && (int)$rows === 0) return 0; // TO PREVENT DIVISION
BY ZERO
if ((int)$rows == 0 && (!class_exists('DBConnection') ||
!class_exists('MySQLQuery'))) return 0; // CLASSES MUST EXIST TO
RE-CALCULATE $rows

// INITIALIZE VARS, ARRAYS, ETC.
list($w1, $dcount, $sumw1, $uniq, $found, $qcount) = array(0, 0, 0,
0, 0, 0);
$keyword = trim(preg_replace('/[\s\t]{2,}/i', ' ', $keyword)); //
TRIM EXTRA WHITESPACE FOR MORE ACCURATE COUNT
$keywordArray = explode(' ', $keyword); // CONVERT INTO ARRAY TO
SEARCH WEIGHT FOR EACH INDIVIDUAL KEYWORD

if ((int)$rows !== 0 && $rows == $found) return 0; // TO PREVENT
DIVISION BY ZERO

// CALCULATE $uniq
$wordComparatorArray = array();
foreach (@array_keys(get_object_vars($row)) as $field) {
if ($row->$field && !in_array(strtolower($row->$field),
$wordComparatorArray)) {
array_push($wordComparatorArray, trim(strtolower($row->$field)));
$uniq++;
}
}

// CALCULATE $qcount
foreach ($keywordArray as $word)
$qcount += @substr_count($sql, strtolower($word)); // NUMBER OF
INSTANCES OF EACH KEYWORD IN SQL QUERY

// CALCULATE $sumw1, EACH VALUE OF $dcount, $w1 PER KEYWORD PARTITION
PER RESULTSET ROW
@reset($keywordArray); // RESET KEYWORD ARRAY TO ENSURE
ACCURATE SCAN
for ($i = 0; $i < @sizeof($result); $i++) { // SCAN EVERY ROW
foreach ($keywordArray as $word) { // SCAN EACH COLUMN FIELD
PER KEYWORD ARRAY OF WORDS
foreach (@array_keys(get_object_vars($row)) as $field) { // SCAN
EACH COLUMN FIELD
if (!$isSetDCount) $dcount +=
@substr_count(strtolower($row->$field), strtolower($word));
$isSetDCount = false;
}
$w1 = log((float)$dcount) + 1;
if (!$isSetSumW1) $sumw1 += $w1;
$dcount = 0;
}
list($isSetDCount, $isSetSumW1) = array(true, true);
}

// CALCULATE $found
$found = @sizeof($result);
// CALCULATE $rows IF NOT PROVIDED AS PARAMETER
if ((int)$rows === 0) {
$query = new MySQLQuery("SELECT count(*) AS rows FROM $tableName
LIMIT 1", $dbConn);
$tempTableRowCountResult = $query->getResult();
$query = null;
$dbConnObj->close();
$dbConn = null;
$dbConnObj = null;
$rows = $tempTableRowCountResult[0]->rows;
}

print_r($row); print_r("<P>");
print_r("$w1 / $sumw1 * $uniq / (1 + 0.0115 * $uniq) *
log((float)(($rows - $found) / $found)) * $qcount<P>");
print_r(($w1 / $sumw1) . ' ' . ($uniq / (1 + 0.0115 * $uniq)) . ' ' .
(log((float)(($rows - $found) / $found))) . ' ' . $qcount . "<P>");
$score = (float)(($w1 / $sumw1) * ($uniq / (1 + 0.0115 * $uniq)) *
log((float)(($rows - $found) / $found)) * $qcount);
return $score;

}

[/PHP]

Problem is that the scores that come back, even if tested for
4+-letter words, are about 10X the size of an actual relevancy score
for 4+-letter words (for 3 letter words it is also anywhere between
7.75 and 30, which I unfortunately cannot compare against with an
actual relevancy score as a 3-letter word is too small for fulltext
query in default settings in MySQL 4.0+).

Just a heads-up, I am not sure what to do at this point.

Thanx
Phil
Jul 17 '05 #1
0 5738

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: JT | last post by:
I have a MySQL fulltext search form in place and I now want to filter the results further. I've added a few multiple select boxes on my form (ResourceType and Topic), and I'm able to build and...
2
by: Phil Powell | last post by:
Relevancy scores are normally defined by a MySQL query on a table that has a fulltext index. The rules for relevancy scoring will exclude certain words due to their being too short (minimum...
2
by: Maziar Aflatoun | last post by:
Hi, I have the following table CREATE TABLE `fulltext_sample` ( `copy` text, FULLTEXT KEY `copy` (`copy`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; INSERT INTO `fulltext_sample` VALUES...
4
by: Ka | last post by:
I install a mysql server in default installation with latin charset, but I want to use GBK(a chinese charset), so that I can store and search chinese words directly. so, I download, unpack and...
0
by: Henry Hank | last post by:
Environment: I'm setting up a database server on a Dell Poweredge 2650, dual 1.8GHZ pentium with 1GB of memory and RAID5 drives. I've installed RedHat 9, and updated the kernel to 2.4.20-19.9smp....
0
by: Phil Powell | last post by:
I have a rather complicated query with a combination of LEFT JOINs and two MATCHES where the first match is non-boolean to get the accurate score, the second to search as boolean: SELECT...
0
by: Phil Powell | last post by:
Retracing my problem leads me to believe I never successfully created fulltext indexes for MySQL 3.23.58 MyISAM tables. I went to the MySQL manual and was able - or so I thought - to create them,...
9
by: elyob | last post by:
Hi, I'm looking at storing snippets of details in MySQL about what credit cards a business excepts. Rather than have a whole column for Visa, another for Amex etc ... I am looking at having a...
2
by: John Kotuby | last post by:
Hi all, Maybe this belongs in the Full Text group but I am writing an ASP.NET application with a SQL Server 2005 backend, so I am posing the question here. I have been using fulltext search...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.