473,702 Members | 2,614 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Lots of FULLTEXT stuff (suggestions)

Hi all,

I'm planning to use MySQL's full-text search for my forum system
(possibly 5+ million posts). I've been playing with it a lot lately to
see the performance and functionality and have some
suggestions/questions.

First, since a few of you may be wanting to know, here is a thread where
I was doing some speed/optimization tests and stuff with 3 million
posts: http://www.sitepointforums.com/showt...threadid=69555
(From post #12)

Especially discovered that IN BOOLEAN MODE is really slow if you want to
sort by relevance (with a lot of matching rows anyway). :-( For
non-BOOLEAN searches, though, I can get 1000 relevance-sorted results in
about 8-10 secs. for searches that match a LOT of rows and everything
has to be read from disk. The full-text processing seems to be very fast
(max 1-2 seconds of "FULLTEXT initialization" in PROCESSLIST). It's the
disk seeks to read random rows from the data file ("Sending data") that
take the most time (7200 RPM/~8ms seek IDE drive). Searches are *MUCH*
faster when the needed parts of the data file are cached by the OS!

Anyway, my suggestions:

--------------------------------------------------
*) Min/Max Word Length -- This should really be able to be set on at
least a per table basis (others may want per index). Right now, people
that don't have control of the server are at the mercy of the admin to
change the min/max word length.

I would also suggest that ft_min_word_len be 3 and ft_max_word_len be 32
by default. I think these would be better defaults for everyone than the
current 4/254.

Or if we could use

SET ft_min_word_len =n;

etc. for the current connection it would be nice.
*) Parser: Indexing of Any and All Numbers -- I think it would be a good
idea to index any sequence of digits less than ft_min_word_len long.
Anything numeric could be very relevant for searching -- software
versions, ages, dates, etc. -- and shouldn't be excluded.

Even anything *containing* a number (among letters) is probably relevant
for searching, again, even if it's shorter than ft_min_word_len . e.g.
RC1, B2, 8oz, F5, etc.
*) Parser: Other Things -- I've seen people trying to search
catalog/item/part numbers with "pieces" of the "number" separated by -
or / for example (making some "pieces" too short). How about indexing
words that are on either side of a "-" or "/" (with no space) no matter
their length? I don't mean including the - or / in the index -- just the
usual word characters on either side (I think) as *separate* words, not
a *single* word with the - or / removed. This would help with things
like CD-ROM, TCP/IP, etc.

Single quotes being counted as a word character is another issue I have.
(I discovered that they're not counted as part of the word when on the
end(s): 'quote' (thank God! :-))) Example: if someone searches for
MySQL, it won't find rows with MySQL's. Since possessive's (sic) are the
biggest problem, how about stripping any 's from the end of the word in
the index? So MySQL's would be indexed as MySQL.
*) "Always Index" Words -- Like it says in the full-text TODO section of
the manual. This should be able to be set on at least a per table basis
(again, others may want per index).
*) Stopword File -- I would also like to be able to define this per
table somehow.
*) Miscellaneous -- Mostly functionality related, from the TODO:
STEMMING! (controlled more finely than server level I hope), multi-byte
character set support, proximity operators. Anything to get it closer to
Verity's full-text functionality. ;-)

Any speed/optimization improvements are welcome for gigs of data,
especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance
like a natural language query, although this is probably difficult if a
wildcard* is used?). And the FULLTEXT index shouldn't always be chosen
for non-const join types when another index would find less rows first.
e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the
PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since
I guess it's a problem auto sorting by relevance if it's not using the
FULLTEXT index.
--------------------------------------------------

To other full-text users: what do you think of these suggestions?

To the developers: any word on if and when any of these things would be
implemented? I know from the TODO and other list messages that some
will. Any *estimates* (in months or MySQL version) on when would be
great. Just any info on new full-text features, even ones that I didn't
mention, would be awesome to hear. :-) And like how they would be
implemented and used by us (syntax, etc.).

How about changing the default min/max (or just min if you want) word
length? I think everyone *really* wishes ft_min_word_len was 3. Seems
like that and indexing numbers shorter than min_word_len could be easily
done. Please? :-)

Here's a couple mailing list threads about full-text:
http://lists.mysql.com/list.php?3:sss:2365
http://lists.mysql.com/list.php?3:sss:6749

There Sergei is talking about a new .frm format (plain text) that will
allow more of these features. Will it allow us to somehow define how to
parse things or something?? Could you elaborate more on what this will
bring? In November 2001, he said the new .frm format would be here "this
year." It's been almost 2 years since then, so when is it do? ;-/ Talk
of a "dynamic" stopword list sounds interesting.

Also, are the current MySQL versions using the "2 level" full-text index
format yet? I'm thinking not?

Finally, in the full-text TODO, it says "Generic user-suppliable UDF
preparser." Could you also elaborate on this? The "generic" part almost
makes it sound like some sort of "script" to define how to parse the
text. But UDF makes it sound like a separate thing that has to be loaded
with CREATE FUNCTION. But UDFs won't work with your MySQL binaries, will
they, since they're complied statically?

Looking forward to any comments from the developers and other users.
Thanks in advance!

Matt
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/my***********...ie.nctu.edu.tw

Jul 19 '05 #1
0 3501

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
4011
by: Phil Powell | last post by:
Relevancy scores are normally defined by a MySQL query on a table that has a fulltext index. The rules for relevancy scoring will exclude certain words due to their being too short (minimum default is 4 letters). This is the Fed. Everything is a TLA (three-letter acronym). Therefore, since I'm building a PORTABLE web application, changing MySQL's default settings for fulltext index querying is completely undoable and unrealistic, so...
0
5767
by: Phil Powell | last post by:
The table already has a fulltext index and from there I can use the MySQL fulltext search query to get results as well as the relevancy score. The problem I have is that MySQL has a default setting whereby the minimum amount of characters is 4 for a search. Being that we're government and full of TLA (three-letter acronyms), that is not practical, and furthermore, the app I'm building must be fully portable, so having MySQL tweaked...
1
1840
by: SC | last post by:
Ok, first of all I'm sure that what I want to do isn't going to be easy... I'd like to be able to take user input and query the database to return exact and similar results. I have added a FULLTEXT index of the column I want to search to the table. Then taking user input and break it apart into separate words and use something like:
0
1286
by: Matt W | last post by:
Hi all, I'm planning to use MySQL's full-text search for my forum system (possibly 5+ million posts). I've been playing with it a lot lately to see the performance and functionality and have some suggestions/questions. First, since a few of you may be wanting to know, here is a thread where I was doing some speed/optimization tests and stuff with 3 million posts: http://www.sitepointforums.com/showthread.php?threadid=69555
0
476
by: Phil Powell | last post by:
Retracing my problem leads me to believe I never successfully created fulltext indexes for MySQL 3.23.58 MyISAM tables. I went to the MySQL manual and was able - or so I thought - to create them, however, my fulltext search queries fail in 3.23.58 but the exact queries (with same data) work perfectly in 4.0.10. --...
0
519
by: Alex Glass | last post by:
I have a large contacts table with about 30 columns of text fields and 5 fulltext indexes spanning the different sections of the table. I'm curious if anyone could suggest a better way to find rows in the table based on text entered by the user. It also would be nice if wildcards could be supported like "beginning*". The query I have built crashed mysqlnt-d a few times but I was unable to determine what triggered the crash. I believe it...
0
1420
by: Robert Oschler | last post by:
I read a while back that MySQL will only use one index per query. (If this is not so, please tell me and point me to a doc that gives a good explanation of MySQL's current index usage policy). I'm using MySQL 4.2.x. Here's my dilemma. 1) --------- I have two tables that have records with a FULLTEXT index text field in each of them. The problem is the relationship between the tables is a
1
2417
by: Robert Oschler | last post by:
I read a while back that MySQL will only use one index per query. (If this is not so, please tell me and point me to a doc that gives a good explanation of MySQL's current index usage policy). I'm using MySQL 4.2.x. Here's my dilemma. 1) --------- I have two tables that have records with a FULLTEXT index text field in each of them. The problem is the relationship between the tables is a
4
1642
by: kristian | last post by:
Hi. According to this text from http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html mysql supports the * like this 'apple*' "Words match if they begin with the word preceding the * operator." But how can I make use of the * like this '*apple' where words match if they end with the word?
7
10152
by: greywire | last post by:
So I need to load lots of data into my database. So I discover LOAD DATA INFILE. Great! This little gem loads my CSV in blazing times (compared to parsing the file and doing INSERT for each row). Its still slow on large files, but just barely acceptable. Only one problem. It truncates fields to 256 characters, even on a text field.
0
8738
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8652
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9086
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8939
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7829
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5907
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4667
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3104
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2399
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.