Commit 239f0714 authored by serg@serg.mysql.com's avatar serg@serg.mysql.com

boolean fulltext search weighting scheme changed

parent 5f2d79c5
......@@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval
Docs/safe-mysql.xml
mysys/test_vsnprintf
Docs/manual.de.log
Docs/internals.info
......@@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals.
* mysys functions:: Functions In The @code{mysys} Library
* DBUG:: DBUG Tags To Use
* protocol:: MySQL Client/Server Protocol
* Fulltext Search:: Fulltext Search in MySQL
@end menu
......@@ -535,7 +536,7 @@ Print query.
@end table
@node protocol, , DBUG, Top
@node protocol, Fulltext Search, DBUG, Top
@chapter MySQL Client/Server Protocol
@menu
......@@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00
@c @printindex fn
@node Fulltext Search, , protocol, Top
@chapter Fulltext Search in MySQL
Hopefully, sometime there will be complete description of
fulltext search algorithms.
Now it's just unsorted notes.
@menu
* Weighting in boolean mode::
@end menu
@node Weighting in boolean mode, , , Fulltext Search
@section Weighting in boolean mode
The basic idea is as follows: in expression
@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone
is enough to match the whole expression. While @code{C},
@code{D}, and @code{E} should @strong{all} match. So it's
reasonable to assign weight 1 to @code{A}, @code{B}, and
@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E}
should get a weight of 1/3.
Things become more complicated when considering boolean
operators, as used in MySQL FTB. Obvioulsy, @code{+A +B}
should be treated as @code{A and B}, and @code{A B} -
as @code{A or B}. The problem is, that @code{+A B} can @strong{not}
be rewritten in and/or terms (that's the reason why this - extended -
set of operators was chosen). Still, aproximations can be used.
@code{+A B C} can be approximated as @code{A or (A and (B or C))}
or as @code{A or (A and B) or (A and C) or (A and B and C)}.
Applying the above logic (and omitting mathematical
transformations and normalization) one gets that for
@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights
should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and,
otherwise, in the first rewritting approach @code{B_j = 1/3},
and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}.
The second expression gives somewhat steeper increase in total
weight as number of matched B's increases, because it assigns
higher weights to individual B's. Also the first expression in
much simplier. So it is the first one, that is implemented in MySQL.
@summarycontents
@contents
......
......@@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}.
@itemize @bullet
@item
Boolean fulltext search weighting scheme changed to something more reasonable.
@item
Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of
@code{ft_min_word_len} characters.
@item
......@@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
break;
if (yn & FTB_FLAG_YES)
{
ftbe->cur_weight+=weight;
ftbe->cur_weight += weight / ftbe->ythresh;
if (++ftbe->yesses == ythresh)
{
yn=ftbe->flags;
......@@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
}
else
{
ftbe->cur_weight+=weight;
ftbe->cur_weight += ftbe->ythresh ? weight/3 : weight;
if (ftbe->yesses < ythresh)
break;
yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment