Commit 7040048f authored by unknown's avatar unknown

boolean fulltext search weighting scheme changed


Docs/internals.texi:
  fulltext chapter added
Docs/manual.texi:
  news updated
BitKeeper/etc/ignore:
  Added Docs/internals.info to the ignore list
myisam/ft_boolean_search.c:
  weighting scheme changed
parent 9b59e430
......@@ -463,3 +463,4 @@ mysql-test/r/rpl000001.eval
Docs/safe-mysql.xml
mysys/test_vsnprintf
Docs/manual.de.log
Docs/internals.info
......@@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals.
* mysys functions:: Functions In The @code{mysys} Library
* DBUG:: DBUG Tags To Use
* protocol:: MySQL Client/Server Protocol
* Fulltext Search:: Fulltext Search in MySQL
@end menu
......@@ -535,7 +536,7 @@ Print query.
@end table
@node protocol, , DBUG, Top
@node protocol, Fulltext Search, DBUG, Top
@chapter MySQL Client/Server Protocol
@menu
......@@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00
@c @printindex fn
@node Fulltext Search, , protocol, Top
@chapter Fulltext Search in MySQL
Hopefully, sometime there will be complete description of
fulltext search algorithms.
Now it's just unsorted notes.
@menu
* Weighting in boolean mode::
@end menu
@node Weighting in boolean mode, , , Fulltext Search
@section Weighting in boolean mode
The basic idea is as follows: in expression
@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone
is enough to match the whole expression. While @code{C},
@code{D}, and @code{E} should @strong{all} match. So it's
reasonable to assign weight 1 to @code{A}, @code{B}, and
@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E}
should get a weight of 1/3.
Things become more complicated when considering boolean
operators, as used in MySQL FTB. Obvioulsy, @code{+A +B}
should be treated as @code{A and B}, and @code{A B} -
as @code{A or B}. The problem is, that @code{+A B} can @strong{not}
be rewritten in and/or terms (that's the reason why this - extended -
set of operators was chosen). Still, aproximations can be used.
@code{+A B C} can be approximated as @code{A or (A and (B or C))}
or as @code{A or (A and B) or (A and C) or (A and B and C)}.
Applying the above logic (and omitting mathematical
transformations and normalization) one gets that for
@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights
should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and,
otherwise, in the first rewritting approach @code{B_j = 1/3},
and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}.
The second expression gives somewhat steeper increase in total
weight as number of matched B's increases, because it assigns
higher weights to individual B's. Also the first expression in
much simplier. So it is the first one, that is implemented in MySQL.
@summarycontents
@contents
......
......@@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}.
@itemize @bullet
@item
Boolean fulltext search weighting scheme changed to something more reasonable.
@item
Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of
@code{ft_min_word_len} characters.
@item
......@@ -322,7 +322,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
break;
if (yn & FTB_FLAG_YES)
{
ftbe->cur_weight+=weight;
ftbe->cur_weight += weight / ftbe->ythresh;
if (++ftbe->yesses == ythresh)
{
yn=ftbe->flags;
......@@ -360,7 +360,7 @@ void _ftb_climb_the_tree(FTB *ftb, FTB_WORD *ftbw, FT_SEG_ITERATOR *ftsi_orig)
}
else
{
ftbe->cur_weight+=weight;
ftbe->cur_weight += ftbe->ythresh ? weight/3 : weight;
if (ftbe->yesses < ythresh)
break;
yn= (ftbe->yesses++ == ythresh) ? ftbe->flags : 0 ;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment