Commit 3dcd3b74 authored by unknown's avatar unknown

Fulltext search section moved

parent e75b64c0
...@@ -120,6 +120,7 @@ version see the relevant distribution. ...@@ -120,6 +120,7 @@ version see the relevant distribution.
* Tutorial:: @strong{MySQL} Tutorial * Tutorial:: @strong{MySQL} Tutorial
* Server:: @strong{MySQL} Server * Server:: @strong{MySQL} Server
* Replication:: Replication * Replication:: Replication
* Fulltext Search:: Fulltext Search
* Performance:: Getting maximum performance from @strong{MySQL} * Performance:: Getting maximum performance from @strong{MySQL}
* MySQL Benchmarks:: The @strong{MySQL} benchmark suite * MySQL Benchmarks:: The @strong{MySQL} benchmark suite
* Tools:: @strong{MySQL} Utilities * Tools:: @strong{MySQL} Utilities
...@@ -600,6 +601,13 @@ Replication in MySQL ...@@ -600,6 +601,13 @@ Replication in MySQL
* Replication FAQ:: Frequently Asked Questions about replication * Replication FAQ:: Frequently Asked Questions about replication
* Replication Problems:: Troubleshooting Replication. * Replication Problems:: Troubleshooting Replication.
MySQL Full-text Search
* Fulltext Search::
* Fulltext Fine-tuning::
* Fulltext Features to Appear in MySQL 4.0::
* Fulltext TODO::
Getting Maximum Performance from MySQL Getting Maximum Performance from MySQL
* Optimize Basics:: Optimization overview * Optimize Basics:: Optimization overview
...@@ -868,15 +876,8 @@ How MySQL Compares to @code{mSQL} ...@@ -868,15 +876,8 @@ How MySQL Compares to @code{mSQL}
MySQL Internals MySQL Internals
* MySQL threads:: MySQL threads * MySQL threads:: MySQL threads
* MySQL full-text search:: MySQL full-text search
* MySQL test suite:: MySQL test suite * MySQL test suite:: MySQL test suite
MySQL Full-text Search
* Fulltext Fine-tuning::
* Fulltext features to appear in MySQL 4.0::
* Fulltext TODO::
Credits Credits
* Developers:: * Developers::
...@@ -15379,7 +15380,7 @@ In @strong{MySQL} Version 3.23.23 or later, you can also create special ...@@ -15379,7 +15380,7 @@ In @strong{MySQL} Version 3.23.23 or later, you can also create special
@code{MyISAM} table type supports @code{FULLTEXT} indexes. They can be @code{MyISAM} table type supports @code{FULLTEXT} indexes. They can be
created only from @code{VARCHAR} and @code{TEXT} columns. created only from @code{VARCHAR} and @code{TEXT} columns.
Indexing always happens over the entire column and partial indexing is not Indexing always happens over the entire column and partial indexing is not
supported. See @ref{MySQL full-text search} for details. supported. See @ref{Fulltext Search} for details.
@cindex multi-column indexes @cindex multi-column indexes
@cindex indexes, multi-column @cindex indexes, multi-column
...@@ -16122,7 +16123,7 @@ For @code{MATCH ... AGAINST()} to work, a @strong{FULLTEXT} index ...@@ -16122,7 +16123,7 @@ For @code{MATCH ... AGAINST()} to work, a @strong{FULLTEXT} index
must be created first. @xref{CREATE TABLE, , @code{CREATE TABLE}}. must be created first. @xref{CREATE TABLE, , @code{CREATE TABLE}}.
@code{MATCH ... AGAINST()} is available in @strong{MySQL} Version @code{MATCH ... AGAINST()} is available in @strong{MySQL} Version
3.23.23 or later. For details and usage examples 3.23.23 or later. For details and usage examples
@pxref{MySQL full-text search}. @pxref{Fulltext Search}.
@end table @end table
@findex casts @findex casts
...@@ -18496,7 +18497,7 @@ In @strong{MySQL} Version 3.23.23 or later, you can also create special ...@@ -18496,7 +18497,7 @@ In @strong{MySQL} Version 3.23.23 or later, you can also create special
@code{MyISAM} table type supports @code{FULLTEXT} indexes. They can be created @code{MyISAM} table type supports @code{FULLTEXT} indexes. They can be created
only from @code{VARCHAR} and @code{TEXT} columns. only from @code{VARCHAR} and @code{TEXT} columns.
Indexing always happens over the entire column, partial indexing is not Indexing always happens over the entire column, partial indexing is not
supported. See @ref{MySQL full-text search} for details of operation. supported. See @ref{Fulltext Search} for details of operation.
@item @item
The @code{FOREIGN KEY}, @code{CHECK}, and @code{REFERENCES} clauses don't The @code{FOREIGN KEY}, @code{CHECK}, and @code{REFERENCES} clauses don't
...@@ -22675,7 +22676,7 @@ For more information about how @strong{MySQL} uses indexes, see ...@@ -22675,7 +22676,7 @@ For more information about how @strong{MySQL} uses indexes, see
@code{FULLTEXT} indexes can index only @code{VARCHAR} and @code{FULLTEXT} indexes can index only @code{VARCHAR} and
@code{TEXT} columns, and only in @code{MyISAM} tables. @code{FULLTEXT} indexes @code{TEXT} columns, and only in @code{MyISAM} tables. @code{FULLTEXT} indexes
are available in @strong{MySQL} Version 3.23.23 and later. are available in @strong{MySQL} Version 3.23.23 and later.
@ref{MySQL full-text search}. @ref{Fulltext Search}.
@findex DROP INDEX @findex DROP INDEX
@node DROP INDEX, Comments, CREATE INDEX, Reference @node DROP INDEX, Comments, CREATE INDEX, Reference
...@@ -26913,7 +26914,7 @@ tables}. ...@@ -26913,7 +26914,7 @@ tables}.
@cindex increasing, speed @cindex increasing, speed
@cindex speed, increasing @cindex speed, increasing
@cindex databases, replicating @cindex databases, replicating
@node Replication, Performance, Server, Top @node Replication, Fulltext Search, Server, Top
@chapter Replication in MySQL @chapter Replication in MySQL
@menu @menu
* Replication Intro:: Introduction * Replication Intro:: Introduction
...@@ -27871,10 +27872,208 @@ Once you have collected the evidence on the phantom problem, try hard to ...@@ -27871,10 +27872,208 @@ Once you have collected the evidence on the phantom problem, try hard to
isolate it into a separate test case first. Then report the problem to isolate it into a separate test case first. Then report the problem to
@email{bugs@@lists.mysql.com} with as much info as possible. @email{bugs@@lists.mysql.com} with as much info as possible.
@cindex searching, full-text
@cindex full-text search
@cindex FULLTEXT
@node Fulltext Search, Performance, Replication, Top
@chapter MySQL Full-text Search
Since Version 3.23.23, @strong{MySQL} has support for full-text indexing
and searching. Full-text indexes in @strong{MySQL} are an index of type
@code{FULLTEXT}. @code{FULLTEXT} indexes can be created from @code{VARCHAR}
and @code{TEXT} columns at @code{CREATE TABLE} time or added later with
@code{ALTER TABLE} or @code{CREATE INDEX}. For large datasets, adding
@code{FULLTEXT} index with @code{ALTER TABLE} (or @code{CREATE INDEX}) would
be much faster than inserting rows into the empty table with a @code{FULLTEXT}
index.
Full-text search is performed with the @code{MATCH} function.
@example
mysql> CREATE TABLE t (a VARCHAR(200), b TEXT, FULLTEXT (a,b));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO t VALUES
-> ('MySQL has now support', 'for full-text search'),
-> ('Full-text indexes', 'are called collections'),
-> ('Only MyISAM tables','support collections'),
-> ('Function MATCH ... AGAINST()','is used to do a search'),
-> ('Full-text search in MySQL', 'implements vector space model');
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM t WHERE MATCH (a,b) AGAINST ('MySQL');
+---------------------------+-------------------------------+
| a | b |
+---------------------------+-------------------------------+
| MySQL has now support | for full-text search |
| Full-text search in MySQL | implements vector-space-model |
+---------------------------+-------------------------------+
2 rows in set (0.00 sec)
mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t;
+------------------------------+-------------------------------+--------+
| a | b | x |
+------------------------------+-------------------------------+--------+
| MySQL has now support | for full-text search | 0.3834 |
| Full-text indexes | are called collections | 0.3834 |
| Only MyISAM tables | support collections | 0.7668 |
| Function MATCH ... AGAINST() | is used to do a search | 0 |
| Full-text search in MySQL | implements vector space model | 0 |
+------------------------------+-------------------------------+--------+
5 rows in set (0.00 sec)
@end example
The function @code{MATCH} matches a natural language query @code{AGAINST}
a text collection (which is simply the columns that are covered by a
@code{FULLTEXT} index). For every row in a table it returns relevance -
a similarity measure between the text in that row (in the columns that are
part of the collection) and the query. When it is used in a @code{WHERE}
clause (see example above) the rows returned are automatically sorted with
relevance decreasing. Relevance is a non-negative floating-point number.
Zero relevance means no similarity. Relevance is computed based on the
number of words in the row, the number of unique words in that row, the
total number of words in the collection, and the number of documents (rows)
that contain a particular word.
MySQL uses a very simple parser to split text into words. A ``word'' is
any sequence of letters, numbers, @samp{'}, and @samp{_}. Any ``word''
that is present in the stopword list or just too short (3 characters
or less) is ignored.
Every correct word in the collection and in the query is weighted,
according to its significance in the query or collection. This way, a
word that is present in many documents will have lower weight (and may
even have a zero weight), because it has lower semantic value in this
particular collection. Otherwise, if the word is rare, it will receive a
higher weight. The weights of the words are then combined to compute the
relevance of the row.
Such a technique works best with large collections (in fact, it was
carefully tuned this way). For very small tables, word distribution
does not reflect adequately their semantical value, and this model
may sometimes produce bizarre results.
For example, search for the word "search" will produce no results in the
above example. Word "search" is present in more than half of rows, and
as such, is effectively treated as a stopword (that is, with semantical value
zero). It is, really, the desired behavior - a natural language query
should not return every other row in 1GB table.
A word that matches half of rows in a table is less likely to locate relevant
documents. In fact, it will most likely find plenty of irrelevant documents.
We all know this happens far too often when we are trying to find something on
the Internet with a search engine. It is with this reasoning that such rows
have been assigned a low semantical value in @strong{a particular dataset}.
@menu
* Fulltext Fine-tuning::
* Fulltext Features to Appear in MySQL 4.0::
* Fulltext TODO::
@end menu
@node Fulltext Fine-tuning, Fulltext Features to Appear in MySQL 4.0, , Fulltext Search
@section Fine-tuning MySQL Full-text Search
Unfortunately, full-text search has no user-tunable parameters yet,
although adding some is very high on the TODO. However, if you have a
@strong{MySQL} source distribution (@xref{Installing source}.), you can
somewhat alter the full-text search behavior.
Note that full-text search was carefully tuned for the best searching
effectiveness. Modifying the default behavior will, in most cases,
only make the search results worse. Do not alter the @strong{MySQL} sources
unless you know what you are doing!
@itemize
@item
Minimal length of word to be indexed is defined in
@code{myisam/ftdefs.h} file by the line
@example
#define MIN_WORD_LEN 4
@end example
Change it to the value you prefer, recompile @strong{MySQL}, and rebuild
your @code{FULLTEXT} indexes.
@item
The stopword list is defined in @code{myisam/ft_static.c}
Modify it to your taste, recompile @strong{MySQL} and rebuild
your @code{FULLTEXT} indexes.
@item
The 50% threshold is caused by the particular weighting scheme chosen. To
disable it, change the following line in @code{myisam/ftdefs.h}:
@example
#define GWS_IN_USE GWS_PROB
@end example
to
@example
#define GWS_IN_USE GWS_FREQ
@end example
and recompile @strong{MySQL}.
There is no need to rebuild the indexes in this case.
@end itemize
@node Fulltext Features to Appear in MySQL 4.0, Fulltext TODO, Fulltext Fine-tuning, Fulltext Search
@section New Features of Full-text Search to Appear in MySQL 4.0
This section includes a list of the fulltext features that are already
implemented in the 4.0 tree. It explains
@strong{More functions for full-text search} entry of @ref{TODO MySQL 4.0}.
@itemize @bullet
@item @code{REPAIR TABLE} with @code{FULLTEXT} indexes,
@code{ALTER TABLE} with @code{FULLTEXT} indexes, and
@code{OPTIMIZE TABLE} with @code{FULLTEXT} indexes are now
up to 100 times faster.
@item @code{MATCH ... AGAINST} now supports the following
@strong{boolean operators}:
@itemize @bullet
@item @code{+}word means the that word @strong{must} be present in every
row returned.
@item @code{-}word means the that word @strong{must not} be present in every
row returned.
@item @code{<} and @code{>} can be used to decrease and increase word
weight in the query.
@item @code{~} can be used to assign a @strong{negative} weight to a noise
word.
@item @code{*} is a truncation operator.
@end itemize
Boolean search utilizes a more simplistic way of calculating the relevance,
that does not have a 50% threshold.
@item Searches are now up to 2 times faster due to optimized search algorithm.
@item Utility program @code{ft_dump} added for low-level @code{FULLTEXT}
index operations (querying/dumping/statistics).
@end itemize
@node Fulltext TODO, , Fulltext Features to Appear in MySQL 4.0, Fulltext Search
@section Full-text Search TODO
@itemize @bullet
@item Make all operations with @code{FULLTEXT} index @strong{faster}.
@item Support for braces @code{()} in boolean full-text search.
@item Support for "always-index words". They could be any strings
the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc.
@item Support for full-text search in @code{MERGE} tables.
@item Support for multi-byte charsets.
@item Make stopword list to depend of the language of the data.
@item Stemming (dependent of the language of the data, of course).
@item Generic user-supplyable UDF (?) preparser.
@item Make the model more flexible (by adding some adjustable
parameters to @code{FULLTEXT} in @code{CREATE/ALTER TABLE}).
@end itemize
@cindex performance, maximizing @cindex performance, maximizing
@cindex optimization @cindex optimization
@node Performance, MySQL Benchmarks, Replication, Top @node Performance, MySQL Benchmarks, Fulltext Search, Top
@chapter Getting Maximum Performance from MySQL @chapter Getting Maximum Performance from MySQL
Optimization is a complicated task because it ultimately requires Optimization is a complicated task because it ultimately requires
...@@ -40160,11 +40359,10 @@ This is a relatively low traffic list, in comparison with ...@@ -40160,11 +40359,10 @@ This is a relatively low traffic list, in comparison with
@menu @menu
* MySQL threads:: MySQL threads * MySQL threads:: MySQL threads
* MySQL full-text search:: MySQL full-text search
* MySQL test suite:: MySQL test suite * MySQL test suite:: MySQL test suite
@end menu @end menu
@node MySQL threads, MySQL full-text search, MySQL internals, MySQL internals @node MySQL threads, MySQL test suite, , MySQL internals
@section MySQL Threads @section MySQL Threads
The @strong{MySQL} server creates the following threads: The @strong{MySQL} server creates the following threads:
...@@ -40211,208 +40409,9 @@ started to read and apply updates from the master. ...@@ -40211,208 +40409,9 @@ started to read and apply updates from the master.
@code{mysqladmin processlist} only shows the connection, @code{INSERT DELAYED}, @code{mysqladmin processlist} only shows the connection, @code{INSERT DELAYED},
and replication threads. and replication threads.
@cindex searching, full-text
@cindex full-text search
@cindex FULLTEXT
@node MySQL full-text search, MySQL test suite, MySQL threads, MySQL internals
@section MySQL Full-text Search
Since Version 3.23.23, @strong{MySQL} has support for full-text indexing
and searching. Full-text indexes in @strong{MySQL} are an index of type
@code{FULLTEXT}. @code{FULLTEXT} indexes can be created from @code{VARCHAR}
and @code{TEXT} columns at @code{CREATE TABLE} time or added later with
@code{ALTER TABLE} or @code{CREATE INDEX}. For large datasets, adding
@code{FULLTEXT} index with @code{ALTER TABLE} (or @code{CREATE INDEX}) would
be much faster than inserting rows into the empty table with a @code{FULLTEXT}
index.
Full-text search is performed with the @code{MATCH} function.
@example
mysql> CREATE TABLE t (a VARCHAR(200), b TEXT, FULLTEXT (a,b));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO t VALUES
-> ('MySQL has now support', 'for full-text search'),
-> ('Full-text indexes', 'are called collections'),
-> ('Only MyISAM tables','support collections'),
-> ('Function MATCH ... AGAINST()','is used to do a search'),
-> ('Full-text search in MySQL', 'implements vector space model');
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM t WHERE MATCH (a,b) AGAINST ('MySQL');
+---------------------------+-------------------------------+
| a | b |
+---------------------------+-------------------------------+
| MySQL has now support | for full-text search |
| Full-text search in MySQL | implements vector-space-model |
+---------------------------+-------------------------------+
2 rows in set (0.00 sec)
mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t;
+------------------------------+-------------------------------+--------+
| a | b | x |
+------------------------------+-------------------------------+--------+
| MySQL has now support | for full-text search | 0.3834 |
| Full-text indexes | are called collections | 0.3834 |
| Only MyISAM tables | support collections | 0.7668 |
| Function MATCH ... AGAINST() | is used to do a search | 0 |
| Full-text search in MySQL | implements vector space model | 0 |
+------------------------------+-------------------------------+--------+
5 rows in set (0.00 sec)
@end example
The function @code{MATCH} matches a natural language query @code{AGAINST}
a text collection (which is simply the columns that are covered by a
@strong{FULLTEXT} index). For every row in a table it returns relevance -
a similarity measure between the text in that row (in the columns that are
part of the collection) and the query. When it is used in a @code{WHERE}
clause (see example above) the rows returned are automatically sorted with
relevance decreasing. Relevance is a non-negative floating-point number.
Zero relevance means no similarity. Relevance is computed based on the
number of words in the row, the number of unique words in that row, the
total number of words in the collection, and the number of documents (rows)
that contain a particular word.
MySQL uses a very simple parser to split text into words. A ``word'' is
any sequence of letters, numbers, @samp{'}, and @samp{_}. Any ``word''
that is present in the stopword list or just too short (3 characters
or less) is ignored.
Every correct word in the collection and in the query is weighted,
according to its significance in the query or collection. This way, a
word that is present in many documents will have lower weight (and may
even have a zero weight), because it has lower semantic value in this
particular collection. Otherwise, if the word is rare, it will receive a
higher weight. The weights of the words are then combined to compute the
relevance of the row.
Such a technique works best with large collections (in fact, it was
carefully tuned this way). For very small tables, word distribution
does not reflect adequately their semantical value, and this model
may sometimes produce bizarre results.
For example, search for the word "search" will produce no results in the
above example. Word "search" is present in more than half of rows, and
as such, is effectively treated as a stopword (that is, with semantical value
zero). It is, really, the desired behavior - a natural language query
should not return every other row in 1GB table.
A word that matches half of rows in a table is less likely to locate relevant
documents. In fact, it will most likely find plenty of irrelevant documents.
We all know this happens far too often when we are trying to find something on
the Internet with a search engine. It is with this reasoning that such rows
have been assigned a low semantical value in @strong{a particular dataset}.
@menu
* Fulltext Fine-tuning::
* Fulltext features to appear in MySQL 4.0::
* Fulltext TODO::
@end menu
@node Fulltext Fine-tuning, Fulltext features to appear in MySQL 4.0, MySQL full-text search, MySQL full-text search
@subsection Fine-tuning MySQL Full-text Search
Unfortunately, full-text search has no user-tunable parameters yet,
although adding some is very high on the TODO. However, if you have a
@strong{MySQL} source distribution (@xref{Installing source}.), you can
somewhat alter the full-text search behavior.
Note that full-text search was carefully tuned for the best searching
effectiveness. Modifying the default behavior will, in most cases,
only make the search results worse. Do not alter the @strong{MySQL} sources
unless you know what you are doing!
@itemize
@item
Minimal length of word to be indexed is defined in
@code{myisam/ftdefs.h} file by the line
@example
#define MIN_WORD_LEN 4
@end example
Change it to the value you prefer, recompile @strong{MySQL}, and rebuild
your @code{FULLTEXT} indexes.
@item
The stopword list is defined in @code{myisam/ft_static.c}
Modify it to your taste, recompile @strong{MySQL} and rebuild
your @code{FULLTEXT} indexes.
@item
The 50% threshold is caused by the particular weighting scheme chosen. To
disable it, change the following line in @code{myisam/ftdefs.h}:
@example
#define GWS_IN_USE GWS_PROB
@end example
to
@example
#define GWS_IN_USE GWS_FREQ
@end example
and recompile @strong{MySQL}.
There is no need to rebuild the indexes in this case.
@end itemize
@node Fulltext features to appear in MySQL 4.0, Fulltext TODO, Fulltext Fine-tuning, MySQL full-text search
@subsection New Features of Full-text Search to Appear in MySQL 4.0
This section includes a list of the fulltext features that are already
implemented in the 4.0 tree. It explains
@strong{More functions for full-text search} entry of @ref{TODO MySQL 4.0}.
@itemize @bullet
@item @code{REPAIR TABLE} with @code{FULLTEXT} indexes,
@code{ALTER TABLE} with @code{FULLTEXT} indexes, and
@code{OPTIMIZE TABLE} with @code{FULLTEXT} indexes are now
up to 100 times faster.
@item @code{MATCH ... AGAINST} now supports the following
@strong{boolean operators}:
@itemize @bullet
@item @code{+}word means the that word @strong{must} be present in every
row returned.
@item @code{-}word means the that word @strong{must not} be present in every
row returned.
@item @code{<} and @code{>} can be used to decrease and increase word
weight in the query.
@item @code{~} can be used to assign a @strong{negative} weight to a noise
word.
@item @code{*} is a truncation operator.
@end itemize
Boolean search utilizes a more simplistic way of calculating the relevance,
that does not have a 50% threshold.
@item Searches are now up to 2 times faster due to optimized search algorithm.
@item Utility program @code{ft_dump} added for low-level @code{FULLTEXT}
index operations (querying/dumping/statistics).
@end itemize
@node Fulltext TODO, , Fulltext features to appear in MySQL 4.0, MySQL full-text search
@subsection Full-text Search TODO
@itemize @bullet
@item Make all operations with @code{FULLTEXT} index @strong{faster}.
@item Support for braces @code{()} in boolean fulltext search.
@item Support for "always-index words". They could be any strings
the user wants to treat as words, examples are "C++", "AS/400", "TCP/IP", etc.
@item Support for fulltext search in @code{MERGE} tables.
@item Support for multi-byte charsets.
@item Make stopword list to depend of the language of the data.
@item Stemming (dependent of the language of the data, of course).
@item Generic user-supplyable UDF (?) preparser.
@item Make the model more flexible (by adding some adjustable
parameters to @code{FULLTEXT} in @code{CREATE/ALTER TABLE}).
@end itemize
@cindex mysqltest, MySQL Test Suite @cindex mysqltest, MySQL Test Suite
@cindex testing mysqld, mysqltest @cindex testing mysqld, mysqltest
@node MySQL test suite, , MySQL full-text search, MySQL internals @node MySQL test suite, , MySQL threads, MySQL internals
@section MySQL Test Suite @section MySQL Test Suite
Until recently, our main full-coverage test suite was based on proprietary Until recently, our main full-coverage test suite was based on proprietary
...@@ -47563,7 +47562,7 @@ the @code{.MYD} file. ...@@ -47563,7 +47562,7 @@ the @code{.MYD} file.
Better replication. Better replication.
@item @item
More functions for full-text search. More functions for full-text search.
@xref{Fulltext features to appear in MySQL 4.0}. @xref{Fulltext Features to Appear in MySQL 4.0}.
@item @item
Character set casts and syntax for handling multiple character sets. Character set casts and syntax for handling multiple character sets.
@item @item
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment