subdist optimization

1. randomize all vectors via multiplication by a random orthogonal matrix * to generate the matrix fill the square matrix with normally distributed random values and create an orthogonal matrix with the QR decomposition * the rnd generator is seeded with the number of dimensions, so the matrix will be always the same for a given table * multiplication by an orthogonal matrix is a "rotation", so does not change distances or angles 2. when calculating the distance, first calculate a "subdistance", the distance between projections to the first subdist_part coordinates (=192, best by test, if it's larger it's less efficient, if it's smaller the error rate is too high) 3. calculate the full distance only if "subdistance" isn't confidently higher (above subdist_margin) than the distance we're comparing with * it might look like it would make sense to do a second projection at, say, subdist_part*2, and so on - but in practice one check is enough, the projected distance converges quickly and if it isn't confidently higher at subdist_part, it won't be later either This optimization introduces a constant overhead per insert/search operation - an input/query vector has to be multiplied by the matrix. And the optimization saves on every distance calculation. Thus it is only beneficial when a number of distance calculations (which grows with M and with the table size) is high enough to outweigh the constant overhead. Let's use MIN_ROWS table option to estimate the number of rows in the table. use_subdist_heuristic() is optimal for mnist and fashion-mnist (784 dimensions, 60k rows) and variations of gist (960 dimensions, 200k, 400k, 600k, 800k, 1000k rows)

subdist optimization
1. randomize all vectors via multiplication by a random orthogonal matrix * to generate the matrix fill the square matrix with normally distributed random values and create an orthogonal matrix with the QR decomposition * the rnd generator is seeded with the number of dimensions, so the matrix will be always the same for a given table * multiplication by an orthogonal matrix is a "rotation", so does not change distances or angles 2. when calculating the distance, first calculate a "subdistance", the distance between projections to the first subdist_part coordinates (=192, best by test, if it's larger it's less efficient, if it's smaller the error rate is too high) 3. calculate the full distance only if "subdistance" isn't confidently higher (above subdist_margin) than the distance we're comparing with * it might look like it would make sense to do a second projection at, say, subdist_part*2, and so on - but in practice one check is enough, the projected distance converges quickly and if it isn't confidently higher at subdist_part, it won't be later either This optimization introduces a constant overhead per insert/search operation - an input/query vector has to be multiplied by the matrix. And the optimization saves on every distance calculation. Thus it is only beneficial when a number of distance calculations (which grows with M and with the table size) is high enough to outweigh the constant overhead. Let's use MIN_ROWS table option to estimate the number of rows in the table. use_subdist_heuristic() is optimal for mnist and fashion-mnist (784 dimensions, 60k rows) and variations of gist (960 dimensions, 200k, 400k, 600k, 800k, 1000k rows)
92878e97 · Sergei Golubchik · e5d56bc2 · 92878e97 · 92878e97 · 92878e97
Commit 92878e97 authored Aug 06, 2024 by Sergei Golubchik
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 129 additions and 47 deletions

CMakeLists.txt CMakeLists.txt +1 -1

sql/CMakeLists.txt sql/CMakeLists.txt +2 -1

sql/vector_mhnsw.cc sql/vector_mhnsw.cc +126 -45

No files found.
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -14,7 +14,7 @@
 # along with this program; if not, write to the Free Software
 # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1335 USA

-CMAKE_MINIMUM_REQUIRED(VERSION 2.8.12)
+CMAKE_MINIMUM_REQUIRED(VERSION 3.0)

 IF(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
  # Setting build type to RelWithDebInfo as none was specified.

--- a/sql/CMakeLists.txt
+++ b/sql/CMakeLists.txt
@@ -225,13 +225,14 @@ RECOMPILE_FOR_EMBEDDED)
 MYSQL_ADD_PLUGIN(online_alter_log online_alter.cc STORAGE_ENGINE MANDATORY
 STATIC_ONLY NOT_EMBEDDED)

+FIND_PACKAGE(Eigen3 3.3 REQUIRED NO_MODULE)

 ADD_LIBRARY(sql STATIC ${SQL_SOURCE})
 MAYBE_DISABLE_IPO(sql)
 DTRACE_INSTRUMENT(sql)
 TARGET_LINK_LIBRARIES(sql
  mysys mysys_ssl dbug strings vio pcre2-8
-  tpool
+  tpool Eigen3::Eigen
  online_alter_log
  ${LIBWRAP} ${LIBCRYPT} ${CMAKE_DL_LIBS} ${CMAKE_THREAD_LIBS_INIT}
  ${SSL_LIBRARIES}

--- a/sql/vector_mhnsw.cc
+++ b/sql/vector_mhnsw.cc