sql/vector_mhnsw.cc · 92878e97dcb6e66b9ae723c1670aee32c0c4b532 · nexedi / MariaDB

1. randomize all vectors via multiplication by a random orthogonal matrix * to generate the matrix fill the square matrix with normally distributed random values and create an orthogonal matrix with the QR decomposition * the rnd generator is seeded with the number of dimensions, so the matrix will be always the same for a given table * multiplication by an orthogonal matrix is a "rotation", so does not change distances or angles 2. when calculating the distance, first calculate a "subdistance", the distance between projections to the first subdist_part coordinates (=192, best by test, if it's larger it's less efficient, if it's smaller the error rate is too high) 3. calculate the full distance only if "subdistance" isn't confidently higher (above subdist_margin) than the distance we're comparing with * it might look like it would make sense to do a second projection at, say, subdist_part*2, and so on - but in practice one check is enough, the projected distance converges quickly and if it isn't confidently higher at subdist_part, it won't be later either This optimization introduces a constant overhead per insert/search operation - an input/query vector has to be multiplied by the matrix. And the optimization saves on every distance calculation. Thus it is only beneficial when a number of distance calculations (which grows with M and with the table size) is high enough to outweigh the constant overhead. Let's use MIN_ROWS table option to estimate the number of rows in the table. use_subdist_heuristic() is optimal for mnist and fashion-mnist (784 dimensions, 60k rows) and variations of gist (960 dimensions, 200k, 400k, 600k, 800k, 1000k rows)

vector_mhnsw.cc 40.6 KB

Replace vector_mhnsw.cc