• Sergei Golubchik's avatar
    subdist optimization · 92878e97
    Sergei Golubchik authored
    1. randomize all vectors via multiplication by a random orthogonal
       matrix
       * to generate the matrix fill the square matrix with normally
         distributed random values and create an orthogonal matrix with
         the QR decomposition
       * the rnd generator is seeded with the number of dimensions,
         so the matrix will be always the same for a given table
       * multiplication by an orthogonal matrix is a "rotation", so
         does not change distances or angles
    2. when calculating the distance, first calculate a "subdistance",
       the distance between projections to the first subdist_part
       coordinates (=192, best by test, if it's larger it's less efficient,
       if it's smaller the error rate is too high)
    3. calculate the full distance only if "subdistance" isn't confidently
       higher (above subdist_margin) than the distance we're comparing with
       * it might look like it would make sense to do a second projection
         at, say, subdist_part*2, and so on - but in practice one check
         is enough, the projected distance converges quickly and if it
         isn't confidently higher at subdist_part, it won't be later either
    
    This optimization introduces a constant overhead per insert/search
    operation - an input/query vector has to be multiplied by the matrix.
    And the optimization saves on every distance calculation. Thus it is only
    beneficial when a number of distance calculations (which grows with M
    and with the table size) is high enough to outweigh the constant
    overhead. Let's use MIN_ROWS table option to estimate the number of rows
    in the table. use_subdist_heuristic() is optimal for mnist and
    fashion-mnist (784 dimensions, 60k rows) and variations of gist (960
    dimensions, 200k, 400k, 600k, 800k, 1000k rows)
    92878e97
vector_mhnsw.cc 40.6 KB