• unknown's avatar
    Bug#30237 (Performance regression in boolean expressions) · fb1be0f1
    unknown authored
    This is a performance bug, related to the parsing or 'OR' and 'AND' boolean
    expressions.
    
    Let N be the number of expressions involved in a OR (respectively AND).
    
    When N=1
    
    For example, "select 1" involve only 1 term: there is no OR operator.
    
    In 4.0 and 4.1, parsing expressions not involving OR had no overhead.
    In 5.0, parsing adds some overhead, with Select->expr_list.
    
    With this patch, the overhead introduced in 5.0 has been removed,
    so that performances for N=1 should be identical to the 4.0 performances,
    which are optimal (there is no code executed at all)
    
    The overhead in 5.0 was in fact affecting significantly some operations.
    For example, loading 1 Million rows into a table with INSERTs,
    for a table that has 100 columns, leads to parsing 100 Millions of
    expressions, which means that the overhead related to Select->expr_list
    is executed 100 Million times ...
    
    Considering that N=1 is by far the most probable expression,
    this case should be optimal.
    
    When N=2
    
    For example, "select a OR b" involves 2 terms in the OR operator.
    
    In 4.0 and 4.1, parsing expressions involving 2 terms created 1 Item_cond_or
    node, which is the expected result.
    In 5.0, parsing these expression also produced 1 node, but with some extra
    overhead related to Select->expr_list : creating 1 list in Select->expr_list
    and another in Item_cond::list is inefficient.
    
    With this patch, the overhead introduced in 5.0 has been removed
    so that performances for N=2 should be identical to the 4.0 performances.
    Note that the memory allocation uses the new (thd->mem_root) syntax
    directly.
    The cost of "is_cond_or" is estimated to be neglectable: the real problem
    of the performance degradation comes from unneeded memory allocations.
    
    When N>=3
    
    For example, "select a OR b OR c ...", which involves 3 or more terms.
    
    In 4.0 and 4.1, the parser had no significant cost overhead, but produced
    an Item tree which is difficult to evaluate / optimize during runtime.
    In 5.0, the parser produces a better Item tree, using the Item_cond
    constructor that accepts a list of children directly, but at an extra cost
    related to Select->expr_list.
    
    With this patch, the code is implemented to take the best of the two
    implementations:
    - there is no overhead with Select->expr_list
    - the Item tree generated is optimized and flattened.
    
    This is achieved by adding children nodes into the Item tree directly,
    with Item_cond::add(), which avoids the need for temporary lists and memory
    allocation
    
    Note that this patch also provide an extra optimization, that the previous
    code in 5.0 did not provide: expressions are flattened in the Item tree,
    based on what the expression already parsed is, and not based on the order
    in which rules are reduced.
    
    For example : "(a OR b) OR c", "a OR (b OR c)" would both be represented
    with 2 Item_cond_or nodes before this patch, and with 1 node only with this
    patch. The logic used is based on the mathematical properties of the OR
    operator (it's associative), and produces a simpler tree.
    
    
    sql/item_cmpfunc.h:
      Improved performances for parsing boolean expressions
    sql/sql_yacc.yy:
      Improved performances for parsing boolean expressions
    mysql-test/r/parser_precedence.result:
      Added test cases to cover boolean operator precedence
    mysql-test/t/parser_precedence.test:
      Added test cases to cover boolean operator precedence
    fb1be0f1
item_cmpfunc.h 48.1 KB