• Ankur Arora's avatar
    audit: cache ctx->major in audit_filter_syscall() · 06954599
    Ankur Arora authored
    ctx->major contains the current syscall number. This is, of course, a
    constant for the duration of the syscall. Unfortunately, GCC's alias
    analysis cannot prove that it is not modified via a pointer in the
    audit_filter_syscall() loop, and so always loads it from memory.
    
    In and of itself the load isn't very expensive (ops dependent on the
    ctx->major load are only used to determine the direction of control flow
    and have short dependence chains and, in any case the related branches
    get predicted perfectly in the fastpath) but still cache ctx->major
    in a local for two reasons:
    
    * ctx->major is in the first cacheline of struct audit_context and has
      similar alignment as audit_entry::list audit_entry. For cases
      with a lot of audit rules, doing this reduces one source of contention
      from a potentially busy cache-set.
    
    * audit_in_mask() (called in the hot loop in audit_filter_syscall())
      does cast manipulation and error checking on ctx->major:
    
         audit_in_mask(const struct audit_krule *rule, unsigned long val):
                 if (val > 0xffffffff)
                         return false;
    
                 word = AUDIT_WORD(val);
                 if (word >= AUDIT_BITMASK_SIZE)
                         return false;
    
                 bit = AUDIT_BIT(val);
    
                 return rule->mask[word] & bit;
    
      The clauses related to the rule need to be evaluated in the loop, but
      the rest is unnecessarily re-evaluated for every loop iteration.
      (Note, however, that most of these are cheap ALU ops and the branches
       are perfectly predicted. However, see discussion on cycles
       improvement below for more on why it is still worth hoisting.)
    
    On a Skylakex system change in getpid() latency (aggregated over
    12 boot cycles):
    
                 Min     Mean  Median     Max       pstdev
                (ns)     (ns)    (ns)    (ns)
    
     -        201.30   216.14  216.22  228.46      (+- 1.45%)
     +        196.63   207.86  206.60  230.98      (+- 3.92%)
    
    Performance counter stats for 'bin/getpid' (3 runs) go from:
        cycles               836.89  (  +-   .80% )
        instructions        2000.19  (  +-   .03% )
        IPC                    2.39  (  +-   .83% )
        branches             430.14  (  +-   .03% )
        branch-misses          1.48  (  +-  3.37% )
        L1-dcache-loads      471.11  (  +-   .05% )
        L1-dcache-load-misses  7.62  (  +- 46.98% )
    
     to:
        cycles               805.58  (  +-  4.11% )
        instructions        1654.11  (  +-   .05% )
        IPC                    2.06  (  +-  3.39% )
        branches             430.02  (  +-   .05% )
        branch-misses          1.55  (  +-  7.09% )
        L1-dcache-loads      440.01  (  +-   .09% )
        L1-dcache-load-misses  9.05  (  +- 74.03% )
    
    (Both aggregated over 12 boot cycles.)
    
    instructions: we reduce around 8 instructions/iteration because some of
    the computation is now hoisted out of the loop (branch count does not
    change because GCC, for reasons unclear, only hoists the computations
    while keeping the basic-blocks.)
    
    cycles: improve by about 5% (in aggregate and looking at individual run
    numbers.) This is likely because we now waste fewer pipeline resources
    on unnecessary instructions which allows the control flow to
    speculatively execute further ahead shortening the execution of the loop
    a little. The final gating factor on the performance of this loop
    remains the long dependence chain due to the linked-list load.
    Signed-off-by: default avatarAnkur Arora <ankur.a.arora@oracle.com>
    Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
    06954599
auditsc.c 80.8 KB