• Andrii Nakryiko's avatar
    bpf: support precision propagation in the presence of subprogs · fde2a388
    Andrii Nakryiko authored
    Add support precision backtracking in the presence of subprogram frames in
    jump history.
    
    This means supporting a few different kinds of subprogram invocation
    situations, all requiring a slightly different handling in precision
    backtracking handling logic:
      - static subprogram calls;
      - global subprogram calls;
      - callback-calling helpers/kfuncs.
    
    For each of those we need to handle a few precision propagation cases:
      - what to do with precision of subprog returns (r0);
      - what to do with precision of input arguments;
      - for all of them callee-saved registers in caller function should be
        propagated ignoring subprog/callback part of jump history.
    
    N.B. Async callback-calling helpers (currently only
    bpf_timer_set_callback()) are transparent to all this because they set
    a separate async callback environment and thus callback's history is not
    shared with main program's history. So as far as all the changes in this
    commit goes, such helper is just a regular helper.
    
    Let's look at all these situation in more details. Let's start with
    static subprogram being called, using an exxerpt of a simple main
    program and its static subprog, indenting subprog's frame slightly to
    make everything clear.
    
    frame 0				frame 1			precision set
    =======				=======			=============
    
     9: r6 = 456;
    10: r1 = 123;						fr0: r6
    11: call pc+10;						fr0: r1, r6
    				22: r0 = r1;		fr0: r6;     fr1: r1
    				23: exit		fr0: r6;     fr1: r0
    12: r1 = <map_pointer>					fr0: r0, r6
    13: r1 += r0;						fr0: r0, r6
    14: r1 += r6;						fr0: r6
    15: exit
    
    As can be seen above main function is passing 123 as single argument to
    an identity (`return x;`) subprog. Returned value is used to adjust map
    pointer offset, which forces r0 to be marked as precise. Then
    instruction #14 does the same for callee-saved r6, which will have to be
    backtracked all the way to instruction #9. For brevity, precision sets
    for instruction #13 and #14 are combined in the diagram above.
    
    First, for subprog calls, r0 returned from subprog (in frame 0) has to
    go into subprog's frame 1, and should be cleared from frame 0. So we go
    back into subprog's frame knowing we need to mark r0 precise. We then
    see that insn #22 sets r0 from r1, so now we care about marking r1
    precise.  When we pop up from subprog's frame back into caller at
    insn #11 we keep r1, as it's an argument-passing register, so we eventually
    find `10: r1 = 123;` and satify precision propagation chain for insn #13.
    
    This example demonstrates two sets of rules:
      - r0 returned after subprog call has to be moved into subprog's r0 set;
      - *static* subprog arguments (r1-r5) are moved back to caller precision set.
    
    Let's look at what happens with callee-saved precision propagation. Insn #14
    mark r6 as precise. When we get into subprog's frame, we keep r6 in
    frame 0's precision set *only*. Subprog itself has its own set of
    independent r6-r10 registers and is not affected. When we eventually
    made our way out of subprog frame we keep r6 in precision set until we
    reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is
    perhaps the simplest aspect, it always stays in its original frame.
    
    That's pretty much all we have to do to support precision propagation
    across *static subprog* invocation.
    
    Let's look at what happens when we have global subprog invocation.
    
    frame 0				frame 1			precision set
    =======				=======			=============
    
     9: r6 = 456;
    10: r1 = 123;						fr0: r6
    11: call pc+10; # global subprog			fr0: r6
    12: r1 = <map_pointer>					fr0: r0, r6
    13: r1 += r0;						fr0: r0, r6
    14: r1 += r6;						fr0: r6;
    15: exit
    
    Starting from insn #13, r0 has to be precise. We backtrack all the way
    to insn #11 (call pc+10) and see that subprog is global, so was already
    validated in isolation. As opposed to static subprog, global subprog
    always returns unknown scalar r0, so that satisfies precision
    propagation and we drop r0 from precision set. We are done for insns #13.
    
    Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`.
    Here we need to recognize that this is effectively both exit and entry
    to global subprog, which means we stay in caller's frame. So we carry on
    with r6 still in precision set, until we satisfy it at insn #9. The only
    hard part with global subprogs is just knowing when it's a global func.
    
    Lastly, callback-calling helpers and kfuncs do simulate subprog calls,
    so jump history will have subprog instructions in between caller
    program's instructions, but the rules of propagating r0 and r1-r5
    differ, because we don't actually directly call callback. We actually
    call helper/kfunc, which at runtime will call subprog, so the only
    difference between normal helper/kfunc handling is that we need to make
    sure to skip callback simulatinog part of jump history.
    Let's look at an example to make this clearer.
    
    frame 0				frame 1			precision set
    =======				=======			=============
    
     8: r6 = 456;
     9: r1 = 123;						fr0: r6
    10: r2 = &callback;					fr0: r6
    11: call bpf_loop;					fr0: r6
    				22: r0 = r1;		fr0: r6      fr1:
    				23: exit		fr0: r6      fr1:
    12: r1 = <map_pointer>					fr0: r0, r6
    13: r1 += r0;						fr0: r0, r6
    14: r1 += r6;						fr0: r6;
    15: exit
    
    Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit`
    we see that this isn't actually a static subprog call (it's `call
    bpf_loop;` helper call instead). So we clear r0 from precision set.
    
    For callee-saved register, there is no difference: it stays in frame 0's
    precision set, we go through insn #22 and #23, ignoring them until we
    get back to caller frame 0, eventually satisfying precision backtrack
    logic at insn #8 (`r6 = 456;`).
    
    Assuming callback needed to set r0 as precise at insn #23, we'd
    backtrack to insn #22, switching from r0 to r1, and then at the point
    when we pop back to frame 0 at insn #11, we'll clear r1-r5 from
    precision set, as we don't really do a subprog call directly, so there
    is no input argument precision propagation.
    
    That's pretty much it. With these changes, it seems like the only still
    unsupported situation for precision backpropagation is the case when
    program is accessing stack through registers other than r10. This is
    still left as unsupported (though rare) case for now.
    
    As for results. For selftests, few positive changes for bigger programs,
    cls_redirect in dynptr variant benefitting the most:
    
    [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0'
    File                                      Program        Insns (A)  Insns (B)  Insns     (DIFF)
    ----------------------------------------  -------------  ---------  ---------  ----------------
    pyperf600_bpf_loop.bpf.linked1.o          on_event            2060       2002      -58 (-2.82%)
    test_cls_redirect_dynptr.bpf.linked1.o    cls_redirect       15660       2914  -12746 (-81.39%)
    test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect       61620      59088    -2532 (-4.11%)
    xdp_synproxy_kern.bpf.linked1.o           syncookie_tc      109980      86278  -23702 (-21.55%)
    xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp      97716      85147  -12569 (-12.86%)
    
    Cilium progress don't really regress. They don't use subprogs and are
    mostly unaffected, but some other fixes and improvements could have
    changed something. This doesn't appear to be the case:
    
    [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0'
    File           Program                         Insns (A)  Insns (B)  Insns (DIFF)
    -------------  ------------------------------  ---------  ---------  ------------
    bpf_host.o     tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
    bpf_lxc.o      tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
    bpf_overlay.o  tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
    bpf_xdp.o      tail_handle_nat_fwd_ipv6            12475      12504  +29 (+0.23%)
    bpf_xdp.o      tail_nodeport_nat_ingress_ipv6       6363       6371   +8 (+0.13%)
    
    Looking at (somewhat anonymized) Meta production programs, we see mostly
    insignificant variation in number of instructions, with one program
    (syar_bind6_protect6) benefitting the most at -17%.
    
    [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0'
    Program                   Insns (A)  Insns (B)  Insns     (DIFF)
    ------------------------  ---------  ---------  ----------------
    on_request_context_event        597        585      -12 (-2.01%)
    read_async_py_stack           43789      43657     -132 (-0.30%)
    read_sync_py_stack            35041      37599    +2558 (+7.30%)
    rrm_usdt                        946        940       -6 (-0.63%)
    sysarmor_inet6_bind           28863      28249     -614 (-2.13%)
    sysarmor_inet_bind            28845      28240     -605 (-2.10%)
    syar_bind4_protect4          154145     147640    -6505 (-4.22%)
    syar_bind6_protect6          165242     137088  -28154 (-17.04%)
    syar_task_exit_setgid         21289      19720    -1569 (-7.37%)
    syar_task_exit_setuid         21290      19721    -1569 (-7.37%)
    do_uprobe                     19967      19413     -554 (-2.77%)
    tw_twfw_ingress              215877     204833   -11044 (-5.12%)
    tw_twfw_tc_in                215877     204833   -11044 (-5.12%)
    
    But checking duration (wall clock) differences, that is the actual time taken
    by verifier to validate programs, we see a sometimes dramatic improvements, all
    the way to about 16x improvements:
    
    [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20
    Program                                   Duration (us) (A)  Duration (us) (B)  Duration (us) (DIFF)
    ----------------------------------------  -----------------  -----------------  --------------------
    tw_twfw_ingress                                     4488374             272836    -4215538 (-93.92%)
    tw_twfw_tc_in                                       4339111             268175    -4070936 (-93.82%)
    tw_twfw_egress                                      3521816             270751    -3251065 (-92.31%)
    tw_twfw_tc_eg                                       3472878             284294    -3188584 (-91.81%)
    balancer_ingress                                     343119             291391      -51728 (-15.08%)
    syar_bind6_protect6                                   78992              64782      -14210 (-17.99%)
    ttls_tc_ingress                                       11739               8176       -3563 (-30.35%)
    kprobe__security_inode_link                           13864              11341       -2523 (-18.20%)
    read_sync_py_stack                                    21927              19442       -2485 (-11.33%)
    read_async_py_stack                                   30444              28136        -2308 (-7.58%)
    syar_task_exit_setuid                                 10256               8440       -1816 (-17.71%)
    Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20230505043317.3629845-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    fde2a388
verifier.c 566 KB