1. 31 Mar, 2020 25 commits
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Fix overflow checks · d0802dc4
      Florian Fainelli authored
      Commit f949a12f ("net: dsa: bcm_sf2: fix buffer overflow doing
      set_rxnfc") tried to fix the some user controlled buffer overflows in
      bcm_sf2_cfp_rule_set() and bcm_sf2_cfp_rule_del() but the fix was using
      CFP_NUM_RULES, which while it is correct not to overflow the bitmaps, is
      not representative of what the device actually supports. Correct that by
      using bcm_sf2_cfp_rule_size() instead.
      
      The latter subtracts the number of rules by 1, so change the checks from
      greater than or equal to greater than accordingly.
      
      Fixes: f949a12f ("net: dsa: bcm_sf2: fix buffer overflow doing set_rxnfc")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0802dc4
    • David S. Miller's avatar
    • Haiyang Zhang's avatar
      hv_netvsc: Remove unnecessary round_up for recv_completion_cnt · f87238d3
      Haiyang Zhang authored
      The vzalloc_node(), already rounds the total size to whole pages, and
      sizeof(u64) is smaller than sizeof(struct recv_comp_data). So
      round_up of recv_completion_cnt is not necessary, and may cause extra
      memory allocation.
      
      To save memory, remove this unnecessary round_up for recv_completion_cnt.
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f87238d3
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · d9679cd9
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      The following patchset contains Netfilter/IPVS updates for net-next:
      
      1) Add support to specify a stateful expression in set definitions,
         this allows users to specify e.g. counters per set elements.
      
      2) Flowtable software counter support.
      
      3) Flowtable hardware offload counter support, from wenxu.
      
      3) Parallelize flowtable hardware offload requests, from Paul Blakey.
         This includes a patch to add one work entry per offload command.
      
      4) Several patches to rework nf_queue refcount handling, from Florian
         Westphal.
      
      4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
         information is missing and set up indirect flow block as TC_SETUP_FT,
         patch from wenxu.
      
      5) Stricter netlink attribute sanity check on filters, from Romain Bellan
         and Florent Fourcot.
      
      5) Annotations to make sparse happy, from Jules Irenge.
      
      6) Improve icmp errors in debugging information, from Haishuang Yan.
      
      7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.
      
      8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9679cd9
    • David S. Miller's avatar
      Merge branch 'Add-packet-trap-policers-support' · 6fe9a949
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      Add packet trap policers support
      
      Background
      ==========
      
      Devices capable of offloading the kernel's datapath and perform
      functions such as bridging and routing must also be able to send (trap)
      specific packets to the kernel (i.e., the CPU) for processing.
      
      For example, a device acting as a multicast-aware bridge must be able to
      trap IGMP membership reports to the kernel for processing by the bridge
      module.
      
      Motivation
      ==========
      
      In most cases, the underlying device is capable of handling packet rates
      that are several orders of magnitude higher compared to those that can
      be handled by the CPU.
      
      Therefore, in order to prevent the underlying device from overwhelming
      the CPU, devices usually include packet trap policers that are able to
      police the trapped packets to rates that can be handled by the CPU.
      
      Proposed solution
      =================
      
      This patch set allows capable device drivers to register their supported
      packet trap policers with devlink. User space can then tune the
      parameters of these policers (currently, rate and burst size) and read
      from the device the number of packets that were dropped by the policer,
      if supported.
      
      These packet trap policers can then be bound to existing packet trap
      groups, which are used to aggregate logically related packet traps. As a
      result, trapped packets are policed to rates that can be handled the
      host CPU.
      
      Example usage
      =============
      
      Instantiate netdevsim:
      
      Dump available packet trap policers:
      netdevsim/netdevsim10:
        policer 1 rate 1000 burst 128
        policer 2 rate 2000 burst 256
        policer 3 rate 3000 burst 512
      
      Change the parameters of a packet trap policer:
      
      Bind a packet trap policer to a packet trap group:
      
      Dump parameters and statistics of a packet trap policer:
      netdevsim/netdevsim10:
        policer 3 rate 100 burst 16
          stats:
              rx:
                dropped 92
      
      Unbind a packet trap policer from a packet trap group:
      
      Patch set overview
      ==================
      
      Patch #1 adds the core infrastructure in devlink which allows capable
      device drivers to register their supported packet trap policers with
      devlink.
      
      Patch #2 extends the existing devlink-trap documentation.
      
      Patch #3 extends netdevsim to register a few dummy packet trap policers
      with devlink. Used later on to selftests the core infrastructure.
      
      Patches #4-#5 adds infrastructure in devlink to allow binding of packet
      trap policers to packet trap groups.
      
      Patch #6 extends netdevsim to allow such binding.
      
      Patch #7 adds a selftest over netdevsim that verifies the core
      devlink-trap policers functionality.
      
      Patches #8-#14 gradually add devlink-trap policers support in mlxsw.
      
      Patch #15 adds a selftest over mlxsw. All registered packet trap
      policers are verified to handle the configured rate and burst size.
      
      Future plans
      ============
      
      * Allow changing default association between packet traps and packet
        trap groups
      * Add more packet traps. For example, for control packets (e.g., IGMP)
      
      v3:
      * Rebase
      
      v2 (address comments from Jiri and Jakub):
      * Patch #1: Add 'strict_start_type' in devlink policy
      * Patch #1: Have device drivers provide max/min rate/burst size for each
        policer. Use them to check validity of user provided parameters
      * Patch #3: Remove check about burst size being a power of 2 and instead
        add a debugfs knob to fail the operation
      * Patch #3: Provide max/min rate/burst size when registering policers
        and remove the validity checks from nsim_dev_devlink_trap_policer_set()
      * Patch #5: Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
        devlink_trap_group_set() and bail if not present
      * Patch #5: Add extack error message in case trap group was partially
        modified
      * Patch #7: Add test case with new 'fail_trap_policer_set' knob
      * Patch #7: Add test case for partially modified trap group
      * Patch #10: Provide max/min rate/burst size when registering policers
      * Patch #11: Remove the max/min validity checks from
        __mlxsw_sp_trap_policer_set()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fe9a949
    • Ido Schimmel's avatar
      selftests: mlxsw: Add test cases for devlink-trap policers · 9f3e63c5
      Ido Schimmel authored
      Add test cases that verify that each registered packet trap policer:
      
      * Honors that imposed limitations of rate and burst size
      * Able to police trapped packets to the specified rate
      * Able to police trapped packets to the specified burst size
      * Able to be unbound from its trap group
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f3e63c5
    • Ido Schimmel's avatar
      mlxsw: spectrum_trap: Add support for setting of packet trap group parameters · 39defcbb
      Ido Schimmel authored
      Implement support for setting of packet trap group parameters by
      invoking the trap_group_init() callback with the new parameters.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39defcbb
    • Ido Schimmel's avatar
      mlxsw: spectrum_trap: Switch to use correct packet trap group · d12d8468
      Ido Schimmel authored
      Some packet traps are currently exposed to user space as being member of
      "l3_drops" trap group, but internally they are member of a different
      group.
      
      Switch these traps to use the correct group so that they are all subject
      to the same policer, as exposed to user space.
      
      Set the trap priority of packets trapped due to loopback error during
      routing to the lowest priority. Such packets are not routed again by the
      kernel and therefore should not mask other traps (e.g., host miss) that
      should be routed.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d12d8468
    • Ido Schimmel's avatar
      mlxsw: spectrum_trap: Do not initialize dedicated discard policer · bc82521e
      Ido Schimmel authored
      The policer is now initialized as part of the registration with devlink,
      so there is no need to initialize it before the registration.
      
      Remove the initialization.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc82521e
    • Ido Schimmel's avatar
      mlxsw: spectrum_trap: Add devlink-trap policer support · 13f2e64b
      Ido Schimmel authored
      Register supported packet trap policers with devlink and implement
      callbacks to change their parameters and read their counters.
      
      Prevent user space from passing invalid policer parameters down to the
      device by checking their validity and communicating the failure via an
      appropriate extack message.
      
      v2:
      * Remove the max/min validity checks from __mlxsw_sp_trap_policer_set()
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13f2e64b
    • Ido Schimmel's avatar
      mlxsw: spectrum_trap: Prepare policers for registration with devlink · 4561705b
      Ido Schimmel authored
      Prepare an array of policer IDs to register with devlink and their
      associated parameters.
      
      The array is composed from both policers that are currently bound to
      exposed trap groups and policers that are not bound to any trap group.
      
      v2:
      * Provide max/min rate/burst size when registering policers
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4561705b
    • Ido Schimmel's avatar
      mlxsw: spectrum: Track used packet trap policer IDs · 03484e49
      Ido Schimmel authored
      During initialization the driver configures various packet trap groups
      and binds policers to them.
      
      Currently, most of these groups are not exposed to user space and
      therefore their policers should not be exposed as well. Otherwise, user
      space will be able to alter policer parameters without knowing which
      packet traps are policed by the policer.
      
      Use a bitmap to track the used policer IDs so that these policers will
      not be registered with devlink in a subsequent patch.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03484e49
    • Ido Schimmel's avatar
      mlxsw: reg: Extend QPCR register · 2b84d7c3
      Ido Schimmel authored
      The QoS Policer Configuration Register (QPCR) is used to configure
      hardware policers. Extend this register with following fields and
      defines which will be used by subsequent patches:
      
      1. Violate counter: reads number of packets dropped by the policer
      2. Clear counter: to ensure we start counting from 0
      3. Rate and burst size limits
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b84d7c3
    • Ido Schimmel's avatar
      selftests: netdevsim: Add test cases for devlink-trap policers · 5fbff58e
      Ido Schimmel authored
      Add test cases for packet trap policer set / show commands as well as
      for the binding of these policers to packet trap groups.
      
      Both good and bad flows are tested for maximum coverage.
      
      v2:
      * Add test case with new 'fail_trap_policer_set' knob
      * Add test case for partially modified trap group
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fbff58e
    • Ido Schimmel's avatar
      netdevsim: Add support for setting of packet trap group parameters · 0dc8249a
      Ido Schimmel authored
      Add a dummy callback to set trap group parameters. Return an error when
      the 'fail_trap_group_set' debugfs file is set in order to exercise error
      paths and verify that error is propagated to user space when should.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dc8249a
    • Ido Schimmel's avatar
      devlink: Allow setting of packet trap group parameters · c064875a
      Ido Schimmel authored
      The previous patch allowed device drivers to publish their default
      binding between packet trap policers and packet trap groups. However,
      some users might not be content with this binding and would like to
      change it.
      
      In case user space passed a packet trap policer identifier when setting
      a packet trap group, invoke the appropriate device driver callback and
      pass the new policer identifier.
      
      v2:
      * Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
        devlink_trap_group_set() and bail if not present
      * Add extack error message in case trap group was partially modified
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c064875a
    • Ido Schimmel's avatar
      devlink: Add packet trap group parameters support · f9f54392
      Ido Schimmel authored
      Packet trap groups are used to aggregate logically related packet traps.
      Currently, these groups allow user space to batch operations such as
      setting the trap action of all member traps.
      
      In order to prevent the CPU from being overwhelmed by too many trapped
      packets, it is desirable to bind a packet trap policer to these groups.
      For example, to limit all the packets that encountered an exception
      during routing to 10Kpps.
      
      Allow device drivers to bind default packet trap policers to packet trap
      groups when the latter are registered with devlink.
      
      The next patch will enable user space to change this default binding.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9f54392
    • Ido Schimmel's avatar
      netdevsim: Add devlink-trap policer support · ad188458
      Ido Schimmel authored
      Register three dummy packet trap policers with devlink and implement
      callbacks to change their parameters and read their counters.
      
      This will be used later on in the series to test the devlink-trap
      policer infrastructure.
      
      v2:
      * Remove check about burst size being a power of 2 and instead add a
        debugfs knob to fail the operation
      * Provide max/min rate/burst size when registering policers and remove
        the validity checks from nsim_dev_devlink_trap_policer_set()
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad188458
    • Ido Schimmel's avatar
      Documentation: Add description of packet trap policers · ef7d5c7d
      Ido Schimmel authored
      Extend devlink-trap documentation with information about packet trap
      policers.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef7d5c7d
    • Ido Schimmel's avatar
      devlink: Add packet trap policers support · 1e8c6619
      Ido Schimmel authored
      Devices capable of offloading the kernel's datapath and perform
      functions such as bridging and routing must also be able to send (trap)
      specific packets to the kernel (i.e., the CPU) for processing.
      
      For example, a device acting as a multicast-aware bridge must be able to
      trap IGMP membership reports to the kernel for processing by the bridge
      module.
      
      In most cases, the underlying device is capable of handling packet rates
      that are several orders of magnitude higher compared to those that can
      be handled by the CPU.
      
      Therefore, in order to prevent the underlying device from overwhelming
      the CPU, devices usually include packet trap policers that are able to
      police the trapped packets to rates that can be handled by the CPU.
      
      This patch allows capable device drivers to register their supported
      packet trap policers with devlink. User space can then tune the
      parameters of these policer (currently, rate and burst size) and read
      from the device the number of packets that were dropped by the policer,
      if supported.
      
      Subsequent patches in the series will allow device drivers to create
      default binding between these policers and packet trap groups and allow
      user space to change the binding.
      
      v2:
      * Add 'strict_start_type' in devlink policy
      * Have device drivers provide max/min rate/burst size for each policer.
        Use them to check validity of user provided parameters
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e8c6619
    • Alexei Starovoitov's avatar
      Merge branch 'cgroup-bpf_link' · 8596a75f
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      bpf_link abstraction itself was formalized in [0] with justifications for why
      its semantics is a good fit for attaching BPF programs of various types. This
      patch set adds bpf_link-based BPF program attachment mechanism for cgroup BPF
      programs.
      
      Cgroup BPF link is semantically compatible with current BPF_F_ALLOW_MULTI
      semantics of attaching cgroup BPF programs directly. Thus cgroup bpf_link can
      co-exist with legacy BPF program multi-attachment.
      
      bpf_link is destroyed and automatically detached when the last open FD holding
      the reference to bpf_link is closed. This means that by default, when the
      process that created bpf_link exits, attached BPF program will be
      automatically detached due to bpf_link's clean up code. Cgroup bpf_link, like
      any other bpf_link, can be pinned in BPF FS and by those means survive the
      exit of process that created the link. This is useful in many scenarios to
      provide long-living BPF program attachments. Pinning also means that there
      could be many owners of bpf_link through independent FDs.
      
      Additionally, auto-detachmet of cgroup bpf_link is implemented. When cgroup is
      dying it will automatically detach all active bpf_links. This ensures that
      cgroup clean up is not delayed due to active bpf_link even despite no chance
      for any BPF program to be run for a given cgroup. In that sense it's similar
      to existing behavior of dropping refcnt of attached bpf_prog. But in the case
      of bpf_link, bpf_link is not destroyed and is still available to user as long
      as at least one active FD is still open (or if it's pinned in BPF FS).
      
      There are two main cgroup-specific differences between bpf_link-based and
      direct bpf_prog-based attachment.
      
      First, as opposed to direct bpf_prog attachment, cgroup itself doesn't "own"
      bpf_link, which makes it possible to auto-clean up attached bpf_link when user
      process abruptly exits without explicitly detaching BPF program. This makes
      for a safe default behavior proven in BPF tracing program types. But bpf_link
      doesn't bump cgroup->bpf.refcnt as well and because of that doesn't prevent
      cgroup from cleaning up its BPF state.
      
      Second, only owners of bpf_link (those who created bpf_link in the first place
      or obtained a new FD by opening bpf_link from BPF FS) can detach and/or update
      it. This makes sure that no other process can accidentally remove/replace BPF
      program.
      
      This patch set also implements LINK_UPDATE sub-command, which allows to
      replace bpf_link's underlying bpf_prog, similarly to BPF_F_REPLACE flag
      behavior for direct bpf_prog cgroup attachment. Similarly to LINK_CREATE, it
      is supposed to be generic command for different types of bpf_links.
      
        [0] https://lore.kernel.org/bpf/20200228223948.360936-1-andriin@fb.com/
      
      v2->v3:
        - revert back to just MULTI mode (Alexei);
        - fix tinyconfig compilation warning (kbuild test robot);
      
      v1->v2:
        - implement exclusive and overridable exclusive modes (Andrey Ignatov);
        - fix build for !CONFIG_CGROUP_BPF build;
        - add more selftests for non-multi mode and inter-operability;
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8596a75f
    • Andrii Nakryiko's avatar
      selftests/bpf: Test FD-based cgroup attachment · 7cccee42
      Andrii Nakryiko authored
      Add selftests to exercise FD-based cgroup BPF program attachments and their
      intermixing with legacy cgroup BPF attachments. Auto-detachment and program
      replacement (both unconditional and cmpxchng-like) are tested as well.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-5-andriin@fb.com
      7cccee42
    • Andrii Nakryiko's avatar
      libbpf: Add support for bpf_link-based cgroup attachment · cc4f864b
      Andrii Nakryiko authored
      Add bpf_program__attach_cgroup(), which uses BPF_LINK_CREATE subcommand to
      create an FD-based kernel bpf_link. Also add low-level bpf_link_create() API.
      
      If expected_attach_type is not specified explicitly with
      bpf_program__set_expected_attach_type(), libbpf will try to determine proper
      attach type from BPF program's section definition.
      
      Also add support for bpf_link's underlying BPF program replacement:
        - unconditional through high-level bpf_link__update_program() API;
        - cmpxchg-like with specifying expected current BPF program through
          low-level bpf_link_update() API.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-4-andriin@fb.com
      cc4f864b
    • Andrii Nakryiko's avatar
      bpf: Implement bpf_prog replacement for an active bpf_cgroup_link · 0c991ebc
      Andrii Nakryiko authored
      Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
      under given bpf_link. Currently this is only supported for bpf_cgroup_link,
      but will be extended to other kinds of bpf_links in follow-up patches.
      
      For bpf_cgroup_link, implemented functionality matches existing semantics for
      direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
      unconditionally set new bpf_prog regardless of which bpf_prog is currently
      active under given bpf_link, or, optionally, can specify expected active
      bpf_prog. If active bpf_prog doesn't match expected one, no changes are
      performed, old bpf_link stays intact and attached, operation returns
      a failure.
      
      cgroup_bpf_replace() operation is resolving race between auto-detachment and
      bpf_prog update in the same fashion as it's done for bpf_link detachment,
      except in this case update has no way of succeeding because of target cgroup
      marked as dying. So in this case error is returned.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
      0c991ebc
    • Andrii Nakryiko's avatar
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko authored
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
  2. 30 Mar, 2020 15 commits
    • Alexei Starovoitov's avatar
      Merge branch 'subreg-bounds' · e5ffcc91
      Alexei Starovoitov authored
      John Fastabend says:
      
      ====================
      This series adds ALU32 signed and unsigned min/max bounds.
      
      The origins of this work is to fix do_refine_retval_range() which before
      this series clamps the return value bounds to [0, max]. However, this
      is not correct because its possible these functions may return negative
      errors so the correct bound is [*MIN, max]. Where *MIN is the signed
      and unsigned min values U64_MIN and S64_MIN. And 'max' here is the max
      positive value returned by this routine.
      
      Patch 1 changes the do_refine_retval_range() to return the correct bounds
      but this breaks existing programs that were depending on the old incorrect
      bound. To repair these old programs we add ALU32 bounds to properly track
      the return values from these helpers. The ALU32 bounds are needed because
      clang realizes these helepers return 'int' type and will use jmp32 ops
      with the return value.  With current state of things this does little to
      help 64bit bounds and with patch 1 applied will cause many programs to
      fail verifier pass. See patch 5 for trace details on how this happens.
      
      Patch 2 does the ALU32 addition it adds the new bounds and populates them
      through the verifier. Design note, initially a var32 was added but as
      pointed out by Alexei and Edward it is not strictly needed so it was
      removed here. This worked out nicely.
      
      Patch 3 notes that the refine return value can now also bound the 32-bit
      subregister allowing better bouinds tracking in these cases.
      
      Patches 4 adds a C test case to test_progs which will cause the verifier
      to fail if new 32bit and do_refine_retval_range() is incorrect.
      
      Patches 5 and 6 fix test cases that broke after refining the return
      values from helpers. I attempted to be explicit about each failure and
      why we need the change. See patches for details.
      
      Patch 7 adds some bounds check tests to ensure bounds checking when
      mixing alu32, alu64 and jmp32 ops together.
      
      Thanks to Alexei, Edward, and Daniel for initial feedback it helped clean
      this up a lot.
      
      v2:
        - rebased to bpf-next
        - fixed tnum equals optimization for combining 32->64bits
        - updated patch to fix verifier test correctly
        - updated refine_retval_range to set both s32_*_value and s*_value we
          need both to get better bounds tracking
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e5ffcc91
    • John Fastabend's avatar
      bpf: Test_verifier, add alu32 bounds tracking tests · 41f70fe0
      John Fastabend authored
      Its possible to have divergent ALU32 and ALU64 bounds when using JMP32
      instructins and ALU64 arithmatic operations. Sometimes the clang will
      even generate this code. Because the case is a bit tricky lets add
      a specific test for it.
      
      Here is  pseudocode asm version to illustrate the idea,
      
       1 r0 = 0xffffffff00000001;
       2 if w0 > 1 goto %l[fail];
       3 r0 += 1
       5 if w0 > 2 goto %l[fail]
       6 exit
      
      The intent here is the verifier will fail the load if the 32bit bounds
      are not tracked correctly through ALU64 op. Similarly we can check the
      64bit bounds are correctly zero extended after ALU32 ops.
      
       1 r0 = 0xffffffff00000001;
       2 w0 += 1
       2 if r0 > 3 goto %l[fail];
       6 exit
      
      The above will fail if we do not correctly zero extend 64bit bounds
      after 32bit op.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560430155.10843.514209255758200922.stgit@john-Precision-5820-Tower
      41f70fe0
    • John Fastabend's avatar
      bpf: Test_verifier, #65 error message updates for trunc of boundary-cross · 32f13a5a
      John Fastabend authored
      After changes to add update_reg_bounds after ALU ops and 32-bit bounds
      tracking truncation of boundary crossing range will fail earlier and with
      a different error message. Now the test error trace is the following
      
      11: (17) r1 -= 2147483584
      12: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
          R1_w=invP(id=0,smin_value=-2147483584,smax_value=63)
          R10=fp0 fp-8_w=mmmmmmmm
      12: (17) r1 -= 2147483584
      13: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
          R1_w=invP(id=0,
                    umin_value=18446744069414584448,umax_value=18446744071562068095,
                    var_off=(0xffffffff00000000; 0xffffffff))
          R10=fp0 fp-8_w=mmmmmmmm
      13: (77) r1 >>= 8
      14: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0)
          R1_w=invP(id=0,
                    umin_value=72057594021150720,umax_value=72057594029539328,
                    var_off=(0xffffffff000000; 0xffffff),
                    s32_min_value=-16777216,s32_max_value=-1,
                    u32_min_value=-16777216)
          R10=fp0 fp-8_w=mmmmmmmm
      14: (0f) r0 += r1
      value 72057594021150720 makes map_value pointer be out of bounds
      
      Because we have 'umin_value == umax_value' instead of previously
      where 'umin_value != umax_value' we can now fail earlier noting
      that pointer addition is out of bounds.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560428103.10843.6316594510312781186.stgit@john-Precision-5820-Tower
      32f13a5a
    • John Fastabend's avatar
      bpf: Test_verifier, bpf_get_stack return value add <0 · 9ac26e99
      John Fastabend authored
      With current ALU32 subreg handling and retval refine fix from last
      patches we see an expected failure in test_verifier. With verbose
      verifier state being printed at each step for clarity we have the
      following relavent lines [I omit register states that are not
      necessarily useful to see failure cause],
      
      #101/p bpf_get_stack return R0 within range FAIL
      Failed to load prog 'Success'!
      [..]
      14: (85) call bpf_get_stack#67
       R0_w=map_value(id=0,off=0,ks=8,vs=48,imm=0)
       R3_w=inv48
      15:
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
      15: (b7) r1 = 0
      16:
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
       R1_w=inv0
      16: (bf) r8 = r0
      17:
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
       R1_w=inv0
       R8_w=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
      17: (67) r8 <<= 32
      18:
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
       R1_w=inv0
       R8_w=inv(id=0,smax_value=9223372032559808512,
                     umax_value=18446744069414584320,
                     var_off=(0x0; 0xffffffff00000000),
                     s32_min_value=0,
                     s32_max_value=0,
                     u32_max_value=0,
                     var32_off=(0x0; 0x0))
      18: (c7) r8 s>>= 32
      19
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
       R1_w=inv0
       R8_w=inv(id=0,smin_value=-2147483648,
                     smax_value=2147483647,
                     var32_off=(0x0; 0xffffffff))
      19: (cd) if r1 s< r8 goto pc+16
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
       R1_w=inv0
       R8_w=inv(id=0,smin_value=-2147483648,
                     smax_value=0,
                     var32_off=(0x0; 0xffffffff))
      20:
       R0=inv(id=0,smax_value=48,var32_off=(0x0; 0xffffffff))
       R1_w=inv0
       R8_w=inv(id=0,smin_value=-2147483648,
                     smax_value=0,
       R9=inv48
      20: (1f) r9 -= r8
      21: (bf) r2 = r7
      22:
       R2_w=map_value(id=0,off=0,ks=8,vs=48,imm=0)
      22: (0f) r2 += r8
      value -2147483648 makes map_value pointer be out of bounds
      
      After call bpf_get_stack() on line 14 and some moves we have at line 16
      an r8 bound with max_value 48 but an unknown min value. This is to be
      expected bpf_get_stack call can only return a max of the input size but
      is free to return any negative error in the 32-bit register space. The
      C helper is returning an int so will use lower 32-bits.
      
      Lines 17 and 18 clear the top 32 bits with a left/right shift but use
      ARSH so we still have worst case min bound before line 19 of -2147483648.
      At this point the signed check 'r1 s< r8' meant to protect the addition
      on line 22 where dst reg is a map_value pointer may very well return
      true with a large negative number. Then the final line 22 will detect
      this as an invalid operation and fail the program. What we want to do
      is proceed only if r8 is positive non-error. So change 'r1 s< r8' to
      'r1 s> r8' so that we jump if r8 is negative.
      
      Next we will throw an error because we access past the end of the map
      value. The map value size is 48 and sizeof(struct test_val) is 48 so
      we walk off the end of the map value on the second call to
      get bpf_get_stack(). Fix this by changing sizeof(struct test_val) to
      24 by using 'sizeof(struct test_val) / 2'. After this everything passes
      as expected.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560426019.10843.3285429543232025187.stgit@john-Precision-5820-Tower
      9ac26e99
    • John Fastabend's avatar
      bpf: Test_progs, add test to catch retval refine error handling · d2db08c7
      John Fastabend authored
      Before this series the verifier would clamp return bounds of
      bpf_get_stack() to [0, X] and this led the verifier to believe
      that a JMP_JSLT 0 would be false and so would prune that path.
      
      The result is anything hidden behind that JSLT would be unverified.
      Add a test to catch this case by hiding an goto pc-1 behind the
      check which will cause an infinite loop if not rejected.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560423908.10843.11783152347709008373.stgit@john-Precision-5820-Tower
      d2db08c7
    • John Fastabend's avatar
      bpf: Verifier, refine 32bit bound in do_refine_retval_range · fa123ac0
      John Fastabend authored
      Further refine return values range in do_refine_retval_range by noting
      these are int return types (We will assume here that int is a 32-bit type).
      
      Two reasons to pull this out of original patch. First it makes the original
      fix impossible to backport. And second I've not seen this as being problematic
      in practice unlike the other case.
      
      Fixes: 849fa506 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower
      fa123ac0
    • John Fastabend's avatar
      bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132
      John Fastabend authored
      It is not possible for the current verifier to track ALU32 and JMP ops
      correctly. This can result in the verifier aborting with errors even though
      the program should be verifiable. BPF codes that hit this can work around
      it by changin int variables to 64-bit types, marking variables volatile,
      etc. But this is all very ugly so it would be better to avoid these tricks.
      
      But, the main reason to address this now is do_refine_retval_range() was
      assuming return values could not be negative. Once we fixed this code that
      was previously working will no longer work. See do_refine_retval_range()
      patch for details. And we don't want to suddenly cause programs that used
      to work to fail.
      
      The simplest example code snippet that illustrates the problem is likely
      this,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
      
      The expected 64-bit and 32-bit bounds after each line are shown on the
      right. The current issue is without the w* bounds we are forced to use
      the worst case bound of [0, U32_MAX]. To resolve this type of case,
      jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
      32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
      from branch_taken logic creating new bounds we can track 32-bit bounds
      explicitly.
      
      The next case we observed is ALU ops after the jmp32,
      
       53: w8 = w0                    // r8 <- [0, S32_MAX],
                                      // w8 <- [-S32_MIN, X]
       54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                      // w8 <- [0, X]
       55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                      // w8 <- [0, X+1]
      
      In order to keep the bounds accurate at this point we also need to track
      ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
      ops, mov, add, sub, etc.
      
      Finally there is a question of how and when to merge bounds. The cases
      enumerate here,
      
      1. MOV ALU32   - zext 32-bit -> 64-bit
      2. MOV ALU64   - copy 64-bit -> 32-bit
      3. op  ALU32   - zext 32-bit -> 64-bit
      4. op  ALU64   - n/a
      5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
      6. jmp ALU64   - 32-bit: (>> (<< var64_off))
      
      Details for each case,
      
      For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
      from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
      bounds correctly. See zext_32_to_64.
      
      For "MOV ALU64" copy all bounds including 32-bit into new register. If
      the src register had 32-bit bounds the dst register will as well.
      
      For "op ALU32" zero extend 32-bit into 64-bit the same as move,
      see zext_32_to_64.
      
      For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
      is done here. Except we have a special case. When RSH or ARSH is
      done we can't simply ignore shifting bits from 64-bit reg into the
      32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
      This will be correct in the sense that they will represent a valid
      state of the register. However we could lose some accuracy if an
      ARSH is following a jmp32 operation. We can handle this special
      case in a follow up series.
      
      For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
      from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
      special case if 64-bit bounds has zero'd upper 32bits at which point
      we can simply copy 32-bit bounds into 64-bit register. This catches
      a common compiler trick where upper 32-bits are zeroed and then
      32-bit ops are used followed by a 64-bit compare or 64-bit op on
      a pointer. See __reg_combine_64_into_32().
      
      For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
      counterpart. For example s32_min_value = (s32)reg->smin_value. For
      tnum use only the lower 32bits via, (>>(<<var_off)). See
      __reg_combine_64_into_32().
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
      3f50f132
    • John Fastabend's avatar
      bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly · 10060503
      John Fastabend authored
      do_refine_retval_range() is called to refine return values from specified
      helpers, probe_read_str and get_stack at the moment, the reasoning is
      because both have a max value as part of their input arguments and
      because the helper ensure the return value will not be larger than this
      we can set smax values of the return register, r0.
      
      However, the return value is a signed integer so setting umax is incorrect
      It leads to further confusion when the do_refine_retval_range() then calls,
      __reg_deduce_bounds() which will see a umax value as meaning the value is
      unsigned and then assuming it is unsigned set the smin = umin which in this
      case results in 'smin = 0' and an 'smax = X' where X is the input argument
      from the helper call.
      
      Here are the comments from _reg_deduce_bounds() on why this would be safe
      to do.
      
       /* Learn sign from unsigned bounds.  Signed bounds cross the sign
        * boundary, so we must be careful.
        */
       if ((s64)reg->umax_value >= 0) {
      	/* Positive.  We can't learn anything from the smin, but smax
      	 * is positive, hence safe.
      	 */
      	reg->smin_value = reg->umin_value;
      	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
      						  reg->umax_value);
      
      But now we incorrectly have a return value with type int with the
      signed bounds (0,X). Suppose the return value is negative, which is
      possible the we have the verifier and reality out of sync. Among other
      things this may result in any error handling code being falsely detected
      as dead-code and removed. For instance the example below shows using
      bpf_probe_read_str() causes the error path to be identified as dead
      code and removed.
      
      >From the 'llvm-object -S' dump,
      
       r2 = 100
       call 45
       if r0 s< 0 goto +4
       r4 = *(u32 *)(r7 + 0)
      
      But from dump xlate
      
        (b7) r2 = 100
        (85) call bpf_probe_read_compat_str#-96768
        (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto
      
      Due to verifier state after call being
      
       R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))
      
      To fix omit setting the umax value because its not safe. The only
      actual bounds we know is the smax. This results in the correct bounds
      (SMIN, X) where X is the max length from the helper. After this the
      new verifier state looks like the following after call 45.
      
      R0=inv(id=0,smax_value=100)
      
      Then xlated version no longer removed dead code giving the expected
      result,
      
        (b7) r2 = 100
        (85) call bpf_probe_read_compat_str#-96768
        (c5) if r0 s< 0x0 goto pc+4
        (61) r4 = *(u32 *)(r7 +0)
      
      Note, bpf_probe_read_* calls are root only so we wont hit this case
      with non-root bpf users.
      
      v3: comment had some documentation about meta set to null case which
      is not relevant here and confusing to include in the comment.
      
      v2 note: In original version we set msize_smax_value from check_func_arg()
      and propagated this into smax of retval. The logic was smax is the bound
      on the retval we set and because the type in the helper is ARG_CONST_SIZE
      we know that the reg is a positive tnum_const() so umax=smax. Alexei
      pointed out though this is a bit odd to read because the register in
      check_func_arg() has a C type of u32 and the umax bound would be the
      normally relavent bound here. Pulling in extra knowledge about future
      checks makes reading the code a bit tricky. Further having a signed
      meta data that can only ever be positive is also a bit odd. So dropped
      the msize_smax_value metadata and made it a u64 msize_max_value to
      indicate its unsigned. And additionally save bound from umax value in
      check_arg_funcs which is the same as smax due to as noted above tnumx_cont
      and negative check but reads better. By my analysis nothing functionally
      changes in v2 but it does get easier to read so that is win.
      
      Fixes: 849fa506 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower
      10060503
    • KP Singh's avatar
      bpf, lsm: Make BPF_LSM depend on BPF_EVENTS · 4edf16b7
      KP Singh authored
      LSM and tracing programs share their helpers with bpf_tracing_func_proto
      which is only defined (in bpf_trace.c) when BPF_EVENTS is enabled.
      
      Instead of adding __weak symbol, make BPF_LSM depend on BPF_EVENTS so
      that both tracing and LSM programs can actually share helpers.
      
      Fixes: fc611f47 ("bpf: Introduce BPF_PROG_TYPE_LSM")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarKP Singh <kpsingh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200330204059.13024-1-kpsingh@chromium.org
      4edf16b7
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_sk_assign' · c58b1558
      Alexei Starovoitov authored
      Joe Stringer says:
      
      ====================
      Introduce a new helper that allows assigning a previously-found socket
      to the skb as the packet is received towards the stack, to cause the
      stack to guide the packet towards that socket subject to local routing
      configuration. The intention is to support TProxy use cases more
      directly from eBPF programs attached at TC ingress, to simplify and
      streamline Linux stack configuration in scale environments with Cilium.
      
      Normally in ip{,6}_rcv_core(), the skb will be orphaned, dropping any
      existing socket reference associated with the skb. Existing tproxy
      implementations in netfilter get around this restriction by running the
      tproxy logic after ip_rcv_core() in the PREROUTING table. However, this
      is not an option for TC-based logic (including eBPF programs attached at
      TC ingress).
      
      This series introduces the BPF helper bpf_sk_assign() to associate the
      socket with the skb on the ingress path as the packet is passed up the
      stack. The initial patch in the series simply takes a reference on the
      socket to ensure safety, but later patches relax this for listen
      sockets.
      
      To ensure delivery to the relevant socket, we still consult the routing
      table, for full examples of how to configure see the tests in patch #5;
      the simplest form of the route would look like this:
      
        $ ip route add local default dev lo
      
      This series is laid out as follows:
      * Patch 1 extends the eBPF API to add sk_assign() and defines a new
        socket free function to allow the later paths to understand when the
        socket associated with the skb should be kept through receive.
      * Patches 2-3 optimize the receive path to avoid taking a reference on
        listener sockets during receive.
      * Patches 4-5 extends the selftests with examples of the new
        functionality and validation of correct behaviour.
      
      Changes since v4:
      * Fix build with CONFIG_INET disabled
      * Rebase
      
      Changes since v3:
      * Use sock_gen_put() directly instead of sock_edemux() from sock_pfree()
      * Commit message wording fixups
      * Add acks from Martin, Lorenz
      * Rebase
      
      Changes since v2:
      * Add selftests for UDP socket redirection
      * Drop the early demux optimization patch (defer for more testing)
      * Fix check for orphaning after TC act return
      * Tidy up the tests to clean up properly and be less noisy.
      
      Changes since v1:
      * Replace the metadata_dst approach with using the skb->destructor to
        determine whether the socket has been prefetched. This is much
        simpler.
      * Avoid taking a reference on listener sockets during receive
      * Restrict assigning sockets across namespaces
      * Restrict assigning SO_REUSEPORT sockets
      * Fix cookie usage for socket dst check
      * Rebase the tests against test_progs infrastructure
      * Tidy up commit messages
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c58b1558
    • Joe Stringer's avatar
      selftests: bpf: Extend sk_assign tests for UDP · 8a02a170
      Joe Stringer authored
      Add support for testing UDP sk_assign to the existing tests.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-6-joe@wand.net.nz
      8a02a170
    • Lorenz Bauer's avatar
      selftests: bpf: Add test for sk_assign · 2d7824ff
      Lorenz Bauer authored
      Attach a tc direct-action classifier to lo in a fresh network
      namespace, and rewrite all connection attempts to localhost:4321
      to localhost:1234 (for port tests) and connections to unreachable
      IPv4/IPv6 IPs to the local socket (for address tests). Includes
      implementations for both TCP and UDP.
      
      Keep in mind that both client to server and server to client traffic
      passes the classifier.
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-5-joe@wand.net.nzCo-authored-by: default avatarJoe Stringer <joe@wand.net.nz>
      2d7824ff
    • Joe Stringer's avatar
      bpf: Don't refcount LISTEN sockets in sk_assign() · 7ae215d2
      Joe Stringer authored
      Avoid taking a reference on listen sockets by checking the socket type
      in the sk_assign and in the corresponding skb_steal_sock() code in the
      the transport layer, and by ensuring that the prefetch free (sock_pfree)
      function uses the same logic to check whether the socket is refcounted.
      Suggested-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
      7ae215d2
    • Joe Stringer's avatar
      net: Track socket refcounts in skb_steal_sock() · 71489e21
      Joe Stringer authored
      Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
      the determination of whether the socket is reference counted in the case
      where it is prefetched by earlier logic such as early_demux.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
      71489e21
    • Joe Stringer's avatar
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer authored
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66