1. 21 Feb, 2020 40 commits
    • Eran Ben Elisha's avatar
      net/mlx5: Add fsm_reactivate callback support · b7331aa2
      Eran Ben Elisha authored
      Add support for fsm reactivate via MIRC (Management Image Re-activation
      Control) set and query commands.
      For re-activation flow, driver shall first run MIRC set, and then wait
      until FW is done (via querying MIRC status).
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7331aa2
    • Eran Ben Elisha's avatar
      net/mlxfw: Add reactivate flow support to FSM burn flow · 958dfd0d
      Eran Ben Elisha authored
      Expose fsm_reactivate callback to the mlxfw_dev_ops struct. FSM reactivate
      is needed before flashing the new image in order to flush the old flashed
      but not running firmware image.
      
      In case mlxfw_dev do not support the reactivation, this step will be
      skipped. But if later image flash will fail, a hint will be provided by
      the extack to advise the user that the failure might be related to it.
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      958dfd0d
    • Saeed Mahameed's avatar
      net/mlxfw: Use MLXFW_ERR_MSG macro for error reporting · 5042e8b9
      Saeed Mahameed authored
      Instead of always calling both mlxfw_err and NL_SET_ERR_MSG_MOD with the
      same message, use the dedicated macro instead.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5042e8b9
    • Saeed Mahameed's avatar
      net/mlxfw: Convert pr_* to dev_* in mlxfw_fsm.c · 6a3f707c
      Saeed Mahameed authored
      Introduce mlxfw_{info, err, dbg} macros and make them call corresponding
      dev_* macros, then convert all instances of pr_* to mlxfw_*.
      
      This will allow printing the device name mlxfw is operating on.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a3f707c
    • Saeed Mahameed's avatar
      net/mlxfw: More error messages coverage · f7fe7aa8
      Saeed Mahameed authored
      Make sure mlxfw_firmware_flash reports a detailed user readable error
      message in every possible error path, basically every time
      mlxfw_dev->ops->*() is called and an error is returned, or when image
      initialization is failed.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7fe7aa8
    • Saeed Mahameed's avatar
      net/mlxfw: Improve FSM err message reporting and return codes · 86a1270f
      Saeed Mahameed authored
      Report unique and standard error codes corresponding to the specific
      FW flash error. In addition, add a more detailed error messages to
      netlink.
      
      Before:
      $ devlink dev flash pci/0000:05:00.0 file ...
      Error: mlxfw: Firmware flash failed.
      devlink answers: Invalid argument
      
      After:
      $ devlink dev flash pci/0000:05:00.0 file ...
      Error: mlxfw: Firmware flash failed: pending reset.
      devlink answers: Device busy
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86a1270f
    • Saeed Mahameed's avatar
      net/mlxfw: Generic mlx FW flash status notify · 4ae57566
      Saeed Mahameed authored
      FW flash status notify is currently implemented via a callback to the
      caller mlx module, and all it is doing is to call
      devlink_flash_update_status_notify with the specific module devlink
      instance.
      
      Instead of repeating the whole process for all mlx modules and
      re-implement the status_notify callback again and again. Just provide the
      devlink instance as part of mlxfw_dev when calling mlxfw_firmware_flash
      and let mlxfw do the devlink status updates directly.
      
      This will be very useful for adding status notify support to mlx5, as
      already done in this patch, with a simple one line of just providing the
      devlink instance to mlxfw_firmware_flash.
      
      mlxfw now depends on NET_DEVLINK as all other mlx modules.
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ae57566
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · b105e8e2
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2020-02-21
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 25 non-merge commits during the last 4 day(s) which contain
      a total of 33 files changed, 2433 insertions(+), 161 deletions(-).
      
      The main changes are:
      
      1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
         with reuseport BPF programs, from Jakub Sitnicki.
      
      2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
         to specify the tracepoint/function dynamically, from Eelco Chaudron.
      
      3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
         guided optimizations, from Daniel Xu.
      
      4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.
      
      5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
         BTF defined maps, from Andrii Nakryiko.
      
      6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
         been observed that former fails in envs with low memlock, from Yonghong Song.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b105e8e2
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · e65ee2fb
      David S. Miller authored
      Conflict resolution of ice_virtchnl_pf.c based upon work by
      Stephen Rothwell.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e65ee2fb
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sockmap-listen' · eb1e1478
      Daniel Borkmann authored
      Jakub Sitnicki says:
      
      ====================
      This patch set turns SOCK{MAP,HASH} into generic collections for TCP
      sockets, both listening and established. Adding support for listening
      sockets enables us to use these BPF map types with reuseport BPF programs.
      
      Why? SOCKMAP and SOCKHASH, in comparison to REUSEPORT_SOCKARRAY, allow
      the socket to be in more than one map at the same time.
      
      Having a BPF map type that can hold listening sockets, and gracefully
      co-exist with reuseport BPF is important if, in the future, we want
      BPF programs that run at socket lookup time [0]. Cover letter for v1 of
      this series tells the full story of how we got here [1].
      
      Although SOCK{MAP,HASH} are not a drop-in replacement for SOCKARRAY just
      yet, because UDP support is lacking, it's a step in this direction. We're
      working with Lorenz on extending SOCK{MAP,HASH} to hold UDP sockets, and
      expect to post RFC series for sockmap + UDP in the near future.
      
      I've dropped Acks from all patches that have been touched since v6.
      
      The audit for missing READ_ONCE annotations for access to sk_prot is
      ongoing. Thus far I've found one location specific to TCP listening sockets
      that needed annotating. This got fixed it in this iteration. I wonder if
      sparse checker could be put to work to identify places where we have
      sk_prot access while not holding sk_lock...
      
      The patch series depends on another one, posted earlier [2], that has
      been split out of it.
      
      v6 -> v7:
      
      - Extended the series to cover SOCKHASH. (patches 4-8, 10-11) (John)
      
      - Rebased onto recent bpf-next. Resolved conflicts in recent fixes to
        sk_state checks on sockmap/sockhash update path. (patch 4)
      
      - Added missing READ_ONCE annotation in sock_copy. (patch 1)
      
      - Split out patches that simplify sk_psock_restore_proto [2].
      
      v5 -> v6:
      
      - Added a fix-up for patch 1 which I forgot to commit in v5. Sigh.
      
      v4 -> v5:
      
      - Rebase onto recent bpf-next to resolve conflicts. (Daniel)
      
      v3 -> v4:
      
      - Make tcp_bpf_clone parameter names consistent across function declaration
        and definition. (Martin)
      
      - Use sock_map_redirect_okay helper everywhere we need to take a different
        action for listening sockets. (Lorenz)
      
      - Expand comment explaining the need for a callback from reuseport to
        sockarray code in reuseport_detach_sock. (Martin)
      
      - Mention the possibility of using a u64 counter for reuseport IDs in the
        future in the description for patch 10. (Martin)
      
      v2 -> v3:
      
      - Generate reuseport ID when group is created. Please see patch 10
        description for details. (Martin)
      
      - Fix the build when CONFIG_NET_SOCK_MSG is not selected by either
        CONFIG_BPF_STREAM_PARSER or CONFIG_TLS. (kbuild bot & John)
      
      - Allow updating sockmap from BPF on BPF_SOCK_OPS_TCP_LISTEN_CB callback. An
        oversight in previous iterations. Users may want to populate the sockmap with
        listening sockets from BPF as well.
      
      - Removed RCU read lock assertion in sock_map_lookup_sys. (Martin)
      
      - Get rid of a warning when child socket was cloned with parent's psock
        state. (John)
      
      - Check for tcp_bpf_unhash rather than tcp_bpf_recvmsg when deciding if
        sk_proto needs restoring on clone. Check for recvmsg in the context of
        listening socket cloning was confusing. (Martin)
      
      - Consolidate sock_map_sk_is_suitable with sock_map_update_okay. This led
        to adding dedicated predicates for sockhash. Update self-tests
        accordingly. (John)
      
      - Annotate unlikely branch in bpf_{sk,msg}_redirect_map when socket isn't
        in a map, or isn't a valid redirect target. (John)
      
      - Document paired READ/WRITE_ONCE annotations and cover shared access in
        more detail in patch 2 description. (John)
      
      - Correct a couple of log messages in sockmap_listen self-tests so the
        message reflects the actual failure.
      
      - Rework reuseport tests from sockmap_listen suite so that ENOENT error
        from bpf_sk_select_reuseport handler does not happen on happy path.
      
      v1 -> v2:
      
      - af_ops->syn_recv_sock callback is no longer overridden and burdened with
        restoring sk_prot and clearing sk_user_data in the child socket. As child
        socket is already hashed when syn_recv_sock returns, it is too late to
        put it in the right state. Instead patches 3 & 4 address restoring
        sk_prot and clearing sk_user_data before we hash the child socket.
        (Pointed out by Martin Lau)
      
      - Annotate shared access to sk->sk_prot with READ_ONCE/WRITE_ONCE macros as
        we write to it from sk_msg while socket might be getting cloned on
        another CPU. (Suggested by John Fastabend)
      
      - Convert tests for SOCKMAP holding listening sockets to return-on-error
        style, and hook them up to test_progs. Also use BPF skeleton for setup.
        Add new tests to cover the race scenario discovered during v1 review.
      
      RFC -> v1:
      
      - Switch from overriding proto->accept to af_ops->syn_recv_sock, which
        happens earlier. Clearing the psock state after accept() does not work
        for child sockets that become orphaned (never got accepted). v4-mapped
        sockets need special care.
      
      - Return the socket cookie on SOCKMAP lookup from syscall to be on par with
        REUSEPORT_SOCKARRAY. Requires SOCKMAP to take u64 on lookup/update from
        syscall.
      
      - Make bpf_sk_redirect_map (ingress) and bpf_msg_redirect_map (egress)
        SOCKMAP helpers fail when target socket is a listening one.
      
      - Make bpf_sk_select_reuseport helper fail when target is a TCP established
        socket.
      
      - Teach libbpf to recognize SK_REUSEPORT program type from section name.
      
      - Add a dedicated set of tests for SOCKMAP holding listening sockets,
        covering map operations, overridden socket callbacks, and BPF helpers.
      
      [0] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
      [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/
      [2] https://lore.kernel.org/bpf/20200217121530.754315-1-jakub@cloudflare.com/
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      eb1e1478
    • Jakub Sitnicki's avatar
      selftests/bpf: Tests for sockmap/sockhash holding listening sockets · 44d28be2
      Jakub Sitnicki authored
      Now that SOCKMAP and SOCKHASH map types can store listening sockets,
      user-space and BPF API is open to a new set of potential pitfalls.
      
      Exercise the map operations, with extra attention to code paths susceptible
      to races between map ops and socket cloning, and BPF helpers that work with
      SOCKMAP/SOCKHASH to gain confidence that all works as expected.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-12-jakub@cloudflare.com
      44d28be2
    • Jakub Sitnicki's avatar
      selftests/bpf: Extend SK_REUSEPORT tests to cover SOCKMAP/SOCKHASH · 11318ba8
      Jakub Sitnicki authored
      Parametrize the SK_REUSEPORT tests so that the map type for storing sockets
      is not hard-coded in the test setup routine.
      
      This, together with careful state cleaning after the tests, lets us run the
      test cases for REUSEPORT_ARRAY, SOCKMAP, and SOCKHASH to have test coverage
      for all supported map types. The last two support only TCP sockets at the
      moment.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-11-jakub@cloudflare.com
      11318ba8
    • Jakub Sitnicki's avatar
      net: Generate reuseport group ID on group creation · 035ff358
      Jakub Sitnicki authored
      Commit 736b4602 ("net: Add ID (if needed) to sock_reuseport and expose
      reuseport_lock") has introduced lazy generation of reuseport group IDs that
      survive group resize.
      
      By comparing the identifier we check if BPF reuseport program is not trying
      to select a socket from a BPF map that belongs to a different reuseport
      group than the one the packet is for.
      
      Because SOCKARRAY used to be the only BPF map type that can be used with
      reuseport BPF, it was possible to delay the generation of reuseport group
      ID until a socket from the group was inserted into BPF map for the first
      time.
      
      Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
      either generate the reuseport ID on map update, like SOCKARRAY does, or
      allocate an ID from the start when reuseport group gets created.
      
      This patch takes the latter approach to keep sockmap free of calls into
      reuseport code. This streamlines the reuseport_id access as its lifetime
      now matches the longevity of reuseport object.
      
      The cost of this simplification, however, is that we allocate reuseport IDs
      for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
      setups. With the way identifiers are currently generated, we can have at
      most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
      get close to the limit, we can switch an u64 counter like sk_cookie.
      
      Another change is that we now always call into SOCKARRAY logic to unlink
      the socket from the map when unhashing or closing the socket. Previously we
      did it only when at least one socket from the group was in a BPF map.
      
      It is worth noting that this doesn't conflict with sockmap tear-down in
      case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
      group. sockmap tear-down happens first:
      
        prot->unhash
        `- tcp_bpf_unhash
           |- tcp_bpf_remove
           |  `- while (sk_psock_link_pop(psock))
           |     `- sk_psock_unlink
           |        `- sock_map_delete_from_link
           |           `- __sock_map_delete
           |              `- sock_map_unref
           |                 `- sk_psock_put
           |                    `- sk_psock_drop
           |                       `- rcu_assign_sk_user_data(sk, NULL)
           `- inet_unhash
              `- reuseport_detach_sock
                 `- bpf_sk_reuseport_detach
                    `- WRITE_ONCE(sk->sk_user_data, NULL)
      Suggested-by: default avatarMartin Lau <kafai@fb.com>
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com
      035ff358
    • Jakub Sitnicki's avatar
      bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH · 9fed9000
      Jakub Sitnicki authored
      SOCKMAP & SOCKHASH now support storing references to listening
      sockets. Nothing keeps us from using these map types a collection of
      sockets to select from in BPF reuseport programs. Whitelist the map types
      with the bpf_sk_select_reuseport helper.
      
      The restriction that the socket has to be a member of a reuseport group
      still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
      set are not a valid target and we signal it with -EINVAL.
      
      The main benefit from this change is that, in contrast to
      REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
      listening socket can be just one BPF map at the same time.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com
      9fed9000
    • Jakub Sitnicki's avatar
      bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH · 1d59f3bc
      Jakub Sitnicki authored
      Don't require the kernel code, like BPF helpers, that needs access to
      SOCK{MAP,HASH} map contents to live in net/core/sock_map.c. Expose the
      lookup operation to all kernel-land.
      
      Lookup from BPF context is not whitelisted yet. While syscalls have a
      dedicated lookup handler.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-8-jakub@cloudflare.com
      1d59f3bc
    • Jakub Sitnicki's avatar
      bpf, sockmap: Return socket cookie on lookup from syscall · c1cdf65d
      Jakub Sitnicki authored
      Tooling that populates the SOCK{MAP,HASH} with sockets from user-space
      needs a way to inspect its contents. Returning the struct sock * that the
      map holds to user-space is neither safe nor useful. An approach established
      by REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
      instead.
      
      Since socket cookies are u64 values, SOCK{MAP,HASH} need to support such a
      value size for lookup to be possible. This requires special handling on
      update, though. Attempts to do a lookup on a map holding u32 values will be
      met with ENOSPC error.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com
      c1cdf65d
    • Jakub Sitnicki's avatar
      bpf, sockmap: Don't set up upcalls and progs for listening sockets · 6e830c2f
      Jakub Sitnicki authored
      Now that sockmap/sockhash can hold listening sockets, when setting up the
      psock we will (i) grab references to verdict/parser progs, and (2) override
      socket upcalls sk_data_ready and sk_write_space.
      
      However, since we cannot redirect to listening sockets so we don't need to
      link the socket to the BPF progs. And more importantly we don't want the
      listening socket to have overridden upcalls because they would get
      inherited by child sockets cloned from it.
      
      Introduce a separate initialization path for listening sockets that does
      not change the upcalls and ignores the BPF progs.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-6-jakub@cloudflare.com
      6e830c2f
    • Jakub Sitnicki's avatar
      bpf, sockmap: Allow inserting listening TCP sockets into sockmap · 8ca30379
      Jakub Sitnicki authored
      In order for sockmap/sockhash types to become generic collections for
      storing TCP sockets we need to loosen the checks during map update, while
      tightening the checks in redirect helpers.
      
      Currently sock{map,hash} require the TCP socket to be in established state,
      which prevents inserting listening sockets.
      
      Change the update pre-checks so the socket can also be in listening state.
      
      Since it doesn't make sense to redirect with sock{map,hash} to listening
      sockets, add appropriate socket state checks to BPF redirect helpers too.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-5-jakub@cloudflare.com
      8ca30379
    • Jakub Sitnicki's avatar
      tcp_bpf: Don't let child socket inherit parent protocol ops on copy · e8025155
      Jakub Sitnicki authored
      Prepare for cloning listening sockets that have their protocol callbacks
      overridden by sk_msg. Child sockets must not inherit parent callbacks that
      access state stored in sk_user_data owned by the parent.
      
      Restore the child socket protocol callbacks before it gets hashed and any
      of the callbacks can get invoked.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com
      e8025155
    • Jakub Sitnicki's avatar
      net, sk_msg: Clear sk_user_data pointer on clone if tagged · f1ff5ce2
      Jakub Sitnicki authored
      sk_user_data can hold a pointer to an object that is not intended to be
      shared between the parent socket and the child that gets a pointer copy on
      clone. This is the case when sk_user_data points at reference-counted
      object, like struct sk_psock.
      
      One way to resolve it is to tag the pointer with a no-copy flag by
      repurposing its lowest bit. Based on the bit-flag value we clear the child
      sk_user_data pointer after cloning the parent socket.
      
      The no-copy flag is stored in the pointer itself as opposed to externally,
      say in socket flags, to guarantee that the pointer and the flag are copied
      from parent to child socket in an atomic fashion. Parent socket state is
      subject to change while copying, we don't hold any locks at that time.
      
      This approach relies on an assumption that sk_user_data holds a pointer to
      an object aligned at least 2 bytes. A manual audit of existing users of
      rcu_dereference_sk_user_data helper confirms our assumption.
      
      Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
      char value or a pathological case of "struct { char c; }". To be safe, warn
      when the flag-bit is set when setting sk_user_data to catch any future
      misuses.
      
      It is worth considering why clearing sk_user_data unconditionally is not an
      option. There exist users, DRBD, NVMe, and Xen drivers being among them,
      that rely on the pointer being copied when cloning the listening socket.
      
      Potentially we could distinguish these users by checking if the listening
      socket has been created in kernel-space via sock_create_kern, and hence has
      sk_kern_sock flag set. However, this is not the case for NVMe and Xen
      drivers, which create sockets without marking them as belonging to the
      kernel.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
      f1ff5ce2
    • Jakub Sitnicki's avatar
      net, sk_msg: Annotate lockless access to sk_prot on clone · b8e202d1
      Jakub Sitnicki authored
      sk_msg and ULP frameworks override protocol callbacks pointer in
      sk->sk_prot, while tcp accesses it locklessly when cloning the listening
      socket, that is with neither sk_lock nor sk_callback_lock held.
      
      Once we enable use of listening sockets with sockmap (and hence sk_msg),
      there will be shared access to sk->sk_prot if socket is getting cloned
      while being inserted/deleted to/from the sockmap from another CPU:
      
      Read side:
      
      tcp_v4_rcv
        sk = __inet_lookup_skb(...)
        tcp_check_req(sk)
          inet_csk(sk)->icsk_af_ops->syn_recv_sock
            tcp_v4_syn_recv_sock
              tcp_create_openreq_child
                inet_csk_clone_lock
                  sk_clone_lock
                    READ_ONCE(sk->sk_prot)
      
      Write side:
      
      sock_map_ops->map_update_elem
        sock_map_update_elem
          sock_map_update_common
            sock_map_link_no_progs
              tcp_bpf_init
                tcp_bpf_update_sk_prot
                  sk_psock_update_proto
                    WRITE_ONCE(sk->sk_prot, ops)
      
      sock_map_ops->map_delete_elem
        sock_map_delete_elem
          __sock_map_delete
           sock_map_unref
             sk_psock_put
               sk_psock_drop
                 sk_psock_restore_proto
                   tcp_update_ulp
                     WRITE_ONCE(sk->sk_prot, proto)
      
      Mark the shared access with READ_ONCE/WRITE_ONCE annotations.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
      b8e202d1
    • Linus Torvalds's avatar
      Merge tag 'linux-watchdog-5.6-rc3' of git://www.linux-watchdog.org/linux-watchdog · 0c0ddd6a
      Linus Torvalds authored
      Pull watchdog fixes from Wim Van Sebroeck:
      
       - mtk_wdt needs RESET_CONTROLLER to build
      
       - da9062 driver fixes:
           - fix power management ops
           - do not ping the hw during stop()
           - add dependency on I2C
      
      * tag 'linux-watchdog-5.6-rc3' of git://www.linux-watchdog.org/linux-watchdog:
        watchdog: da9062: Add dependency on I2C
        watchdog: da9062: fix power management ops
        watchdog: da9062: do not ping the hw during stop()
        watchdog: fix mtk_wdt.c RESET_CONTROLLER build error
      0c0ddd6a
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · bb65619e
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small char/misc driver fixes for 5.6-rc3.
      
        Also included in here are some updates for some documentation files
        that I seem to be maintaining these days.
      
        The driver fixes are:
         - small fixes for the habanalabs driver
         - fsi driver bugfix
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'char-misc-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        Documentation/process: Swap out the ambassador for Canonical
        habanalabs: patched cb equals user cb in device memset
        habanalabs: do not halt CoreSight during hard reset
        habanalabs: halt the engines before hard-reset
        MAINTAINERS: remove unnecessary ':' characters
        fsi: aspeed: add unspecified HAS_IOMEM dependency
        COPYING: state that all contributions really are covered by this file
        Documentation/process: Change Microsoft contact for embargoed hardware issues
        embargoed-hardware-issues: drop Amazon contact as the email address now bounces
        Documentation/process: Add Arm contact for embargoed HW issues
      bb65619e
    • Linus Torvalds's avatar
      Merge tag 'staging-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · e5553ac7
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are some small staging driver fixes for 5.6-rc3, along with the
        removal of an unused/unneeded driver as well.
      
        The android vsoc driver is not needed anymore by anyone, so it was
        removed.
      
        The other driver fixes are:
         - ashmem bugfixes
         - greybus audio driver bugfix
         - wireless driver bugfixes and tiny cleanups to error paths
      
        All of these have been in linux-next for a while now with no reported
        issues"
      
      * tag 'staging-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: rtl8723bs: Remove unneeded goto statements
        staging: rtl8188eu: Remove some unneeded goto statements
        staging: rtl8723bs: Fix potential overuse of kernel memory
        staging: rtl8188eu: Fix potential overuse of kernel memory
        staging: rtl8723bs: Fix potential security hole
        staging: rtl8188eu: Fix potential security hole
        staging: greybus: use after free in gb_audio_manager_remove_all()
        staging: android: Delete the 'vsoc' driver
        staging: rtl8723bs: fix copy of overlapping memory
        staging: android: ashmem: Disallow ashmem memory from being remapped
        staging: vt6656: fix sign of rx_dbm to bb_pre_ed_rssi.
      e5553ac7
    • Linus Torvalds's avatar
      Merge tag 'tty-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · ef11f1b7
      Linus Torvalds authored
      Pull tty/serial driver fixes from Greg KH:
       "Here are a number of small tty and serial driver fixes for 5.6-rc3
        that resolve a bunch of reported issues.
      
        They are:
         - vt selection and ioctl fixes
         - serdev bugfix
         - atmel serial driver fixes
         - qcom serial driver fixes
         - other minor serial driver fixes
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        vt: selection, close sel_buffer race
        vt: selection, handle pending signals in paste_selection
        serial: cpm_uart: call cpm_muram_init before registering console
        tty: serial: qcom_geni_serial: Fix RX cancel command failure
        serial: 8250: Check UPF_IRQ_SHARED in advance
        tty: serial: imx: setup the correct sg entry for tx dma
        vt: vt_ioctl: fix race in VT_RESIZEX
        vt: fix scrollback flushing on background consoles
        tty: serial: tegra: Handle RX transfer in PIO mode if DMA wasn't started
        tty/serial: atmel: manage shutdown in case of RS485 or ISO7816 mode
        serdev: ttyport: restore client ops on deregistration
        serial: ar933x_uart: set UART_CS_{RX,TX}_READY_ORIDE
      ef11f1b7
    • Linus Torvalds's avatar
      Merge tag 'usb-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · cee853e8
      Linus Torvalds authored
      Pull USB/Thunderbolt fixes from Greg KH:
       "Here are a number of small USB driver fixes for 5.6-rc3.
      
        Included in here are:
        - MAINTAINER file updates
        - USB gadget driver fixes
        - usb core quirk additions and fixes for regressions
        - xhci driver fixes
        - usb serial driver id additions and fixes
        - thunderbolt bugfix
      
        Thunderbolt patches come in through here now that USB4 is really
        thunderbolt.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'usb-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (34 commits)
        USB: misc: iowarrior: add support for the 100 device
        thunderbolt: Prevent crash if non-active NVMem file is read
        usb: gadget: udc-xilinx: Fix xudc_stop() kernel-doc format
        USB: misc: iowarrior: add support for the 28 and 28L devices
        USB: misc: iowarrior: add support for 2 OEMed devices
        USB: Fix novation SourceControl XL after suspend
        xhci: Fix memory leak when caching protocol extended capability PSI tables - take 2
        Revert "xhci: Fix memory leak when caching protocol extended capability PSI tables"
        MAINTAINERS: Sort entries in database for THUNDERBOLT
        usb: dwc3: debug: fix string position formatting mixup with ret and len
        usb: gadget: serial: fix Tx stall after buffer overflow
        usb: gadget: ffs: ffs_aio_cancel(): Save/restore IRQ flags
        usb: dwc2: Fix SET/CLEAR_FEATURE and GET_STATUS flows
        usb: dwc2: Fix in ISOC request length checking
        usb: gadget: composite: Support more than 500mA MaxPower
        usb: gadget: composite: Fix bMaxPower for SuperSpeedPlus
        usb: gadget: u_audio: Fix high-speed max packet size
        usb: dwc3: gadget: Check for IOC/LST bit in TRB->ctrl fields
        USB: core: clean up endpoint-descriptor parsing
        USB: quirks: blacklist duplicate ep on Sound Devices USBPre2
        ...
      cee853e8
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2020-02-21' of git://anongit.freedesktop.org/drm/drm · 88f8bbfa
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Varied fixes for rc3.
      
        i915 is the largest, they are seeing some ACPI problems with their CI
        which hopefully get solved soon [1].
      
        msm has a bunch of fixes for new hw added in the merge, a bunch of
        amdgpu fixes, and nouveau adds support for some new firmwares for
        turing tu11x GPUs that were just released into linux-firmware by
        nvidia, they operate the same as the ones we already have for tu10x so
        should be fine to hook up.
      
        Otherwise it's just misc fixes for panfrost and sun4i.
      
        core:
         - Allow only one rotation argument, and allow zero rotation in video
           cmdline.
      
        i915:
         - Workaround missing Display Stream Compression (DSC) state readout
           by forcing modeset when its enabled at probe
         - Fix EHL port clock voltage level requirements
         - Fix queuing retire workers on the virtual engine
         - Fix use of partially initialized waiters
         - Stop using drm_pci_alloc/drm_pci/free
         - Fix rewind of RING_TAIL by forcing a context reload
         - Fix locking on resetting ring->head
         - Propagate our bug filing URL change to stable kernels
      
        panfrost:
         - Small compiler warning fix for panfrost.
         - Fix when using performance counters in panfrost when using per fd
           address space.
      
        sun4xi:
         - Fix dt binding
      
        nouveau:
         - tu11x modesetting fix
         - ACR/GR firmware support for tu11x (fw is public now)
      
        msm:
         - fix UBWC on GPU and display side for sc7180
         - fix DSI suspend/resume issue encountered on sc7180
         - fix some breakage on so called "linux-android" devices
            (fallout from sc7180/a618 support, not seen earlier due to
             bootloader/firmware differences)
         - couple other misc fixes
      
        amdgpu:
         - HDCP fixes
         - xclk fix for raven
         - GFXOFF fixes"
      
      [1] The Intel suspend testing should now be fixed by commit 63fb9623
          ("ACPI: PM: s2idle: Check fixed wakeup events in acpi_s2idle_wake()")
      
      * tag 'drm-fixes-2020-02-21' of git://anongit.freedesktop.org/drm/drm: (39 commits)
        drm/amdgpu/display: clean up hdcp workqueue handling
        drm/amdgpu: add is_raven_kicker judgement for raven1
        drm/i915/gt: Avoid resetting ring->head outside of its timeline mutex
        drm/i915/execlists: Always force a context reload when rewinding RING_TAIL
        drm/i915: Wean off drm_pci_alloc/drm_pci_free
        drm/i915/gt: Protect defer_request() from new waiters
        drm/i915/gt: Prevent queuing retire workers on the virtual engine
        drm/i915/dsc: force full modeset whenever DSC is enabled at probe
        drm/i915/ehl: Update port clock voltage level requirements
        drm/i915: Update drm/i915 bug filing URL
        MAINTAINERS: Update drm/i915 bug filing URL
        drm/i915: Initialise basic fence before acquiring seqno
        drm/i915/gem: Require per-engine reset support for non-persistent contexts
        drm/nouveau/kms/gv100-: Re-set LUT after clearing for modesets
        drm/nouveau/gr/tu11x: initial support
        drm/nouveau/acr/tu11x: initial support
        drm/amdgpu/gfx10: disable gfxoff when reading rlc clock
        drm/amdgpu/gfx9: disable gfxoff when reading rlc clock
        drm/amdgpu/soc15: fix xclk for raven
        drm/amd/powerplay: always refetch the enabled features status on dpm enablement
        ...
      88f8bbfa
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 3dc55dba
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
          Cong Wang.
      
       2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
          is finished, from Magnus Karlsson.
      
       3) Set table field properly to RT_TABLE_COMPAT when necessary, from
          Jethro Beekman.
      
       4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
          that in IFLA_ALT_IFNAME. From Eric Dumazet.
      
       5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.
      
       6) Handle mtu==0 devices properly in wireguard, from Jason A.
          Donenfeld.
      
       7) Fix several lockdep warnings in bonding, from Taehee Yoo.
      
       8) Fix cls_flower port blocking, from Jason Baron.
      
       9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.
      
      10) Fix RDMA race in qede driver, from Michal Kalderon.
      
      11) Fix several false lockdep warnings by adding conditions to
          list_for_each_entry_rcu(), from Madhuparna Bhowmik.
      
      12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.
      
      13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.
      
      14) Hey, variables declared in switch statement before any case
          statements are not initialized. I learn something every day. Get
          rids of this stuff in several parts of the networking, from Kees
          Cook.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
        bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
        bnxt_en: Improve device shutdown method.
        net: netlink: cap max groups which will be considered in netlink_bind()
        net: thunderx: workaround BGX TX Underflow issue
        ionic: fix fw_status read
        net: disable BRIDGE_NETFILTER by default
        net: macb: Properly handle phylink on at91rm9200
        s390/qeth: fix off-by-one in RX copybreak check
        s390/qeth: don't warn for napi with 0 budget
        s390/qeth: vnicc Fix EOPNOTSUPP precedence
        openvswitch: Distribute switch variables for initialization
        net: ip6_gre: Distribute switch variables for initialization
        net: core: Distribute switch variables for initialization
        udp: rehash on disconnect
        net/tls: Fix to avoid gettig invalid tls record
        bpf: Fix a potential deadlock with bpf_map_do_batch
        bpf: Do not grab the bucket spinlock by default on htab batch ops
        ice: Wait for VF to be reset/ready before configuration
        ice: Don't tell the OS that link is going down
        ice: Don't reject odd values of usecs set by user
        ...
      3dc55dba
    • David S. Miller's avatar
      Merge branch 'Migrate-QRTR-Nameservice-to-Kernel' · b4d9785c
      David S. Miller authored
      Manivannan Sadhasivam says:
      
      ====================
      Migrate QRTR Nameservice to Kernel
      
      This patchset migrates the Qualcomm IPC Router (QRTR) Nameservice from userspace
      to kernel under net/qrtr.
      
      The userspace implementation of it can be found here:
      https://github.com/andersson/qrtr/blob/master/src/ns.c
      
      This change is required for enabling the WiFi functionality of some Qualcomm
      WLAN devices using ATH11K without any dependency on a userspace daemon. Since
      the QRTR NS is not usually packed in most of the distros, users need to clone,
      build and install it to get the WiFi working. It will become a hassle when the
      user doesn't have any other source of network connectivity.
      
      The original userspace code is published under BSD3 license. For migrating it
      to Linux kernel, I have adapted Dual BSD/GPL license.
      
      This patchset has been verified on Dragonboard410c and Intel NUC with QCA6390
      WLAN device.
      
      Changes in v2:
      
      * Sorted the local variables in reverse XMAS tree order
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4d9785c
    • Manivannan Sadhasivam's avatar
      net: qrtr: Fix the local node ID as 1 · 31d6cbee
      Manivannan Sadhasivam authored
      In order to start the QRTR nameservice, the local node ID needs to be
      valid. Hence, fix it to 1. Previously, the node ID was configured through
      a userspace tool before starting the nameservice daemon. Since we have now
      integrated the nameservice handling to kernel, this change is necessary
      for making it functional.
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31d6cbee
    • Manivannan Sadhasivam's avatar
      net: qrtr: Migrate nameservice to kernel from userspace · 0c2204a4
      Manivannan Sadhasivam authored
      The QRTR nameservice has been maintained in userspace for some time. This
      commit migrates it to Linux kernel. This change is required in order to
      eliminate the need of starting a userspace daemon for making the WiFi
      functional for ath11k based devices. Since the QRTR NS is not usually
      packed in most of the distros, users need to clone, build and install it
      to get the WiFi working. It will become a hassle when the user doesn't
      have any other source of network connectivity.
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c2204a4
    • Dan Murphy's avatar
      net: phy: dp83867: Add speed optimization feature · cd26d72d
      Dan Murphy authored
      Set the speed optimization bit on the DP83867 PHY.
      This feature can also be strapped on the 64 pin PHY devices
      but the 48 pin devices do not have the strap pin available to enable
      this feature in the hardware.  PHY team suggests to have this bit set.
      
      With this bit set the PHY will auto negotiate and report the link
      parameters in the PHYSTS register.  This register provides a single
      location within the register set for quick access to commonly accessed
      information.
      
      In this case when auto negotiation is on the PHY core reads the bits
      that have been configured or if auto negotiation is off the PHY core
      reads the BMCR register and sets the phydev parameters accordingly.
      
      This Giga bit PHY can throttle the speed to 100Mbps or 10Mbps to accomodate a
      4-wire cable.  If this should occur the PHYSTS register contains the
      current negotiated speed and duplex mode.
      
      In overriding the genphy_read_status the dp83867_read_status will do a
      genphy_read_status to setup the LP and pause bits.  And then the PHYSTS
      register is read and the phydev speed and duplex mode settings are
      updated.
      Signed-off-by: default avatarDan Murphy <dmurphy@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd26d72d
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · b0dd1eb2
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
      
       - A few y2038 fixes which missed the merge window while dependencies
         in NFS were being sorted out.
      
       - A bunch of fixes. Some minor, some not.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        MAINTAINERS: use tabs for SAFESETID
        lib/stackdepot.c: fix global out-of-bounds in stack_slabs
        mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
        mm/vmscan.c: don't round up scan size for online memory cgroup
        lib/string.c: update match_string() doc-strings with correct behavior
        mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()
        mm/swapfile.c: fix a comment in sys_swapon()
        scripts/get_maintainer.pl: deprioritize old Fixes: addresses
        get_maintainer: remove uses of P: for maintainer name
        selftests/vm: add missed tests in run_vmtests
        include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap
        Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"
        y2038: hide timeval/timespec/itimerval/itimerspec types
        y2038: remove unused time32 interfaces
        y2038: remove ktime to/from timespec/timeval conversion
      b0dd1eb2
    • Randy Dunlap's avatar
      MAINTAINERS: use tabs for SAFESETID · bb8d00ff
      Randy Dunlap authored
      Use tabs for indentation instead of spaces for SAFESETID.  All (!) other
      entries in MAINTAINERS use tabs (according to my simple grepping).
      
      Link: http://lkml.kernel.org/r/2bb2e52a-2694-816d-57b4-6cabfadd6c1a@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Micah Morton <mortonm@chromium.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8d00ff
    • Alexander Potapenko's avatar
      lib/stackdepot.c: fix global out-of-bounds in stack_slabs · 305e519c
      Alexander Potapenko authored
      Walter Wu has reported a potential case in which init_stack_slab() is
      called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
      initialized.  In that case init_stack_slab() will overwrite
      stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
      corruption.
      
      Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      305e519c
    • Wei Yang's avatar
      mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM · 18e19f19
      Wei Yang authored
      When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
      doesn't work before sparse_init_one_section() is called.
      
      This leads to a crash when hotplug memory:
      
          BUG: unable to handle page fault for address: 0000000006400000
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G        W         5.5.0-next-20200205+ #343
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:__memset+0x24/0x30
          Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
          RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
          RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
          RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
          RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
          R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
          R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
          FS:  0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           sparse_add_section+0x1c9/0x26a
           __add_pages+0xbf/0x150
           add_pages+0x12/0x60
           add_memory_resource+0xc8/0x210
           __add_memory+0x62/0xb0
           acpi_memory_device_add+0x13f/0x300
           acpi_bus_attach+0xf6/0x200
           acpi_bus_scan+0x43/0x90
           acpi_device_hotplug+0x275/0x3d0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x1a7/0x370
           worker_thread+0x30/0x380
           kthread+0x112/0x130
           ret_from_fork+0x35/0x40
      
      We should use memmap as it did.
      
      On x86 the impact is limited to x86_32 builds, or x86_64 configurations
      that override the default setting for SPARSEMEM_VMEMMAP.
      
      Other memory hotplug archs (arm64, ia64, and ppc) also default to
      SPARSEMEM_VMEMMAP=y.
      
      [dan.j.williams@intel.com: changelog update]
      {rppt@linux.ibm.com: changelog update]
      Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18e19f19
    • Gavin Shan's avatar
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan authored
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
    • Alexandru Ardelean's avatar
      lib/string.c: update match_string() doc-strings with correct behavior · c11d3fa0
      Alexandru Ardelean authored
      There were a few attempts at changing behavior of the match_string()
      helpers (i.e.  'match_string()' & 'sysfs_match_string()'), to change &
      extend the behavior according to the doc-string.
      
      But the simplest approach is to just fix the doc-strings.  The current
      behavior is fine as-is, and some bugs were introduced trying to fix it.
      
      As for extending the behavior, new helpers can always be introduced if
      needed.
      
      The match_string() helpers behave more like 'strncmp()' in the sense
      that they go up to n elements or until the first NULL element in the
      array of strings.
      
      This change updates the doc-strings with this info.
      
      Link: http://lkml.kernel.org/r/20200213072722.8249-1-alexandru.ardelean@analog.comSigned-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Acked-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Tobin C . Harding" <tobin@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c11d3fa0
    • Vasily Averin's avatar
      mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() · 75866af6
      Vasily Averin authored
      for_each_mem_cgroup() increases css reference counter for memory cgroup
      and requires to use mem_cgroup_iter_break() if the walk is cancelled.
      
      Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
      Fixes: 0a4465d3 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75866af6
    • Christoph Hellwig's avatar