1. 04 Feb, 2022 11 commits
    • Florian Westphal's avatar
      netfilter: ctnetlink: disable helper autoassign · d1ca60ef
      Florian Westphal authored
      When userspace, e.g. conntrackd, inserts an entry with a specified helper,
      its possible that the helper is lost immediately after its added:
      
      ctnetlink_create_conntrack
        -> nf_ct_helper_ext_add + assign helper
          -> ctnetlink_setup_nat
            -> ctnetlink_parse_nat_setup
               -> parse_nat_setup -> nfnetlink_parse_nat_setup
      	                       -> nf_nat_setup_info
                                       -> nf_conntrack_alter_reply
                                         -> __nf_ct_try_assign_helper
      
      ... and __nf_ct_try_assign_helper will zero the helper again.
      
      Set IPS_HELPER bit to bypass auto-assign logic, its unwanted, just like
      when helper is assigned via ruleset.
      
      Dropped old 'not strictly necessary' comment, it referred to use of
      rcu_assign_pointer() before it got replaced by RCU_INIT_POINTER().
      
      NB: Fixes tag intentionally incorrect, this extends the referenced commit,
      but this change won't build without IPS_HELPER introduced there.
      
      Fixes: 6714cf54 ("netfilter: nf_conntrack: fix explicit helper attachment and NAT")
      Reported-by: default avatarPham Thanh Tuyen <phamtyn@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d1ca60ef
    • Florian Westphal's avatar
      MAINTAINERS: netfilter: update git links · 1f6339e0
      Florian Westphal authored
      nf and nf-next have a new location.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1f6339e0
    • Florian Westphal's avatar
      netfilter: conntrack: re-init state for retransmitted syn-ack · 82b72cb9
      Florian Westphal authored
      TCP conntrack assumes that a syn-ack retransmit is identical to the
      previous syn-ack.  This isn't correct and causes stuck 3whs in some more
      esoteric scenarios.  tcpdump to illustrate the problem:
      
       client > server: Flags [S] seq 1365731894, win 29200, [mss 1460,sackOK,TS val 2083035583 ecr 0,wscale 7]
       server > client: Flags [S.] seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215367629 ecr 2082921663]
      
      Note the invalid/outdated synack ack number.
      Conntrack marks this syn-ack as out-of-window/invalid, but it did
      initialize the reply direction parameters based on this packets content.
      
       client > server: Flags [S] seq 1365731894, win 29200, [mss 1460,sackOK,TS val 2083036623 ecr 0,wscale 7]
      
      ... retransmit...
      
       server > client: Flags [S.], seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215368644 ecr 2082921663]
      
      and another bogus synack. This repeats, then client re-uses for a new
      attempt:
      
      client > server: Flags [S], seq 2375731741, win 29200, [mss 1460,sackOK,TS val 2083100223 ecr 0,wscale 7]
      server > client: Flags [S.], seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215430754 ecr 2082921663]
      
      ... but still gets a invalid syn-ack.
      
      This repeats until:
      
       server > client: Flags [S.], seq 145824453, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215437785 ecr 2082921663]
       server > client: Flags [R.], seq 145824454, ack 643160523, win 65535, [mss 8952,wscale 5,TS val 3215443451 ecr 2082921663]
       client > server: Flags [S], seq 2375731741, win 29200, [mss 1460,sackOK,TS val 2083115583 ecr 0,wscale 7]
       server > client: Flags [S.], seq 162602410, ack 2375731742, win 65535, [mss 8952,wscale 5,TS val 3215445754 ecr 2083115583]
      
      This syn-ack has the correct ack number, but conntrack flags it as
      invalid: The internal state was created from the first syn-ack seen
      so the sequence number of the syn-ack is treated as being outside of
      the announced window.
      
      Don't assume that retransmitted syn-ack is identical to previous one.
      Treat it like the first syn-ack and reinit state.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      82b72cb9
    • Florian Westphal's avatar
      netfilter: conntrack: move synack init code to helper · cc4f9d62
      Florian Westphal authored
      It seems more readable to use a common helper in the followup fix rather
      than copypaste or goto.
      
      No functional change intended.  The function is only called for syn-ack
      or syn in repy direction in case of simultaneous open.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cc4f9d62
    • Florian Westphal's avatar
      netfilter: nft_payload: don't allow th access for fragments · a9e8503d
      Florian Westphal authored
      Loads relative to ->thoff naturally expect that this points to the
      transport header, but this is only true if pkt->fragoff == 0.
      
      This has little effect for rulesets with connection tracking/nat because
      these enable ip defra. For other rulesets this prevents false matches.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a9e8503d
    • Florian Westphal's avatar
      netfilter: conntrack: don't refresh sctp entries in closed state · 77b33719
      Florian Westphal authored
      Vivek Thrivikraman reported:
       An SCTP server application which is accessed continuously by client
       application.
       When the session disconnects the client retries to establish a connection.
       After restart of SCTP server application the session is not established
       because of stale conntrack entry with connection state CLOSED as below.
      
       (removing this entry manually established new connection):
      
       sctp 9 CLOSED src=10.141.189.233 [..]  [ASSURED]
      
      Just skip timeout update of closed entries, we don't want them to
      stay around forever.
      Reported-and-tested-by: default avatarVivek Thrivikraman <vivek.thrivikraman@est.tech>
      Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1579Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      77b33719
    • Steen Hegelund's avatar
      net: sparx5: Fix get_stat64 crash in tcpdump · ed14fc7a
      Steen Hegelund authored
      This problem was found with Sparx5 when the tcpdump tool requests the
      do_get_stats64 (sparx5_get_stats64) statistic.
      
      The portstats pointer was incorrectly incremented when fetching priority
      based statistics.
      
      Fixes: af4b1102 (net: sparx5: add ethtool configuration and statistics support)
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Link: https://lore.kernel.org/r/20220203102900.528987-1-steen.hegelund@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed14fc7a
    • Kees Cook's avatar
      gcc-plugins/stackleak: Use noinstr in favor of notrace · dcb85f85
      Kees Cook authored
      While the stackleak plugin was already using notrace, objtool is now a
      bit more picky.  Update the notrace uses to noinstr.  Silences the
      following objtool warnings when building with:
      
      CONFIG_DEBUG_ENTRY=y
      CONFIG_STACK_VALIDATION=y
      CONFIG_VMLINUX_VALIDATION=y
      CONFIG_GCC_PLUGIN_STACKLEAK=y
      
        vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section
      
      Note that the plugin's addition of calls to stackleak_track_stack() from
      noinstr functions is expected to be safe, as it isn't runtime
      instrumentation and is self-contained.
      
      Cc: Alexander Popov <alex.popov@linux.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcb85f85
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · eb2eb516
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, netfilter, and ieee802154.
      
        Current release - regressions:
      
         - Partially revert "net/smc: Add netlink net namespace support", fix
           uABI breakage
      
         - netfilter:
            - nft_ct: fix use after free when attaching zone template
            - nft_byteorder: track register operations
      
        Previous releases - regressions:
      
         - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
      
         - phy: qca8081: fix speeds lower than 2.5Gb/s
      
         - sched: fix use-after-free in tc_new_tfilter()
      
        Previous releases - always broken:
      
         - tcp: fix mem under-charging with zerocopy sendmsg()
      
         - tcp: add missing tcp_skb_can_collapse() test in
           tcp_shift_skb_data()
      
         - neigh: do not trigger immediate probes on NUD_FAILED from
           neigh_managed_work, avoid a deadlock
      
         - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
           false-positives
      
         - netfilter: nft_reject_bridge: fix for missing reply from prerouting
      
         - smc: forward wakeup to smc socket waitqueue after fallback
      
         - ieee802154:
            - return meaningful error codes from the netlink helpers
            - mcr20a: fix lifs/sifs periods
            - at86rf230, ca8210: stop leaking skbs on error paths
      
         - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
      
         - ax25: add refcount in ax25_dev to avoid UAF bugs
      
         - eth: mlx5e:
            - fix SFP module EEPROM query
            - fix broken SKB allocation in HW-GRO
            - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
      
         - eth: amd-xgbe:
            - fix skb data length underflow
            - ensure reset of the tx_timer_active flag, avoid Tx timeouts
      
         - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
      
         - eth: e1000e: handshake with CSME starts from Alder Lake platforms"
      
      * tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
        ax25: fix reference count leaks of ax25_dev
        net: stmmac: ensure PTP time register reads are consistent
        net: ipa: request IPA register values be retained
        dt-bindings: net: qcom,ipa: add optional qcom,qmp property
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
        tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
        net: sparx5: do not refer to skb after passing it on
        Partially revert "net/smc: Add netlink net namespace support"
        net/mlx5e: Avoid field-overflowing memcpy()
        net/mlx5e: Use struct_group() for memcpy() region
        net/mlx5e: Avoid implicit modify hdr for decap drop rule
        net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
        net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
        net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
        net/mlx5: E-Switch, Fix uninitialized variable modact
        net/mlx5e: Fix handling of wrong devices during bond netevent
        net/mlx5e: Fix broken SKB allocation in HW-GRO
        net/mlx5e: Fix wrong calculation of header index in HW_GRO
        ...
      eb2eb516
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 551007a8
      Linus Torvalds authored
      Pull selinux fix from Paul Moore:
       "One small SELinux patch to ensure that a policy structure field is
        properly reset after freeing so that we don't inadvertently do a
        double-free on certain error conditions"
      
      * tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: fix double free of cond_list on error paths
      551007a8
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of... · 25b20ae8
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "Important fixes to several tests and documentation clarification on
        running mainline kselftest on stable releases. A few notable fixes:
      
         - fix kselftest run hang due to child processes that haven't been
           terminated. Fix signals all child processes
      
         - fix false pass/fail results from vdso_test_abi, openat2, mincore
      
         - build failures when using -j (multiple jobs) option
      
         - exec test build failure due to incorrect build rule for a run-time
           created "pipe"
      
         - zram test fixes related to interaction with zram-generator to make
           sure zram test to coordinate deleted with zram-generator
      
         - zram test compression ratio calculation fix and skipping
           max_comp_streams.
      
         - increasing rtc test timeout
      
         - cpufreq test to write test results to stdout which will necessary
           on automated test systems"
      
      * tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kselftest: Fix vdso_test_abi return status
        selftests: skip mincore.check_file_mmap when fs lacks needed support
        selftests: openat2: Skip testcases that fail with EOPNOTSUPP
        selftests: openat2: Add missing dependency in Makefile
        selftests: openat2: Print also errno in failure messages
        selftests: futex: Use variable MAKE instead of make
        selftests/exec: Remove pipe from TEST_GEN_FILES
        selftests/zram: Adapt the situation that /dev/zram0 is being used
        selftests/zram01.sh: Fix compression ratio calculation
        selftests/zram: Skip max_comp_streams interface on newer kernel
        docs/kselftest: clarify running mainline tests on stables
        kselftest: signal all child processes
        selftests: cpufreq: Write test output to stdout as well
        selftests: rtc: Increase test timeout so that all tests run
      25b20ae8
  2. 03 Feb, 2022 16 commits
    • Duoming Zhou's avatar
      ax25: fix reference count leaks of ax25_dev · 87563a04
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev
      to avoid UAF bugs") introduces refcount into ax25_dev, but there
      are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
      ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
      
      This patch uses ax25_dev_put() and adjusts the position of
      ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
      
      Fixes: d01ffb9e ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87563a04
    • Yannick Vignon's avatar
      net: stmmac: ensure PTP time register reads are consistent · 80d46090
      Yannick Vignon authored
      Even if protected from preemption and interrupts, a small time window
      remains when the 2 register reads could return inconsistent values,
      each time the "seconds" register changes. This could lead to an about
      1-second error in the reported time.
      
      Add logic to ensure the "seconds" and "nanoseconds" values are consistent.
      
      Fixes: 92ba6888 ("stmmac: add the support for PTP hw clock driver")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80d46090
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 77b1b8b4
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-02-03
      
      We've added 6 non-merge commits during the last 10 day(s) which contain
      a total of 7 files changed, 11 insertions(+), 236 deletions(-).
      
      The main changes are:
      
      1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
         flag which otherwise trips over KASAN, from Hou Tao.
      
      2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
         rename, from Alexei Starovoitov.
      
      3) Fix a possible race in inc_misses_counter() when IRQ would trigger
         during counter update, from He Fengqing.
      
      4) Fix tooling infra for cross-building with clang upon probing whether
         gcc provides the standard libraries, from Jean-Philippe Brucker.
      
      5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
      
      6) Drop unneeded and outdated lirc.h header copy from tooling infra as
         BPF does not require it anymore, from Sean Young.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        tools: Ignore errors from `which' when searching a GCC toolchain
        tools headers UAPI: remove stale lirc.h
        bpf: Fix possible race in inc_misses_counter
        bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
      ====================
      
      Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77b1b8b4
    • Mickaël Salaün's avatar
      printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin() · 1f2cfdd3
      Mickaël Salaün authored
      The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
      kernel/printk/sysctl.c introduced an incorrect __user attribute to the
      buffer argument.  I spotted this change in [1] as well as the kernel
      test robot.  Revert this change to please sparse:
      
        kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
        kernel/printk/sysctl.c:20:51:    expected void *
        kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer
      
      Fixes: faaa357a ("printk: move printk sysctl to printk/sysctl.c")
      Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net [1]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.netSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f2cfdd3
    • Igor Pylypiv's avatar
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv authored
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      Signed-off-by: default avatarIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: default avatarChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
    • Linus Torvalds's avatar
      Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 305e6c42
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
      
       - Eric's fix for a long standing cgroup1 permission issue where it only
         checks for uid 0 instead of CAP which inadvertently allows
         unprivileged userns roots to modify release_agent userhelper
      
       - Fixes for the fallout from Waiman's recent cpuset work
      
      * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
        cgroup-v1: Require capabilities to set release_agent
        cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
        cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
      305e6c42
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-enable-register-retention' · 0166556a
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: enable register retention
      
      With runtime power management in place, we sometimes need to issue
      a command to enable retention of IPA register values before power
      collapse.  This requires a new Device Tree property, whose presence
      will also be used to signal that the command is required.
      ====================
      
      Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0166556a
    • Alex Elder's avatar
      net: ipa: request IPA register values be retained · 34a08176
      Alex Elder authored
      In some cases, the IPA hardware needs to request the always-on
      subsystem (AOSS) to coordinate with the IPA microcontroller to
      retain IPA register values at power collapse.  This is done by
      issuing a QMP request to the AOSS microcontroller.  A similar
      request ondoes that request.
      
      We must get and hold the "QMP" handle early, because we might get
      back EPROBE_DEFER for that.  But the actual request should be sent
      while we know the IPA clock is active, and when we know the
      microcontroller is operational.
      
      Fixes: 1aac309d ("net: ipa: use autosuspend")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34a08176
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: add optional qcom,qmp property · ac62a017
      Alex Elder authored
      For some systems, the IPA driver must make a request to ensure that
      its registers are retained across power collapse of the IPA hardware.
      On such systems, we'll use the existence of the "qcom,qmp" property
      as a signal that this request is required.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac62a017
    • Waiman Long's avatar
      cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning · 2bdfd282
      Waiman Long authored
      It was found that a "suspicious RCU usage" lockdep warning was issued
      with the rcu_read_lock() call in update_sibling_cpumasks().  It is
      because the update_cpumasks_hier() function may sleep. So we have
      to release the RCU lock, call update_cpumasks_hier() and reacquire
      it afterward.
      
      Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
      instead of stating that in the comment.
      
      Fixes: 4716909c ("cpuset: Track cpusets that use parent's effective_cpus")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Tested-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2bdfd282
    • Nathan Chancellor's avatar
      tools/resolve_btfids: Do not print any commands when building silently · 7f3bdbc3
      Nathan Chancellor authored
      When building with 'make -s', there is some output from resolve_btfids:
      
      $ make -sj"$(nproc)" oldconfig prepare
        MKDIR     .../tools/bpf/resolve_btfids/libbpf/
        MKDIR     .../tools/bpf/resolve_btfids//libsubcmd
        LINK     resolve_btfids
      
      Silent mode means that no information should be emitted about what is
      currently being done. Use the $(silent) variable from Makefile.include
      to avoid defining the msg macro so that there is no information printed.
      
      Fixes: fbbb68de ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220201212503.731732-1-nathan@kernel.org
      7f3bdbc3
    • John Hubbard's avatar
      Revert "mm/gup: small refactoring: simplify try_grab_page()" · c36c04c2
      John Hubbard authored
      This reverts commit 54d516b1
      
      That commit did a refactoring that effectively combined fast and slow
      gup paths (again).  And that was again incorrect, for two reasons:
      
       a) Fast gup and slow gup get reference counts on pages in different
          ways and with different goals: see Linus' writeup in commit
          cd1adf1b ("Revert "mm/gup: remove try_get_page(), call
          try_get_compound_head() directly""), and
      
       b) try_grab_compound_head() also has a specific check for
          "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
          can fall back to slow gup. This resulted in new failures, as
          recently report by Will McVicker [1].
      
      But (a) has problems too, even though they may not have been reported
      yet.  So just revert this.
      
      Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
      Fixes: 54d516b1 ("mm/gup: small refactoring: simplify try_grab_page()")
      Reported-and-tested-by: default avatarWill McVicker <willmcvicker@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: stable@vger.kernel.org # 5.15
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c36c04c2
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · d394bb77
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - fix missed change for PTR->PTR_WD conversion
      
       - kernel-doc fixes
      
      * tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: KVM: fix vz.c kernel-doc notation
        MIPS: octeon: Fix missed PTR->PTR_WD conversion
      d394bb77
    • Hou Tao's avatar
      bpf: Use VM_MAP instead of VM_ALLOC for ringbuf · b293dcc4
      Hou Tao authored
      After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
      after mapping"), non-VM_ALLOC mappings will be marked as accessible
      in __get_vm_area_node() when KASAN is enabled. But now the flag for
      ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
      after vmap() returns. Because the ringbuf area is created by mapping
      allocated pages, so use VM_MAP instead.
      
      After the change, info in /proc/vmallocinfo also changes from
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
      to
        [start]-[end]   24576 ringbuf_map_alloc+0x171/0x290 vmap user
      
      Fixes: 457f4436 ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com
      b293dcc4
    • Daniel Borkmann's avatar
      net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work · 4a81f6da
      Daniel Borkmann authored
      syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
      
        kworker/0:16/14617 is trying to acquire lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        [...]
        but task is already holding lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
      
      The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
      triggered an immediate probe as per commit cd28ca0a ("neigh: reduce
      arp latency") via neigh_probe() given table lock was held.
      
      One option to fix this situation is to defer the neigh_probe() back to
      the neigh_timer_handler() similarly as pre cd28ca0a. For the case
      of NTF_MANAGED, this deferral is acceptable given this only happens on
      actual failure state and regular / expected state is NUD_VALID with the
      entry already present.
      
      The fix adds a parameter to __neigh_event_send() in order to communicate
      whether immediate probe is allowed or disallowed. Existing call-sites
      of neigh_event_send() default as-is to immediate probe. However, the
      neigh_managed_work() disables it via use of neigh_event_send_probe().
      
      [0] <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
        check_deadlock kernel/locking/lockdep.c:2999 [inline]
        validate_chain kernel/locking/lockdep.c:3788 [inline]
        __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
        lock_acquire kernel/locking/lockdep.c:5639 [inline]
        lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
        _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
        ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
        __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
        __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
        ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
        NF_HOOK_COND include/linux/netfilter.h:296 [inline]
        ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
        dst_output include/net/dst.h:451 [inline]
        NF_HOOK include/linux/netfilter.h:307 [inline]
        ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
        ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
        ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
        neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
        __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
        neigh_event_send include/net/neighbour.h:470 [inline]
        neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
        process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
        worker_thread+0x657/0x1110 kernel/workqueue.c:2454
        kthread+0x2e9/0x3a0 kernel/kthread.c:377
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        </TASK>
      
      Fixes: 7482e384 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a81f6da
    • Eric Dumazet's avatar
      tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() · b67985be
      Eric Dumazet authored
      tcp_shift_skb_data() might collapse three packets into a larger one.
      
      P_A, P_B, P_C  -> P_ABC
      
      Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
      because it was enough.
      
      In commit 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions"),
      this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
      
      But the now needed test over P_C has been missed.
      
      This probably broke MPTCP.
      
      Then later, commit 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      added an extra condition to tcp_skb_can_collapse(), but the missing call
      from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
      might have different skb_zcopy_pure() status.
      
      Fixes: 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions")
      Fixes: 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b67985be
  3. 02 Feb, 2022 13 commits