1. 25 Aug, 2021 2 commits
    • Linus Torvalds's avatar
      pipe: do FASYNC notifications for every pipe IO, not just state changes · fe67f4dd
      Linus Torvalds authored
      It turns out that the SIGIO/FASYNC situation is almost exactly the same
      as the EPOLLET case was: user space really wants to be notified after
      every operation.
      
      Now, in a perfect world it should be sufficient to only notify user
      space on "state transitions" when the IO state changes (ie when a pipe
      goes from unreadable to readable, or from unwritable to writable).  User
      space should then do as much as possible - fully emptying the buffer or
      what not - and we'll notify it again the next time the state changes.
      
      But as with EPOLLET, we have at least one case (stress-ng) where the
      kernel sent SIGIO due to the pipe being marked for asynchronous
      notification, but the user space signal handler then didn't actually
      necessarily read it all before returning (it read more than what was
      written, but since there could be multiple writes, it could leave data
      pending).
      
      The user space code then expected to get another SIGIO for subsequent
      writes - even though the pipe had been readable the whole time - and
      would only then read more.
      
      This is arguably a user space bug - and Colin King already fixed the
      stress-ng code in question - but the kernel regression rules are clear:
      it doesn't matter if kernel people think that user space did something
      silly and wrong.  What matters is that it used to work.
      
      So if user space depends on specific historical kernel behavior, it's a
      regression when that behavior changes.  It's on us: we were silly to
      have that non-optimal historical behavior, and our old kernel behavior
      was what user space was tested against.
      
      Because of how the FASYNC notification was tied to wakeup behavior, this
      was first broken by commits f467a6a6 and 1b6b26ae ("pipe: fix
      and clarify pipe read/write wakeup logic"), but at the time it seems
      nobody noticed.  Probably because the stress-ng problem case ends up
      being timing-dependent too.
      
      It was then unwittingly fixed by commit 3a34b13a ("pipe: make pipe
      writes always wake up readers") only to be broken again when by commit
      3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal
      loads").
      
      And at that point the kernel test robot noticed the performance
      refression in the stress-ng.sigio.ops_per_sec case.  So the "Fixes" tag
      below is somewhat ad hoc, but it matches when the issue was noticed.
      
      Fix it for good (knock wood) by simply making the kill_fasync() case
      separate from the wakeup case.  FASYNC is quite rare, and we clearly
      shouldn't even try to use the "avoid unnecessary wakeups" logic for it.
      
      Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/
      Fixes: 3b844826 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Tested-by: default avatarOliver Sang <oliver.sang@intel.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe67f4dd
    • Linus Torvalds's avatar
      Merge branch 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · 62add982
      Linus Torvalds authored
      Pull ucount fixes from Eric Biederman:
       "This branch fixes a regression that made it impossible to increase
        rlimits that had been converted to the ucount infrastructure, and also
        fixes a reference counting bug where the reference was not incremented
        soon enough.
      
        The fixes are trivial and the bugs have been encountered in the wild,
        and the fixes have been tested"
      
      * 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        ucounts: Increase ucounts reference counter before the security hook
        ucounts: Fix regression preventing increasing of rlimits in init_user_ns
      62add982
  2. 24 Aug, 2021 1 commit
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 6e764bcd
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "Several small fixes, the first three are significant:
      
         - mlx5 crash unloading drivers with a rare HW config
      
         - missing userspace reporting for the new dmabuf objects
      
         - random rxe failure due to missing memory zeroing
      
         - static checker/etc reports: missing spin lock init, null pointer
           deref on error, extra unlock on error path, memory allocation under
           spinlock, missing IRQ vector cleanup
      
         - kconfig typo in the new irdma driver"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/rxe: Zero out index member of struct rxe_queue
        RDMA/efa: Free IRQ vectors on error flow
        RDMA/rxe: Fix memory allocation while in a spin lock
        RDMA/bnxt_re: Remove unpaired rtnl unlock in bnxt_re_dev_init()
        IB/hfi1: Fix possible null-pointer dereference in _extend_sdma_tx_descs()
        RDMA/irdma: Use correct kconfig symbol for AUXILIARY_BUS
        RDMA/bnxt_re: Add missing spin lock initialization
        RDMA/uverbs: Track dmabuf memory regions
        RDMA/mlx5: Fix crash when unbind multiport slave
      6e764bcd
  3. 23 Aug, 2021 3 commits
    • Alexey Gladkov's avatar
      ucounts: Increase ucounts reference counter before the security hook · bbb6d0f3
      Alexey Gladkov authored
      We need to increment the ucounts reference counter befor security_prepare_creds()
      because this function may fail and abort_creds() will try to decrement
      this reference.
      
      [   96.465056][ T8641] FAULT_INJECTION: forcing a failure.
      [   96.465056][ T8641] name fail_page_alloc, interval 1, probability 0, space 0, times 0
      [   96.478453][ T8641] CPU: 1 PID: 8641 Comm: syz-executor668 Not tainted 5.14.0-rc6-syzkaller #0
      [   96.487215][ T8641] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      [   96.497254][ T8641] Call Trace:
      [   96.500517][ T8641]  dump_stack_lvl+0x1d3/0x29f
      [   96.505758][ T8641]  ? show_regs_print_info+0x12/0x12
      [   96.510944][ T8641]  ? log_buf_vmcoreinfo_setup+0x498/0x498
      [   96.516652][ T8641]  should_fail+0x384/0x4b0
      [   96.521141][ T8641]  prepare_alloc_pages+0x1d1/0x5a0
      [   96.526236][ T8641]  __alloc_pages+0x14d/0x5f0
      [   96.530808][ T8641]  ? __rmqueue_pcplist+0x2030/0x2030
      [   96.536073][ T8641]  ? lockdep_hardirqs_on_prepare+0x3e2/0x750
      [   96.542056][ T8641]  ? alloc_pages+0x3f3/0x500
      [   96.546635][ T8641]  allocate_slab+0xf1/0x540
      [   96.551120][ T8641]  ___slab_alloc+0x1cf/0x350
      [   96.555689][ T8641]  ? kzalloc+0x1d/0x30
      [   96.559740][ T8641]  __kmalloc+0x2e7/0x390
      [   96.563980][ T8641]  ? kzalloc+0x1d/0x30
      [   96.568029][ T8641]  kzalloc+0x1d/0x30
      [   96.571903][ T8641]  security_prepare_creds+0x46/0x220
      [   96.577174][ T8641]  prepare_creds+0x411/0x640
      [   96.581747][ T8641]  __sys_setfsuid+0xe2/0x3a0
      [   96.586333][ T8641]  do_syscall_64+0x3d/0xb0
      [   96.590739][ T8641]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   96.596611][ T8641] RIP: 0033:0x445a69
      [   96.600483][ T8641] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      [   96.620152][ T8641] RSP: 002b:00007f1054173318 EFLAGS: 00000246 ORIG_RAX: 000000000000007a
      [   96.628543][ T8641] RAX: ffffffffffffffda RBX: 00000000004ca4c8 RCX: 0000000000445a69
      [   96.636600][ T8641] RDX: 0000000000000010 RSI: 00007f10541732f0 RDI: 0000000000000000
      [   96.644550][ T8641] RBP: 00000000004ca4c0 R08: 0000000000000001 R09: 0000000000000000
      [   96.652500][ T8641] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ca4cc
      [   96.660631][ T8641] R13: 00007fffffe0b62f R14: 00007f1054173400 R15: 0000000000022000
      
      Fixes: 905ae01c ("Add a reference to ucounts for each cred")
      Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/97433b1742c3331f02ad92de5a4f07d673c90613.1629735352.git.legion@kernel.orgSigned-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      bbb6d0f3
    • Eric W. Biederman's avatar
      ucounts: Fix regression preventing increasing of rlimits in init_user_ns · 5ddf994f
      Eric W. Biederman authored
      "Ma, XinjianX" <xinjianx.ma@intel.com> reported:
      
      > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
      > in kselftest failed with following message.
      >
      > # selftests: mqueue: mq_perf_tests
      > #
      > # Initial system state:
      > #       Using queue path:                       /mq_perf_tests
      > #       RLIMIT_MSGQUEUE(soft):                  819200
      > #       RLIMIT_MSGQUEUE(hard):                  819200
      > #       Maximum Message Size:                   8192
      > #       Maximum Queue Size:                     10
      > #       Nice value:                             0
      > #
      > # Adjusted system state for testing:
      > #       RLIMIT_MSGQUEUE(soft):                  (unlimited)
      > #       RLIMIT_MSGQUEUE(hard):                  (unlimited)
      > #       Maximum Message Size:                   16777216
      > #       Maximum Queue Size:                     65530
      > #       Nice value:                             -20
      > #       Continuous mode:                        (disabled)
      > #       CPUs to pin:                            3
      > # ./mq_perf_tests: mq_open() at 296: Too many open files
      > not ok 2 selftests: mqueue: mq_perf_tests # exit=1
      > ```
      >
      > Test env:
      > rootfs: debian-10
      > gcc version: 9
      
      After investigation the problem turned out to be that ucount_max for
      the rlimits in init_user_ns was being set to the initial rlimit value.
      The practical problem is that ucount_max provides a limit that
      applications inside the user namespace can not exceed.  Which means in
      practice that rlimits that have been converted to use the ucount
      infrastructure were not able to exceend their initial rlimits.
      
      Solve this by setting the relevant values of ucount_max to
      RLIM_INIFINITY.  A limit in init_user_ns is pointless so the code
      should allow the values to grow as large as possible without riscking
      an underflow or an overflow.
      
      As the ltp test case was a bit of a pain I have reproduced the rlimit failure
      and tested the fix with the following little C program:
      > #include <stdio.h>
      > #include <fcntl.h>
      > #include <sys/stat.h>
      > #include <mqueue.h>
      > #include <sys/time.h>
      > #include <sys/resource.h>
      > #include <errno.h>
      > #include <string.h>
      > #include <stdlib.h>
      > #include <limits.h>
      > #include <unistd.h>
      >
      > int main(int argc, char **argv)
      > {
      > 	struct mq_attr mq_attr;
      > 	struct rlimit rlim;
      > 	mqd_t mqd;
      > 	int ret;
      >
      > 	ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
      > 	if (ret != 0) {
      > 		fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      > 	printf("RLIMIT_MSGQUEUE %lu %lu\n",
      > 	       rlim.rlim_cur, rlim.rlim_max);
      > 	rlim.rlim_cur = RLIM_INFINITY;
      > 	rlim.rlim_max = RLIM_INFINITY;
      > 	ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
      > 	if (ret != 0) {
      > 		fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      >
      > 	memset(&mq_attr, 0, sizeof(struct mq_attr));
      > 	mq_attr.mq_maxmsg = 65536 - 1;
      > 	mq_attr.mq_msgsize = 16*1024*1024 - 1;
      >
      > 	mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr);
      > 	if (mqd == (mqd_t)-1) {
      > 		fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      > 	ret = mq_close(mqd);
      > 	if (ret) {
      > 		fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      >
      > 	return EXIT_SUCCESS;
      > }
      
      Fixes: 6e52a9f0 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
      Fixes: d7c9e99a ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
      Fixes: d6469690 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      Fixes: 21d1c5e3 ("Reimplement RLIMIT_NPROC on top of ucounts")
      Reported-by: kernel test robot lkp@intel.com
      Acked-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      5ddf994f
    • Linus Torvalds's avatar
      Revert "media: dvb header files: move some headers to staging" · d5ae8d7f
      Linus Torvalds authored
      This reverts commit 819fbd3d.
      
      It turns out that some user-space applications use these uapi header
      files, so even though the only user of the interface is an old driver
      that was moved to staging, moving the header files causes unnecessary
      pain.
      
      Generally, we really don't want user space to use kernel headers
      directly (exactly because it causes pain when we re-organize), and
      instead copy them as needed.  But these things happen, and the headers
      were in the uapi directory, so I guess it's not entirely unreasonable.
      
      Link: https://lore.kernel.org/lkml/4e3e0d40-df4a-94f8-7c2d-85010b0873c4@web.de/Reported-by: default avatarSoeren Moch <smoch@web.de>
      Cc: stable@kernel.org  # 5.13
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5ae8d7f
  4. 22 Aug, 2021 2 commits
  5. 21 Aug, 2021 9 commits
  6. 20 Aug, 2021 23 commits