1. 15 Aug, 2024 20 commits
  2. 14 Aug, 2024 18 commits
    • Phil Sutter's avatar
      netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests · bd662c42
      Phil Sutter authored
      Objects' dump callbacks are not concurrency-safe per-se with reset bit
      set. If two CPUs perform a reset at the same time, at least counter and
      quota objects suffer from value underrun.
      
      Prevent this by introducing dedicated locking callbacks for nfnetlink
      and the asynchronous dump handling to serialize access.
      
      Fixes: 43da04a5 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      bd662c42
    • Phil Sutter's avatar
      netfilter: nf_tables: Introduce nf_tables_getobj_single · 69fc3e9e
      Phil Sutter authored
      Outsource the reply skb preparation for non-dump getrule requests into a
      distinct function. Prep work for object reset locking.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      69fc3e9e
    • Phil Sutter's avatar
      netfilter: nf_tables: Audit log dump reset after the fact · e0b6648b
      Phil Sutter authored
      In theory, dumpreset may fail and invalidate the preceeding log message.
      Fix this and use the occasion to prepare for object reset locking, which
      benefits from a few unrelated changes:
      
      * Add an early call to nfnetlink_unicast if not resetting which
        effectively skips the audit logging but also unindents it.
      * Extract the table's name from the netlink attribute (which is verified
        via earlier table lookup) to not rely upon validity of the looked up
        table pointer.
      * Do not use local variable family, it will vanish.
      
      Fixes: 8e6cf365 ("audit: log nftables configuration change events")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e0b6648b
    • Florian Westphal's avatar
      selftests: netfilter: add test for br_netfilter+conntrack+queue combination · ea2306f0
      Florian Westphal authored
      Trigger cloned skbs leaving softirq protection.
      This triggers splat without the preceeding change
      ("netfilter: nf_queue: drop packets with cloned unconfirmed
       conntracks"):
      
      WARNING: at net/netfilter/nf_conntrack_core.c:1198 __nf_conntrack_confirm..
      
      because local delivery and forwarding will race for confirmation.
      
      Based on a reproducer script from Yi Chen.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      ea2306f0
    • Florian Westphal's avatar
      netfilter: nf_queue: drop packets with cloned unconfirmed conntracks · 7d8dc1c7
      Florian Westphal authored
      Conntrack assumes an unconfirmed entry (not yet committed to global hash
      table) has a refcount of 1 and is not visible to other cores.
      
      With multicast forwarding this assumption breaks down because such
      skbs get cloned after being picked up, i.e.  ct->use refcount is > 1.
      
      Likewise, bridge netfilter will clone broad/mutlicast frames and
      all frames in case they need to be flood-forwarded during learning
      phase.
      
      For ip multicast forwarding or plain bridge flood-forward this will
      "work" because packets don't leave softirq and are implicitly
      serialized.
      
      With nfqueue this no longer holds true, the packets get queued
      and can be reinjected in arbitrary ways.
      
      Disable this feature, I see no other solution.
      
      After this patch, nfqueue cannot queue packets except the last
      multicast/broadcast packet.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7d8dc1c7
    • Donald Hunter's avatar
      netfilter: flowtable: initialise extack before use · e9767137
      Donald Hunter authored
      Fix missing initialisation of extack in flow offload.
      
      Fixes: c29f74e0 ("netfilter: nf_flow_table: hardware offload support")
      Signed-off-by: default avatarDonald Hunter <donald.hunter@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e9767137
    • Donald Hunter's avatar
      netfilter: nfnetlink: Initialise extack before use in ACKs · d1a7b382
      Donald Hunter authored
      Add missing extack initialisation when ACKing BATCH_BEGIN and BATCH_END.
      
      Fixes: bf2ac490 ("netfilter: nfnetlink: Handle ACK flags for batch messages")
      Signed-off-by: default avatarDonald Hunter <donald.hunter@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d1a7b382
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · d07b4328
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "s390:
      
         - Fix failure to start guests with kvm.use_gisa=0
      
         - Panic if (un)share fails to maintain security.
      
        ARM:
      
         - Use kvfree() for the kvmalloc'd nested MMUs array
      
         - Set of fixes to address warnings in W=1 builds
      
         - Make KVM depend on assembler support for ARMv8.4
      
         - Fix for vgic-debug interface for VMs without LPIs
      
         - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest
      
         - Minor code / comment cleanups for configuring PAuth traps
      
         - Take kvm->arch.config_lock to prevent destruction / initialization
           race for a vCPU's CPUIF which may lead to a UAF
      
        x86:
      
         - Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
      
         - Fix smatch issues
      
         - Small cleanups
      
         - Make x2APIC ID 100% readonly
      
         - Fix typo in uapi constant
      
        Generic:
      
         - Use synchronize_srcu_expedited() on irqfd shutdown"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
        KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
        KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
        KVM: eventfd: Use synchronize_srcu_expedited() on shutdown
        KVM: selftests: Add a testcase to verify x2APIC is fully readonly
        KVM: x86: Make x2APIC ID 100% readonly
        KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
        KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()
        KVM: SVM: Fix an error code in sev_gmem_post_populate()
        KVM: SVM: Fix uninitialized variable bug
        KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
        KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
        KVM: arm64: Tidying up PAuth code in KVM
        KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
        KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
        s390/uv: Panic for set and remove shared access UVC errors
        KVM: s390: fix validity interception issue when gisa is switched off
        docs: KVM: Fix register ID of SPSR_FIQ
        KVM: arm64: vgic: fix unexpected unlock sparse warnings
        KVM: arm64: fix kdoc warnings in W=1 builds
        KVM: arm64: fix override-init warnings in W=1 builds
        ...
      d07b4328
    • Tom Hughes's avatar
      netfilter: allow ipv6 fragments to arrive on different devices · 3cd740b9
      Tom Hughes authored
      Commit 264640fc ("ipv6: distinguish frag queues by device
      for multicast and link-local packets") modified the ipv6 fragment
      reassembly logic to distinguish frag queues by device for multicast
      and link-local packets but in fact only the main reassembly code
      limits the use of the device to those address types and the netfilter
      reassembly code uses the device for all packets.
      
      This means that if fragments of a packet arrive on different interfaces
      then netfilter will fail to reassemble them and the fragments will be
      expired without going any further through the filters.
      
      Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: default avatarTom Hughes <tom@compton.nu>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3cd740b9
    • Amit Shah's avatar
      KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG · 1c0e5881
      Amit Shah authored
      "INVALID" is misspelt in "SEV_RET_INAVLID_CONFIG". Since this is part of
      the UAPI, keep the current definition and add a new one with the fix.
      Fix-suggested-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarAmit Shah <amit.shah@amd.com>
      Message-ID: <20240814083113.21622-1-amit@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1c0e5881
    • Sean Christopherson's avatar
      KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) · 66155de9
      Sean Christopherson authored
      Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
      directly emulate instructions for ES/SNP, and instead the guest must
      explicitly request emulation.  Unless the guest explicitly requests
      emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
      SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
      
      But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
      because except for ES/SNP, doing so requires setting reserved bits in the
      SPTE, i.e. the SPTE can't be readable while also generating a #VC on
      writes.  Because KVM never creates MMIO SPTEs and jumps directly to
      emulation, the guest never gets a #VC.  And since KVM simply resumes the
      guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
      into an infinite #NPF loop if the vCPU attempts to write read-only memory.
      
      Disallow read-only memory for all VMs with protected state, i.e. for
      upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
      to support read-only memory, as TDX uses EPT Violation #VE to reflect the
      fault into the guest, e.g. KVM could configure read-only SPTEs with RX
      protections and SUPPRESS_VE=0.  But there is no strong use case for
      supporting read-only memslots on TDX, e.g. the main historical usage is
      to emulate option ROMs, but TDX disallows executing from shared memory.
      And if someone comes along with a legitimate, strong use case, the
      restriction can always be lifted for TDX.
      
      Don't bother trying to retroactively apply the restriction to SEV-ES
      VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
      possibly work for SEV-ES, i.e. disallowing such memslots is really just
      means reporting an error to userspace instead of silently hanging vCPUs.
      Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
      isn't worth the marginal benefit it would provide userspace.
      
      Fixes: 26c44aa9 ("KVM: SEV: define VM types for SEV and SEV-ES")
      Fixes: 1dfe571c ("KVM: SEV: Add initial SEV-SNP support")
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Michael Roth <michael.roth@amd.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Cc: Ackerly Tng <ackerleytng@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240809190319.1710470-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66155de9
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 9d590679
      Linus Torvalds authored
      Pull selinux fixes from Paul Moore:
      
       - Fix a xperms counting problem where we adding to the xperms count
         even if we failed to add the xperm.
      
       - Propogate errors from avc_add_xperms_decision() back to the caller so
         that we can trigger the proper cleanup and error handling.
      
       - Revert our use of vma_is_initial_heap() in favor of our older logic
         as vma_is_initial_heap() doesn't correctly handle the no-heap case
         and it is causing issues with the SELinux process/execheap access
         control. While the older SELinux logic may not be perfect, it
         restores the expected user visible behavior.
      
         Hopefully we will be able to resolve the problem with the
         vma_is_initial_heap() macro with the mm folks, but we need to fix
         this in the meantime.
      
      * tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: revert our use of vma_is_initial_heap()
        selinux: add the processing of the failure of avc_add_xperms_decision()
        selinux: fix potential counting error in avc_add_xperms_decision()
      9d590679
    • Linus Torvalds's avatar
      Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 4ac0f08f
      Linus Torvalds authored
      Pull vfs fixes from Christian Brauner:
       "VFS:
      
         - Fix the name of file lease slab cache. When file leases were split
           out of file locks the name of the file lock slab cache was used for
           the file leases slab cache as well.
      
         - Fix a type in take_fd() helper.
      
         - Fix infinite directory iteration for stable offsets in tmpfs.
      
         - When the icache is pruned all reclaimable inodes are marked with
           I_FREEING and other processes that try to lookup such inodes will
           block.
      
           But some filesystems like ext4 can trigger lookups in their inode
           evict callback causing deadlocks. Ext4 does such lookups if the
           ea_inode feature is used whereby a separate inode may be used to
           store xattrs.
      
           Introduce I_LRU_ISOLATING which pins the inode while its pages are
           reclaimed. This avoids inode deletion during inode_lru_isolate()
           avoiding the deadlock and evict is made to wait until
           I_LRU_ISOLATING is done.
      
        netfs:
      
         - Fault in smaller chunks for non-large folio mappings for
           filesystems that haven't been converted to large folios yet.
      
         - Fix the CONFIG_NETFS_DEBUG config option. The config option was
           renamed a short while ago and that introduced two minor issues.
           First, it depended on CONFIG_NETFS whereas it wants to depend on
           CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
           does. Second, the documentation for the config option wasn't fixed
           up.
      
         - Revert the removal of the PG_private_2 writeback flag as ceph is
           using it and fix how that flag is handled in netfs.
      
         - Fix DIO reads on 9p. A program watching a file on a 9p mount
           wouldn't see any changes in the size of the file being exported by
           the server if the file was changed directly in the source
           filesystem. Fix this by attempting to read the full size specified
           when a DIO read is requested.
      
         - Fix a NULL pointer dereference bug due to a data race where a
           cachefiles cookies was retired even though it was still in use.
           Check the cookie's n_accesses counter before discarding it.
      
        nsfs:
      
         - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
           the kernel is writing to userspace.
      
        pidfs:
      
         - Prevent the creation of pidfds for kthreads until we have a
           use-case for it and we know the semantics we want. It also confuses
           userspace why they can get pidfds for kthreads.
      
        squashfs:
      
         - Fix an unitialized value bug reported by KMSAN caused by a
           corrupted symbolic link size read from disk. Check that the
           symbolic link size is not larger than expected"
      
      * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        Squashfs: sanity check symbolic link size
        9p: Fix DIO read through netfs
        vfs: Don't evict inode under the inode lru traversing context
        netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
        netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
        file: fix typo in take_fd() comment
        pidfd: prevent creation of pidfds for kthreads
        netfs: clean up after renaming FSCACHE_DEBUG config
        libfs: fix infinite directory reads for offset dir
        nsfs: fix ioctl declaration
        fs/netfs/fscache_cookie: add missing "n_accesses" check
        filelock: fix name of file_lease slab cache
        netfs: Fault in smaller chunks for non-large folio mappings
      4ac0f08f
    • Linus Torvalds's avatar
      Merge tag 'bpf-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 02f8ca3d
      Linus Torvalds authored
      Pull bpf fixes from Alexei Starovoitov:
      
       - Fix bpftrace regression from Kyle Huey.
      
         Tracing bpf prog was called with perf_event input arguments causing
         bpftrace produce garbage output.
      
       - Fix verifier crash in stacksafe() from Yonghong Song.
      
         Daniel Hodges reported verifier crash when playing with sched-ext.
         The stack depth in the known verifier state was larger than stack
         depth in being explored state causing out-of-bounds access.
      
       - Fix update of freplace prog in prog_array from Leon Hwang.
      
         freplace prog type wasn't recognized correctly.
      
      * tag 'bpf-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        perf/bpf: Don't call bpf_overflow_handler() for tracing events
        selftests/bpf: Add a test to verify previous stacksafe() fix
        bpf: Fix a kernel verifier crash in stacksafe()
        bpf: Fix updating attached freplace prog in prog_array map
      02f8ca3d
    • Niklas Cassel's avatar
      Revert "ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error" · fa0db8e5
      Niklas Cassel authored
      This reverts commit 28ab9769.
      
      Sense data can be in either fixed format or descriptor format.
      
      SAT-6 revision 1, "10.4.6 Control mode page", defines the D_SENSE bit:
      "The SATL shall support this bit as defined in SPC-5 with the following
      exception: if the D_ SENSE bit is set to zero (i.e., fixed format sense
      data), then the SATL should return fixed format sense data for ATA
      PASS-THROUGH commands."
      
      The libata SATL has always kept D_SENSE set to zero by default. (It is
      however possible to change the value using a MODE SELECT SG_IO command.)
      
      Failed ATA PASS-THROUGH commands correctly respected the D_SENSE bit,
      however, successful ATA PASS-THROUGH commands incorrectly returned the
      sense data in descriptor format (regardless of the D_SENSE bit).
      
      Commit 28ab9769 ("ata: libata-scsi: Honor the D_SENSE bit for
      CK_COND=1 and no error") fixed this bug for successful ATA PASS-THROUGH
      commands.
      
      However, after commit 28ab9769 ("ata: libata-scsi: Honor the D_SENSE
      bit for CK_COND=1 and no error"), there were bug reports that hdparm,
      hddtemp, and udisks were no longer working as expected.
      
      These applications incorrectly assume the returned sense data is in
      descriptor format, without even looking at the RESPONSE CODE field in the
      returned sense data (to see which format the returned sense data is in).
      
      Considering that there will be broken versions of these applications around
      roughly forever, we are stuck with being bug compatible with older kernels.
      
      Cc: stable@vger.kernel.org # 4.19+
      Reported-by: default avatarStephan Eisvogel <eisvogel@seitics.de>
      Reported-by: default avatarChristian Heusel <christian@heusel.eu>
      Closes: https://lore.kernel.org/linux-ide/0bf3f2f0-0fc6-4ba5-a420-c0874ef82d64@heusel.eu/
      Fixes: 28ab9769 ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Link: https://lore.kernel.org/r/20240813131900.1285842-2-cassel@kernel.orgSigned-off-by: default avatarNiklas Cassel <cassel@kernel.org>
      fa0db8e5
    • Subash Abhinov Kasiviswanathan's avatar
      tcp: Update window clamping condition · a2cbb160
      Subash Abhinov Kasiviswanathan authored
      This patch is based on the discussions between Neal Cardwell and
      Eric Dumazet in the link
      https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/
      
      It was correctly pointed out that tp->window_clamp would not be
      updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if
      (copied <= tp->rcvq_space.space). While it is expected for most
      setups to leave the sysctl enabled, the latter condition may
      not end up hitting depending on the TCP receive queue size and
      the pattern of arriving data.
      
      The updated check should be hit only on initial MSS update from
      TCP_MIN_MSS to measured MSS value and subsequently if there was
      an update to a larger value.
      
      Fixes: 05f76b2d ("tcp: Adjust clamping window for applications specifying SO_RCVBUF")
      Signed-off-by: default avatarSean Tranchetti <quic_stranche@quicinc.com>
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2cbb160
    • Hans de Goede's avatar
      media: atomisp: Fix streaming no longer working on BYT / ISP2400 devices · 63de936b
      Hans de Goede authored
      Commit a0821ca1 ("media: atomisp: Remove test pattern generator (TPG)
      support") broke BYT support because it removed a seemingly unused field
      from struct sh_css_sp_config and a seemingly unused value from enum
      ia_css_input_mode.
      
      But these are part of the ABI between the kernel and firmware on ISP2400
      and this part of the TPG support removal changes broke ISP2400 support.
      
      ISP2401 support was not affected because on ISP2401 only a part of
      struct sh_css_sp_config is used.
      
      Restore the removed field and enum value to fix this.
      
      Fixes: a0821ca1 ("media: atomisp: Remove test pattern generator (TPG) support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      63de936b
    • Eugene Syromiatnikov's avatar
      mptcp: correct MPTCP_SUBFLOW_ATTR_SSN_OFFSET reserved size · 655111b8
      Eugene Syromiatnikov authored
      ssn_offset field is u32 and is placed into the netlink response with
      nla_put_u32(), but only 2 bytes are reserved for the attribute payload
      in subflow_get_info_size() (even though it makes no difference
      in the end, as it is aligned up to 4 bytes).  Supply the correct
      argument to the relevant nla_total_size() call to make it less
      confusing.
      
      Fixes: 5147dfb5 ("mptcp: allow dumping subflow context to userspace")
      Signed-off-by: default avatarEugene Syromiatnikov <esyr@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20240812065024.GA19719@asgard.redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      655111b8
  3. 13 Aug, 2024 2 commits
    • Linus Torvalds's avatar
      Merge tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 6b0f8db9
      Linus Torvalds authored
      Pull execve fixes from Kees Cook:
      
       - binfmt_flat: Fix corruption when not offsetting data start
      
       - exec: Fix ToCToU between perm check and set-uid/gid usage
      
      * tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        exec: Fix ToCToU between perm check and set-uid/gid usage
        binfmt_flat: Fix corruption when not offsetting data start
      6b0f8db9
    • Kees Cook's avatar
      exec: Fix ToCToU between perm check and set-uid/gid usage · f50733b4
      Kees Cook authored
      When opening a file for exec via do_filp_open(), permission checking is
      done against the file's metadata at that moment, and on success, a file
      pointer is passed back. Much later in the execve() code path, the file
      metadata (specifically mode, uid, and gid) is used to determine if/how
      to set the uid and gid. However, those values may have changed since the
      permissions check, meaning the execution may gain unintended privileges.
      
      For example, if a file could change permissions from executable and not
      set-id:
      
      ---------x 1 root root 16048 Aug  7 13:16 target
      
      to set-id and non-executable:
      
      ---S------ 1 root root 16048 Aug  7 13:16 target
      
      it is possible to gain root privileges when execution should have been
      disallowed.
      
      While this race condition is rare in real-world scenarios, it has been
      observed (and proven exploitable) when package managers are updating
      the setuid bits of installed programs. Such files start with being
      world-executable but then are adjusted to be group-exec with a set-uid
      bit. For example, "chmod o-x,u+s target" makes "target" executable only
      by uid "root" and gid "cdrom", while also becoming setuid-root:
      
      -rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target
      
      becomes:
      
      -rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target
      
      But racing the chmod means users without group "cdrom" membership can
      get the permission to execute "target" just before the chmod, and when
      the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
      setuid to root, violating the expressed authorization of "only cdrom
      group members can setuid to root".
      
      Re-check that we still have execute permissions in case the metadata
      has changed. It would be better to keep a copy from the perm-check time,
      but until we can do that refactoring, the least-bad option is to do a
      full inode_permission() call (under inode lock). It is understood that
      this is safe against dead-locks, but hardly optimal.
      Reported-by: default avatarMarco Vanotti <mvanotti@google.com>
      Tested-by: default avatarMarco Vanotti <mvanotti@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      f50733b4