1. 21 Jun, 2021 23 commits
    • Filipe Manana's avatar
      btrfs: don't set the full sync flag when truncation does not touch extents · 0d7d3165
      Filipe Manana authored
      At btrfs_truncate() where we truncate the inode either to the same size
      or to a smaller size, we always set the full sync flag on the inode.
      
      This is needed in case the truncation drops or trims any file extent items
      that start beyond or cross the new inode size, so that the next fsync
      drops all inode items from the log and scans again the fs/subvolume tree
      to find all items that must be logged.
      
      However if the truncation does not drop or trims any file extent items, we
      do not need to set the full sync flag and force the next fsync to use the
      slow code path. So do not set the full sync flag in such cases.
      
      One use case where it is frequent to do truncations that do not change
      the inode size and do not drop any extents (no prealloc extents beyond
      i_size) is when running Microsoft's SQL Server inside a Docker container.
      One example workload is the one Philipp Fent reported recently, in the
      thread with a link below. In this workload a large number of fsyncs are
      preceded by such truncate operations.
      
      After this change I constantly get the runtime for that workload from
      Philipp to be reduced by about -12%, for example from 184 seconds down
      to 162 seconds.
      
      Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/Tested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d7d3165
    • Filipe Manana's avatar
      btrfs: fix misleading and incomplete comment of btrfs_truncate() · 4f7e6737
      Filipe Manana authored
      The comment at the top of btrfs_truncate() mentions that csum items are
      dropped or truncated to the new i_size, but this is wrong and non sense,
      as they are unrelated to the i_size and are located in the csums tree and
      not on a tree with inode items (fs/subvolume tree or a log tree). Instead
      that claim applies to file extent items, so fix the comment to refer to
      them instead.
      
      While at it make the whole comment for the function more descriptive and
      follow the kernel doc style.
      Tested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f7e6737
    • Josef Bacik's avatar
      btrfs: abort transaction if we fail to update the delayed inode · 04587ad9
      Josef Bacik authored
      If we fail to update the delayed inode we need to abort the transaction,
      because we could leave an inode with the improper counts or some other
      such corruption behind.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04587ad9
    • Josef Bacik's avatar
      btrfs: fix error handling in __btrfs_update_delayed_inode · bb385bed
      Josef Bacik authored
      If we get an error while looking up the inode item we'll simply bail
      without cleaning up the delayed node.  This results in this style of
      warning happening on commit:
      
        WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
        CPU: 0 PID: 76403 Comm: fsstress Tainted: G        W         5.13.0-rc1+ #373
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
        RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
        RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
        RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
        R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
        R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
        FS:  00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
        Call Trace:
         btrfs_commit_transaction+0x43c/0xb00
         ? finish_wait+0x80/0x80
         ? vfs_fsync_range+0x90/0x90
         iterate_supers+0x8c/0x100
         ksys_sync+0x50/0x90
         __do_sys_sync+0xa/0x10
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Because the iref isn't dropped and this leaves an elevated node->count,
      so any release just re-queues it onto the delayed inodes list.  Fix this
      by going to the out label to handle the proper cleanup of the delayed
      node.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb385bed
    • Josef Bacik's avatar
      btrfs: make btrfs_release_delayed_iref handle the !iref case · a4cb90dc
      Josef Bacik authored
      Right now we only cleanup the delayed iref if we have
      BTRFS_DELAYED_NODE_DEL_IREF set on the node.  However we have some error
      conditions that need to cleanup the iref if it still exists, so to make
      this code cleaner move the test_bit into btrfs_release_delayed_iref
      itself and unconditionally call it in each of the cases instead.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4cb90dc
    • David Sterba's avatar
      btrfs: scrub: per-device bandwidth control · eb3b5053
      David Sterba authored
      Add sysfs interface to limit io during scrub. We relied on the ionice
      interface to do that, eg. the idle class let the system usable while
      scrub was running. This has changed when mq-deadline got widespread and
      did not implement the scheduling classes. That was a CFQ thing that got
      deleted. We've got numerous complaints from users about degraded
      performance.
      
      Currently only BFQ supports that but it's not a common scheduler and we
      can't ask everybody to switch to it.
      
      Alternatively the cgroup io limiting can be used but that also a
      non-trivial setup (v2 required, the controller must be enabled on the
      system). This can still be used if desired.
      
      Other ideas that have been explored: piggy-back on ionice (that is set
      per-process and is accessible) and interpret the class and classdata as
      bandwidth limits, but this does not have enough flexibility as there are
      only 8 allowed and we'd have to map fixed limits to each value. Also
      adjusting the value would need to lookup the process that currently runs
      scrub on the given device, and the value is not sticky so would have to
      be adjusted each time scrub runs.
      
      Running out of options, sysfs does not look that bad:
      
      - it's accessible from scripts, or udev rules
      - the name is similar to what MD-RAID has
        (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
      - the value is sticky at least for filesystem mount time
      - adjusting the value has immediate effect
      - sysfs is available in constrained environments (eg. system rescue)
      - the limit also applies to device replace
      
      Sysfs:
      
      - raw value is in bytes
      - values written to the file accept suffixes like K, M
      - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
      - 0 means use default priority of IO
      
      The scheduler is a simple deadline one and the accuracy is up to nearest
      128K.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb3b5053
    • Johannes Thumshirn's avatar
      btrfs: zoned: factor out zoned device lookup · e7ff9e6b
      Johannes Thumshirn authored
      To be able to construct a zone append bio we need to look up the
      btrfs_device. The code doing the chunk map lookup to get the device is
      present in btrfs_submit_compressed_write and submit_extent_page.
      
      Factor out the lookup calls into a helper and use it in the submission
      paths.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7ff9e6b
    • Tian Tao's avatar
      btrfs: return EAGAIN if defrag is canceled · 50535db8
      Tian Tao authored
      When inode defrag is canceled, the error is set to EAGAIN but then
      overwritten by number of defragmented bytes. As this would hide the
      error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it
      will print the error and continue to next file (as it does in for any
      other error).
      Signed-off-by: default avatarTian Tao <tiantao6@hisilicon.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50535db8
    • Qu Wenruo's avatar
      btrfs: remove io_failure_record::in_validation · 1245835d
      Qu Wenruo authored
      The io_failure_record::in_validation was introduced to handle failed bio
      which cross several sectors.  In such case, we still need to verify
      which sectors are corrupted.
      
      But since we've changed the way how we handle corrupted sectors, by only
      submitting repair for each corrupted sector, there is no need for extra
      validation any more.
      
      This patch will cleanup all io_failure_record::in_validation related
      code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1245835d
    • Qu Wenruo's avatar
      btrfs: submit read time repair only for each corrupted sector · 150e4b05
      Qu Wenruo authored
      Currently btrfs_submit_read_repair() has some extra check on whether the
      failed bio needs extra validation for repair.  But we can avoid all
      these extra mechanisms if we submit the repair for each sector.
      
      By this, each read repair can be easily handled without the need to
      verify which sector is corrupted.
      
      This will also benefit subpage, as one subpage bvec can contain several
      sectors, making the extra verification more complex.
      
      So this patch will:
      
      - Introduce repair_one_sector()
        The main code submitting repair, which is more or less the same as old
        btrfs_submit_read_repair().
        But this time, it only repairs one sector.
      
      - Make btrfs_submit_read_repair() to handle sectors differently
        There are 3 different cases:
      
        * Good sector
          We need to release the page and extent, set the range uptodate.
      
        * Bad sector and failed to submit repair bio
          We need to release the page and extent, but not set the range
          uptodate.
      
        * Bad sector but repair bio submitted
          The page and extent release will be handled by the submitted repair
          bio. Nothing needs to be done.
      
        Since btrfs_submit_read_repair() will handle the page and extent
        release now, we need to skip to next bvec even we hit some error.
      
      - Change the lifespan of @uptodate in end_bio_extent_readpage()
        Since now btrfs_submit_read_repair() will handle the full bvec
        which contains any corruption, we don't need to bother updating
        @uptodate bit anymore.
        Just let @uptodate to be local variable inside the main loop,
        so that any error from one bvec won't affect later bvec.
      
      - Only export btrfs_repair_one_sector(), unexport
        btrfs_submit_read_repair()
        The only outside caller for read repair is DIO, which already submits
        its repair for just one sector.
        Only export btrfs_repair_one_sector() for DIO.
      
      This patch will focus on the change on the repair path, the extra
      validation code is still kept as is, and will be cleaned up later.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      150e4b05
    • Qu Wenruo's avatar
      btrfs: make btrfs_verify_data_csum() to return a bitmap · 08508fea
      Qu Wenruo authored
      This will provide the basis for later per-sector repair for subpage,
      while still keeping the existing code happy.
      
      As if all csums match, the return value will be 0, same as now.
      Only when csum mismatches, the return value is different.
      
      The new return value will be a bitmap, for 4K sectorsize and 4K page
      size, it will be either 1, instead of the -EIO (which is not used
      directly by the callers, no effective change).
      
      But for 4K sectorsize and 64K page size, aka subpage case, since the
      bvec can contain multiple sectors, knowing which sectors are corrupted
      will allow us to submit repair only for corrupted sectors.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08508fea
    • Johannes Thumshirn's avatar
      btrfs: rename check_async_write and let it return bool · f4dcfb30
      Johannes Thumshirn authored
      The 'check_async_write' function is a helper used in
      'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be
      used for metadata.
      
      Make the function return bool and get rid of the local variable async in
      btrfs_submit_metadata_bio storing the result of check_async_write's
      tests.
      
      As this is touching all function call sites, also rename it to
      should_async_write as this is more in line with the naming we use.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4dcfb30
    • Johannes Thumshirn's avatar
      btrfs: zoned: bail out if we can't read a reliable write pointer · 06e1e7f4
      Johannes Thumshirn authored
      If we can't read a reliable write pointer from a sequential zone fail
      creating the block group with an I/O error.
      
      Also if the read write pointer is beyond the end of the respective zone,
      fail the creation of the block group on this zone with an I/O error.
      
      While this could also happen in real world scenarios with misbehaving
      drives, this issue addresses a problem uncovered by fstests' test case
      generic/475.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06e1e7f4
    • Naohiro Aota's avatar
      btrfs: zoned: print message when zone sanity check type fails · 47cdfb5e
      Naohiro Aota authored
      This extends patch 784daf2b ("btrfs: zoned: sanity check zone
      type"), the message was supposed to be there but was lost during merge.
      We want to make the error noticeable so add it.
      
      Fixes: 784daf2b ("btrfs: zoned: sanity check zone type")
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47cdfb5e
    • Josef Bacik's avatar
      btrfs: handle preemptive delalloc flushing slightly differently · 385f421f
      Josef Bacik authored
      If we decide to flush delalloc from the preemptive flusher, we really do
      not want to wait on ordered extents, as it gains us nothing.  However
      there was logic to go ahead and wait on ordered extents if there was
      more ordered bytes than delalloc bytes.  We do not want this behavior,
      so pass through whether this flushing is for preemption, and do not wait
      for ordered extents if that's the case.  Also break out of the shrink
      loop after the first flushing, as we just want to one shot shrink
      delalloc.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      385f421f
    • Josef Bacik's avatar
      btrfs: only ignore delalloc if delalloc is much smaller than ordered · 3e101569
      Josef Bacik authored
      While testing heavy delalloc workloads I noticed that sometimes we'd
      just stop preemptively flushing when we had loads of delalloc available
      to flush.  This is because we skip preemptive flushing if delalloc <=
      ordered.  However if we start with say 4gib of delalloc, and we flush
      2gib of that, we'll stop flushing there, when we still have 2gib of
      delalloc to flush.
      
      Instead adjust the ordered bytes down by half, this way if 2/3 of our
      outstanding delalloc reservations are tied up by ordered extents we
      don't bother preemptive flushing, as we're getting close to the state
      where we need to wait on ordered extents.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e101569
    • Josef Bacik's avatar
      btrfs: don't include the global rsv size in the preemptive used amount · 30acce4e
      Josef Bacik authored
      When deciding if we should preemptively flush space, we will add in the
      amount of space used by all block rsvs.  However this also includes the
      global block rsv, which isn't flushable so shouldn't be accounted for in
      this calculation.  If we decide to use ->bytes_may_use in our used
      calculation we need to subtract the global rsv size from this amount so
      it most closely matches the flushable space.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30acce4e
    • Josef Bacik's avatar
      btrfs: use the global rsv size in the preemptive thresh calculation · 1239e2da
      Josef Bacik authored
      We calculate the amount of "free" space available for normal
      reservations by taking the total space and subtracting out the hard used
      space, which is readonly, used, and reserved space.
      
      However we weren't taking into account the global block rsv, which is
      essentially hard used space.  Handle this by subtracting it from the
      available free space, so that our threshold more closely mirrors
      reality.
      
      We need to do the check because it's possible that the global_rsv_size +
      used is > total_bytes, sometimes the global reserve can end up being
      calculated as larger than the available size (think small filesystems
      where we only have the original 8MiB chunk of metadata).  It doesn't
      usually happen, but that can get us into trouble so this is safer.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1239e2da
    • Josef Bacik's avatar
      btrfs: take into account global rsv in need_preemptive_reclaim · 610a6ef4
      Josef Bacik authored
      Global rsv can't be used for normal allocations, and for very full file
      systems we can decide to try and async flush constantly even though
      there's really not a lot of space to reclaim.  Deal with this by
      including the global block rsv size in the "total used" calculation.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      610a6ef4
    • Josef Bacik's avatar
      btrfs: only clamp the first time we have to start flushing · 0aae4ca9
      Josef Bacik authored
      We were clamping the threshold for preemptive reclaim any time we added
      a ticket to wait on, which if we have a lot of threads means we'd
      essentially max out the clamp the first time we start to flush.
      
      Instead of doing this, simply do it every time we have to start
      flushing, this will make us ramp up gradually instead of going to max
      clamping as soon as we start needing to do flushing.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0aae4ca9
    • Josef Bacik's avatar
      btrfs: check worker before need_preemptive_reclaim · ed738ba7
      Josef Bacik authored
      need_preemptive_reclaim() does some calculations, which aren't heavy,
      but if we're already running preemptive reclaim there's no reason to do
      them at all, so re-order the checks so that we don't do the calculation
      if we're already doing reclaim.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed738ba7
    • Su Yue's avatar
      btrfs: remove stale comment for argument seed of btrfs_find_device · 94358c35
      Su Yue authored
      Commit b2598edf ("btrfs: remove unused argument seed from
      btrfs_find_device") removed the argument seed from btrfs_find_device
      but forgot the comment, so remove it.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94358c35
    • Goldwyn Rodrigues's avatar
      btrfs: correct try_lock_extent() usage in read_extent_buffer_subpage() · dc56219f
      Goldwyn Rodrigues authored
      try_lock_extent() returns 1 on success or 0 for failure and not an error
      code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
      zero indicating subpage extent read success.
      
      Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
      extent.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc56219f
  2. 20 Jun, 2021 4 commits
  3. 19 Jun, 2021 13 commits
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · b84a7c28
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Fix initrd corruption caused by our recent change to use relative jump
        labels.
      
        Fix a crash using perf record on systems without a hardware PMU
        backend.
      
        Rework our 64-bit signal handling slighty to make it more closely
        match the old behaviour, after the recent change to use unsafe user
        accessors.
      
        Thanks to Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel
        Axtens, Greg Kurz, and Roman Bolshakov"
      
      * tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set
        powerpc: Fix initrd corruption with relative jump labels
        powerpc/signal64: Copy siginfo before changing regs->nip
        powerpc/mem: Add back missing header to fix 'no previous prototype' error
      b84a7c28
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v5.13-2021-06-19' of... · 913ec3c2
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix refcount usage when processing PERF_RECORD_KSYMBOL.
      
       - 'perf stat' metric group fixes.
      
       - Fix 'perf test' non-bash issue with stat bpf counters.
      
       - Update unistd, in.h and socket.h with the kernel sources, silencing
         perf build warnings.
      
      * tag 'perf-tools-fixes-for-v5.13-2021-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        tools headers UAPI: Sync linux/in.h copy with the kernel sources
        tools headers UAPI: Sync asm-generic/unistd.h with the kernel original
        perf beauty: Update copy of linux/socket.h with the kernel sources
        perf test: Fix non-bash issue with stat bpf counters
        perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL
        perf metricgroup: Return error code from metricgroup__add_metric_sys_event_iter()
        perf metricgroup: Fix find_evsel_group() event selector
      913ec3c2
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · d9403d30
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - A build fix to always build modules with the 'medany' code model, as
         the module loader doesn't support 'medlow'.
      
       - A Kconfig warning fix for the SiFive errata.
      
       - A pair of fixes that for regressions to the recent memory layout
         changes.
      
       - A fix for the FU740 device tree.
      
      * tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: dts: fu740: fix cache-controller interrupts
        riscv: Ensure BPF_JIT_REGION_START aligned with PMD size
        riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' name
        riscv: sifive: fix Kconfig errata warning
        riscv32: Use medany C model for modules
      d9403d30
    • Linus Torvalds's avatar
      Merge tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · e14c779a
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix zcrypt ioctl hang due to AP queue msg counter dropping below 0
         when pending requests are purged.
      
       - Two fixes for the machine check handler in the entry code.
      
      * tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/ap: Fix hanging ioctl caused by wrong msg counter
        s390/mcck: fix invalid KVM guest condition check
        s390/mcck: fix calculation of SIE critical section size
      e14c779a
    • Arnaldo Carvalho de Melo's avatar
      tools headers UAPI: Sync linux/in.h copy with the kernel sources · 1792a59e
      Arnaldo Carvalho de Melo authored
      To pick the changes in:
      
        32182747 ("icmp: don't send out ICMP messages with a source address of 0.0.0.0")
      
      That don't result in any change in tooling, as INADDR_ are not used to
      generate id->string tables used by 'perf trace'.
      
      This addresses this build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
        diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Toke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      1792a59e
    • Arnaldo Carvalho de Melo's avatar
      tools headers UAPI: Sync asm-generic/unistd.h with the kernel original · 17d27fc3
      Arnaldo Carvalho de Melo authored
      To pick the changes in:
      
        8b1462b6 ("quota: finish disable quotactl_path syscall")
      
      Those headers are used in some arches to generate the syscall table used
      in 'perf trace' to translate syscall numbers into strings.
      
      This addresses this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
        diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Marcin Juszkiewicz <marcin@juszkiewicz.com.pl>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      17d27fc3
    • Arnaldo Carvalho de Melo's avatar
      perf beauty: Update copy of linux/socket.h with the kernel sources · ef83f9ef
      Arnaldo Carvalho de Melo authored
      To pick the changes in:
      
        ea6932d7 ("net: make get_net_ns return error if NET_NS is disabled")
      
      That don't result in any changes in the tables generated from that
      header.
      
      This silences this perf build warning:
      
        Warning: Kernel ABI header at 'tools/perf/trace/beauty/include/linux/socket.h' differs from latest version at 'include/linux/socket.h'
        diff -u tools/perf/trace/beauty/include/linux/socket.h include/linux/socket.h
      
      Cc: Changbin Du <changbin.du@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      ef83f9ef
    • Ian Rogers's avatar
      perf test: Fix non-bash issue with stat bpf counters · 482698c2
      Ian Rogers authored
      $(( .. )) is a bash feature but the test's interpreter is !/bin/sh,
      switch the code to use expr.
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: bpf@vger.kernel.org
      Link: http://lore.kernel.org/lkml/20210617184216.2075588-1-irogers@google.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      482698c2
    • Riccardo Mancini's avatar
      perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL · c087e948
      Riccardo Mancini authored
      ASan reported a memory leak of BPF-related ksymbols map and dso. The
      leak is caused by refount never reaching 0, due to missing __put calls
      in the function machine__process_ksymbol_register.
      
      Once the dso is inserted in the map, dso__put() should be called
      (map__new2() increases the refcount to 2).
      
      The same thing applies for the map when it's inserted into maps
      (maps__insert() increases the refcount to 2).
      
        $ sudo ./perf record -- sleep 5
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.025 MB perf.data (8 samples) ]
      
        =================================================================
        ==297735==ERROR: LeakSanitizer: detected memory leaks
      
        Direct leak of 6992 byte(s) in 19 object(s) allocated from:
            #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
            #1 0x8e4e53 in map__new2 /home/user/linux/tools/perf/util/map.c:216:20
            #2 0x8cf68c in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:778:10
            [...]
      
        Indirect leak of 8702 byte(s) in 19 object(s) allocated from:
            #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
            #1 0x8728d7 in dso__new_id /home/user/linux/tools/perf/util/dso.c:1256:20
            #2 0x872015 in dso__new /home/user/linux/tools/perf/util/dso.c:1295:9
            #3 0x8cf623 in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:774:21
            [...]
      
        Indirect leak of 1520 byte(s) in 19 object(s) allocated from:
            #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
            #1 0x87b3da in symbol__new /home/user/linux/tools/perf/util/symbol.c:269:23
            #2 0x888954 in map__process_kallsym_symbol /home/user/linux/tools/perf/util/symbol.c:710:8
            [...]
      
        Indirect leak of 1406 byte(s) in 19 object(s) allocated from:
            #0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
            #1 0x87b3da in symbol__new /home/user/linux/tools/perf/util/symbol.c:269:23
            #2 0x8cfbd8 in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:803:8
            [...]
      Signed-off-by: default avatarRiccardo Mancini <rickyman7@gmail.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tommi Rantala <tommi.t.rantala@nokia.com>
      Link: http://lore.kernel.org/lkml/20210612173751.188582-1-rickyman7@gmail.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      c087e948
    • John Garry's avatar
      perf metricgroup: Return error code from metricgroup__add_metric_sys_event_iter() · fe7a98b9
      John Garry authored
      The error code is not set at all in the sys event iter function.
      
      This may lead to an uninitialized value of "ret" in
      metricgroup__add_metric() when no CPU metric is added.
      
      Fix by properly setting the error code.
      
      It is not necessary to init "ret" to 0 in metricgroup__add_metric(), as
      if we have no CPU or sys event metric matching, then "has_match" should
      be 0 and "ret" is set to -EINVAL.
      
      However gcc cannot detect that it may not have been set after the
      map_for_each_metric() loop for CPU metrics, which is strange.
      
      Fixes: be335ec2 ("perf metricgroup: Support adding metrics for system PMUs")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Acked-by: default avatarIan Rogers <irogers@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kajol Jain <kjain@linux.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/1623335580-187317-3-git-send-email-john.garry@huawei.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      fe7a98b9
    • John Garry's avatar
      perf metricgroup: Fix find_evsel_group() event selector · fc96ec4d
      John Garry authored
      The following command segfaults on my x86 broadwell:
      
        $ ./perf stat  -M frontend_bound,retiring,backend_bound,bad_speculation sleep 1
        WARNING: grouped events cpus do not match, disabling group:
          anon group { raw 0x10e }
          anon group { raw 0x10e }
        perf: util/evsel.c:1596: get_group_fd: Assertion `!(!leader->core.fd)' failed.
        Aborted (core dumped)
      
      The issue shows itself as a use-after-free in evlist__check_cpu_maps(),
      whereby the leader of an event selector (evsel) has been deleted (yet we
      still attempt to verify for an evsel).
      
      Fundamentally the problem comes from metricgroup__setup_events() ->
      find_evsel_group(), and has developed from the previous fix attempt in
      commit 9c880c24 ("perf metricgroup: Fix for metrics containing
      duration_time").
      
      The problem now is that the logic in checking if an evsel is in the same
      group is subtly broken for the "cycles" event. For the "cycles" event,
      the pmu_name is NULL; however the logic in find_evsel_group() may set an
      event matched against "cycles" as used, when it should not be.
      
      This leads to a condition where an evsel is set, yet its leader is not.
      
      Fix the check for evsel pmu_name by not matching evsels when either has a
      NULL pmu_name.
      
      There is still a pre-existing metric issue whereby the ordering of the
      metrics may break the 'stat' function, as discussed at:
      https://lore.kernel.org/lkml/49c6fccb-b716-1bf0-18a6-cace1cdb66b9@huawei.com/
      
      Fixes: 9c880c24 ("perf metricgroup: Fix for metrics containing duration_time")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> # On a Thinkpad T450S
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kajol Jain <kjain@linux.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lore.kernel.org/lkml/1623335580-187317-2-git-send-email-john.garry@huawei.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      fc96ec4d
    • David Abdurachmanov's avatar
      riscv: dts: fu740: fix cache-controller interrupts · 7ede12b0
      David Abdurachmanov authored
      The order of interrupt numbers is incorrect.
      
      The order for FU740 is: DirError, DataError, DataFail, DirFail
      
      From SiFive FU740-C000 Manual:
      19 - L2 Cache DirError
      20 - L2 Cache DirFail
      21 - L2 Cache DataError
      22 - L2 Cache DataFail
      Signed-off-by: default avatarDavid Abdurachmanov <david.abdurachmanov@sifive.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      7ede12b0
    • Jisheng Zhang's avatar
      riscv: Ensure BPF_JIT_REGION_START aligned with PMD size · 3a02764c
      Jisheng Zhang authored
      Andreas reported commit fc850476 ("riscv: bpf: Avoid breaking W^X")
      breaks booting with one kind of defconfig, I reproduced a kernel panic
      with the defconfig:
      
      [    0.138553] Unable to handle kernel paging request at virtual address ffffffff81201220
      [    0.139159] Oops [#1]
      [    0.139303] Modules linked in:
      [    0.139601] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc5-default+ #1
      [    0.139934] Hardware name: riscv-virtio,qemu (DT)
      [    0.140193] epc : __memset+0xc4/0xfc
      [    0.140416]  ra : skb_flow_dissector_init+0x1e/0x82
      [    0.140609] epc : ffffffff8029806c ra : ffffffff8033be78 sp : ffffffe001647da0
      [    0.140878]  gp : ffffffff81134b08 tp : ffffffe001654380 t0 : ffffffff81201158
      [    0.141156]  t1 : 0000000000000002 t2 : 0000000000000154 s0 : ffffffe001647dd0
      [    0.141424]  s1 : ffffffff80a43250 a0 : ffffffff81201220 a1 : 0000000000000000
      [    0.141654]  a2 : 000000000000003c a3 : ffffffff81201258 a4 : 0000000000000064
      [    0.141893]  a5 : ffffffff8029806c a6 : 0000000000000040 a7 : ffffffffffffffff
      [    0.142126]  s2 : ffffffff81201220 s3 : 0000000000000009 s4 : ffffffff81135088
      [    0.142353]  s5 : ffffffff81135038 s6 : ffffffff8080ce80 s7 : ffffffff80800438
      [    0.142584]  s8 : ffffffff80bc6578 s9 : 0000000000000008 s10: ffffffff806000ac
      [    0.142810]  s11: 0000000000000000 t3 : fffffffffffffffc t4 : 0000000000000000
      [    0.143042]  t5 : 0000000000000155 t6 : 00000000000003ff
      [    0.143220] status: 0000000000000120 badaddr: ffffffff81201220 cause: 000000000000000f
      [    0.143560] [<ffffffff8029806c>] __memset+0xc4/0xfc
      [    0.143859] [<ffffffff8061e984>] init_default_flow_dissectors+0x22/0x60
      [    0.144092] [<ffffffff800010fc>] do_one_initcall+0x3e/0x168
      [    0.144278] [<ffffffff80600df0>] kernel_init_freeable+0x1c8/0x224
      [    0.144479] [<ffffffff804868a8>] kernel_init+0x12/0x110
      [    0.144658] [<ffffffff800022de>] ret_from_exception+0x0/0xc
      [    0.145124] ---[ end trace f1e9643daa46d591 ]---
      
      After some investigation, I think I found the root cause: commit
      2bfc6cd8 ("move kernel mapping outside of linear mapping") moves
      BPF JIT region after the kernel:
      
      | #define BPF_JIT_REGION_START	PFN_ALIGN((unsigned long)&_end)
      
      The &_end is unlikely aligned with PMD size, so the front bpf jit
      region sits with part of kernel .data section in one PMD size mapping.
      But kernel is mapped in PMD SIZE, when bpf_jit_binary_lock_ro() is
      called to make the first bpf jit prog ROX, we will make part of kernel
      .data section RO too, so when we write to, for example memset the
      .data section, MMU will trigger a store page fault.
      
      To fix the issue, we need to ensure the BPF JIT region is PMD size
      aligned. This patch acchieve this goal by restoring the BPF JIT region
      to original position, I.E the 128MB before kernel .text section. The
      modification to kasan_init.c is inspired by Alexandre.
      
      Fixes: fc850476 ("riscv: bpf: Avoid breaking W^X")
      Reported-by: default avatarAndreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      3a02764c