- 08 Dec, 2020 24 commits
-
-
Goldwyn Rodrigues authored
The read and write DIO don't have anything in common except for the call to iomap_dio_rw. Extract the write call into a new function to get rid of conditional statements for direct write. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Add /sys/fs/btrfs/UUID/read_policy attribute so that the read policy for the raid1, raid1c34 and raid10 can be tuned. When this attribute is read, it will show all available policies, with active policy in [ ]. The read_policy attribute can be written using one of the items listed in there. For example: $ cat /sys/fs/btrfs/UUID/read_policy [pid] $ echo pid > /sys/fs/btrfs/UUID/read_policy Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
As of now, we use the pid method to read striped mirrored data, which means process id determines the stripe id to read. This type of routing typically helps in a system with many small independent processes tying to read random data. On the other hand, the pid based read IO policy is inefficient because if there is a single process trying to read a large file, the overall disk bandwidth remains underutilized. So this patch introduces a read policy framework so that we could add more read policies, such as IO routing based on the device's wait-queue or manual when we have a read-preferred device or a policy based on the target storage caching. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Add a generic helper to match the string in a given buffer, and ignore the leading and trailing whitespace. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename variables, add comments ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We do not need anymore to start writeback for delalloc of roots that are being snapshotted and wait for it to complete. This was done in commit 609e804d ("Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes") to fix a type of file corruption where files in a snapshot end up having their i_size updated in a non-ordered way, leaving implicit file holes, when buffered IO writes that increase a file's size are followed by direct IO writes that also increase the file's size. This is not needed anymore because we now have a more generic mechanism to prevent a non-ordered i_size update since commit 9ddc959e ("btrfs: use the file extent tree infrastructure"), which addresses this scenario involving snapshots as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Historically we've implemented our own locking because we wanted to be able to selectively spin or sleep based on what we were doing in the tree. For instance, if all of our nodes were in cache then there's rarely a reason to need to sleep waiting for node locks, as they'll likely become available soon. At the time this code was written the rw_semaphore didn't do adaptive spinning, and thus was orders of magnitude slower than our home grown locking. However now the opposite is the case. There are a few problems with how we implement blocking locks, namely that we use a normal waitqueue and simply wake everybody up in reverse sleep order. This leads to some suboptimal performance behavior, and a lot of context switches in highly contended cases. The rw_semaphores actually do this properly, and also have adaptive spinning that works relatively well. The locking code is also a bit of a bear to understand, and we lose the benefit of lockdep for the most part because the blocking states of the lock are simply ad-hoc and not mapped into lockdep. So rework the locking code to drop all of this custom locking stuff, and simply use a rw_semaphore for everything. This makes the locking much simpler for everything, as we can now drop a lot of cruft and blocking transitions. The performance numbers vary depending on the workload, because generally speaking there doesn't tend to be a lot of contention on the btree. However, on my test system which is an 80 core single socket system with 256GiB of RAM and a 2TiB NVMe drive I get the following results (with all debug options off): dbench 200 baseline Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms dbench 200 with patch Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms Previously we also used fs_mark to test this sort of contention, and those results are far less impressive, mostly because there's not enough tasks to really stress the locking fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16 baseline Average Files/sec: 160166.7 p50 Files/sec: 165832 p90 Files/sec: 123886 p99 Files/sec: 123495 real 3m26.527s user 2m19.223s sys 48m21.856s patched Average Files/sec: 164135.7 p50 Files/sec: 171095 p90 Files/sec: 122889 p99 Files/sec: 113819 real 3m29.660s user 2m19.990s sys 44m12.259s Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Just open code it in its sole caller and remove a level of indirection. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Now that we have the building blocks for some better recovery options with corrupted file systems, add a rescue=all option to enable all of the relevant rescue options. This will allow distros to simply default to rescue=all for the "oh dear lord the world's on fire" recovery without needing to know all the different options that we have and may add in the future. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
There are cases where you can end up with bad data csums because of misbehaving applications. This happens when an application modifies a buffer in-flight when doing an O_DIRECT write. In order to recover the file we need a way to turn off data checksums so you can copy the file off, and then you can delete the file and restore it properly later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
The standalone option usebackuproot was intended as one-time use and it was not necessary to keep it in the option list. Now that we're going to have more rescue options, it's desirable to keep them intact as it could be confusing why the option disappears. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ remove the btrfs_clear_opt part from open_ctree ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We're going to have a lot of rescue options, add a helper to collapse the /proc/mounts output to rescue=option1:option2:option3 format. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We're going to be adding a variety of different rescue options, we should advertise which ones we support to make user spaces life easier in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
When we move to being able to handle NULL csum_roots it'll be cleaner to just check in btrfs_lookup_bio_sums instead of at all of the caller locations, so push the NODATASUM check into it as well so it's unified. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We're going to be adding more options that require RDONLY, so add a helper to do the check and error out if we don't have RDONLY set. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When scrubbing a stripe of a block group we always start readahead for the checksums btree and wait for it to complete, however when the blockgroup is not a data block group (or a mixed block group) it is a waste of time to do it, since there are no checksums for metadata extents in that btree. So skip that when the block group does not have the data flag set, saving some time doing memory allocations, queueing a job in the readahead work queue, waiting for it to complete and potentially avoiding some IO as well (when csum tree extents are not in memory already). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When we drop the last reference of a zone, we end up releasing it through the callback reada_zone_release(), which deletes the zone from a device's reada_zones radix tree. This tree is protected by the global readahead lock at fs_info->reada_lock. Currently all places that are sure that they are dropping the last reference on a zone, are calling kref_put() in a critical section delimited by this lock, while all other places that are sure they are not dropping the last reference, do not bother calling kref_put() while holding that lock. When working on the previous fix for hangs and use-after-frees in the readahead code, my initial attempts were different and I actually ended up having reada_zone_release() called when not holding the lock, which resulted in weird and unexpected problems. So just add an assertion there to detect such problem more quickly and make the dependency more obvious. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Goldwyn Rodrigues authored
Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as opposed to calling set_extent_bits again later. Fold check for written length within the function. Note: EXTENT_NORESERVE is set before unlocking extents. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Goldwyn Rodrigues authored
round_down looks prettier than the bit mask operations. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Goldwyn Rodrigues authored
While using compression, a submitted bio is mapped with a compressed bio which performs the read from disk, decompresses and returns uncompressed data to original bio. The original bio must reflect the uncompressed size (iosize) of the I/O to be performed, or else the page just gets the decompressed I/O length of data (disk_io_size). The compressed bio checks the extent map and gets the correct length while performing the I/O from disk. This came up in subpage work when only compressed length of the original bio was filled in the page. This worked correctly for pagesize == sectorsize because both compressed and uncompressed data are at pagesize boundaries, and would end up filling the requested page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Goldwyn Rodrigues authored
write_bytes can change in btrfs_check_nocow_lock(). Calculate variables such as num_pages and reserve_bytes once we are sure of the value of write_bytes so there is no need to re-calculate. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
If transaction_kthread is woken up before btrfs_fs_info::commit_interval seconds have elapsed it will sleep for a fixed period of 5 seconds. This is not a problem per-se but is not accurate. Instead the code should sleep for an interval which guarantees on next wakeup commit_interval would have passed. Since time tracking is not precise subtract 1 second from delta to ensure the delay we end up waiting will be longer than than the wake up period. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Rename 'now' to 'delta' and store there the delta between transaction start time and current time. This is in preparation for optimising the sleep logic in the next patch. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
The value obtained from ktime_get_seconds() is guaranteed to be monotonically increasing since it's taken from CLOCK_MONOTONIC. As transaction_kthread obtains a reference to the currently running transaction under holding btrfs_fs_info::trans_lock it's guaranteed to: a) see an initialized 'cur', whose start_time is guaranteed to be smaller than 'now' or b) not obtain a 'cur' and simply go to sleep. Given this remove the unnecessary check, if it sees now < cur->start_time this would imply there are far greater problems on the machine. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
- 07 Dec, 2020 2 commits
-
-
Nikolay Borisov authored
The kernel provides easy to understand helpers to convert from human understandable units to the kernel-friendly 'jiffies'. So let's use those to make the code easier to understand. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Matching with the information that's available from the ioctl FS_INFO, add generation to the per-filesystem directory /sys/fs/btrfs/UUID/generation, which could be used by scripts. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
- 06 Dec, 2020 14 commits
-
-
Linus Torvalds authored
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-miscLinus Torvalds authored
Pull char/misc driver fixes from Greg KH: "Here are some small driver fixes, and one "large" revert, for 5.10-rc7. They include: - revert mei patch from 5.10-rc1 that was using a reserved userspace value. It will be resubmitted once the proper id has been assigned by the virtio people. - habanalabs fixes found by the fall-through audit from Gustavo - speakup driver fixes for reported issues - fpga config build fix for reported issue. All of these except the revert have been in linux-next with no reported issues. The revert is "clean" and just removes a previously-added driver, so no real issue there" * tag 'char-misc-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: Revert "mei: virtio: virtualization frontend driver" fpga: Specify HAS_IOMEM dependency for FPGA_DFL habanalabs: put devices before driver removal habanalabs: free host huge va_range if not used speakup: Reject setting the speakup line discipline outside of speakup
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/ttyLinus Torvalds authored
Pull tty fixes from Greg KH: "Here are two tty core fixes for 5.10-rc7. They resolve some reported locking issues in the tty core. While they have not been in a released linux-next yet, they have passed all of the 0-day bot testing as well as the submitter's testing" * tag 'tty-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: Fix ->session locking tty: Fix ->pgrp locking in tiocspgrp()
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usbLinus Torvalds authored
Pull USB fixes from Greg KH: "Here are some small USB fixes for 5.10-rc7 that resolve a number of reported issues, and add some new device ids. Nothing major here, but these solve some problems that people were having with the 5.10-rc tree: - reverts for USB storage dma settings that broke working devices - thunderbolt use-after-free fix - cdns3 driver fixes - gadget driver userspace copy fix - new device ids All of these except for the reverts have been in linux-next with no reported issues. The reverts are "clean" and were tested by Hans, as well as passing the 0-day tests" * tag 'usb-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: gadget: f_fs: Use local copy of descriptors for userspace copy usb: ohci-omap: Fix descriptor conversion Revert "usb-storage: fix sdev->host->dma_dev" Revert "uas: fix sdev->host->dma_dev" Revert "uas: bump hw_max_sectors to 2048 blocks for SS or faster drives" USB: serial: kl5kusb105: fix memleak on open USB: serial: ch341: sort device-id entries USB: serial: ch341: add new Product ID for CH341A USB: serial: option: fix Quectel BG96 matching usb: cdns3: core: fix goto label for error path usb: cdns3: gadget: clear trb->length as zero after preparing every trb usb: cdns3: Fix hardware based role switch USB: serial: option: add support for Thales Cinterion EXS82 USB: serial: option: add Fibocom NL668 variants thunderbolt: Fix use-after-free in remove_unplugged_switch()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Make the AMD L3 QoS code and data priorization enable/disable mechanism work correctly. The control bit was only set/cleared on one of the CPUs in a L3 domain, but it has to be modified on all CPUs in the domain. The initial documentation was not clear about this, but the updated one from Oct 2020 spells it out. - Fix an off by one in the UV platform detection code which causes the UV hubs to be identified wrongly. The chip revisions start at 1 not at 0. - Fix a long standing bug in the evaluation of prefixes in the uprobes code which fails to handle repeated prefixes properly. The aggregate size of the prefixes can be larger than the bytes array but the code blindly iterated over the aggregate size beyond the array boundary. Add a macro to handle this case properly and use it at the affected places" * tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes x86/platform/uv: Fix UV4 hub revision adjustment x86/resctrl: Fix AMD L3 QOS CDP enable/disable
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull perf fixes from Thomas Gleixner: "Two fixes for performance monitoring on X86: - Add recursion protection to another callchain invoked from x86_pmu_stop() which can recurse back into x86_pmu_stop(). The first attempt to fix this missed this extra code path. - Use the already filtered status variable to check for PEBS counter overflow bits and not the unfiltered full status read from IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which would be evaluated incorrectly" * tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Check PEBS status correctly perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull irq fixes from Thomas Gleixner: "A set of updates for the interrupt subsystem: - Make multiqueue devices which use the managed interrupt affinity infrastructure work on PowerPC/Pseries. PowerPC does not use the generic infrastructure for setting up PCI/MSI interrupts and the multiqueue changes failed to update the legacy PCI/MSI infrastructure. Make this work by passing the affinity setup information down to the mapping and allocation functions. - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is bouncing and he's not reachable. We hope all is well with him and say thanks for his work over the years" * tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: powerpc/pseries: Pass MSI affinity to irq_create_mapping() genirq/irqdomain: Add an irq_create_mapping_affinity() function MAINTAINERS: Move Jason Cooper to CREDITS
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull intel_idle build fix from Thomas Gleixner: "A tiny build fix for a recent change in the intel_idle driver which missed a CONFIG dependency and broke the build for certain configurations" * tag 'locking-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: intel_idle: Build fix
-
Linus Torvalds authored
Merge tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Move -Wcast-align to W=3, which tends to be false-positive and there is no tree-wide solution. - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a preprocessor option and makes sense for .S files as well. - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings. - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings. - Fix undesirable line breaks in *.mod files. * tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: avoid split lines in .mod files kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1 kbuild: Hoist '--orphan-handling' into Kconfig Kbuild: do not emit debug info for assembly with LLVM_IAS=1 kbuild: use -fmacro-prefix-map for .S sources Makefile.extrawarn: move -Wcast-align to W=3
-
Linus Torvalds authored
Merge misc fixes from Andrew Morton: "12 patches. Subsystems affected by this patch series: mm (memcg, zsmalloc, swap, mailmap, selftests, pagecache, hugetlb, pagemap), lib, and coredump" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/mmap.c: fix mmap return value when vma is merged after call_mmap() hugetlb_cgroup: fix offline of hugetlb cgroup with reservations mm/filemap: add static for function __add_to_page_cache_locked userfaultfd: selftests: fix SIGSEGV if huge mmap fails tools/testing/selftests/vm: fix build error mailmap: add two more addresses of Uwe Kleine-König mm/swapfile: do not sleep with a spin lock held mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING mm: list_lru: set shrinker map bit when child nr_items is not zero mm: memcg/slab: fix obj_cgroup_charge() return value handling coredump: fix core_pattern parse error zlib: export S390 symbols for zlib modules
-
Liu Zixian authored
On success, mmap should return the begin address of newly mapped area, but patch "mm: mmap: merge vma after call_mmap() if possible" set vm_start of newly merged vma to return value addr. Users of mmap will get wrong address if vma is merged after call_mmap(). We fix this by moving the assignment to addr before merging vma. We have a driver which changes vm_flags, and this bug is found by our testcases. Fixes: d70cec89 ("mm: mmap: merge vma after call_mmap() if possible") Signed-off-by: Liu Zixian <liuzixian4@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Hongxiang Lou <louhongxiang@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mike Kravetz authored
Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload using hugetlbfs. In this environment the issue is reproduced by: - Start a simple pod that uses the recently added HugePages medium feature (pod yaml attached) - Start a DPDK app. It doesn't need to run successfully (as in transfer packets) nor interact with real hardware. It seems just initializing the EAL layer (which handles hugepage reservation and locking) is enough to trigger the issue - Delete the Pod (or let it "Complete"). This would result in a kworker thread going into a tight loop (top output): 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy 'perf top -g' reports: - 63.28% 0.01% [kernel] [k] worker_thread - 49.97% worker_thread - 52.64% process_one_work - 62.08% css_killed_work_fn - hugetlb_cgroup_css_offline 41.52% _raw_spin_lock - 2.82% _cond_resched rcu_all_qs 2.66% PageHuge - 0.57% schedule - 0.57% __schedule We are spinning in the do-while loop in hugetlb_cgroup_css_offline. Worse yet, we are holding the master cgroup lock (cgroup_mutex) while infinitely spinning. Little else can be done on the system as the cgroup_mutex can not be acquired. Do note that the issue can be reproduced by simply offlining a hugetlb cgroup containing pages with reservation counts. The loop in hugetlb_cgroup_css_offline is moving page counts from the cgroup being offlined to the parent cgroup. This is done for each hstate, and is repeated until hugetlb_cgroup_have_usage returns false. The routine moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. The routine hugetlb_cgroup_have_usage is checking for both 'usage' and 'reservation' counts. Discussion about what to do with reservation counts when reparenting was discussed here: https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ The decision was made to leave a zombie cgroup for with reservation counts. Unfortunately, the code checking reservation counts was incorrectly added to hugetlb_cgroup_have_usage. To fix the issue, simply remove the check for reservation counts. While fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. The hstate index is not reinitialized each time through the do-while loop. Fix this as well. Fixes: 1adc4d41 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") Reported-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Adrian Moreno <amorenoz@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Alex Shi authored
mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Axel Rasmussen authored
The error handling in hugetlb_allocate_area() was incorrect for the hugetlb_shared test case. Previously the behavior was: - mmap a hugetlb area - If this fails, set the pointer to NULL, and carry on - mmap an alias of the same hugetlb fd - If this fails, munmap the original area If the original mmap failed, it's likely the second one did too. If both failed, we'd blindly try to munmap a NULL pointer, causing a SIGSEGV. Instead, "goto fail" so we return before trying to mmap the alias. This issue can be hit "in real life" by forgetting to set /proc/sys/vm/nr_hugepages (leaving it at 0), and then trying to run the hugetlb_shared test. Another small improvement is, when the original mmap fails, don't just print "it failed": perror(), so we can see *why*. :) Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Alan Gilbert <dgilbert@redhat.com> Link: https://lkml.kernel.org/r/20201204203443.2714693-1-axelrasmussen@google.comSigned-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-