Commits · 688b0d8536e0e937f608a93cb6909e14389a0c45 · Kirill Smelkov / linux

13 Mar, 2022 3 commits

doc: fixed a typo in ext4 documentation · 688b0d85

lianzhi chang authored Mar 10, 2022

The unit of file system size should be TiB, not PiB
Signed-off-by: lianzhi chang <changlianzhi@uniontech.com>
Link: https://lore.kernel.org/r/20220310014415.29937-1-changlianzhi@uniontech.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

688b0d85

ext4: make mb_optimize_scan performance mount option work with extents · 077d0c2c

Ojaswin Mujoo authored Mar 08, 2022

Currently mb_optimize_scan scan feature which improves filesystem
performance heavily (when FS is fragmented), seems to be not working
with files with extents (ext4 by default has files with extents).

This patch fixes that and makes mb_optimize_scan feature work
for files with extents.

Below are some performance numbers obtained when allocating a 10M and 100M
file with and w/o this patch on a filesytem with no 1M contiguous block.

<perf numbers>
===============
Workload: dd if=/dev/urandom of=test conv=fsync bs=1M count=10/100

Time taken
=====================================================
no.     Size   without-patch     with-patch    Diff(%)
1       10M      0m8.401s         0m5.623s     33.06%
2       100M     1m40.465s        1m14.737s    25.6%

<debug stats>
=============
w/o patch:
  mballoc:
    reqs: 17056
    success: 11407
    groups_scanned: 13643
    cr0_stats:
            hits: 37
            groups_considered: 9472
            useless_loops: 36
            bad_suggestions: 0
    cr1_stats:
            hits: 11418
            groups_considered: 908560
            useless_loops: 1894
            bad_suggestions: 0
    cr2_stats:
            hits: 1873
            groups_considered: 6913
            useless_loops: 21
    cr3_stats:
            hits: 21
            groups_considered: 5040
            useless_loops: 21
    extents_scanned: 417364
            goal_hits: 3707
            2^n_hits: 37
            breaks: 1873
            lost: 0
    buddies_generated: 239/240
    buddies_time_used: 651080
    preallocated: 705
    discarded: 478

with patch:
  mballoc:
    reqs: 12768
    success: 11305
    groups_scanned: 12768
    cr0_stats:
            hits: 1
            groups_considered: 18
            useless_loops: 0
            bad_suggestions: 0
    cr1_stats:
            hits: 5829
            groups_considered: 50626
            useless_loops: 0
            bad_suggestions: 0
    cr2_stats:
            hits: 6938
            groups_considered: 580363
            useless_loops: 0
    cr3_stats:
            hits: 0
            groups_considered: 0
            useless_loops: 0
    extents_scanned: 309059
            goal_hits: 0
            2^n_hits: 1
            breaks: 1463
            lost: 0
    buddies_generated: 239/240
    buddies_time_used: 791392
    preallocated: 673
    discarded: 446

Fixes: 196e402a (ext4: improve cr 0 / cr 1 group scanning)
Cc: stable@kernel.org
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Suggested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/fc9a48f7f8dcfc83891a8b21f6dd8cdf056ed810.1646732698.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

077d0c2c

ext4: make mb_optimize_scan option work with set/unset mount cmd · 27b38686

Ojaswin Mujoo authored Mar 08, 2022

After moving to the new mount API, mb_optimize_scan mount option
handling was not working as expected due to the parsed value always
being overwritten by default. Refactor and fix this to the expected
behavior described below:

*  mb_optimize_scan=1 - On
*  mb_optimize_scan=0 - Off
*  mb_optimize_scan not passed - On if no. of BGs > threshold else off
*  Remounts retain previous value unless we explicitly pass the option
   with a new value

Fixes: cebe85d5 ("ext4: switch to the new mount api")
Cc: stable@kernel.org
Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c98970fe99f26718586d02e942f293300fb48ef3.1646732698.git.ojaswin@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

27b38686

03 Mar, 2022 7 commits

ext4: don't BUG if someone dirty pages without asking ext4 first · cc509574

Theodore Ts'o authored Mar 03, 2022

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu

cc509574

ext4: remove redundant assignment to variable split_flag1 · 6b71b69d

Colin Ian King authored Mar 01, 2022

Variable split_flag1 is being assigned a value that is never read,
it is being re-assigned a new value in the following code block.
The assignment is redundant and can be removed.

Cleans up clang scan build warning:
fs/ext4/extents.c:3371:2: warning: Value stored to 'split_flag1' is
never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20220301121644.997833-1-colin.i.king@gmail.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

6b71b69d

ext4: fix underflow in ext4_max_bitmap_size() · 5c93e8ec

Zhang Yi authored Mar 01, 2022

when ext4 filesystem is created with 64k block size, ^extent and
^huge_file features. the upper_limit would underflow during the
computations in ext4_max_bitmap_size(). The problem is the size of block
index tree for such large block size is more than i_blocks can carry.
So fix the computation to count with this possibility. After this fix,
the 'res' cannot overflow loff_t on the extreme case of filesystem with
huge_files and 64K block size, so this patch also revert commit
75ca6ad4 ("ext4: fix loff_t overflow in ext4_max_bitmap_size()").
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220301111704.2153829-1-yi.zhang@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

5c93e8ec

ext4: fix ext4_mb_clear_bb() kernel-doc comment · fd9b6fad

Yang Li authored Mar 01, 2022

Remove the excess description of @bh in ext4_mb_clear_bb() kernel-doc
comment to remove warnings found by running scripts/kernel-doc, which
is caused by using 'make W=1'.

fs/ext4/mballoc.c:5895: warning: Excess function parameter 'bh'
description in 'ext4_mb_clear_bb'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220301092136.34764-1-yang.lee@linux.alibaba.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

fd9b6fad

ext4: fix fs corruption when tring to remove a non-empty directory with IO error · 7aab5c84

Ye Bin authored Feb 28, 2022

We inject IO error when rmdir non empty direcory, then got issue as follows:
step1: mkfs.ext4 -F /dev/sda
step2: mount /dev/sda  test
step3: cd test
step4: mkdir -p 1/2
step5: rmdir 1
	[  110.920551] ext4_empty_dir: inject fault
	[  110.921926] EXT4-fs warning (device sda): ext4_rmdir:3113: inode #12:
	comm rmdir: empty directory '1' has too many links (3)
step6: cd ..
step7: umount test
step8: fsck.ext4 -f /dev/sda
	e2fsck 1.42.9 (28-Dec-2013)
	Pass 1: Checking inodes, blocks, and sizes
	Pass 2: Checking directory structure
	Entry '..' in .../??? (13) has deleted/unused inode 12.  Clear<y>? yes
	Pass 3: Checking directory connectivity
	Unconnected directory inode 13 (...)
	Connect to /lost+found<y>? yes
	Pass 4: Checking reference counts
	Inode 13 ref count is 3, should be 2.  Fix<y>? yes
	Pass 5: Checking group summary information

	/dev/sda: ***** FILE SYSTEM WAS MODIFIED *****
	/dev/sda: 12/131072 files (0.0% non-contiguous), 26157/524288 blocks

ext4_rmdir
	if (!ext4_empty_dir(inode))
		goto end_rmdir;
ext4_empty_dir
	bh = ext4_read_dirblock(inode, 0, DIRENT_HTREE);
	if (IS_ERR(bh))
		return true;
Now if read directory block failed, 'ext4_empty_dir' will return true, assume
directory is empty. Obviously, it will lead to above issue.
To solve this issue, if read directory block failed 'ext4_empty_dir' just
return false. To avoid making things worse when file system is already
corrupted, 'ext4_empty_dir' also return false.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20220228024815.3952506-1-yebin10@huawei.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

7aab5c84

ext4: use time_is_before_jiffies() instead of open coding it · a861fb9f

Wang Qing authored Feb 27, 2022

Use the helper function time_is_{before,after}_jiffies() to improve
code readability.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Link: https://lore.kernel.org/r/1646018120-61462-1-git-send-email-wangqing@vivo.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

a861fb9f

ext4: improve fast_commit performance and scalability · b3998b3b

Ritesh Harjani authored Feb 21, 2022

Currently ext4_fc_commit_dentry_updates() is of quadratic time
complexity, which is causing performance bottlenecks with high
threads/file/dir count with fs_mark.

This patch makes commit dentry updates (and hence ext4_fc_commit()) path
to linear time complexity. Hence improves the performance of workloads
which does fsync on multiple threads/open files one-by-one.

Absolute numbers in avg file creates per sec (from fs_mark in 1K order)
=======================================================================
no.     Order   without-patch(K)   with-patch(K)   Diff(%)
1       1        16.90              17.51           +3.60
2       2,2      32.08              31.80           -0.87
3       3,3      53.97              55.01           +1.92
4       4,4      78.94              76.90           -2.58
5       5,5      95.82              95.37           -0.46
6       6,6      87.92              103.38          +17.58
7       6,10      0.73              126.13          +17178.08
8       6,14      2.33              143.19          +6045.49

workload type
==============
For e.g. 7th row order of 6,10 (2^6 == 64 && 2^10 == 1024)
echo /run/riteshh/mnt/{1..64} |sed -E 's/[[:space:]]+/ -d /g' \
  | xargs -I {} bash -c "sudo fs_mark -L 100 -D 1024 -n 1024 -s0 -S5 -d {}"

Perf profile
(w/o patches)
=============================
87.15%  [kernel]  [k] ext4_fc_commit           --> Heavy contention/bottleneck
 1.98%  [kernel]  [k] perf_event_interrupt
 0.96%  [kernel]  [k] power_pmu_enable
 0.91%  [kernel]  [k] update_sd_lb_stats.constprop.0
 0.67%  [kernel]  [k] ktime_get
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/930f35d4fd5f83e2673c868781d9ebf15e91bf4e.1645426817.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

b3998b3b

26 Feb, 2022 13 commits

ext4: add extra check in ext4_mb_mark_bb() to prevent against possible corruption · 8c91c579

Ritesh Harjani authored Feb 16, 2022

This patch adds an extra checks in ext4_mb_mark_bb() function
to make sure we mark & report error if we were to mark/clear any
of the critical FS metadata specific bitmaps (&bail out) to prevent
from any accidental corruption.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/53cbb6f2573db162a57f935365050d8b1df202ee.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

8c91c579

ext4: add strict range checks while freeing blocks · a00b482b

Ritesh Harjani authored Feb 16, 2022

Currently ext4_mb_clear_bb() & ext4_group_add_blocks() only checks
whether the given block ranges (which is to be freed) belongs to any FS
metadata blocks or not, of the block's respective block group.
But to detect any FS error early, it is better to add more strict
checkings in those functions which checks whether the given blocks
belongs to any critical FS metadata or not within system-zone.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ddd9143d064774e32d6364a99667817c6e8bfdc0.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

a00b482b

ext4: add ext4_sb_block_valid() refactored out of ext4_inode_block_valid() · 6bc6c2bd

Ritesh Harjani authored Feb 16, 2022

This API will be needed at places where we don't have an inode
for e.g. while freeing blocks in ext4_group_add_blocks()
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/dd34a236543ad5ae7123eeebe0cb69e6bdd44f34.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

6bc6c2bd

ext4: no need to test for block bitmap bits in ext4_mb_mark_bb() · bd8247ee

Ritesh Harjani authored Feb 16, 2022

We don't need the return value of mb_test_and_clear_bits() in ext4_mb_mark_bb()
So simply use mb_clear_bits() instead.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a971935306dafb124da0193c7dad1aa485210b62.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

bd8247ee

ext4: rename ext4_set_bits to mb_set_bits · 123e3016

Ritesh Harjani authored Feb 16, 2022

ext4_set_bits() should actually be mb_set_bits() for uniform API naming
convention.
This is via below cmd -

grep -nr "ext4_set_bits" fs/ext4/ | cut -d ":" -f 1 | xargs sed -i 's/ext4_set_bits/mb_set_bits/g'
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f1f6ece1405b76a7a987e9145d1adfaf71e30695.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

123e3016

ext4: use in_range() for range checking in ext4_fc_replay_check_excluded · dbaafbad

Ritesh Harjani authored Feb 16, 2022

Instead of open coding it, use in_range() function instead.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/8e5526ef14150778871ac7c937c8993c6a09cd3e.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

dbaafbad

ext4: refactor ext4_free_blocks() to pull out ext4_mb_clear_bb() · 8ac3939d

Ritesh Harjani authored Feb 16, 2022

ext4_free_blocks() function became too long and confusing, this patch
just pulls out the ext4_mb_clear_bb() function logic from it
which clears the block bitmap and frees it.

No functionality change in this patch
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/22c30fbb26ba409cf8aa5f0c7912970272c459e8.1644992610.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

8ac3939d

ext4: fix ext4_mb_mark_bb() with flex_bg with fast_commit · bfdc502a

Ritesh Harjani authored Feb 16, 2022

In case of flex_bg feature (which is by default enabled), extents for
any given inode might span across blocks from two different block group.
ext4_mb_mark_bb() only reads the buffer_head of block bitmap once for the
starting block group, but it fails to read it again when the extent length
boundary overflows to another block group. Then in this below loop it
accesses memory beyond the block group bitmap buffer_head and results
into a data abort.

	for (i = 0; i < clen; i++)
		if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state)
			already++;

This patch adds this functionality for checking block group boundary in
ext4_mb_mark_bb() and update the buffer_head(bitmap_bh) for every different
block group.

w/o this patch, I was easily able to hit a data access abort using Power platform.

<...>
[   74.327662] EXT4-fs error (device loop3): ext4_mb_generate_buddy:1141: group 11, block bitmap and bg descriptor inconsistent: 21248 vs 23294 free clusters
[   74.533214] EXT4-fs (loop3): shut down requested (2)
[   74.536705] Aborting journal on device loop3-8.
[   74.702705] BUG: Unable to handle kernel data access on read at 0xc00000005e980000
[   74.703727] Faulting instruction address: 0xc0000000007bffb8
cpu 0xd: Vector: 300 (Data Access) at [c000000015db7060]
    pc: c0000000007bffb8: ext4_mb_mark_bb+0x198/0x5a0
    lr: c0000000007bfeec: ext4_mb_mark_bb+0xcc/0x5a0
    sp: c000000015db7300
   msr: 800000000280b033
   dar: c00000005e980000
 dsisr: 40000000
  current = 0xc000000027af6880
  paca    = 0xc00000003ffd5200   irqmask: 0x03   irq_happened: 0x01
    pid   = 5167, comm = mount
<...>
enter ? for help
[c000000015db7380] c000000000782708 ext4_ext_clear_bb+0x378/0x410
[c000000015db7400] c000000000813f14 ext4_fc_replay+0x1794/0x2000
[c000000015db7580] c000000000833f7c do_one_pass+0xe9c/0x12a0
[c000000015db7710] c000000000834504 jbd2_journal_recover+0x184/0x2d0
[c000000015db77c0] c000000000841398 jbd2_journal_load+0x188/0x4a0
[c000000015db7880] c000000000804de8 ext4_fill_super+0x2638/0x3e10
[c000000015db7a40] c0000000005f8404 get_tree_bdev+0x2b4/0x350
[c000000015db7ae0] c0000000007ef058 ext4_get_tree+0x28/0x40
[c000000015db7b00] c0000000005f6344 vfs_get_tree+0x44/0x100
[c000000015db7b70] c00000000063c408 path_mount+0xdd8/0xe70
[c000000015db7c40] c00000000063c8f0 sys_mount+0x450/0x550
[c000000015db7d50] c000000000035770 system_call_exception+0x4a0/0x4e0
[c000000015db7e10] c00000000000c74c system_call_common+0xec/0x250
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/2609bc8f66fc15870616ee416a18a3d392a209c4.1644992609.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

bfdc502a

ext4: correct cluster len and clusters changed accounting in ext4_mb_mark_bb · a5c0e2fd

Ritesh Harjani authored Feb 16, 2022

ext4_mb_mark_bb() currently wrongly calculates cluster len (clen) and
flex_group->free_clusters. This patch fixes that.

Identified based on code review of ext4_mb_mark_bb() function.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/a0b035d536bafa88110b74456853774b64c8ac40.1644992609.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

a5c0e2fd

jbd2: remove CONFIG_JBD2_DEBUG to update t_max_wait · 2d442920

Ritesh Harjani authored Feb 16, 2022

CONFIG_JBD2_DEBUG and jbd2_journal_enable_debug knobs were added in
update_t_max_wait(), since earlier it used to take a spinlock for
updating t_max_wait, which could cause a bottleneck while starting a
txn (start_this_handle()).

Since in previous patch, we have killed t_handle_lock completely, we
could get rid of this debug config and knob to let t_max_wait be
updated by default again.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/ad7319a601fd501079310747ce87d908e0944763.1644992076.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

2d442920

jbd2: kill t_handle_lock transaction spinlock · f7f497cb

Ritesh Harjani authored Feb 16, 2022

This patch kills t_handle_lock transaction spinlock completely from
jbd2.

To explain the reasoning, currently there were three sites at which
this spinlock was used.

1. jbd2_journal_wait_updates()
   a. Based on careful code review it can be seen that, we don't need this
      lock here. This is since we wait for any currently ongoing updates
      based on a atomic variable t_updates. And we anyway don't take any
      t_handle_lock while in stop_this_handle().
      i.e.

	write_lock(&journal->j_state_lock()
	jbd2_journal_wait_updates() 			stop_this_handle()
		while (atomic_read(txn->t_updates) { 		|
		DEFINE_WAIT(wait); 				|
		prepare_to_wait(); 				|
		if (atomic_read(txn->t_updates) 		if (atomic_dec_and_test(txn->t_updates))
			write_unlock(&journal->j_state_lock);
			schedule();					wake_up()
			write_lock(&journal->j_state_lock);
		finish_wait();
	   }
	txn->t_state = T_COMMIT
	write_unlock(&journal->j_state_lock);

   b.  Also note that between atomic_inc(&txn->t_updates) in
       start_this_handle() and jbd2_journal_wait_updates(), the
       synchronization happens via read_lock(journal->j_state_lock) in
       start_this_handle();

2. jbd2_journal_extend()
   a. jbd2_journal_extend() is called with the handle of each process from
      task_struct. So no lock required in updating member fields of handle_t

   b. For member fields of h_transaction, all updates happens only via
      atomic APIs (which is also within read_lock()).
      So, no need of this transaction spinlock.

3. update_t_max_wait()
   Based on Jan suggestion, this can be carefully removed using atomic
   cmpxchg API.
   Note that there can be several processes which are waiting for a new
   transaction to be allocated and started. For doing this only one
   process will succeed in taking write_lock() and allocating a new txn.
   After that all of the process will be updating the t_max_wait (max
   transaction wait time). This can be done via below method w/o taking
   any locks using atomic cmpxchg.
   For more details refer [1]

	   new = get_new_val();
	   old = READ_ONCE(ptr->max_val);
	   while (old < new)
		old = cmpxchg(&ptr->max_val, old, new);

[1]: https://lwn.net/Articles/849237/Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/d89e599658b4a1f3893a48c6feded200073037fc.1644992076.git.riteshh@linux.ibm.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

f7f497cb

jbd2: fix use-after-free of transaction_t race · cc16eeca

Ritesh Harjani authored Feb 10, 2022

jbd2_journal_wait_updates() is called with j_state_lock held. But if
there is a commit in progress, then this transaction might get committed
and freed via jbd2_journal_commit_transaction() ->
jbd2_journal_free_transaction(), when we release j_state_lock.
So check for journal->j_running_transaction everytime we release and
acquire j_state_lock to avoid use-after-free issue.

Link: https://lore.kernel.org/r/948c2fed518ae739db6a8f7f83f1d58b504f87d0.1644497105.git.ritesh.list@gmail.com
Fixes: 4f981868 ("jbd2: refactor wait logic for transaction updates into a common function")
Cc: stable@kernel.org
Reported-and-tested-by: syzbot+afa2ca5171d93e44b348@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

cc16eeca

ext4: fix remount with 'abort' option · e3952fcc

Lukas Czerner authored Feb 01, 2022

After commit 6e47a3cc ("ext4: get rid of super block and sbi from
handle_mount_ops()") the 'abort' options stopped working. This is
because we're using ctx_set_mount_flags() helper that's expecting an
argument with the appropriate bit set, but instead got
EXT4_MF_FS_ABORTED which is a bit position. ext4_set_mount_flag() is
using set_bit() while ctx_set_mount_flags() was using bitwise OR.

Create a separate helper ctx_set_mount_flag() to handle setting the
mount_flags correctly.

While we're at it clean up the EXT4_SET_CTX macros so that we're only
creating helpers that we actually use to avoid warnings.

Fixes: 6e47a3cc ("ext4: get rid of super block and sbi from handle_mount_ops()")
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Ye Bin <yebin10@huawei.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220201131345.77591-1-lczerner@redhat.comSigned-off-by: Theodore Ts'o <tytso@mit.edu>

e3952fcc

20 Feb, 2022 13 commits

Linux 5.17-rc5 · cfb92440
Linus Torvalds authored Feb 20, 2022

cfb92440

Merge tag 'locking_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3324e6e8

Linus Torvalds authored Feb 20, 2022

Pull locking fix from Borislav Petkov:
 "Fix a NULL ptr dereference when dumping lockdep chains through
  /proc/lockdep_chains"

* tag 'locking_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Correct lock_classes index mapping

3324e6e8

Merge tag 'x86_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 22217739

Linus Torvalds authored Feb 20, 2022

Pull x86 fixes from Borislav Petkov:

 - Fix the ptrace regset xfpregs_set() callback to behave according to
   the ABI

 - Handle poisoned pages properly in the SGX reclaimer code

* tag 'x86_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearing
  x86/sgx: Fix missing poison handling in reclaimer

22217739

Merge tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0b0894ff

Linus Torvalds authored Feb 20, 2022

Pull scheduler fix from Borislav Petkov:
 "Fix task exposure order when forking tasks"

* tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix yet more sched_fork() races

0b0894ff

Merge tag 'edac_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras · 6e8e752f

Linus Torvalds authored Feb 20, 2022

Pull EDAC fix from Borislav Petkov:
 "Fix a long-standing struct alignment bug in the EDAC struct allocation
  code"

* tag 'edac_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC: Fix calculation of returned address and next offset in edac_align_ptr()

6e8e752f

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · e268d708

Linus Torvalds authored Feb 20, 2022

Pull SCSI fixes from James Bottomley:
 "Three fixes, all in drivers.

  The ufs and qedi fixes are minor; the lpfc one is a bit bigger because
  it involves adding a heuristic to detect and deal with common but not
  standards compliant behaviour"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Fix divide by zero in ufshcd_map_queues()
  scsi: lpfc: Fix pt2pt NVMe PRLI reject LOGO loop
  scsi: qedi: Fix ABBA deadlock in qedi_process_tmf_resp() and qedi_process_cmd_cleanup_resp()

e268d708

Merge tag 'dmaengine-fix-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine · 77478077

Linus Torvalds authored Feb 20, 2022

Pull dmaengine fixes from Vinod Koul:
 "A bunch of driver fixes for:

   - ptdma error handling in init

   - lock fix in at_hdmac

   - error path and error num fix for sh dma

   - pm balance fix for stm32"

* tag 'dmaengine-fix-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
  dmaengine: shdma: Fix runtime PM imbalance on error
  dmaengine: sh: rcar-dmac: Check for error num after dma_set_max_seg_size
  dmaengine: stm32-dmamux: Fix PM disable depth imbalance in stm32_dmamux_probe
  dmaengine: sh: rcar-dmac: Check for error num after setting mask
  dmaengine: at_xdmac: Fix missing unlock in at_xdmac_tasklet()
  dmaengine: ptdma: Fix the error handling path in pt_core_init()

77478077

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · dacec3e7

Linus Torvalds authored Feb 20, 2022

Pull i2c fixes from Wolfram Sang:
 "Some driver updates, a MAINTAINERS fix, and additions to COMPILE_TEST
  (so we won't miss build problems again)"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  MAINTAINERS: remove duplicate entry for i2c-qcom-geni
  i2c: brcmstb: fix support for DSL and CM variants
  i2c: qup: allow COMPILE_TEST
  i2c: imx: allow COMPILE_TEST
  i2c: cadence: allow COMPILE_TEST
  i2c: qcom-cci: don't put a device tree node before i2c_add_adapter()
  i2c: qcom-cci: don't delete an unregistered adapter
  i2c: bcm2835: Avoid clock stretching timeouts

dacec3e7

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 961af9db

Linus Torvalds authored Feb 20, 2022

Pull input fixes from Dmitry Torokhov:

 - a fix for Synaptics touchpads in RMI4 mode failing to suspend/resume
   properly because I2C client devices are now being suspended and
   resumed asynchronously which changed the ordering

 - a change to make sure we do not set right and middle buttons
   capabilities on touchpads that are "buttonpads" (i.e. do not have
   separate physical buttons)

 - a change to zinitix touchscreen driver adding more compatible
   strings/IDs

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: psmouse - set up dependency between PS/2 and SMBus companions
  Input: zinitix - add new compatible strings
  Input: clear BTN_RIGHT/MIDDLE on buttonpads

961af9db

Merge tag 'for-v5.17-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply · 70d2bec7

Linus Torvalds authored Feb 20, 2022

Pull power supply fixes from Sebastian Reichel:
 "Three regression fixes for the 5.17 cycle:

   - build warning fix for power-supply documentation

   - pointer size fix in cw2015 battery driver

   - OOM handling in bq256xx charger driver"

* tag 'for-v5.17-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply:
  power: supply: bq256xx: Handle OOM correctly
  power: supply: core: fix application of sizeof to pointer
  power: supply: fix table problem in sysfs-class-power

70d2bec7

Merge tag 'fs.mount_setattr.v5.17-rc4' of... · 7f25f041

Linus Torvalds authored Feb 20, 2022

Merge tag 'fs.mount_setattr.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull mount_setattr test/doc fixes from Christian Brauner:
 "This contains a fix for one of the selftests for the mount_setattr
  syscall to create idmapped mounts, an entry for idmapped mounts for
  maintainers, and missing kernel documentation for the helper we split
  out some time ago to get and yield write access to a mount when
  changing mount properties"

* tag 'fs.mount_setattr.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: add kernel doc for mnt_{hold,unhold}_writers()
  MAINTAINERS: add entry for idmapped mounts
  tests: fix idmapped mount_setattr test

7f25f041

Merge tag 'pidfd.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · c1034d24

Linus Torvalds authored Feb 20, 2022

Pull pidfd fix from Christian Brauner:
 "This fixes a problem reported by lockdep when installing a pidfd via
  fd_install() with siglock and the tasklisk write lock held in
  copy_process() when calling clone()/clone3() with CLONE_PIDFD.

  Originally a pidfd was created prior to holding any of these locks but
  this required a call to ksys_close(). So quite some time ago in
  6fd2fe49 ("copy_process(): don't use ksys_close() on cleanups") we
  switched to a get_unused_fd_flags() + fd_install() model.

  As part of that we moved fd_install() as late as possible. This was
  done for two main reasons. First, because we needed to ensure that we
  call fd_install() past the point of no return as once that's called
  the fd is live in the task's file table. Second, because we tried to
  ensure that the fd is visible in /proc/<pid>/fd/<pidfd> right when the
  task is visible.

  This fix moves the fd_install() to an even later point which means
  that a task will be visible in proc while the pidfd isn't yet under
  /proc/<pid>/fd/<pidfd>.

  While this is a user visible change it's very unlikely that this will
  have any impact. Nobody should be relying on that and if they do we
  need to come up with something better but again, it's doubtful this is
  relevant"

* tag 'pidfd.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  copy_process(): Move fd_install() out of sighand->siglock critical section

c1034d24

Merge branch 'ucount-rlimit-fixes-for-v5.17' of... · 2d3409eb

Linus Torvalds authored Feb 20, 2022

Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

Pull ucounts fixes from Eric Biederman:
 "Michal Koutný recently found some bugs in the enforcement of
  RLIMIT_NPROC in the recent ucount rlimit implementation.

  In this set of patches I have developed a very conservative approach
  changing only what is necessary to fix the bugs that I can see
  clearly. Cleanups and anything that is making the code more consistent
  can follow after we have the code working as it has historically.

  The problem is not so much inconsistencies (although those exist) but
  that it is very difficult to figure out what the code should be doing
  in the case of RLIMIT_NPROC.

  All other rlimits are only enforced where the resource is acquired
  (allocated). RLIMIT_NPROC by necessity needs to be enforced in an
  additional location, and our current implementation stumbled it's way
  into that implementation"

* 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ucounts: Handle wrapping in is_ucounts_overlimit
  ucounts: Move RLIMIT_NPROC handling after set_user
  ucounts: Base set_cred_ucounts changes on the real user
  ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1
  rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user

2d3409eb

19 Feb, 2022 4 commits

MAINTAINERS: remove duplicate entry for i2c-qcom-geni · 2428766e

Wolfram Sang authored Feb 18, 2022

The driver is already covered in the ARM/QUALCOMM section. Also, Akash
Asthana's email bounces meanwhile and Mukesh Savaliya has never
responded to mails regarding this driver.
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Wolfram Sang <wsa@kernel.org>

2428766e

sched: Fix yet more sched_fork() races · b1e82065

Peter Zijlstra authored Feb 14, 2022

Where commit 4ef0c5c6 ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.

Commit 13765de8 ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.

Fixes: 4ef0c5c6 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net

b1e82065

Merge tag 'nfs-for-5.17-3' of git://git.linux-nfs.org/projects/anna/linux-nfs · 4f12b742

Linus Torvalds authored Feb 18, 2022

Pull NFS client bugfixes from Anna Schumaker:

 - Fix unnecessary changeattr revalidations

 - Fix resolving symlinks during directory lookups

 - Don't report writeback errors in nfs_getattr()

* tag 'nfs-for-5.17-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Do not report writeback errors in nfs_getattr()
  NFS: LOOKUP_DIRECTORY is also ok with symlinks
  NFS: Remove an incorrect revalidation in nfs4_update_changeattr_locked()

4f12b742

Merge tag 'acpi-5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 1c2a33d0

Linus Torvalds authored Feb 18, 2022

Pull ACPI fixes from Rafael Wysocki:
 "These make an excess warning message go away and fix a recently
  introduced boot failure on a vintage machine.

  Specifics:

   - Change the log level of the "table not found" message in
     acpi_table_parse_entries_array() to debug to prevent it from
     showing up in the logs unnecessarily (Dan Williams)

   - Add a C-state limit quirk for 32-bit ThinkPad T40 to prevent it
     from crashing on boot after recent changes in the ACPI processor
     driver (Woody Suwalski)"

* tag 'acpi-5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: idle: fix lockup regression on 32-bit ThinkPad T40
  ACPI: tables: Quiet ACPI table not found warning

1c2a33d0