Commits · fd1a154cad6c6a16960fa9c2c9c6427da129e461 · Kirill Smelkov / linux

14 Dec, 2020 38 commits

libceph: make sure our addr->port is zero and addr->nonce is non-zero · fd1a154c

Ilya Dryomov authored Nov 05, 2020

Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port.  However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved.  Zero it.

Conversely, make sure our addr->nonce is non-zero.  A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fd1a154c

libceph: factor out ceph_con_get_out_msg() · 771294fe

Ilya Dryomov authored Nov 18, 2020

Move the logic of grabbing the next message from the queue into its own
function. Like ceph_con_in_msg_alloc(), this is protocol independent.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

771294fe

libceph: change ceph_con_in_msg_alloc() to take hdr · fc4c128e

Ilya Dryomov authored Nov 16, 2020

ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific.  While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fc4c128e

libceph: change ceph_msg_data_cursor_init() to take cursor · 8ee8abf7

Ilya Dryomov authored Nov 04, 2020

Make it possible to have local cursors and embed them outside struct
ceph_msg.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8ee8abf7

libceph: handle discarding acked and requeued messages separately · 02471928

Ilya Dryomov authored Oct 13, 2020

Make it easier to follow and remove dependency on msgr1 specific
CEPH_MSGR_TAG_SEQ.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

02471928

libceph: drop msg->ack_stamp field · 5cd8da3a

Ilya Dryomov authored Oct 13, 2020

It is set in process_ack() but never used.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5cd8da3a

libceph: remove redundant session reset log message · d3c1248c

Ilya Dryomov authored Nov 11, 2020

Stick with pr_info message because session reset isn't an error most of
the time.  When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d3c1248c

libceph: clear con->peer_global_seq on RESETSESSION · a3da057b

Ilya Dryomov authored Nov 11, 2020

con->peer_global_seq is part of session state.  Clear it when
the server tells us to reset, not just in ceph_con_close().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a3da057b

libceph: rename reset_connection() to ceph_con_reset_session() · 5963c3d0
Ilya Dryomov authored Nov 06, 2020
```
With just session reset bits left, rename appropriately.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
5963c3d0

libceph: split protocol reset bits out of reset_connection() · 3596f4c1

Ilya Dryomov authored Nov 06, 2020

Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.

Note that con->out_skip is now reset on faults.  This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3596f4c1

libceph: don't call reset_connection() on version/feature mismatches · 90b6561a

Ilya Dryomov authored Nov 06, 2020

A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.

This changed a long time ago, see commits 6384bb8b ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc6 ("libceph: fix protocol
feature mismatch failure path").  Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced).  reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol.  This is cleaned up
in the next patch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

90b6561a

libceph: lower exponential backoff delay · 418af5b3

Ilya Dryomov authored Oct 29, 2020

The current setting allows the backoff to climb up to 5 minutes.  This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.

In userspace, ms_max_backoff is defaulted to 15 seconds.  Let's do the
same.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

418af5b3

libceph: include middle_len in process_message() dout · b77f8f0e
Ilya Dryomov authored Nov 05, 2020
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
b77f8f0e

ceph: implement updated ceph_mds_request_head structure · 4f1ddb1e

Jeff Layton authored Dec 09, 2020

When we added the btime feature in mainline ceph, we had to extend
struct ceph_mds_request_args so that it could be set. Implement the same
in the kernel client.

Rename ceph_mds_request_head with a _old extension, and a union
ceph_mds_request_args_ext to allow for the extended size of the new
header format.

Add the appropriate code to handle both formats in struct
create_request_message and key the behavior on whether the peer supports
CEPH_FEATURE_FS_BTIME.

The gid_list field in the payload is now populated from the saved
credential. For now, we don't add any support for setting the btime via
setattr, but this does enable us to add that in the future.

[ idryomov: break unnecessarily long lines ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4f1ddb1e

ceph: clean up argument lists to __prepare_send_request and __send_request · 396bd62c

Jeff Layton authored Dec 09, 2020

We can always get the mdsc from the session, so there's no need to pass
it in as a separate argument. Pass the session to __prepare_send_request
as well, to prepare for later patches that will need to access it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

396bd62c

ceph: take a cred reference instead of tracking individual uid/gid · 7fe0cdeb

Jeff Layton authored Dec 08, 2020

Replace req->r_uid/r_gid with an r_cred pointer and take a reference to
that at the point where we previously would sample the two.  Use that to
populate the uid and gid in the header and release the reference when
the request is freed.

This should enable us to later add support for sending supplementary
group lists in MDS requests.

[ idryomov: break unnecessarily long lines ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

7fe0cdeb

ceph: don't reach into request header for readdir info · 0f51a983

Jeff Layton authored Dec 09, 2020

We already have a pointer to the argument struct in req->r_args. Use that
instead of groveling around in the ceph_mds_request_head.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0f51a983

ceph: set osdmap epoch for setxattr · 968cd14e

Xiubo Li authored Dec 09, 2020

When setting the file/dir layout, it may need data pool info. So
in mds server, it needs to check the osdmap. At present, if mds
doesn't find the data pool specified, it will try to get the latest
osdmap. Now if pass the osd epoch for setxattr, the mds server can
only check this epoch of osdmap.

URL: https://tracker.ceph.com/issues/48504Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

968cd14e

ceph: remove redundant assignment to variable i · 4a756db2

Colin Ian King authored Dec 04, 2020

The variable i is being initialized with a value that is never read
and it is being updated later with a new value in a for-loop.  The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4a756db2

ceph: add ceph.caps vxattr · dd980fc0

Luis Henriques authored Nov 23, 2020

Add a new vxattr that allows userspace to list the caps for a specific
directory or file.

[ jlayton: change format delimiter to '/' ]
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dd980fc0

ceph: when filling trace, call ceph_get_inode outside of mutexes · bca9fc14

Jeff Layton authored Nov 12, 2020

Geng Jichao reported a rather complex deadlock involving several
moving parts:

1) readahead is issued against an inode and some of its pages are locked
   while the read is in flight

2) the same inode is evicted from the cache, and this task gets stuck
   waiting for the page lock because of the above readahead

3) another task is processing a reply trace, and looks up the inode
   being evicted while holding the s_mutex. That ends up waiting for the
   eviction to complete

4) a write reply for an unrelated inode is then processed in the
   ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
   caps, and that gets stuck waiting on the s_mutex held by 3.

The reply to "1" is stuck behind the write reply in "4", so we deadlock
at that point.

This patch changes the trace processing to call ceph_get_inode outside
of the s_mutex and snap_rwsem, which should break the cycle above.

[ idryomov: break unnecessarily long lines ]

URL: https://tracker.ceph.com/issues/47998Reported-by: Geng Jichao <gengjichao@jd.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

bca9fc14

Revert "ceph: allow rename operation under different quota realms" · 6646ea1c

Luis Henriques authored Nov 12, 2020

This reverts commit dffdcd71.

When doing a rename across quota realms, there's a corner case that isn't
handled correctly.  Here's a testcase:

  mkdir files limit
  truncate files/file -s 10G
  setfattr limit -n ceph.quota.max_bytes -v 1000000
  mv files limit/

The above will succeed because ftruncate(2) won't immediately notify the
MDSs with the new file size, and thus the quota realms stats won't be
updated.

Since the possible fixes for this issue would have a huge performance impact,
the solution for now is to simply revert to returning -EXDEV when doing a cross
quota realms rename.

URL: https://tracker.ceph.com/issues/48203Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6646ea1c

ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails · 68cbb805

Jeff Layton authored Nov 12, 2020

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

68cbb805

ceph: downgrade warning from mdsmap decode to debug · ccd1acdf

Luis Henriques authored Nov 12, 2020

While the MDS cluster is unstable and changing state the client may get
mdsmap updates that will trigger warnings:

  [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay)
  [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay)

This patch downgrades these warnings to debug, as they may flood the logs
if the cluster is unstable for a while.
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ccd1acdf

ceph: fix race in concurrent __ceph_remove_cap invocations · e5cafce3

Luis Henriques authored Nov 12, 2020

A NULL pointer dereference may occur in __ceph_remove_cap with some of the
callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and
remove_session_caps_cb. Those callers hold the session->s_mutex, so they
are prevented from concurrent execution, but ceph_evict_inode does not.

Since the callers of this function hold the i_ceph_lock, the fix is simply
a matter of returning immediately if caps->ci is NULL.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/43272Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

e5cafce3

ceph: pass down the flags to grab_cache_page_write_begin · 4a357f50

Jeff Layton authored Nov 10, 2020

write_begin operations are passed a flags parameter that we need to
mirror here, so that we don't (e.g.) recurse back into filesystem code
inappropriately.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4a357f50

ceph: add ceph.{cluster_fsid/client_id} vxattrs · 5a9e2f5d

Xiubo Li authored Nov 11, 2020

These two vxattrs will only exist in local client side, with which
we can easily know which mountpoint the file belongs to and also
they can help locate the debugfs path quickly.

URL: https://tracker.ceph.com/issues/48057Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5a9e2f5d

ceph: add status debugfs file · 247b1f19

Xiubo Li authored Nov 11, 2020

This will help list some useful client side info, like the client
entity address/name and blocklisted status, etc.

URL: https://tracker.ceph.com/issues/48057Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

247b1f19

libceph: remove unused port macros · 36c9478d

Liu, Changcheng authored Nov 10, 2020

1. monitor's default port is defined by CEPH_MON_PORT
2. CEPH_PORT_START and CEPH_PORT_LAST are not needed.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

36c9478d

ceph: ensure we have Fs caps when fetching dir link count · 04fabb11

Jeff Layton authored Nov 09, 2020

The link count for a directory is defined as inode->i_subdirs + 2,
(for "." and ".."). i_subdirs is only populated when Fs caps are held.
Ensure we grab Fs caps when fetching the link count for a directory.

[ idryomov: break unnecessarily long line ]

URL: https://tracker.ceph.com/issues/48125Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

04fabb11

ceph: send dentry lease metrics to MDS daemon · 8ba3b8c7

Xiubo Li authored Nov 05, 2020

For the old ceph version, if it received this one metric message
containing the dentry lease metric info, it will just ignore it.

URL: https://tracker.ceph.com/issues/43423Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8ba3b8c7

ceph: acquire Fs caps when getting dir stats · 81048c00

Jeff Layton authored Nov 03, 2020

We only update the inode's dirstats when we have Fs caps from the MDS.

Declare a new VXATTR_FLAG_DIRSTAT that we set on all dirstats, and have
the vxattr handling code acquire those caps when it's set.

URL: https://tracker.ceph.com/issues/48104Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

81048c00

ceph: fix up some warnings on W=1 builds · 06a1ad43

Jeff Layton authored Sep 29, 2020

Convert some decodes into unused variables into skips, and fix up some
non-kerneldoc comment headers to not start with "/**".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

06a1ad43

ceph: queue MDS requests to REJECTED sessions when CLEANRECOVER is set · 4ae3713f

Jeff Layton authored Sep 21, 2020

Ilya noticed that the first access to a blacklisted mount would often
get back -EACCES, but then subsequent calls would be OK. The problem is
in __do_request. If the session is marked as REJECTED, a hard error is
returned instead of waiting for a new session to come into being.

When the session is REJECTED and the mount was done with
recover_session=clean, queue the request to the waiting_for_map queue,
which will be awoken after tearing down the old session. We can only
do this for sync requests though, so check for async ones first and
just let the callers redrive a sync request.

URL: https://tracker.ceph.com/issues/47385Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4ae3713f

ceph: remove timeout on allowing reconnect after blocklisting · dbeec07b

Jeff Layton authored Sep 25, 2020

30 minutes is a long time to wait, and this makes it difficult to test
the feature by manually blocklisting clients. Remove the timeout
infrastructure and just allow the client to reconnect at will.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dbeec07b

ceph: add new RECOVER mount_state when recovering session · 50c9132d

Jeff Layton authored Sep 25, 2020

When recovering a session (a'la recover_session=clean), we want to do
all of the operations that we do on a forced umount, but changing the
mount state to SHUTDOWN is can cause queued MDS requests to fail when
the session comes back. Most of those can idle until the session is
recovered in this situation.

Reserve SHUTDOWN state for forced umount, and make a new RECOVER state
for the forced reconnect situation. Change several tests for equality with
SHUTDOWN to test for that or RECOVER.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

50c9132d

ceph: make fsc->mount_state an int · aa5c7910

Jeff Layton authored Oct 06, 2020

This field is an unsigned long currently, which is a bit of a waste on
most arches since this just holds an enum. Make it (signed) int instead.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

aa5c7910

ceph: don't WARN when removing caps due to blocklisting · dc167e38

Jeff Layton authored Sep 25, 2020

We expect to remove dirty caps when the client is blocklisted. Don't
throw a warning in that case.

[ idryomov: break unnecessarily long line ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dc167e38

13 Dec, 2020 2 commits

Linux 5.10 · 2c85ebc5
Linus Torvalds authored Dec 13, 2020

2c85ebc5

Merge tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ec6f5e0e

Linus Torvalds authored Dec 13, 2020

Pull x86 fixes from Thomas Gleixner:
 "A set of x86 and membarrier fixes:

   - Correct a few problems in the x86 and the generic membarrier
     implementation. Small corrections for assumptions about visibility
     which have turned out not to be true.

   - Make the PAT bits for memory encryption correct vs 4K and 2M/1G
     page table entries as they are at a different location.

   - Fix a concurrency issue in the the local bandwidth readout of
     resource control leading to incorrect values

   - Fix the ordering of allocating a vector for an interrupt. The order
     missed to respect the provided cpumask when the first attempt of
     allocating node local in the mask fails. It then tries the node
     instead of trying the full provided mask first. This leads to
     erroneous error messages and breaking the (user) supplied affinity
     request. Reorder it.

   - Make the INT3 padding detection in optprobe work correctly"

* tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kprobes: Fix optprobe to detect INT3 padding correctly
  x86/apic/vector: Fix ordering in vector assignment
  x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
  x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
  membarrier: Execute SYNC_CORE on the calling thread
  membarrier: Explicitly sync remote cores when SYNC_CORE is requested
  membarrier: Add an actual barrier before rseq_preempt()
  x86/membarrier: Get rid of a dubious optimization

ec6f5e0e