Commits · cd1a677cad994021b19665ed476aea63f5d54f31 · Kirill Smelkov / linux

14 Dec, 2020 40 commits

libceph, ceph: implement msgr2.1 protocol (crc and secure modes) · cd1a677c

Ilya Dryomov authored Nov 19, 2020

Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5.  msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

cd1a677c

libceph: introduce connection modes and ms_mode option · 00498b99

Ilya Dryomov authored Nov 19, 2020

msgr2 supports two connection modes: crc (plain) and secure (on-wire
encryption).  Connection mode is picked by server based on input from
client.

Introduce ms_mode option:

  ms_mode=legacy        - msgr1 (default)
  ms_mode=crc           - crc mode, if denied fail
  ms_mode=secure        - secure mode, if denied fail
  ms_mode=prefer-crc    - crc mode, if denied agree to secure mode
  ms_mode=prefer-secure - secure mode, if denied agree to crc mode

ms_mode affects all connections, we don't separate connections to mons
like it's done in userspace with ms_client_mode vs ms_mon_client_mode.

For now the default is legacy, to be flipped to prefer-crc after some
time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

00498b99

libceph, rbd: ignore addr->type while comparing in some cases · 313771e8

Ilya Dryomov authored Nov 25, 2020

For libceph, this ensures that libceph instance sharing (share option)
continues to work.  For rbd, this avoids blocklisting alive lock owners
(locker addr is always LEGACY, while watcher addr is ANY in nautilus).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

313771e8

libceph, ceph: get and handle cluster maps with addrvecs · a5cbd5fc

Ilya Dryomov authored Oct 30, 2020

In preparation for msgr2, make the cluster send us maps with addrvecs
including both LEGACY and MSGR2 addrs instead of a single LEGACY addr.
This means advertising support for SERVER_NAUTILUS and also some older
features: SERVER_MIMIC, MONENC and MONNAMES.

MONNAMES and MONENC are actually pre-argonaut, we just never updated
ceph_monmap_decode() for them. Decoding is unconditional, see commit
23c625ce ("libceph: assume argonaut on the server side").

SERVER_MIMIC doesn't bear any meaning for the kernel client.

Since ceph_decode_entity_addrvec() is guarded by encoding version
checks (and in msgr2 case it is guarded implicitly by the fact that
server is speaking msgr2), we assume MSG_ADDR2 for it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a5cbd5fc

libceph: factor out finish_auth() · 8921f251

Ilya Dryomov authored Oct 14, 2020

In preparation for msgr2, factor out finish_auth() so it is suitable
for both existing MAuth message based authentication and upcoming msgr2
authentication exchange.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8921f251

libceph: drop ac->ops->name field · c1c0ce78
Ilya Dryomov authored Oct 26, 2020
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
c1c0ce78

libceph: amend cephx init_protocol() and build_request() · 59711f9e

Ilya Dryomov authored Oct 26, 2020

In msgr2, initial authentication happens with an exchange of msgr2
control frames -- MAuth message and struct ceph_mon_request_header
aren't used.  Make that optional.

Stop reporting cephx protocol as "x".  Use "cephx" instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

59711f9e

libceph, ceph: incorporate nautilus cephx changes · 285ea34f

Ilya Dryomov authored Oct 26, 2020

- request service tickets together with auth ticket.  Currently we get
  auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service
  tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message.
  Since nautilus, desired service tickets are shared togther with auth
  ticket in CEPHX_GET_AUTH_SESSION_KEY reply.

- propagate session key and connection secret, if any.  In preparation
  for msgr2, update handle_reply() and verify_authorizer_reply() auth
  ops to propagate session key and connection secret.  Since nautilus,
  if secure mode is negotiated, connection secret is shared either in
  CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer
  reply (for osds and mdses).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

285ea34f

libceph: safer en/decoding of cephx requests and replies · 6610fff2
Ilya Dryomov authored Oct 12, 2020
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
6610fff2

libceph: more insight into ticket expiry and invalidation · f79e25b0

Ilya Dryomov authored Nov 27, 2020

Make it clear that "need" is a union of "missing" and "have, but up
for renewal" and dout when the ticket goes missing due to expiry or
invalidation by client.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f79e25b0

libceph: move msgr1 protocol specific fields to its own struct · a56dd9bf
Ilya Dryomov authored Nov 12, 2020
```
A couple whitespace fixups, no functional changes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
a56dd9bf

libceph: move msgr1 protocol implementation to its own file · 2f713615

Ilya Dryomov authored Nov 12, 2020

A pure move, no other changes.

Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}()
helpers are also moved.  msgr2 will bring its own, more efficient,
variants based on iov_iter.  Switching msgr1 to them was considered
but decided against to avoid subtle regressions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2f713615

libceph: separate msgr1 protocol implementation · 566050e1

Ilya Dryomov authored Nov 12, 2020

In preparation for msgr2, define internal messenger <-> protocol
interface (as opposed to external messenger <-> client interface, which
is struct ceph_connection_operations) consisting of try_read(),
try_write(), revoke(), revoke_incoming(), opened(), reset_session() and
reset_protocol() ops.  The semantics are exactly the same as they are
now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

566050e1

libceph: export remaining protocol independent infrastructure · 6503e0b6

Ilya Dryomov authored Nov 09, 2020

In preparation for msgr2, make all protocol independent functions
in messenger.c global.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6503e0b6

libceph: export zero_page · 699921d9

Ilya Dryomov authored Nov 09, 2020

In preparation for msgr2, make zero_page global.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

699921d9

libceph: rename and export con->flags bits · 3fefd43e

Ilya Dryomov authored Nov 09, 2020

In preparation for msgr2, move the defines to the header file.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3fefd43e

libceph: rename and export con->state states · 6d7f62bf

Ilya Dryomov authored Nov 09, 2020

In preparation for msgr2, rename msgr1 specific states and move the
defines to the header file.

Also drop state transition comments.  They don't cover all possible
transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more
harm than good.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6d7f62bf

libceph: make con->state an int · 30be780a

Ilya Dryomov authored Nov 09, 2020

unsigned long is a leftover from when con->state used to be a set of
bits managed with set_bit(), clear_bit(), etc. Save a bit of memory.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

30be780a

libceph: don't export ceph_messenger_{init_fini}() to modules · 2f687380
Ilya Dryomov authored Nov 05, 2020
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
2f687380

libceph: make sure our addr->port is zero and addr->nonce is non-zero · fd1a154c

Ilya Dryomov authored Nov 05, 2020

Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port.  However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved.  Zero it.

Conversely, make sure our addr->nonce is non-zero.  A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fd1a154c

libceph: factor out ceph_con_get_out_msg() · 771294fe

Ilya Dryomov authored Nov 18, 2020

Move the logic of grabbing the next message from the queue into its own
function. Like ceph_con_in_msg_alloc(), this is protocol independent.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

771294fe

libceph: change ceph_con_in_msg_alloc() to take hdr · fc4c128e

Ilya Dryomov authored Nov 16, 2020

ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific.  While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fc4c128e

libceph: change ceph_msg_data_cursor_init() to take cursor · 8ee8abf7

Ilya Dryomov authored Nov 04, 2020

Make it possible to have local cursors and embed them outside struct
ceph_msg.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8ee8abf7

libceph: handle discarding acked and requeued messages separately · 02471928

Ilya Dryomov authored Oct 13, 2020

Make it easier to follow and remove dependency on msgr1 specific
CEPH_MSGR_TAG_SEQ.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

02471928

libceph: drop msg->ack_stamp field · 5cd8da3a

Ilya Dryomov authored Oct 13, 2020

It is set in process_ack() but never used.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

5cd8da3a

libceph: remove redundant session reset log message · d3c1248c

Ilya Dryomov authored Nov 11, 2020

Stick with pr_info message because session reset isn't an error most of
the time.  When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d3c1248c

libceph: clear con->peer_global_seq on RESETSESSION · a3da057b

Ilya Dryomov authored Nov 11, 2020

con->peer_global_seq is part of session state.  Clear it when
the server tells us to reset, not just in ceph_con_close().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a3da057b

libceph: rename reset_connection() to ceph_con_reset_session() · 5963c3d0
Ilya Dryomov authored Nov 06, 2020
```
With just session reset bits left, rename appropriately.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
5963c3d0

libceph: split protocol reset bits out of reset_connection() · 3596f4c1

Ilya Dryomov authored Nov 06, 2020

Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.

Note that con->out_skip is now reset on faults.  This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

3596f4c1

libceph: don't call reset_connection() on version/feature mismatches · 90b6561a

Ilya Dryomov authored Nov 06, 2020

A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.

This changed a long time ago, see commits 6384bb8b ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc6 ("libceph: fix protocol
feature mismatch failure path").  Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced).  reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol.  This is cleaned up
in the next patch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

90b6561a

libceph: lower exponential backoff delay · 418af5b3

Ilya Dryomov authored Oct 29, 2020

The current setting allows the backoff to climb up to 5 minutes.  This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.

In userspace, ms_max_backoff is defaulted to 15 seconds.  Let's do the
same.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

418af5b3

libceph: include middle_len in process_message() dout · b77f8f0e
Ilya Dryomov authored Nov 05, 2020
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
b77f8f0e

ceph: implement updated ceph_mds_request_head structure · 4f1ddb1e

Jeff Layton authored Dec 09, 2020

When we added the btime feature in mainline ceph, we had to extend
struct ceph_mds_request_args so that it could be set. Implement the same
in the kernel client.

Rename ceph_mds_request_head with a _old extension, and a union
ceph_mds_request_args_ext to allow for the extended size of the new
header format.

Add the appropriate code to handle both formats in struct
create_request_message and key the behavior on whether the peer supports
CEPH_FEATURE_FS_BTIME.

The gid_list field in the payload is now populated from the saved
credential. For now, we don't add any support for setting the btime via
setattr, but this does enable us to add that in the future.

[ idryomov: break unnecessarily long lines ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4f1ddb1e

ceph: clean up argument lists to __prepare_send_request and __send_request · 396bd62c

Jeff Layton authored Dec 09, 2020

We can always get the mdsc from the session, so there's no need to pass
it in as a separate argument. Pass the session to __prepare_send_request
as well, to prepare for later patches that will need to access it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

396bd62c

ceph: take a cred reference instead of tracking individual uid/gid · 7fe0cdeb

Jeff Layton authored Dec 08, 2020

Replace req->r_uid/r_gid with an r_cred pointer and take a reference to
that at the point where we previously would sample the two.  Use that to
populate the uid and gid in the header and release the reference when
the request is freed.

This should enable us to later add support for sending supplementary
group lists in MDS requests.

[ idryomov: break unnecessarily long lines ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

7fe0cdeb

ceph: don't reach into request header for readdir info · 0f51a983

Jeff Layton authored Dec 09, 2020

We already have a pointer to the argument struct in req->r_args. Use that
instead of groveling around in the ceph_mds_request_head.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

0f51a983

ceph: set osdmap epoch for setxattr · 968cd14e

Xiubo Li authored Dec 09, 2020

When setting the file/dir layout, it may need data pool info. So
in mds server, it needs to check the osdmap. At present, if mds
doesn't find the data pool specified, it will try to get the latest
osdmap. Now if pass the osd epoch for setxattr, the mds server can
only check this epoch of osdmap.

URL: https://tracker.ceph.com/issues/48504Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

968cd14e

ceph: remove redundant assignment to variable i · 4a756db2

Colin Ian King authored Dec 04, 2020

The variable i is being initialized with a value that is never read
and it is being updated later with a new value in a for-loop.  The
initialization is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4a756db2

ceph: add ceph.caps vxattr · dd980fc0

Luis Henriques authored Nov 23, 2020

Add a new vxattr that allows userspace to list the caps for a specific
directory or file.

[ jlayton: change format delimiter to '/' ]
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dd980fc0

ceph: when filling trace, call ceph_get_inode outside of mutexes · bca9fc14

Jeff Layton authored Nov 12, 2020

Geng Jichao reported a rather complex deadlock involving several
moving parts:

1) readahead is issued against an inode and some of its pages are locked
   while the read is in flight

2) the same inode is evicted from the cache, and this task gets stuck
   waiting for the page lock because of the above readahead

3) another task is processing a reply trace, and looks up the inode
   being evicted while holding the s_mutex. That ends up waiting for the
   eviction to complete

4) a write reply for an unrelated inode is then processed in the
   ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
   caps, and that gets stuck waiting on the s_mutex held by 3.

The reply to "1" is stuck behind the write reply in "4", so we deadlock
at that point.

This patch changes the trace processing to call ceph_get_inode outside
of the s_mutex and snap_rwsem, which should break the cycle above.

[ idryomov: break unnecessarily long lines ]

URL: https://tracker.ceph.com/issues/47998Reported-by: Geng Jichao <gengjichao@jd.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

bca9fc14