Commits · 069f3222ca96acfe8c59937e98c401bda5475b48 · Kirill Smelkov / linux

07 Jul, 2017 39 commits

crush: implement weight and id overrides for straw2 · 069f3222

Ilya Dryomov authored Jun 22, 2017

bucket_straw2_choose needs to use weights that may be different from
weight_items. For instance to compensate for an uneven distribution
caused by a low number of values. Or to fix the probability biais
introduced by conditional probabilities (see
http://tracker.ceph.com/issues/15653 for more information).

We introduce a weight_set for each straw2 bucket to set the desired
weight for a given item at a given position. The weight of a given item
when picking the first replica (first position) may be different from
the weight the second replica (second position). For instance the weight
matrix for a given bucket containing items 3, 7 and 13 could be as
follows:

          position 0   position 1

item 3     0x10000      0x100000
item 7     0x40000       0x10000
item 13    0x40000       0x10000

When crush_do_rule picks the first of two replicas (position 0), item 7,
3 are four times more likely to be choosen by bucket_straw2_choose than
item 13. When choosing the second replica (position 1), item 3 is ten
times more likely to be choosen than item 7, 13.

By default the weight_set of each bucket exactly matches the content of
item_weights for each position to ensure backward compatibility.

bucket_straw2_choose compares items by using their id. The same ids are
also used to index buckets and they must be unique. For each item in a
bucket an array of ids can be provided for placement purposes and they
are used instead of the ids. If no replacement ids are provided, the
legacy behavior is preserved.

Reflects ceph.git commit 19537a450fd5c5a0bb8b7830947507a76db2ceca.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

069f3222

libceph: apply_upmap() · 1c2e7b45

Ilya Dryomov authored Jun 21, 2017

Previously, pg_to_raw_osds() didn't filter for existent OSDs because
raw_to_up_osds() would filter for "up" ("up" is predicated on "exists")
and raw_to_up_osds() was called directly after pg_to_raw_osds().  Now,
with apply_upmap() call in there, nonexistent OSDs in pg_to_raw_osds()
output can affect apply_upmap().  Introduce remove_nonexistent_osds()
to deal with that.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

1c2e7b45

libceph: compute actual pgid in ceph_pg_to_up_acting_osds() · 463bb8da

Ilya Dryomov authored Jun 21, 2017

Move raw_pg_to_pg() call out of get_temp_osds() and into
ceph_pg_to_up_acting_osds(), for upcoming apply_upmap().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

463bb8da

libceph: pg_upmap[_items] infrastructure · 6f428df4

Ilya Dryomov authored Jun 21, 2017

pg_temp and pg_upmap encodings are the same (PG -> array of osds),
except for the incremental remove: it's an empty mapping in new_pg_temp
for pg_temp and a separate old_pg_upmap set for pg_upmap.  (This isn't
to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't
looked at as an example for pg_upmap encoding.)

Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap.
__decode_pg_temp() stores into pg_temp union member, but since pg_upmap
union member is identical, reading through pg_upmap later is OK.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6f428df4

libceph: ceph_decode_skip_* helpers · 278b1d70

Ilya Dryomov authored Jun 21, 2017

Some of these won't be as efficient as they could be (e.g.
ceph_decode_skip_set(... 32 ...) could advance by len * sizeof(u32)
once instead of advancing by sizeof(u32) len times), but that's fine
and not worth a bunch of extra macro code.

Replace skip_name_map() with ceph_decode_skip_map as an example.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

278b1d70

libceph: kill __{insert,lookup,remove}_pg_mapping() · ab75144b

Ilya Dryomov authored Jun 21, 2017

Switch to DEFINE_RB_FUNCS2-generated {insert,lookup,erase}_pg_mapping().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ab75144b

libceph: introduce and switch to decode_pg_mapping() · a303bb0e
Ilya Dryomov authored Jun 21, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
a303bb0e

libceph: don't pass pgid by value · 33333d10

Ilya Dryomov authored Jun 21, 2017

Make __{lookup,remove}_pg_mapping() look like their ceph_spg_mapping
counterparts: take const struct ceph_pg *.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

33333d10

libceph: respect RADOS_BACKOFF backoffs · a02a946d
Ilya Dryomov authored Jun 19, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
a02a946d

libceph: make DEFINE_RB_* helpers more general · 76f827a7

Ilya Dryomov authored Jun 19, 2017

Initially for ceph_pg_mapping, ceph_spg_mapping and ceph_hobject_id,
compared with ceph_pg_compare(), ceph_spg_compare() and hoid_compare()
respectively.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

76f827a7

libceph: avoid unnecessary pi lookups in calc_target() · df28152d
Ilya Dryomov authored Jun 15, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
df28152d

libceph: use target pi for calc_target() calculations · 6d637a54

Ilya Dryomov authored Jun 15, 2017

For luminous and beyond we are encoding the actual spgid, which
requires operating with the correct pg_num, i.e. that of the target
pool.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6d637a54

libceph: always populate t->target_{oid,oloc} in calc_target() · db098ec4

Ilya Dryomov authored Jun 15, 2017

need_check_tiering logic doesn't make a whole lot of sense. Drop it
and apply tiering unconditionally on every calc_target() call instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

db098ec4

libceph: make sure need_resend targets reflect latest map · 04c7d789

Ilya Dryomov authored Jun 15, 2017

Otherwise we may miss events like PG splits, pool deletions, etc when
we get multiple incremental maps at once.  Because check_pool_dne() can
now be fed an unlinked request, finish_request() needed to be taught to
handle unlinked requests.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

04c7d789

libceph: delete from need_resend_linger before check_linger_pool_dne() · a10bcb19

Ilya Dryomov authored Jun 15, 2017

When processing a map update consisting of multiple incrementals, we
may end up running check_linger_pool_dne() on a lingering request that
was previously added to need_resend_linger list.  If it is concluded
that the target pool doesn't exist, the request is killed off while
still on need_resend_linger list, which leads to a crash on a NULL
lreq->osd in kick_requests():

    libceph: linger_id 18446462598732840961 pool does not exist
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: ceph_osdc_handle_map+0x4ae/0x870
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

a10bcb19

libceph: resend on PG splits if OSD has RESEND_ON_SPLIT · 7de030d6

Ilya Dryomov authored Jun 15, 2017

Note that ceph_osd_request_target fields are updated regardless of
RESEND_ON_SPLIT.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

7de030d6

libceph: drop need_resend from calc_target() · 84ed45df

Ilya Dryomov authored Jun 15, 2017

Replace it with more fine-grained bools to separate updating
ceph_osd_request_target fields and the decision to resend.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

84ed45df

libceph: MOSDOp v8 encoding (actual spgid + full hash) · 8cb441c0
Ilya Dryomov authored Jun 15, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
8cb441c0

libceph: ceph_connection_operations::reencode_message() method · 98ad5ebd

Ilya Dryomov authored Jun 15, 2017

Give upper layers a chance to reencode the message after the connection
is negotiated and ->peer_features is set. OSD client will use this to
support both luminous and pre-luminous OSDs (in a single cluster): the
former need MOSDOp v8; the latter will continue to be sent MOSDOp v4.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

98ad5ebd

libceph: encode_{pgid,oloc}() helpers · 2e59ffd1

Ilya Dryomov authored Jun 15, 2017

Factor out encode_{pgid,oloc}() and use ceph_encode_string() for oid.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2e59ffd1

libceph: introduce ceph_spg, ceph_pg_to_primary_shard() · dc98ff72

Ilya Dryomov authored Jun 15, 2017

Store both raw pgid and actual spgid in ceph_osd_request_target.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dc98ff72

libceph: new pi->last_force_request_resend · 8e48cf00

Ilya Dryomov authored Jun 05, 2017

The old (v15) pi->last_force_request_resend has been repurposed to
make pre-RESEND_ON_SPLIT clients that don't check for PG splits but do
obey pi->last_force_request_resend resend on splits.  See ceph.git
commit 189ca7ec6420 ("mon/OSDMonitor: make pre-luminous clients resend
ops on split").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8e48cf00

libceph: fold [l]req->last_force_resend into ceph_osd_request_target · dc93e0e2
Ilya Dryomov authored Jun 05, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
dc93e0e2

libceph: support SERVER_JEWEL feature bits · 220abf5a

Ilya Dryomov authored Jun 05, 2017

Only MON_STATEFUL_SUB, really.  MON_ROUTE_OSDMAP and
OSDSUBOP_NO_SNAPCONTEXT are irrelevant.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

220abf5a

libceph: advertise support for OSD_POOLRESEND · 2d7522e0

Ilya Dryomov authored Jun 05, 2017

The code has been in place since commit 63244fa1 ("libceph:
introduce ceph_osd_request_target, calc_target()"), and, with the
ceph_{oloc,oid}_copy() issue fixed in the previous commit, is now
in working order.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

2d7522e0

libceph: handle non-empty dest in ceph_{oloc,oid}_copy() · ca35ffea
Ilya Dryomov authored Jun 05, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
ca35ffea
libceph: new features macros · f179d3ba
Ilya Dryomov authored Jun 05, 2017
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
f179d3ba

libceph: remove ceph_sanitize_features() workaround · dcbbd97c

Ilya Dryomov authored Jun 05, 2017

Reflects ceph.git commit ff1959282826ae6acd7134e1b1ede74ffd1cc04a.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

dcbbd97c

ceph: update ceph_dentry_info::lease_session when necessary · 481f001f

Yan, Zheng authored Jul 03, 2017

Current code does not update ceph_dentry_info::lease_session once
it is set. If auth mds of corresponding dentry changes, dentry lease
keeps in an invalid state.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

481f001f

ceph: new mount option that specifies fscache uniquifier · 1d8f8360

Yan, Zheng authored Jun 27, 2017

Current ceph uses FSID as primary index key of fscache data. This
allows ceph to retain cached data across remount. But this causes
problem (kernel opps, fscache does not support sharing data) when
a filesystem get mounted several times (with fscache enabled, with
different mount options).

The fix is adding a new mount option, which specifies uniquifier
for fscache.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

1d8f8360

ceph: avoid accessing freeing inode in ceph_check_delayed_caps() · 4b9f2042
Yan, Zheng authored Jun 27, 2017
```
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
4b9f2042

ceph: avoid invalid memory dereference in the middle of umount · 62a65f36

Yan, Zheng authored Jun 22, 2017

extra_mon_dispatch() and debugfs' foo_show functions dereference
fsc->mdsc. we should clean up fsc->client->extra_mon_dispatch
and debugfs before destroying fsc->mds.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

62a65f36

ceph: getattr before read on ceph.* xattrs · 1684dd03

Yan, Zheng authored Jun 14, 2017

Previously we were returning values for quota, layout
xattrs without any kind of update -- the user just got
whatever happened to be in our cache.

Clearly this extra round trip has a cost, but reads of
these xattrs are fairly rare, happening on admin
intervention rather than in normal operation.

Link: http://tracker.ceph.com/issues/17939Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

1684dd03

ceph: don't re-send interrupted flock request · 92e57e62

Yan, Zheng authored Jun 05, 2017

Don't re-send interrupted flock request in cases of mds failover
and receiving request forward. Because corresponding 'lock intr'
request may have been finished, it won't get re-sent.

Link: http://tracker.ceph.com/issues/20170Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

92e57e62

ceph: cleanup writepage_nounlock() · 43986881

Yan, Zheng authored May 23, 2017

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

43986881

ceph: redirty page when writepage_nounlock() skips unwritable page · fa71fefb

Yan, Zheng authored May 23, 2017

Ceph needs to flush dirty page in the order in which in which snap
context they belong to. Dirty pages belong to older snap context
should be flushed earlier. if writepage_nounlock() can not flush a
page, it should redirty the page.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

fa71fefb

ceph: remove useless page->mapping check in writepage_nounlock() · f2b0c45f

Yan, Zheng authored May 23, 2017

Callers of writepage_nounlock() have already ensured non-null
page->mapping.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

f2b0c45f

ceph: update the 'approaching max_size' code · efb0ca76

Yan, Zheng authored May 22, 2017

The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

efb0ca76

ceph: re-request max size after importing caps · 84eea8c7

Yan, Zheng authored May 16, 2017

The 'wanted max size' could be sent to inode's old auth mds, re-send
it to inode's new auth mds if necessary. Otherwise write syscall may
hang.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

84eea8c7

02 Jul, 2017 1 commit
- Linux 4.12 · 6f7da290
  Linus Torvalds authored Jul 02, 2017
  
  6f7da290