Commits · 9abd4db713704aac146395e079224ddd716e9b95 · Kirill Smelkov / linux

25 May, 2016 40 commits

ceph: don't use truncate_pagecache() to invalidate read cache · 9abd4db7

Yan, Zheng authored May 18, 2016

truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

9abd4db7

ceph: SetPageError() for writeback pages if writepages fails · b109eec6
Yan, Zheng authored May 13, 2016
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
b109eec6

ceph: handle interrupted ceph_writepage() · ad15ec06

Yan, Zheng authored May 13, 2016

writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

ad15ec06

ceph: make ceph_update_writeable_page() uninterruptible · a78bbd4b

Yan, Zheng authored May 13, 2016

ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

a78bbd4b

libceph: make ceph_osdc_wait_request() uninterruptible · 0e76abf2

Yan, Zheng authored May 13, 2016

Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().
Signed-off-by: Yan, Zheng <zyan@redhat.com>

0e76abf2

ceph: handle -EAGAIN returned by ceph_update_writeable_page() · f0b33df5

Yan, Zheng authored May 10, 2016

when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

f0b33df5

ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM · 6ce026e4
Yan, Zheng authored May 10, 2016
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
6ce026e4

ceph: block non-fatal signals for fault/page_mkwrite · 4f7e89f6

Yan, Zheng authored May 10, 2016

Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible
Signed-off-by: Yan, Zheng <zyan@redhat.com>

4f7e89f6

ceph: make logical calculation functions return bool · 3b33f692

Zhang Zhuoyu authored Mar 25, 2016

This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>

3b33f692

ceph: tolerate bad i_size for symlink inode · 224a7542

Yan, Zheng authored May 05, 2016

A mds bug can cause symlink's size to be truncated to zero.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

224a7542

ceph: improve fragtree change detection · 1b1bc16d

Yan, Zheng authored May 04, 2016

check if number of splits in i_fragtree is equal to number of splits
in mds reply
Signed-off-by: Yan, Zheng <zyan@redhat.com>

1b1bc16d

ceph: keep leaf frag when updating fragtree · a4b7431f

Yan, Zheng authored May 04, 2016

Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

a4b7431f

ceph: fix dir_auth check in ceph_fill_dirfrag() · 42172119

Yan, Zheng authored May 03, 2016

-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds
Signed-off-by: Yan, Zheng <zyan@redhat.com>

42172119

ceph: don't assume frag tree splits in mds reply are sorted · a407846e

Yan, Zheng authored May 03, 2016

The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().

The fix is sort the frag tree splits first, then updates i_fragtree.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

a407846e

ceph: fix inode reference leak · 209ae762
Yan, Zheng authored Apr 29, 2016
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
209ae762

ceph: using hash value to compose dentry offset · f3c4ebe6

Yan, Zheng authored Apr 29, 2016

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

f3c4ebe6

ceph: don't forbid marking directory complete after forward seek · 076c40f1

Yan, Zheng authored Apr 28, 2016

Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete
Signed-off-by: Yan, Zheng <zyan@redhat.com>

076c40f1

ceph: record 'offset' for each entry of readdir result · 8974eebd

Yan, Zheng authored Apr 28, 2016

This is preparation for using hash value as dentry 'offset'
Signed-off-by: Yan, Zheng <zyan@redhat.com>

8974eebd

ceph: define 'end/complete' in readdir reply as bit flags · 956d39d6

Yan, Zheng authored Apr 27, 2016

Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

956d39d6

ceph: define struct for dir entry in readdir reply · 2a5beea3

Yan, Zheng authored Apr 28, 2016

This avoids defining multiple arrays for entries in readdir reply
Signed-off-by: Yan, Zheng <zyan@redhat.com>

2a5beea3

ceph: simplify 'offset in frag' · a78600e7

Yan, Zheng authored Apr 27, 2016

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

a78600e7

ceph: remove unnecessary checks in __dcache_readdir · 1cd42a42

Yan, Zheng authored Apr 29, 2016

we never add snapdir and the hidden .ceph dir into readdir cache
Signed-off-by: Yan, Zheng <zyan@redhat.com>

1cd42a42

ceph: search cache postion for dcache readdir · c530cd24

Yan, Zheng authored Apr 28, 2016

use binary search to find cache index that corresponds to readdir
postion.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

c530cd24

ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr · 04303d8a

Yan, Zheng authored Apr 21, 2016

Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

04303d8a

ceph: report mount root in session metadata · 3f384954
Yan, Zheng authored Apr 21, 2016
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
3f384954

ceph: don't show symlink target in debugfs/mdsc · aeda081c

Yan, Zheng authored Apr 18, 2016

symlink target is useless for debug and can be very long. It's annoying
to show it in debugfs/mdsc.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

aeda081c

ceph: don't call truncate_pagecache in ceph_writepages_start · 6c93df5d

Yan, Zheng authored Apr 15, 2016

truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).

The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

6c93df5d

ceph: renew caps for read/write if mds session got killed. · 77310320

Yan, Zheng authored Apr 08, 2016

When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

77310320

ceph: CEPH_FEATURE_MDSENC support · d463a43d
Yan, Zheng authored Mar 31, 2016
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
d463a43d

ceph: multiple filesystem support · 235a0982

Yan, Zheng authored Mar 30, 2016

To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

235a0982

libceph: support for subscribing to "mdsmap.<id>" maps · 737cc81e
Ilya Dryomov authored May 26, 2016
```
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
```
737cc81e

libceph: replace ceph_monc_request_next_osdmap() · 7cca78c9

Ilya Dryomov authored Apr 28, 2016

... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

7cca78c9

libceph: take osdc->lock in osdmap_show() and dump flags in hex · b4f34795

Ilya Dryomov authored Apr 28, 2016

There is now about a dozen CEPH_OSDMAP_* flags.  This is a debugging
interface, so just dump in hex instead of spelling each flag out.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b4f34795

libceph: pool deletion detection · 4609245e

Ilya Dryomov authored Apr 28, 2016

This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

4609245e

libceph: async MON client generic requests · d0b19705

Ilya Dryomov authored Apr 28, 2016

For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion.  Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

d0b19705

libceph: support for checking on status of watch · b07d3c4b

Ilya Dryomov authored Apr 28, 2016

Implement ceph_osdc_watch_check() to be able to check on status of
watch.  Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b07d3c4b

libceph: support for sending notifies · 19079203

Ilya Dryomov authored Apr 28, 2016

Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

19079203

libceph, rbd: ceph_osd_linger_request, watch/notify v2 · 922dab61

Ilya Dryomov authored May 26, 2016

This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

922dab61

rbd: rbd_dev_header_unwatch_sync() variant · c525f036

Ilya Dryomov authored Apr 28, 2016

Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks.  This is for the new rados_watcherrcb_t, which would be
called from a notify callback.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c525f036

libceph: wait_request_timeout() · 42b06965

Ilya Dryomov authored Apr 28, 2016

The unwatch timeout is currently implemented in rbd.  With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

42b06965