Commits · 792c3a914910bd34302c5345578f85cfcb5e2c01 · nexedi / linux

14 Oct, 2014 39 commits

rbd: rbd workqueues need a resque worker · 792c3a91

Ilya Dryomov authored Oct 10, 2014

Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.

Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>

792c3a91

libceph: ceph-msgr workqueue needs a resque worker · f9865f06

Ilya Dryomov authored Oct 10, 2014

Commit f363e45f ("net/ceph: make ceph_msgr_wq non-reentrant")
effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq.  This is
wrong - libceph is very much a memory reclaim path, so restore it.

Cc: stable@vger.kernel.org # needs backporting for < 3.12
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>

f9865f06

ceph: fix bool assignments · ab6c2c3e

Fabian Frederick authored Oct 09, 2014

Fix some coccinelle warnings:
fs/ceph/caps.c:2400:6-10: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2401:6-15: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2402:6-17: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2403:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2404:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2405:6-19: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2440:4-20: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2469:3-16: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2490:2-18: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2519:3-7: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2549:3-12: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2575:2-6: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2589:3-7: WARNING: Assignment of bool to 0/1
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>

ab6c2c3e

libceph: separate multiple ops with commas in debugfs output · 25f89777

Ilya Dryomov authored Oct 06, 2014

For requests with multiple ops, separate ops with commas instead of \t,
which is a field separator here.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

25f89777

libceph: sync osd op definitions in rados.h · 70b5bfa3

Ilya Dryomov authored Oct 02, 2014

Bring in missing osd ops and strings, use macros to eliminate multiple
points of maintenance.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

70b5bfa3

libceph: remove redundant declaration · eb179d39

Fabian Frederick authored Sep 30, 2014

ceph_release_page_vector was defined twice in libceph.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>

eb179d39

ceph: additional debugfs output · 14ed9703

John Spray authored Sep 12, 2014

MDS session state and client global ID is
useful instrumentation when testing.
Signed-off-by: John Spray <john.spray@redhat.com>

14ed9703

ceph: export ceph_session_state_name function · a687ecaf

John Spray authored Sep 19, 2014

...so that it can be used from the ceph debugfs
code when dumping session info.
Signed-off-by: John Spray <john.spray@redhat.com>

a687ecaf

ceph: include the initial ACL in create/mkdir/mknod MDS requests · b1ee94aa

Yan, Zheng authored Sep 16, 2014

Current code set new file/directory's initial ACL in a non-atomic
manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.

The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

b1ee94aa

ceph: use pagelist to present MDS request data · 25e6bae3

Yan, Zheng authored Sep 16, 2014

Current code uses page array to present MDS request data. Pages in the
array are allocated/freed by caller of ceph_mdsc_do_request(). If request
is interrupted, the pages can be freed while they are still being used by
the request message.

The fix is use pagelist to present MDS request data. Pagelist is
reference counted.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

25e6bae3

libceph: reference counting pagelist · e4339d28

Yan, Zheng authored Sep 16, 2014

this allow pagelist to present data that may be sent multiple times.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

e4339d28

ceph: fix llistxattr on symlink · 0abb43dc

Yan, Zheng authored Sep 18, 2014

only regular file and directory have vxattrs.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

0abb43dc

ceph: send client metadata to MDS · dbd0c8bf

John Spray authored Sep 09, 2014

Implement version 2 of CEPH_MSG_CLIENT_SESSION syntax,
which includes additional client metadata to allow
the MDS to report on clients by user-sensible names
like hostname.
Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>

dbd0c8bf

ceph: remove redundant code for max file size verification · a4483e8a

Chao Yu authored Sep 17, 2014

Both ceph_update_writeable_page and ceph_setattr will verify file size
with max size ceph supported.
There are two caller for ceph_update_writeable_page, ceph_write_begin and
ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
chance to change file size when mmap. Likewise we have already verified the size
in inode_change_ok when we call ceph_setattr.
So let's remove the redundant code for max file size verification.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>

a4483e8a

ceph: remove redundant io_iter_advance() · 3b70b388

Yan, Zheng authored Sep 17, 2014

ceph_sync_read and generic_file_read_iter() have already advanced the
IO iterator.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

3b70b388

ceph: move ceph_find_inode() outside the s_mutex · 6cd3bcad

Yan, Zheng authored Sep 17, 2014

ceph_find_inode() may wait on freeing inode, using it inside the s_mutex
may cause deadlock. (the freeing inode is waiting for OSD read reply, but
dispatch thread is blocked by the s_mutex)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

6cd3bcad

ceph: request xattrs if xattr_version is zero · 508b32d8

Yan, Zheng authored Sep 16, 2014

Following sequence of events can happen.
  - Client releases an inode, queues cap release message.
  - A 'lookup' reply brings the same inode back, but the reply
    doesn't contain xattrs because MDS didn't receive the cap release
    message and thought client already has up-to-data xattrs.

The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

508b32d8

rbd: set the remaining discard properties to enable support · b76f8239

Josh Durgin authored Apr 07, 2014

max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

b76f8239

rbd: use helpers to handle discard for layered images correctly · d3246fb0

Josh Durgin authored Apr 07, 2014

Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes.  Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

d3246fb0

rbd: extract a method for adding object operations · 3b434a2a

Josh Durgin authored Apr 04, 2014

rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.

For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.

op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

3b434a2a

rbd: make discard trigger copy-on-write · 1c220881

Josh Durgin authored Apr 04, 2014

Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

1c220881

rbd: tolerate -ENOENT for discard operations · d0265de7

Josh Durgin authored Apr 07, 2014

Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

d0265de7

rbd: fix snapshot context reference count for discards · bef95455

Josh Durgin authored Apr 04, 2014

Discards take a reference to the snapshot context of an image when
they are created.  This reference needs to be cleaned up when the
request is done just as it is for regular writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

bef95455

rbd: read image size for discard check safely · 3c5df893

Josh Durgin authored Apr 04, 2014

In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

3c5df893

rbd: initial discard bits from Guangliang Zhao · 90e98c52

Guangliang Zhao authored Apr 01, 2014

This patch add the discard support for rbd driver.

There are three types operation in the driver:
1. The objects would be removed if they completely contained
   within the discard range.
2. The objects would be truncated if they partly contained within
   the discard range, and align with their boundary.
3. Others would be zeroed.

A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.

This resolve:
	http://tracker.ceph.com/issues/190

[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
  commits by Josh Durgin for refinements and fixes which weren't
  folded in to preserve authorship. ]
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

90e98c52

rbd: extend the operation type · 6d2940c8

Guangliang Zhao authored Mar 13, 2014

It could only handle the read and write operations now,
extend it for the coming discard support.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

6d2940c8

rbd: skip the copyup when an entire object writing · c622d226

Guangliang Zhao authored Apr 01, 2014

It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

c622d226

rbd: add img_obj_request_simple() helper · 70d045f6

Ilya Dryomov authored Sep 12, 2014

To clarify the conditions and make it easier to add new ones.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

70d045f6

rbd: access snapshot context and mapping size safely · 4e752f0a

Josh Durgin authored Apr 08, 2014

These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized. They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.

Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

4e752f0a

rbd: do not return -ERANGE on auth failures · 7dd440c9

Ilya Dryomov authored Sep 11, 2014

Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug.  Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors.  (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

7dd440c9

libceph: don't try checking queue_work() return value · 91883cd2

Ilya Dryomov authored Sep 11, 2014

queue_work() doesn't "fail to queue", it returns false if work was
already on a queue, which can't happen here since we allocate
event_work right before we queue it.  So don't bother at all.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

91883cd2

ceph: make sure request isn't in any waiting list when kicking request. · 03974e81

Yan, Zheng authored Sep 11, 2014

we may corrupt waiting list if a request in the waiting list is kicked.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

03974e81

ceph: protect kick_requests() with mdsc->mutex · 656e4382
Yan, Zheng authored Sep 11, 2014
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
```
656e4382

libceph: Convert pr_warning to pr_warn · b9a67899

Joe Perches authored Sep 09, 2014

Use the more common pr_warn.

Other miscellanea:

o Coalesce formats
o Realign arguments
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

b9a67899

ceph: trim unused inodes before reconnecting to recovering MDS · 5d23371f

Yan, Zheng authored Sep 10, 2014

So the recovering MDS does not need to fetch these ununsed inodes during
cache rejoin. This may reduce MDS recovery time.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

5d23371f

libceph: fix a use after free issue in osdmap_set_max_osd · 589506f1

Li RongQing authored Sep 07, 2014

If the state variable is krealloced successfully, map->osd_state will be
freed, once following two reallocation failed, and exit the function
without resetting map->osd_state, map->osd_state become a wild pointer.

fix it by resetting them after krealloc successfully.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

589506f1

libceph: select CRYPTO_CBC in addition to CRYPTO_AES · dc220db0

Ilya Dryomov authored Sep 05, 2014

We want "cbc(aes)" algorithm, so select CRYPTO_CBC too, not just
CRYPTO_AES.  Otherwise on !CRYPTO_CBC kernels we fail rbd map/mount
with

    libceph: error -2 building auth method x request
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

dc220db0

libceph: resend lingering requests with a new tid · 2cc6128a

Ilya Dryomov authored Sep 03, 2014

Both not yet registered (r_linger && list_empty(&r_linger_item)) and
registered linger requests should use the new tid on resend to avoid
the dup op detection logic on the OSDs, yet we were doing this only for
"registered" case. Factor out and simplify the "registered" logic and
use the new helper for "not registered" case as well.

Fixes: http://tracker.ceph.com/issues/8806Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

2cc6128a

libceph: abstract out ceph_osd_request enqueue logic · f671b581

Ilya Dryomov authored Sep 02, 2014

Introduce __enqueue_request() and switch to it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>

f671b581

05 Oct, 2014 1 commit
- Linux 3.17 · bfe01a5b
  Linus Torvalds authored Oct 05, 2014
  
  bfe01a5b