Commits · 5a60e87603c4c533492c515b7f62578189b03c9c · Kirill Smelkov / linux

30 Jun, 2015 2 commits

rbd: use GFP_NOIO in rbd_obj_request_create() · 5a60e876

Ilya Dryomov authored Jun 24, 2015

rbd_obj_request_create() is called on the main I/O path, so we need to
use GFP_NOIO to make sure allocation doesn't blow back on us.  Not all
callers need this, but I'm still hardcoding the flag inside rather than
making it a parameter because a) this is going to stable, and b) those
callers shouldn't really use rbd_obj_request_create() and will be fixed
in the future.

More memory allocation fixes will follow.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

5a60e876

crush: fix a bug in tree bucket decode · 82cd003a

Ilya Dryomov authored Jun 29, 2015

struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -> u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>

82cd003a

29 Jun, 2015 1 commit

libceph: Fix ceph_tcp_sendpage()'s more boolean usage · c2cfa194

Benoît Canet authored Jun 25, 2015

From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h:

bool    last_piece;     /* current is last piece */

In ceph_msg_data_next():

*last_piece = cursor->last_piece;

A call to ceph_msg_data_next() is followed by:

ret = ceph_tcp_sendpage(con->sock, page, page_offset,
                        length, last_piece);

while ceph_tcp_sendpage() is:

static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
                             int offset, size_t size, bool more)

The logic is inverted: correct it.
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

c2cfa194

25 Jun, 2015 37 commits

libceph: Remove spurious kunmap() of the zero page · 6ba8edc0

Benoît Canet authored Jun 24, 2015

ceph_tcp_sendpage already does the work of mapping/unmapping
the zero page if needed.
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6ba8edc0

rbd: queue_depth map option · b5584180

Ilya Dryomov authored Jun 23, 2015

nr_requests (/sys/block/rbd<id>/queue/nr_requests) is pretty much
irrelevant in blk-mq case because each driver sets its own max depth
that it can handle and that's the number of tags that gets preallocated
on setup.  Users can't increase queue depth beyond that value via
writing to nr_requests.

For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most
cases but we want to give users the opportunity to increase it.
Introduce a new per-device queue_depth option to do just that:

    $ sudo rbd map -o queue_depth=1024 ...
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

b5584180

rbd: store rbd_options in rbd_device · d147543d

Ilya Dryomov authored Jun 22, 2015

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

d147543d

rbd: terminate rbd_opts_tokens with Opt_err · 210c104c

Ilya Dryomov authored Jun 22, 2015

Also nuke useless Opt_last_bool and don't break lines unnecessarily.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

210c104c

ceph: fix ceph_writepages_start() · e1966b49

Yan, Zheng authored Jun 18, 2015

Before a page get locked, someone else can write data to the page
and increase the i_size. So we should re-check the i_size after
pages are locked.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

e1966b49

rbd: bump queue_max_segments · d3834fef

Ilya Dryomov authored Jun 12, 2015

The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128)
unnecessarily limits bio sizes to 512k (assuming 4k pages).  rbd, being
a virtual block device, doesn't have any restrictions on the number of
physical segments, so bump max_segments to max_hw_sectors, in theory
allowing a sector per segment (although the only case this matters that
I can think of is some readv/writev style thing).  In practice this is
going to give us 1M bios - the number of segments in a bio is limited
in bio_get_nr_vecs() by BIO_MAX_PAGES = 256.

Note that this doesn't result in any improvement on a typical direct
sequential test.  This is because on a box with a not too badly
fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice
rbd object size sized requests.  The only difference is the size of
bios being merged - 512k vs 1M for something like

    $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE
    $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

d3834fef

ceph: rework dcache readdir · fdd4e158

Yan, Zheng authored Jun 16, 2015

Previously our dcache readdir code relies on that child dentries in
directory dentry's d_subdir list are sorted by dentry's offset in
descending order. When adding dentries to the dcache, if a dentry
already exists, our readdir code moves it to head of directory
dentry's d_subdir list. This design relies on dcache internals.
Al Viro suggests using ncpfs's approach: keeping array of pointers
to dentries in page cache of directory inode. the validity of those
pointers are presented by directory inode's complete and ordered
flags. When a dentry gets pruned, we clear directory inode's complete
flag in the d_prune() callback. Before moving a dentry to other
directory, we clear the ordered flag for both old and new directory.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

fdd4e158

crush: sync up with userspace · b459be73

Ilya Dryomov authored Jun 12, 2015

.. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff
between kernel and userspace").  This fixes a bunch of recently pulled
coding style issues and makes includes a bit cleaner.

A patch "crush:Make the function crush_ln static" from Nicholas Krause
<xerofoify@gmail.com> is folded in as crush_ln() has been made static
in userspace as well.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b459be73

crush: fix crash from invalid 'take' argument · 8f529795

Ilya Dryomov authored Jun 12, 2015

Verify that the 'take' argument is a valid device or bucket.
Otherwise ignore it (do not add the value to the working vector).

Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

8f529795

ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL · 687265e5

Yan, Zheng authored Jun 13, 2015

GFP_NOFS memory allocation is required for page writeback path.
But there is no need to use GFP_NOFS in syscall path and readpage
path
Signed-off-by: Yan, Zheng <zyan@redhat.com>

687265e5

ceph: pre-allocate data structure that tracks caps flushing · f66fd9f0
Yan, Zheng authored Jun 10, 2015
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
f66fd9f0

ceph: re-send flushing caps (which are revoked) in reconnect stage · e548e9b9

Yan, Zheng authored Jun 10, 2015

if flushing caps were revoked, we should re-send the cap flush in
client reconnect stage. This guarantees that MDS processes the cap
flush message before issuing the flushing caps to other client.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

e548e9b9

ceph: send TID of the oldest pending caps flush to MDS · a2971c8c

Yan, Zheng authored Jun 10, 2015

According to this information, MDS can trim its completed caps flush
list (which is used to detect duplicated cap flush).
Signed-off-by: Yan, Zheng <zyan@redhat.com>

a2971c8c

ceph: track pending caps flushing globally · 8310b089

Yan, Zheng authored Jun 09, 2015

So we know TID of the oldest pending caps flushing. Later patch will
send this information to MDS, so that MDS can trim its completed caps
flush list.

Tracking pending caps flushing globally also simplifies syncfs code.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

8310b089

ceph: track pending caps flushing accurately · 553adfd9

Yan, Zheng authored Jun 09, 2015

Previously we do not trace accurate TID for flushing caps. when
MDS failovers, we have no choice but to re-send all flushing caps
with a new TID. This can cause problem because MDS can has already
flushed some caps and has issued the same caps to other client.
The re-sent cap flush has a new TID, which makes MDS unable to
detect if it has already processed the cap flush.

This patch adds code to track pending caps flushing accurately.
When re-sending cap flush is needed, we use its original flush
TID.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

553adfd9

libceph: fix wrong name "Ceph filesystem for Linux" · 6c13a6bb

Hong Zhiguo authored Jun 10, 2015

modinfo libceph prints the module name "Ceph filesystem for Linux",
which is same as the real fs module ceph. It's confusing.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

6c13a6bb

ceph: fix directory fsync · da819c81

Yan, Zheng authored May 27, 2015

fsync() on directory should flush dirty caps and wait for any
uncommitted directory opertions to commit. But ceph_dir_fsync()
only waits for uncommitted directory opertions.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

da819c81

ceph: fix flushing caps · 89b52fe1

Yan, Zheng authored May 27, 2015

Current ceph_fsync() only flushes dirty caps and wait for them to be
flushed. It doesn't wait for caps that has already been flushing.
This patch makes ceph_fsync() wait for pending flushing caps too.
Besides, this patch also makes caps_are_flushed() peroperly handle
tid wrapping.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

89b52fe1

ceph: don't include used caps in cap_wanted · 41445999

Yan, Zheng authored May 25, 2015

when copying files to cephfs, file data may stay in page cache after
corresponding file is closed. Cached data use Fc capability. If we
include Fc capability in cap_wanted, MDS will treat files with cached
data as open files, and journal them in an EOpen event when trimming
log segment.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

41445999

ceph: ratelimit warn messages for MDS closes session · 3e0708b9
Yan, Zheng authored May 22, 2015
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
3e0708b9

rbd: timeout watch teardown on unmap with mount_timeout · 2894e1d7

Ilya Dryomov authored May 12, 2015

As part of unmap sequence, kernel client has to talk to the OSDs to
teardown watch on the header object. If none of the OSDs are available
it would hang forever, until interrupted by a signal - when that
happens we follow through with the rest of unmap procedure (i.e.
unregister the device and put all the data structures) and the unmap is
still considired successful (rbd cli tool exits with 0). The watch on
the userspace side should eventually timeout so that's fine.

This isn't very nice, because various userspace tools (pacemaker rbd
resource agent, for example) then have to worry about setting up their
own timeouts. Timeout it with mount_timeout (60 seconds by default).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@redhat.com>

2894e1d7

ceph: simplify two mount_timeout sites · 5be73034

Ilya Dryomov authored May 19, 2015

No need to bifurcate wait now that we've got ceph_timeout_jiffies().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Yan, Zheng <zyan@redhat.com>

5be73034

libceph: a couple tweaks for wait loops · 216639dd

Ilya Dryomov authored May 19, 2015

- return -ETIMEDOUT instead of -EIO in case of timeout
- wait_event_interruptible_timeout() returns time left until timeout
  and since it can be almost LONG_MAX we had better assign it to long
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

216639dd

libceph: store timeouts in jiffies, verify user input · a319bf56

Ilya Dryomov authored May 15, 2015

There are currently three libceph-level timeouts that the user can
specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
these are in seconds and no checking is done on user input: negative
values are accepted, we multiply them all by HZ which may or may not
overflow, arbitrarily large jiffies then get added together, etc.

There is also a bug in the way mount_timeout=0 is handled.  It's
supposed to mean "infinite timeout", but that's not how wait.h APIs
treat it and so __ceph_open_session() for example will busy loop
without much chance of being interrupted if none of ceph-mons are
there.

Fix all this by verifying user input, storing timeouts capped by
msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
helper for all user-specified waits to handle infinite timeouts
correctly.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

a319bf56

libceph: nuke time_sub() · d50c97b5

Ilya Dryomov authored May 15, 2015

Unused since ceph got merged into mainline I guess.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>

d50c97b5

ceph: exclude setfilelock requests when calculating oldest tid · e8a7b8b1

Yan, Zheng authored May 19, 2015

setfilelock requests can block for a long time, which can prevent
client from advancing its oldest tid.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

e8a7b8b1

ceph: don't pre-allocate space for cap release messages · 745a8e3b

Yan, Zheng authored May 14, 2015

Previously we pre-allocate cap release messages for each caps. This
wastes lots of memory when there are large amount of caps. This patch
make the code not pre-allocate the cap release messages. Instead,
we add the corresponding ceph_cap struct to a list when releasing a
cap. Later when flush cap releases is needed, we allocate the cap
release messages dynamically.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

745a8e3b

ceph: make sure syncfs flushes all cap snaps · affbc19a
Yan, Zheng authored May 05, 2015
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
affbc19a
ceph: don't trim auth cap when there are cap snaps · 622f3e25
Yan, Zheng authored May 07, 2015
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
622f3e25

ceph: take snap_rwsem when accessing snap realm's cached_context · 604d1b02

Yan, Zheng authored May 01, 2015

When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps()
accesses snap realm's cached_context. So we need take read lock
of snap_rwsem.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

604d1b02

ceph: avoid sending unnessesary FLUSHSNAP message · 86056090

Yan, Zheng authored May 01, 2015

when a snap notification contains no new snapshot, we can avoid
sending FLUSHSNAP message to MDS. But we still need to create
cap_snap in some case because it's required by write path and
page writeback path
Signed-off-by: Yan, Zheng <zyan@redhat.com>

86056090

ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference · 5dda377c

Yan, Zheng authored Apr 30, 2015

In most cases that snap context is needed, we are holding
reference of CEPH_CAP_FILE_WR. So we can set ceph inode's
i_head_snapc when getting the CEPH_CAP_FILE_WR reference,
and make codes get snap context from i_head_snapc. This makes
the code simpler.

Another benefit of this change is that we can handle snap
notification more elegantly. Especially when snap context
is updated while someone else is doing write. The old queue
cap_snap code may set cap_snap's context to ether the old
context or the new snap context, depending on if i_head_snapc
is set. The new queue capp_snap code always set cap_snap's
context to the old snap context.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

5dda377c

ceph: use empty snap context for uninline_data and get_pool_perm · 7b06a826

Yan, Zheng authored May 01, 2015

Cached_context in ceph_snap_realm is directly accessed by
uninline_data() and get_pool_perm(). This is racy in theory.
both uninline_data() and get_pool_perm() do not modify existing
object, they only create new object. So we can pass the empty
snap context to them.  Unlike cached_context in ceph_snap_realm,
we do not need to protect the empty snap context.
Signed-off-by: Yan, Zheng <zyan@redhat.com>

7b06a826

libceph: use kvfree() instead of open-coding it · b01da6a0

Ilya Dryomov authored May 04, 2015

This one sneaked in through vfs tree with commit 2b777c9d
("ceph_sync_read: stop poking into iov_iter guts").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

b01da6a0

ceph: check OSD caps before read/write · 10183a69
Yan, Zheng authored Apr 27, 2015
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
```
10183a69

libceph: allow setting osd_req_op's flags · 144cba14

Yan, Zheng authored Apr 27, 2015

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>

144cba14

libceph: properly release STAT request's raw_data_in · 66ba609f
Yan, Zheng authored Apr 27, 2015
```
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
```
66ba609f