- 31 Dec, 2013 31 commits
-
-
Ilya Dryomov authored
Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Parameterize the attempts for the _firstn choose method, and apply the rule-specified tries count to firstn mode as well. Note that we have slightly different behavior here than with indep: If the firstn value is not specified for firstn, we pass through the normal attempt count. This maintains compatibility with legacy behavior. Note that this is usually *not* actually N^2 work, though, because of the descend_once tunable. However, descend_once is unfortunately *not* the same thing as 1 chooseleaf try because it is only checked on a reject but not on a collision. Sigh. In contrast, for indep, if tries is not specified we default to 1 recursive attempt, because that is simply more sane, and we have the option to do so. The descend_once tunable has no effect for indep. Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Explicitly control the number of sample attempts, and allow the number of tries in the recursive call to be explicitly controlled via the rule. This is important because the amount of time we want to spend looking for a solution may be rule dependent (e.g., higher for the wide indep pool than the rep pools). (We should do the same for the other tunables, by the way!) Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Pass down the parent's 'r' value so that we will sample different values in the recursive call when the parent tries multiple times. This avoids doing useless work (calling multiple times and trying the same values). Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Now that indep is handled by crush_choose_indep, rename crush_choose to crush_choose_firstn and remove all the conditionals. This ends up stripping out *lots* of code. Note that it *also* makes it obvious that the shenanigans we were playing with r' for uniform buckets were broken for firstn mode. This appears to have happened waaaay back in commit dae8bec9 (or earlier)... 2007. Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
For firstn mode, if we fail to make a valid placement choice, we just continue and return a short result to the caller. For indep mode, however, we need to make the position stable, and return an undefined value on failed placements to avoid shifting later results to the left. Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
This is only present to size the temporary scratch arrays that we put on the stack. Let the caller allocate them as they wish and remove the limitation. Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 3cef755428761f2481b1dd0e0fbd0464ac483fc5. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit e7d47827f0333c96ad43d257607fb92ed4176550. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Pass the size of the weight vector into crush_do_rule() to ensure that we don't access values past the end. This can happen if the caller misbehaves and passes a weight vector that is smaller than max_devices. Currently the monitor tries to prevent that from happening, but this will gracefully tolerate previous bad osdmaps that got into this state. It's also a bit more defensive. Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
This updates ceph_features.h so that it has all feature bits defined in ceph.git. In the interim since the last update, ceph.git crossed the "32 feature bits" point, and, the addition of the 33rd bit wasn't handled correctly. The work-around is squashed into this commit and reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
In preparation for ceph_features.h update, change all features fields from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this point.) Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Tear down watch request if rbd_dev_device_setup() fails. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
Rename rbd_dev_header_watch_sync() to __rbd_dev_header_watch_sync() and introduce two helpers: rbd_dev_header_{,un}watch_sync() to make it more clear what is going on. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
I no longer have direct access to my Inktank e-mail. I still pay attention to rbd, so update its entry in MAINTAINERS accordingly. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
If single-major device number allocation scheme is turned on, instead of reserving 256 minors per device, which imposes a limit of 4096 images mapped at once, reserve 16 minors per device and enable extended devt feature. This results in a theoretical limit of 65536 images mapped at once, and still allows to have more than 15 partititions: partitions starting with 16th are mapped under major 259 (Block Extended Major): $ rbd showmapped id pool image snap device 0 rbd b5 - /dev/rbd0 # no partitions 1 rbd b2 - /dev/rbd1 # 40 partitions 2 rbd b3 - /dev/rbd2 # 2 partitions $ cat /proc/partitions 251 0 1024 rbd0 251 16 1024 rbd1 251 17 0 rbd1p1 251 18 0 rbd1p2 ... 251 30 0 rbd1p14 251 31 0 rbd1p15 259 0 0 rbd1p16 259 1 0 rbd1p17 ... 259 23 0 rbd1p39 259 24 0 rbd1p40 251 32 1024 rbd2 251 33 0 rbd2p1 251 34 0 rbd2p2 (major 251 was assigned dynamically at module load time) Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Li Wang authored
Currently, if one new page allocated into fscache in readpage(), however, with no data read into due to error encountered during reading from OSDs, the slot in fscache is not uncached. This patch fixes this. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Reviewed-by: Milosz Tanski <milosz@adfin.com>
-
Li Wang authored
Signed-off-by: Li Wang <liwang@ubuntukylin.com> Reviewed-by: Milosz Tanski <milosz@adfin.com>
-
Guangliang Zhao authored
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Li Wang <li.wang@ubuntykylin.com> Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>
-
Yan, Zheng authored
Adds cap check to the page fault handler. The check prevents page fault handler from adding new page to the page cache while Fcb caps are being revoked. This solves Fc revoking hang in multiple clients mmap IO workload. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Ilya Dryomov authored
Currently each rbd device is allocated its own major number, which leads to a hard limit of 230-250 images mapped at once. This commit adds support for a new single-major device number allocation scheme, which is hidden behind a new single_major boolean module parameter and is disabled by default for backwards compatibility reasons. (Old userspace cannot correctly unmap images mapped under single-major scheme and would essentially just unmap a random image, if that.) $ rbd showmapped id pool image snap device 0 rbd b100 - /dev/rbd0 1 rbd b101 - /dev/rbd1 2 rbd b102 - /dev/rbd2 3 rbd b103 - /dev/rbd3 Old scheme (modprobe rbd): $ ls -l /dev/rbd* brw-rw---- 1 root disk 253, 0 Dec 10 12:24 /dev/rbd0 brw-rw---- 1 root disk 252, 0 Dec 10 12:28 /dev/rbd1 brw-rw---- 1 root disk 252, 1 Dec 10 12:28 /dev/rbd1p1 brw-rw---- 1 root disk 252, 2 Dec 10 12:28 /dev/rbd1p2 brw-rw---- 1 root disk 252, 3 Dec 10 12:28 /dev/rbd1p3 brw-rw---- 1 root disk 251, 0 Dec 10 12:28 /dev/rbd2 brw-rw---- 1 root disk 251, 1 Dec 10 12:28 /dev/rbd2p1 brw-rw---- 1 root disk 250, 0 Dec 10 12:24 /dev/rbd3 New scheme (modprobe rbd single_major=Y): $ ls -l /dev/rbd* brw-rw---- 1 root disk 253, 0 Dec 10 12:30 /dev/rbd0 brw-rw---- 1 root disk 253, 256 Dec 10 12:30 /dev/rbd1 brw-rw---- 1 root disk 253, 257 Dec 10 12:30 /dev/rbd1p1 brw-rw---- 1 root disk 253, 258 Dec 10 12:30 /dev/rbd1p2 brw-rw---- 1 root disk 253, 259 Dec 10 12:30 /dev/rbd1p3 brw-rw---- 1 root disk 253, 512 Dec 10 12:30 /dev/rbd2 brw-rw---- 1 root disk 253, 513 Dec 10 12:30 /dev/rbd2p1 brw-rw---- 1 root disk 253, 768 Dec 10 12:30 /dev/rbd3 (major 253 was assigned dynamically at module load time) The new limit is 4096 images mapped at once, and it comes from the fact that, as before, 256 minor numbers are reserved for each mapping. (A follow-up commit changes the number of minors reserved and the way we deal with partitions over that number.) If single_major is set to true, two new sysfs interfaces show up: /sys/bus/rbd/{add,remove}_single_major. These are to be used instead of /sys/bus/rbd/{add,remove}, which are disabled for backwards compatibility reasons outlined above. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
In preparation for single-major device number allocation scheme, wire up attribute_group::is_visible() callback for rbd bus. This allows us to make the new single-major attributes conditional. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
Introduce /sys/bus/rbd/devices/<id>/minor sysfs attribute for exporting rbd whole disk minor numbers. This is a step towards single-major device number allocation scheme, but also a good thing on its own. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
Currently rbd ids are allocated using an atomic variable that keeps track of the highest id currently in use and each new id is simply one more than the value of that variable. That's nice and cheap, but it does mean that rbd ids are allowed to grow boundlessly, and, more importantly, it's completely unpredictable. So, in preparation for single-major device number allocation scheme, which is going to establish and rely on a constant mapping between rbd ids and device numbers, switch to ida for rbd id assignments. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
Refactor rbd_init() a bit to make it more clear what's going on. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
Tweak "loaded" message, so that it looks like [ 30.184235] rbd: loaded instead of [ 38.056564] rbd: loaded rbd (rados block device) Also move (and slightly tweak) MODULE_DESCRIPTION so that all authors are next to each other in modinfo output. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
rbd_device::dev_id is an int, format it as such. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
- 13 Dec, 2013 9 commits
-
-
Josh Durgin authored
With the current full handling, there is a race between osds and clients getting the first map marked full. If the osd wins, it will return -ENOSPC to any writes, but the client may already have writes in flight. This results in the client getting the error and propagating it up the stack. For rbd, the block layer turns this into EIO, which can cause corruption in filesystems above it. To avoid this race, osds are being changed to drop writes that came from clients with an osdmap older than the last osdmap marked full. In order for this to work, clients must resend all writes after they encounter a full -> not full transition in the osdmap. osds will wait for an updated map instead of processing a request from a client with a newer map, so resent writes will not be dropped by the osd unless there is another not full -> full transition. This approach requires both osds and clients to be fixed to avoid the race. Old clients talking to osds with this fix may hang instead of returning EIO and potentially corrupting an fs. New clients talking to old osds have the same behavior as before if they encounter this race. Fixes: http://tracker.ceph.com/issues/6938Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
-
Josh Durgin authored
The PAUSEWR and PAUSERD flags are meant to stop the cluster from processing writes and reads, respectively. The FULL flag is set when the cluster determines that it is out of space, and will no longer process writes. PAUSEWR and PAUSERD are purely client-side settings already implemented in userspace clients. The osd does nothing special with these flags. When the FULL flag is set, however, the osd responds to all writes with -ENOSPC. For cephfs, this makes sense, but for rbd the block layer translates this into EIO. If a cluster goes from full to non-full quickly, a filesystem on top of rbd will not behave well, since some writes succeed while others get EIO. Fix this by blocking any writes when the FULL flag is set in the osd client. This is the same strategy used by userspace, so apply it by default. A follow-on patch makes this configurable. __map_request() is called to re-target osd requests in case the available osds changed. Add a paused field to a ceph_osd_request, and set it whenever an appropriate osd map flag is set. Avoid queueing paused requests in __map_request(), but force them to be resent if they become unpaused. Also subscribe to the next osd map from the monitor if any of these flags are set, so paused requests can be unblocked as soon as possible. Fixes: http://tracker.ceph.com/issues/6079Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
-
Libo Chen authored
Signed-off-by: Libo Chen <clbchenlibo.chen@huawei.com> Signed-off-by: Sage Weil <sage@inktank.com>
-
Li Wang authored
Wake up possible waiters, invoke the call back if any, unregister the request Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>
-
Li Wang authored
Clean up if error occurred rather than going through normal process Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>
-
majianpeng authored
For readv/preadv sync-operatoin, ceph only do the first iov. Now implement this. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
-
majianpeng authored
For writev/pwritev sync-operatoin, ceph only do the first iov. I divided the write-sync-operation into two functions. One for direct-write, other for none-direct-sync-write. This is because for none-direct-sync-write we can merge iovs to one. But for direct-write, we can't merge iovs. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com>
-
Yan, Zheng authored
Positve dentry and corresponding inode are always accompanied in MDS reply. So no need to keep inode in the cache after dropping all its aliases. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
-
Li Wang authored
If the length of data to be read in readpage() is exactly PAGE_CACHE_SIZE, the original code does not flush d-cache for data consistency after finishing reading. This patches fixes this. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>
-