- 02 May, 2013 40 commits
-
-
Alex Elder authored
Hold off setting the EXISTS rbd device flag until just before we announce the disk as available for use. There's no point in doing so any earlier than that, and at that point the device truly is fully set up and ready to use. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
This just tweaks a few things in the routines that implement rbd sysfs files. All of the entries for an rbd device in /sys/bus/rbd/devices/<id>/ will represent information whose valid values are known by the time they are accessible. Right now we get the size of the mapped image by a call to get_capacity(). There's no need to do this, because that will return what we last set the capacity to, which is just the size recorded for the mapping. So just show that value instead. We also get this under protection of the header semaphore, in order to provide a precisely correct value. This isn't really necessary; these files are really informational only and it's not necessary to be so careful. Finally, print a special value in case the major device number is not recorded. Right now that won't matter much but soon the parent images won't have devices associated with them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
When a snapshot context update occurs, rbd_update_mapping_size() is called to set the capacity of the disk to record the updated size of the image in case it has changed. There's a bug though. The mapping size is in units of *bytes*. The code that updates the mapping size field is assigning a value that has been scaled down to *sectors*. Fix that. Also, check to see if the size has actually changed, and don't bother updating things (specifically, calling set_capacity()) if it has not. This resolves: http://tracker.ceph.com/issues/4833Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Fairly straightforward refactoring of rbd_dev_probe_update_spec(). The name is changed to rbd_dev_spec_update(). Rearrange it so nothing gets assigned to the spec until all of the names have been successfully acquired. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Rename rbd_dev_probe() to be rbd_dev_image_probe(). Its purpose will eventually be to probe for the existence of a valid rbd image for the rbd device--focusing only on the ceph side and not the Linux device side of initialization. For now the two "sides" are not fully separated, and this function is still the entry point for initializing the full rbd device. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Currently, rbd_dev_destroy() does more than just the inverse of what rbd_dev_create() does. Stop doing that, and move the two extra things it does into the three call sites. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Encapsulate the creation of a snapshot context for rbd in a new function rbd_snap_context_create(). Define rbd wrappers for getting and dropping references to them once they're created. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Change some calls to WARN_ON() so they use rbd_warn() instead, so we get consistent messaging. A few remain but they can probably just go away eventually. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
This commit added fetching if fancy striping parameters: 09186ddb rbd: get and check striping parameters They are almost unused, but the two fields storing the information really belonged in the rbd_image_header structure. This patch moves them there. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Make the names and image id in an rbd_spec be pointers to constant data. This required the use of a local variable to hold the snapshot name in rbd_add_parse_args() to avoid a warning. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Set the rbd spec's snapshot id for an image getting mapped in rbd_dev_probe_update_spec() rather than rbd_dev_set_mapping(). This is the more logical place for that to happen (even though it means we might look up the snapshot by name twice). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
A function called snap_by_name() ought to just look up a snapshot by name. It does that, but then it assigns some stuff to the rbd device structure as well. Change the function to do just the lookup, and have the caller do the assignments that follow. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
If a format 2 image id is found for an image being mapped, but the subsequent probe of the image fails, rbd_dev_probe() quits without freeing the image id. Fix that. Also drop a redundant hunk of code in rbd_dev_image_id(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Currently, rbd_dev_probe() assumes that any error returned by rbd_dev_image_id() is most likely -ENOENT, and responds by calling the format 1 probe routine, rbd_dev_v1_probe(). Then, at the top of rbd_dev_v1_probe(), an empty string is allocated for the image id. This is sort of unbalanced. Fix this by having rbd_dev_image_id() look for -ENOENT from its "get_id" method call. If that is seen, have it allocate the empty string there rather than depending on rbd_dev_v1_probe() to do it. Given that this is effectively defining the format of the image, set rbd_dev->image_format inside rbd_dev_image_id() rather than in the format-specific probe routines. Also drop a redundant hunk of code in rbd_dev_image_id(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
I found during some failure injection testing that the call to rbd_free_disk() in the error path of rbd_dev_probe_finish() was dropping an extra reference to the disk queue. The problem occurred when put_disk tried to drop a reference to the disk's queue. A call to blk_cleanup_queue() just prior to that will have also dropped a reference to the queue. The problem is that the reference dropped by put_disk() is assumed to have been taken by add_disk(). Our code has error paths that can occur after the disk and its queue are initialized, but before the call to add_disk(), and in those paths we won't have that extra reference. The fix is easy though. In rbd_free_disk() we're already checking the disk's GENHD_FL_UP flag. That flag is an indication that add_disk() has been called, so just call blk_cleanup_queue() conditional on that flag being set. This resolves: http://tracker.ceph.com/issues/4800Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Now that rbd_obj_method_sync() returns the number of bytes returned by the method call, that value should be used by callers to ensure we don't overrun the valid portion of the buffer. Fix the two spots that remained that weren't doing that, rbd_dev_image_name() and rbd_dev_v2_snap_name(). Rearrange the error path slightly in rbd_dev_v2_snap_name(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
When the snapshot context for an rbd device gets updated (or the initial one is recorded) a a list of snapshot structures is created to represent them, one entry per snapshot. Each entry includes a dynamically-allocated copy of the snapshot name. Currently the name is allocated in rbd_snap_create(), as a duplicate of the passed-in name. For format 1 images, the snapshot name provided is just a pointer to an existing name. But for format 2 images, the passed-in name is already dynamically allocated, and in the the process of duplicating it here we are leaking the passed-in name. Fix this by dynamically allocating the name for format 1 snapshots also, and then stop allocating a duplicate in rbd_snap_create(). Change rbd_dev_v1_snap_info() so none of its parameters is side-effected unless it's going to return success. This is part of: http://tracker.ceph.com/issues/4803Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Rename __rbd_add_snap_dev() to be rbd_snap_create(). We no longer have devices for non-mapped snapshots, and we're not actually "adding" it to the list in this function, just creating it. Rename rbd_remove_snap_dev() to be rbd_snap_destroy() for reasons similar to the above. Stop having this function delete the snapshot from its list (to be symmetrical with its create counterpart) and do that in the caller instead. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Change rbd_dev_v2_snap_info() so it only ever sets values of the size and features parameters if looking up the snapshot name was successful. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Only one of the two callers of _rbd_dev_v2_snap_size() needs the order value returned. So make that an optional argument--a null pointer if the caller doesn't need it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
When an rbd image is initially mapped, its snapshot context is collected, and then a list of snapshot entries representing the snapshots in that context is created. The list is created using rbd_dev_snaps_update(). (This function also supports updating an existing snapshot list based on a new snapshot context.) If an error occurs, updating the list is aborted, and the list is currently left as-is, in an inconsistent state. At that point, there may be a partially-constructed list, but the calling functions (rbd_dev_probe_finish() from rbd_dev_probe() from rbd_add()) never clean them up. So this constitutes a leak. A snapshot list that is inconsistent with the current snapshot context is of no use, and might even be actively bad. So rather than just having the caller clean it up, have rbd_dev_snaps_update() just clear out the entire snapshot list in the event an error occurs. The other place rbd_dev_snaps_update() is used is when a refresh is triggered, either because of a watch callback or via a write to the /sys/bus/rbd/devices/<id>/refresh interface. An error while updating the snapshots has no substantive effect in either of those cases, but one of them issues a warning. Move that warning to the common rbd_dev_refresh() function so it gets issued regardless of how it got initiated. This is part of: http://tracker.ceph.com/issues/4803Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
When an rbd image gets mapped a device entry gets created for it under /sys/bus/rbd/devices/<id>/. Inside that directory there are sysfs files that contain information about the image: its size, feature bits, major device number, and so on. Additionally, if that image has any snapshots, a device entry gets created for each of those as a "child" of the mapped device. Each of these is a subdirectory of the mapped device, and each directory contains a few files with information about the snapshot (its snapshot id, size, and feature mask). There is no clear benefit to having those device entries for the snapshots. The information provided via sysfs of of little real value--and all of it is available via rbd CLI commands. If we still wanted to see the kernel's view of this information it could be done much more simply by including it in a single sysfs file for the mapped image. But there *is* a clear cost to supporting them. Every time a snapshot context changes, these entries need to be updated (deleted snapshots removed, new snapshots created). The rbd driver is notified of changes to the snapshot context via callbacks from an osd, and care must be taken to coordinate removal of snapshot data structures with the possibility of one these notifications occurring. Things would be considerably simpler if we just didn't have to maintain device entries for the snapshots. So get rid of them. The ability to map a snapshot of an rbd image will remain; the only thing lost will be the ability to query these sysfs directories for information about snapshots of mapped images. This resolves: http://tracker.ceph.com/issues/4796Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
A WATCH op includes an object version. The version that's supplied is incorrectly byte-swapped osd_req_op_watch_init() where it's first assigned (it's been this way since that code was first added). The result is that the version sent to the osd is wrong, because that value gets byte-swapped again in osd_req_encode_op(). This is the source of a sparse warning related to improper byte order in the assignment. The approach of using the version to avoid a race is deprecated (see http://tracker.ceph.com/issues/3871), and the watch parameter is no longer even examined by the osd. So fix the assignment in osd_req_op_watch_init() so it no longer does the byte swap. This resolves: http://tracker.ceph.com/issues/3847Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Now that we have most everything in place to support layered rbd images, enable support for them in the kernel client. Issue a warning to the log that the support is considered experimental whenever a format 2 layered image is mapped. Note that we also have to claim to support the STRIPINGV2 feature, due to a mistake in the way the rbd CLI set up those flags. This feature can work if it has the right parameters, and safeguards have been put in place to reject those images that do not have compatible parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
If an rbd format 2 image indicates it supports the STRIPINGV2 feature we need to find out its stripe unit and stripe count in order to know whether we can use it. We don't yet support fancy striping fully, but if the default parameters are used the behavior is indistinguishible from non-fancy striping. This is necessary because some images require the STRIPINGV2 feature even if they use the default parameters. (Which is to say the feature bit was erroneously set even if the feature was not used.) This resolves: http://tracker.ceph.com/issues/4709Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Callers of rbd_obj_method_sync() don't know how many bytes of data got returned by the class method call. As a result, they have been assuming enough got returned to decode whatever was expected. This isn't safe. We know how many bytes got transferred, so have rbd_obj_method_sync() return that amount (rather than just 0) if the call is successful. Change all callers to use this return value to ensure decoding of the results is done safely. On the other hand, most callers of rbd_obj_method_sync() only indicate success or failure, so all of *their* callers can simply test for non-zero result. This resolves: http://tracker.ceph.com/issues/4773Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Make the inbound and outbound data parameters have void rather than character type for rbd_obj_method_sync(). This makes it more clear they don't expect typed data, and eliminates the need for some silly type casts. One more unrelated change: define the features buffer used in _rbd_dev_v2_snap_features() to be a packed data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Make the buf parameter into which the data is to be read have type void pointer. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
A ceph timespec contains 32-bit unsigned values for its seconds and nanoseconds components. For a standard timespec, both fields are signed, and the seconds field is almost surely 64 bits. Add some explicit casts so the fact that this conversion is taking place is obvious. Also trip a bug if we ever try to put out of range (negative or too big) values into a ceph timespec. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Flesh out the limits defined in <linux/ceph/decode.h> to include the maximum and minimum values for signed type S8, S16, S32, and S64. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
A clone image has a defined overlap point with its parent image. That is the byte offset beyond which the parent image has no defined data to back the clone, and anything thereafter can be viewed as being zero-filled by the clone image. This is needed because a clone image can be resized. If it gets resized larger than the snapshot it is based on, the overlap defines the original size. If the clone gets resized downward below the original size the new clone size defines the overlap. If the clone is subsequently resized to be larger, the overlap won't be increased because the previous resize invalidated any parent data beyond that point. This resolves: http://tracker.ceph.com/issues/4724Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
This implements the main copyup functionality for layered writes. Here we add a copyup_pages field to the object request, which is used only for copyup requests to keep track of the page array containing data read from the parent image. A copyup request is currently the only request rbd has that requires two osd operations. Because of this we handle copyup specially. All image object requests get an osd request allocated when they are created. For a write request, if a copyup is required, the osd request originally allocated is released, and a new one (with room for two osd ops) is allocated to replace it. A new function rbd_osd_req_create_copyup() allocates an osd request suitable for a copyup request. The first op is then filled with a copyup object class method call, supplying the array of pages containing data read from the parent. The second op is filled in with the original write request. The original request otherwise remains intact, and it describes the original write request (found in the second osd op). The presence of the copyup op is sort of implicit; a non-null copyup_pages field could be used to distinguish between a "normal" write request and a request containing both a copyup call and a write. This resolves: http://tracker.ceph.com/issues/3419Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
As a step toward implementing layered writes, implement reading the data for a target object from the parent image for a write request whose target object is known to not exist. Add a copyup_pages field to an image request to track the page array used (only) for such a request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Laurent Barbe authored
If rbd disk is open and rbd resize is done, new size is not visible by filesystem. Like is done in virtio-blk and dm driver, revalidate_disk() permits to update the bd_inode size. Signed-off-by: Laurent Barbe <laurent@ksperis.com> Reviewed-by: Alex Elder <elder@inktank.com>
-
Alex Elder authored
This patch adds the ability to build an image request whose data will be written from or read into memory described by a page array. (Previously only bio lists were supported.) Originally this was going to define a new function for this purpose but it was largely identical to the rbd_img_request_fill_bio(). So instead, rbd_img_request_fill_bio() has been generalized to handle both types of image request. For the moment we still only fill image requests with bio data. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Define a new function zero_pages() that zeroes a range of memory defined by a page array, along the lines of zero_bio_chain(). It saves and the irq flags like bvec_kmap_irq() does, though I'm not sure at this point that it's necessary. Update rbd_img_obj_request_read_callback() to use the new function if the object request contains page rather than bio data. For the moment, only bio data is used for osd READ ops. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Object requests that are part of an image request are subject to some additional handling. Define rbd_img_obj_request_submit() to encapsulate that, and use it when initially submitting an image object request, and when re-submitting it during callback of an object existence check. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Separate rbd_osd_req_format() into two functions, one for read requests and the other for write requests. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
Add the ability to provide an array of pages as outbound request data for object class method calls. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-
Alex Elder authored
This patch makes four small changes in the ceph messenger. While getting copyup functionality working I found two bugs in the messenger. Existing paths through the code did not trigger these problems, but they're fixed here: - In ceph_msg_data_pagelist_cursor_init(), the cursor's last_piece field was being checked against the length supplied. This was OK until this commit: ccba6d98 libceph: implement multiple data items in a message That commit changed the cursor init routines to allow lengths to be supplied that exceeded the size of the current data item. Because of this, we have to use the assigned cursor resid field rather than the provided length in determining whether the cursor points to the last piece of a data item. - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously catching attempts to add page data to a message if the message already had data assigned to it. That was OK until that same commit, at which point it was fine for messages to have multiple data items. It slipped through because that BUG_ON() call was present twice in that function. (You can never be too careful.) In addition two other minor things are changed: - In ceph_msg_data_cursor_init(), the local variable "data" was getting assigned twice. - In ceph_msg_data_advance(), it was assumed that the type-specific advance routine would set new_piece to true after it advanced past the last piece. That may have been fine, but since we check for that case we might as well set it explicitly in ceph_msg_data_advance(). This resolves: http://tracker.ceph.com/issues/4762Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
-