1. 28 Feb, 2010 11 commits
    • Boaz Harrosh's avatar
      exofs: groups support · 50a76fd3
      Boaz Harrosh authored
      * _calc_stripe_info() changes to accommodate for grouping
        calculations. Returns additional information
      
      * old _prepare_pages() becomes _prepare_one_group()
        which stores pages belonging to one device group.
      
      * New _prepare_for_striping iterates on all groups calling
        _prepare_one_group().
      
      * Enable mounting of groups data_maps (group_width != 0)
      
      [QUESTION]
      what is faster A or B;
      A.	x += stride;
      	x = x % width + first_x;
      
      B	x += stride
      	if (x < last_x)
      		x = first_x;
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      50a76fd3
    • Boaz Harrosh's avatar
      exofs: Prepare for groups · b367e78b
      Boaz Harrosh authored
      * Rename _offset_dev_unit_off() to _calc_stripe_info()
        and recieve a struct for the output params
      
      * In _prepare_for_striping we only need to call
        _calc_stripe_info() once. The other componets
        are easy to calculate from that. This code
        was inspired by what's done in truncate.
      
      * Some code shifts that make sense now but will make
        more sense when group support is added.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      b367e78b
    • Boaz Harrosh's avatar
      exofs: Error recovery if object is missing from storage · 96391e2b
      Boaz Harrosh authored
      If an object is referenced by a directory but does not
      exist on a target, it is a very serious corruption that
      means:
      1. Either a power failure with very slim chance of it
        happening. Because the directory update is always submitted
        much after object creation, but if a directory is written
        to one device and the object creation to another it might
        theoretically happen.
      2. It only ever happened to me while developing with BUGs
        causing file corruption. Crashes could also cause it but
        they are more like case 1.
      
      In any way the object does not exist, so data is surely lost.
      If there is a mix-up in the obj-id or data-map, then lost objects
      can be salvaged by off-line fsck. The only recoverable information
      is the directory name. By letting it appear as a regular empty file,
      with date==0 (1970 Jan 1st) ownership to root, we enable recovery
      of the only useful information. And also enable deletion or over-write.
      I can see how this can hurt.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      96391e2b
    • Boaz Harrosh's avatar
      exofs: convert io_state to use pages array instead of bio at input · 86093aaf
      Boaz Harrosh authored
      * inode.c operations are full-pages based, and not actually
        true scatter-gather
      * Lets us use more pages at once upto 512 (from 249) in 64 bit
      * Brings us much much closer to be able to use exofs's io_state engine
        from objlayout driver. (Once I decide where to put the common code)
      
      After RAID0 patch the outer (input) bio was never used as a bio, but
      was simply a page carrier into the raid engine. Even in the simple
      mirror/single-dev arrangement pages info was copied into a second bio.
      It is now easer to just pass a pages array into the io_state and prepare
      bio(s) once.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      86093aaf
    • Boaz Harrosh's avatar
      exofs: RAID0 support · 5d952b83
      Boaz Harrosh authored
      We now support striping over mirror devices. Including variable sized
      stripe_unit.
      
      Some limits:
      * stripe_unit must be a multiple of PAGE_SIZE
      * stripe_unit * stripe_count is maximum upto 32-bit (4Gb)
      
      Tested RAID0 over mirrors, RAID0 only, mirrors only. All check.
      
      Design notes:
      * I'm not using a vectored raid-engine mechanism yet. Following the
        pnfs-objects-layout data-map structure, "Mirror" is just a private
        case of "group_width" == 1, and RAID0 is a private case of
        "Mirrors" == 1. The performance lose of the general case over the
        particular special case optimization is totally negligible, also
        considering the extra code size.
      
      * In general I added a prepare_stripes() stage that divides the
        to-be-io pages to the participating devices, the previous
        exofs_ios_write/read, now becomes _write/read_mirrors and a new
        write/read upper layer loops on all devices calling
        _write/read_mirrors. Effectively the prepare_stripes stage is the all
        secret.
        Also truncate need fixing to accommodate for striping.
      
      * In a RAID0 arrangement, in a regular usage scenario, if all inode
        layouts will start at the same device, the small files fill up the
        first device and the later devices stay empty, the farther the device
        the emptier it is.
      
        To fix that, each inode will start at a different stripe_unit,
        according to it's obj_id modulus number-of-stripe-units. And
        will then span all stripe-units in the same incrementing order
        wrapping back to the beginning of the device table. We call it
        a stripe-units moving window.
      
        Special consideration was taken to keep all devices in a mirror
        arrangement identical. So a broken osd-device could just be cloned
        from one of the mirrors and no FS scrubbing is needed. (We do that
        by rotating stripe-unit at a time and not a single device at a time.)
      
      TODO:
       We no longer verify object_length == inode->i_size in exofs_iget.
       (since i_size is stripped on multiple objects now).
       I should introduce a multiple-device attribute reading, and use
       it in exofs_iget.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      5d952b83
    • Boaz Harrosh's avatar
      exofs: Define on-disk per-inode optional layout attribute · d9c740d2
      Boaz Harrosh authored
      * Layouts describe the way a file is spread on multiple devices.
        The layout information is stored in the objects attribute introduced
        in this patch.
      
      * There can be multiple generating function for the layout.
        Currently defined:
          - No attribute present - use below moving-window on global
            device table, all devices.
            (This is the only one currently used in exofs)
          - an obj_id generated moving window - the obj_id is a randomizing
            factor in the otherwise global map layout.
          - An explicit layout stored, including a data_map and a device
            index list.
          - More might be defined in future ...
      
      * There are two attributes defined of the same structure:
        A-data-files-layout - This layout is used by data-files. If present
                              at a directory, all files of that directory will
                              be created with this layout.
        A-meta-data-layout - This layout is used by a directory and other
                             meta-data information. Also inherited at creation
                             of subdirectories.
      
      * At creation time inodes are created with the layout specified above.
        A usermode utility may change the creation layout on a give directory
        or file. Which in the case of directories, will also apply to newly
        created files/subdirectories, children of that directory.
        In the simple unaltered case of a newly created exofs, no layout
        attributes are present, and all layouts adhere to the layout specified
        at the device-table.
      
      * In case of a future file system loaded in an old exofs-driver.
        At iget(), the generating_function is inspected and if not supported
        will return an IO error to the application and the inode will not
        be loaded. So not to damage any data.
        Note: After this patch we do not yet support any type of layout
              only the RAID0 patch that enables striping at the super-block
              level will add support for RAID0 layouts above. This way we
              are past and future compatible and fully bisectable.
      
      * Access to the device table is done by an accessor since
        it will change according to above information.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      d9c740d2
    • Boaz Harrosh's avatar
      exofs: unindent exofs_sbi_read · 46f4d973
      Boaz Harrosh authored
      The original idea was that a mirror read can be sub-divided
      to multiple devices. But this has very little gain and only
      at very large IOes so it's not going to be implemented soon.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      46f4d973
    • Boaz Harrosh's avatar
      exofs: Move layout related members to a layout structure · 45d3abcb
      Boaz Harrosh authored
      * Abstract away those members in exofs_sb_info that are related/needed
        by a layout into a new exofs_layout structure. Embed it in exofs_sb_info.
      
      * At exofs_io_state receive/keep a pointer to an exofs_layout. No need for
        an exofs_sb_info pointer, all we need is at exofs_layout.
      
      * Change any usage of above exofs_sb_info members to their new name.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      45d3abcb
    • Boaz Harrosh's avatar
      exofs: Recover in the case of read-passed-end-of-file · 22ddc556
      Boaz Harrosh authored
      In check_io, implement the case of reading passed end of
      file, by clearing the pages and recover with no error. In
      a raid arrangement this can become a legitimate situation
      in case of holes in the file.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      22ddc556
    • Boaz Harrosh's avatar
      exofs: Micro-optimize exofs_i_info · 518f167a
      Boaz Harrosh authored
      optimize the exofs_i_info struct usage by moving the embedded
      vfs_inode to be first. A compiler might optimize away an "add"
      operation with constant zero. (Which it cannot with other constants)
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      518f167a
    • Boaz Harrosh's avatar
      exofs: debug print even less · 34ce4e7c
      Boaz Harrosh authored
      * Last debug trimming left in some stupid print, remove them.
        Fixup some other prints
      * Shift printing from inode.c to ios.c
      * Add couple of prints when memory allocation fails.
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      34ce4e7c
  2. 29 Jan, 2010 13 commits
  3. 28 Jan, 2010 16 commits