• Victor Gladkov's avatar
    nvme-fabrics: reject I/O to offline device · 8c4dfea9
    Victor Gladkov authored
    Commands get stuck while Host NVMe-oF controller is in reconnect state.
    The controller enters into reconnect state when it loses connection with
    the target.  It tries to reconnect every 10 seconds (default) until
    a successful reconnect or until the reconnect time-out is reached.
    The default reconnect time out is 10 minutes.
    
    Applications are expecting commands to complete with success or error
    within a certain timeout (30 seconds by default).  The NVMe host is
    enforcing that timeout while it is connected, but during reconnect the
    timeout is not enforced and commands may get stuck for a long period or
    even forever.
    
    To fix this long delay due to the default timeout, introduce new
    "fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
    from the controller reconnect and any command beyond that timeout is
    rejected.  The new parameter value may be passed during 'connect'.
    The default value of -1 means no timeout (similar to current behavior).
    Signed-off-by: default avatarVictor Gladkov <victor.gladkov@kioxia.com>
    Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
    Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
    Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Reviewed-by: default avatarChao Leng <lengchao@huawei.com>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    8c4dfea9
fabrics.c 31.1 KB