• David Ahern's avatar
    net: ipv4: Consider failed nexthops in multipath routes · a6db4494
    David Ahern authored
    Multipath route lookups should consider knowledge about next hops and not
    select a hop that is known to be failed.
    
    Example:
    
                         [h2]                   [h3]   15.0.0.5
                          |                      |
                         3|                     3|
                        [SP1]                  [SP2]--+
                         1  2                   1     2
                         |  |     /-------------+     |
                         |   \   /                    |
                         |     X                      |
                         |    / \                     |
                         |   /   \---------------\    |
                         1  2                     1   2
             12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                         4                         4
                          \                       /
                            \                    /
                             \                  /
                              -------|   |-----/
                                     1   2
                                    [TOR3]
                                      3|
                                       |
                                      [h1]  12.0.0.1
    
    host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
    
        root@h1:~# ip ro ls
        ...
        12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
        15.0.0.0/16
                nexthop via 12.0.0.2  dev swp1 weight 1
                nexthop via 12.0.0.3  dev swp1 weight 1
        ...
    
    If the link between tor3 and tor1 is down and the link between tor1
    and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
    in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
    ssh 15.0.0.5 gets the other. Connections that attempt to use the
    12.0.0.2 nexthop fail since that neighbor is not reachable:
    
        root@h1:~# ip neigh show
        ...
        12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
        12.0.0.2 dev swp1  FAILED
        ...
    
    The failed path can be avoided by considering known neighbor information
    when selecting next hops. If the neighbor lookup fails we have no
    knowledge about the nexthop, so give it a shot. If there is an entry
    then only select the nexthop if the state is sane. This is similar to
    what fib_detect_death does.
    
    To maintain backward compatibility use of the neighbor information is
    based on a new sysctl, fib_multipath_use_neigh.
    Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
    Reviewed-by: default avatarJulian Anastasov <ja@ssi.bg>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    a6db4494
sysctl_net_ipv4.c 25 KB