1. 03 Jul, 2019 40 commits
    • Daniel Borkmann's avatar
      bpf: fix unconnected udp hooks · 591c18e3
      Daniel Borkmann authored
      commit 983695fa upstream.
      
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      591c18e3
    • Matt Mullins's avatar
      bpf: fix nested bpf tracepoints with per-cpu data · 2a9fedc1
      Matt Mullins authored
      commit 9594dc3c upstream.
      
      BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
      they do not increment bpf_prog_active while executing.
      
      This enables three levels of nesting, to support
        - a kprobe or raw tp or perf event,
        - another one of the above that irq context happens to call, and
        - another one in nmi context
      (at most one of which may be a kprobe or perf event).
      
      Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data")
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a9fedc1
    • Jonathan Lemon's avatar
      bpf: lpm_trie: check left child of last leftmost node for NULL · 7cec8976
      Jonathan Lemon authored
      commit da2577fd upstream.
      
      If the leftmost parent node of the tree has does not have a child
      on the left side, then trie_get_next_key (and bpftool map dump) will
      not look at the child on the right.  This leads to the traversal
      missing elements.
      
      Lookup is not affected.
      
      Update selftest to handle this case.
      
      Reproducer:
      
       bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
           value 1 entries 256 name test_lpm flags 1
       bpftool map update pinned /sys/fs/bpf/lpm key  8 0 0 0  0   0 value 1
       bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0  0 128 value 2
       bpftool map dump   pinned /sys/fs/bpf/lpm
      
      Returns only 1 element. (2 expected)
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cec8976
    • Martynas Pumputis's avatar
      bpf: simplify definition of BPF_FIB_LOOKUP related flags · 7108c835
      Martynas Pumputis authored
      commit b1d6c15b upstream.
      
      Previously, the BPF_FIB_LOOKUP_{DIRECT,OUTPUT} flags in the BPF UAPI
      were defined with the help of BIT macro. This had the following issues:
      
      - In order to use any of the flags, a user was required to depend
        on <linux/bits.h>.
      - No other flag in bpf.h uses the macro, so it seems that an unwritten
        convention is to use (1 << (nr)) to define BPF-related flags.
      
      Fixes: 87f5fc7e ("bpf: Provide helper to do forwarding lookups in kernel FIB table")
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7108c835
    • Dmitry Bogdanov's avatar
      net: aquantia: fix vlans not working over bridged network · 03c3e507
      Dmitry Bogdanov authored
      [ Upstream commit 48dd73d0 ]
      
      In configuration of vlan over bridge over aquantia device
      it was found that vlan tagged traffic is dropped on chip.
      
      The reason is that bridge device enables promisc mode,
      but in atlantic chip vlan filters will still apply.
      So we have to corellate promisc settings with vlan configuration.
      
      The solution is to track in a separate state variable the
      need of vlan forced promisc. And also consider generic
      promisc configuration when doing vlan filter config.
      
      Fixes: 7975d2af ("net: aquantia: add support of rx-vlan-filter offload")
      Signed-off-by: default avatarDmitry Bogdanov <dmitry.bogdanov@aquantia.com>
      Signed-off-by: default avatarIgor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      03c3e507
    • Fei Li's avatar
      tun: wake up waitqueues after IFF_UP is set · 9590d1d1
      Fei Li authored
      [ Upstream commit 72b319dc ]
      
      Currently after setting tap0 link up, the tun code wakes tx/rx waited
      queues up in tun_net_open() when .ndo_open() is called, however the
      IFF_UP flag has not been set yet. If there's already a wait queue, it
      would fail to transmit when checking the IFF_UP flag in tun_sendmsg().
      Then the saving vhost_poll_start() will add the wq into wqh until it
      is waken up again. Although this works when IFF_UP flag has been set
      when tun_chr_poll detects; this is not true if IFF_UP flag has not
      been set at that time. Sadly the latter case is a fatal error, as
      the wq will never be waken up in future unless later manually
      setting link up on purpose.
      
      Fix this by moving the wakeup process into the NETDEV_UP event
      notifying process, this makes sure IFF_UP has been set before all
      waited queues been waken up.
      Signed-off-by: default avatarFei Li <lifei.shirley@bytedance.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9590d1d1
    • Xin Long's avatar
      tipc: check msg->req data len in tipc_nl_compat_bearer_disable · a54c0c1d
      Xin Long authored
      [ Upstream commit 4f07b80c ]
      
      This patch is to fix an uninit-value issue, reported by syzbot:
      
        BUG: KMSAN: uninit-value in memchr+0xce/0x110 lib/string.c:981
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x191/0x1f0 lib/dump_stack.c:113
          kmsan_report+0x130/0x2a0 mm/kmsan/kmsan.c:622
          __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:310
          memchr+0xce/0x110 lib/string.c:981
          string_is_valid net/tipc/netlink_compat.c:176 [inline]
          tipc_nl_compat_bearer_disable+0x2a1/0x480 net/tipc/netlink_compat.c:449
          __tipc_nl_compat_doit net/tipc/netlink_compat.c:327 [inline]
          tipc_nl_compat_doit+0x3ac/0xb00 net/tipc/netlink_compat.c:360
          tipc_nl_compat_handle net/tipc/netlink_compat.c:1178 [inline]
          tipc_nl_compat_recv+0x1b1b/0x27b0 net/tipc/netlink_compat.c:1281
      
      TLV_GET_DATA_LEN() may return a negtive int value, which will be
      used as size_t (becoming a big unsigned long) passed into memchr,
      cause this issue.
      
      Similar to what it does in tipc_nl_compat_bearer_enable(), this
      fix is to return -EINVAL when TLV_GET_DATA_LEN() is negtive in
      tipc_nl_compat_bearer_disable(), as well as in
      tipc_nl_compat_link_stat_dump() and tipc_nl_compat_link_reset_stats().
      
      v1->v2:
        - add the missing Fixes tags per Eric's request.
      
      Fixes: 0762216c ("tipc: fix uninit-value in tipc_nl_compat_bearer_enable")
      Fixes: 8b66fee7 ("tipc: fix uninit-value in tipc_nl_compat_link_reset_stats")
      Reported-by: syzbot+30eaa8bf392f7fafffaf@syzkaller.appspotmail.com
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a54c0c1d
    • Xin Long's avatar
      tipc: change to use register_pernet_device · ec7fafa6
      Xin Long authored
      [ Upstream commit c492d4c7 ]
      
      This patch is to fix a dst defcnt leak, which can be reproduced by doing:
      
        # ip net a c; ip net a s; modprobe tipc
        # ip net e s ip l a n eth1 type veth peer n eth1 netns c
        # ip net e c ip l s lo up; ip net e c ip l s eth1 up
        # ip net e s ip l s lo up; ip net e s ip l s eth1 up
        # ip net e c ip a a 1.1.1.2/8 dev eth1
        # ip net e s ip a a 1.1.1.1/8 dev eth1
        # ip net e c tipc b e m udp n u1 localip 1.1.1.2
        # ip net e s tipc b e m udp n u1 localip 1.1.1.1
        # ip net d c; ip net d s; rmmod tipc
      
      and it will get stuck and keep logging the error:
      
        unregister_netdevice: waiting for lo to become free. Usage count = 1
      
      The cause is that a dst is held by the udp sock's sk_rx_dst set on udp rx
      path with udp_early_demux == 1, and this dst (eventually holding lo dev)
      can't be released as bearer's removal in tipc pernet .exit happens after
      lo dev's removal, default_device pernet .exit.
      
       "There are two distinct types of pernet_operations recognized: subsys and
        device.  At creation all subsys init functions are called before device
        init functions, and at destruction all device exit functions are called
        before subsys exit function."
      
      So by calling register_pernet_device instead to register tipc_net_ops, the
      pernet .exit() will be invoked earlier than loopback dev's removal when a
      netns is being destroyed, as fou/gue does.
      
      Note that vxlan and geneve udp tunnels don't have this issue, as the udp
      sock is released in their device ndo_stop().
      
      This fix is also necessary for tipc dst_cache, which will hold dsts on tx
      path and I will introduce in my next patch.
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec7fafa6
    • YueHaibing's avatar
      team: Always enable vlan tx offload · a061216a
      YueHaibing authored
      [ Upstream commit ee429742 ]
      
      We should rather have vlan_tci filled all the way down
      to the transmitting netdevice and let it do the hw/sw
      vlan implementation.
      Suggested-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a061216a
    • Xin Long's avatar
      sctp: change to hold sk after auth shkey is created successfully · 6c616a13
      Xin Long authored
      [ Upstream commit 25bff6d5 ]
      
      Now in sctp_endpoint_init(), it holds the sk then creates auth
      shkey. But when the creation fails, it doesn't release the sk,
      which causes a sk defcnf leak,
      
      Here to fix it by only holding the sk when auth shkey is created
      successfully.
      
      Fixes: a29a5bd4 ("[SCTP]: Implement SCTP-AUTH initializations.")
      Reported-by: syzbot+afabda3890cc2f765041@syzkaller.appspotmail.com
      Reported-by: syzbot+276ca1c77a19977c0130@syzkaller.appspotmail.com
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c616a13
    • Dirk van der Merwe's avatar
      net/tls: fix page double free on TX cleanup · 0962d139
      Dirk van der Merwe authored
      [ Upstream commit 9354544c ]
      
      With commit 94850257 ("tls: Fix tls_device handling of partial records")
      a new path was introduced to cleanup partial records during sk_proto_close.
      This path does not handle the SW KTLS tx_list cleanup.
      
      This is unnecessary though since the free_resources calls for both
      SW and offload paths will cleanup a partial record.
      
      The visible effect is the following warning, but this bug also causes
      a page double free.
      
          WARNING: CPU: 7 PID: 4000 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
          RIP: 0010:sk_stream_kill_queues+0x103/0x110
          RSP: 0018:ffffb6df87e07bd0 EFLAGS: 00010206
          RAX: 0000000000000000 RBX: ffff8c21db4971c0 RCX: 0000000000000007
          RDX: ffffffffffffffa0 RSI: 000000000000001d RDI: ffff8c21db497270
          RBP: ffff8c21db497270 R08: ffff8c29f4748600 R09: 000000010020001a
          R10: ffffb6df87e07aa0 R11: ffffffff9a445600 R12: 0000000000000007
          R13: 0000000000000000 R14: ffff8c21f03f2900 R15: ffff8c21f03b8df0
          Call Trace:
           inet_csk_destroy_sock+0x55/0x100
           tcp_close+0x25d/0x400
           ? tcp_check_oom+0x120/0x120
           tls_sk_proto_close+0x127/0x1c0
           inet_release+0x3c/0x60
           __sock_release+0x3d/0xb0
           sock_close+0x11/0x20
           __fput+0xd8/0x210
           task_work_run+0x84/0xa0
           do_exit+0x2dc/0xb90
           ? release_sock+0x43/0x90
           do_group_exit+0x3a/0xa0
           get_signal+0x295/0x720
           do_signal+0x36/0x610
           ? SYSC_recvfrom+0x11d/0x130
           exit_to_usermode_loop+0x69/0xb0
           do_syscall_64+0x173/0x180
           entry_SYSCALL_64_after_hwframe+0x3d/0xa2
          RIP: 0033:0x7fe9b9abc10d
          RSP: 002b:00007fe9b19a1d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
          RAX: fffffffffffffe00 RBX: 0000000000000006 RCX: 00007fe9b9abc10d
          RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007fe948003430
          RBP: 00007fe948003410 R08: 00007fe948003430 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000246 R12: 00005603739d9080
          R13: 00007fe9b9ab9f90 R14: 00007fe948003430 R15: 0000000000000000
      
      Fixes: 94850257 ("tls: Fix tls_device handling of partial records")
      Signed-off-by: default avatarDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0962d139
    • Roland Hii's avatar
      net: stmmac: set IC bit when transmitting frames with HW timestamp · a6902fe4
      Roland Hii authored
      [ Upstream commit d0bb82fd ]
      
      When transmitting certain PTP frames, e.g. SYNC and DELAY_REQ, the
      PTP daemon, e.g. ptp4l, is polling the driver for the frame transmit
      hardware timestamp. The polling will most likely timeout if the tx
      coalesce is enabled due to the Interrupt-on-Completion (IC) bit is
      not set in tx descriptor for those frames.
      
      This patch will ignore the tx coalesce parameter and set the IC bit
      when transmitting PTP frames which need to report out the frame
      transmit hardware timestamp to user space.
      
      Fixes: f748be53 ("net: stmmac: Rework coalesce timer and fix multi-queue races")
      Signed-off-by: default avatarRoland Hii <roland.king.guan.hii@intel.com>
      Signed-off-by: default avatarOng Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: default avatarVoon Weifeng <weifeng.voon@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6902fe4
    • Roland Hii's avatar
      net: stmmac: fixed new system time seconds value calculation · ac086d4c
      Roland Hii authored
      [ Upstream commit a1e5388b ]
      
      When ADDSUB bit is set, the system time seconds field is calculated as
      the complement of the seconds part of the update value.
      
      For example, if 3.000000001 seconds need to be subtracted from the
      system time, this field is calculated as
      2^32 - 3 = 4294967296 - 3 = 0x100000000 - 3 = 0xFFFFFFFD
      
      Previously, the 0x100000000 is mistakenly written as 100000000.
      
      This is further simplified from
        sec = (0x100000000ULL - sec);
      to
        sec = -sec;
      
      Fixes: ba1ffd74 ("stmmac: fix PTP support for GMAC4")
      Signed-off-by: default avatarRoland Hii <roland.king.guan.hii@intel.com>
      Signed-off-by: default avatarOng Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: default avatarVoon Weifeng <weifeng.voon@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac086d4c
    • JingYi Hou's avatar
      net: remove duplicate fetch in sock_getsockopt · 505c9258
      JingYi Hou authored
      [ Upstream commit d0bae4a0 ]
      
      In sock_getsockopt(), 'optlen' is fetched the first time from userspace.
      'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is
      fetched the second time from userspace.
      
      If change it between two fetches may cause security problems or unexpected
      behaivor, and there is no reason to fetch it a second time.
      
      To fix this, we need to remove the second fetch.
      Signed-off-by: default avatarJingYi Hou <houjingyi647@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      505c9258
    • Eric Dumazet's avatar
      net/packet: fix memory leak in packet_set_ring() · 65b2a804
      Eric Dumazet authored
      [ Upstream commit 55655e3d ]
      
      syzbot found we can leak memory in packet_set_ring(), if user application
      provides buggy parameters.
      
      Fixes: 7f953ab2 ("af_packet: TX_RING support for TPACKET_V3")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      65b2a804
    • Stephen Suryaputra's avatar
      ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop · c79ab459
      Stephen Suryaputra authored
      [ Upstream commit 38c73529 ]
      
      In commit 19e4e768 ("ipv4: Fix raw socket lookup for local
      traffic"), the dif argument to __raw_v4_lookup() is coming from the
      returned value of inet_iif() but the change was done only for the first
      lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex.
      
      Fixes: 19e4e768 ("ipv4: Fix raw socket lookup for local traffic")
      Signed-off-by: default avatarStephen Suryaputra <ssuryaextr@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c79ab459
    • YueHaibing's avatar
      bonding: Always enable vlan tx offload · bc4fdb7d
      YueHaibing authored
      [ Upstream commit 30d8177e ]
      
      We build vlan on top of bonding interface, which vlan offload
      is off, bond mode is 802.3ad (LACP) and xmit_hash_policy is
      BOND_XMIT_POLICY_ENCAP34.
      
      Because vlan tx offload is off, vlan tci is cleared and skb push
      the vlan header in validate_xmit_vlan() while sending from vlan
      devices. Then in bond_xmit_hash, __skb_flow_dissect() fails to
      get information from protocol headers encapsulated within vlan,
      because 'nhoff' is points to IP header, so bond hashing is based
      on layer 2 info, which fails to distribute packets across slaves.
      
      This patch always enable bonding's vlan tx offload, pass the vlan
      packets to the slave devices with vlan tci, let them to handle
      vlan implementation.
      
      Fixes: 278339a4 ("bonding: propogate vlan_features to bonding master")
      Suggested-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc4fdb7d
    • Neil Horman's avatar
      af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET · c14a0de9
      Neil Horman authored
      [ Upstream commit 89ed5b51 ]
      
      When an application is run that:
      a) Sets its scheduler to be SCHED_FIFO
      and
      b) Opens a memory mapped AF_PACKET socket, and sends frames with the
      MSG_DONTWAIT flag cleared, its possible for the application to hang
      forever in the kernel.  This occurs because when waiting, the code in
      tpacket_snd calls schedule, which under normal circumstances allows
      other tasks to run, including ksoftirqd, which in some cases is
      responsible for freeing the transmitted skb (which in AF_PACKET calls a
      destructor that flips the status bit of the transmitted frame back to
      available, allowing the transmitting task to complete).
      
      However, when the calling application is SCHED_FIFO, its priority is
      such that the schedule call immediately places the task back on the cpu,
      preventing ksoftirqd from freeing the skb, which in turn prevents the
      transmitting task from detecting that the transmission is complete.
      
      We can fix this by converting the schedule call to a completion
      mechanism.  By using a completion queue, we force the calling task, when
      it detects there are no more frames to send, to schedule itself off the
      cpu until such time as the last transmitted skb is freed, allowing
      forward progress to be made.
      
      Tested by myself and the reporter, with good results
      
      Change Notes:
      
      V1->V2:
      	Enhance the sleep logic to support being interruptible and
      allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
      
      V2->V3:
      	Rearrage the point at which we wait for the completion queue, to
      avoid needing to check for ph/skb being null at the end of the loop.
      Also move the complete call to the skb destructor to avoid needing to
      modify __packet_set_status.  Also gate calling complete on
      packet_read_pending returning zero to avoid multiple calls to complete.
      (Willem de Bruijn)
      
      	Move timeo computation within loop, to re-fetch the socket
      timeout since we also use the timeo variable to record the return code
      from the wait_for_complete call (Neil Horman)
      
      V3->V4:
      	Willem has requested that the control flow be restored to the
      previous state.  Doing so lets us eliminate the need for the
      po->wait_on_complete flag variable, and lets us get rid of the
      packet_next_frame function, but introduces another complexity.
      Specifically, but using the packet pending count, we can, if an
      applications calls sendmsg multiple times with MSG_DONTWAIT set, each
      set of transmitted frames, when complete, will cause
      tpacket_destruct_skb to issue a complete call, for which there will
      never be a wait_on_completion call.  This imbalance will lead to any
      future call to wait_for_completion here to return early, when the frames
      they sent may not have completed.  To correct this, we need to re-init
      the completion queue on every call to tpacket_snd before we enter the
      loop so as to ensure we wait properly for the frames we send in this
      iteration.
      
      	Change the timeout and interrupted gotos to out_put rather than
      out_status so that we don't try to free a non-existant skb
      	Clean up some extra newlines (Willem de Bruijn)
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Reported-by: default avatarMatteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c14a0de9
    • Paul Burton's avatar
      irqchip/mips-gic: Use the correct local interrupt map registers · 0730644e
      Paul Burton authored
      commit 6d4d367d upstream.
      
      The MIPS GIC contains a block of registers used to map local interrupts
      to a particular CPU interrupt pin. Since these registers are found at a
      consecutive range of addresses we access them using an index, via the
      (read|write)_gic_v[lo]_map accessor functions. We currently use values
      from enum mips_gic_local_interrupt as those indices.
      
      Unfortunately whilst enum mips_gic_local_interrupt provides the correct
      offsets for bits in the pending & mask registers, the ordering of the
      map registers is subtly different... Compared with the ordering of
      pending & mask bits, the map registers move the FDC from the end of the
      list to index 3 after the timer interrupt. As a result the performance
      counter & software interrupts are therefore at indices 4-6 rather than
      indices 3-5.
      
      Notably this causes problems with performance counter interrupts being
      incorrectly mapped on some systems, and presumably will also cause
      problems for FDC interrupts.
      
      Introduce a function to map from enum mips_gic_local_interrupt to the
      index of the corresponding map register, and use it to ensure we access
      the map registers for the correct interrupts.
      Signed-off-by: default avatarPaul Burton <paul.burton@mips.com>
      Fixes: a0dc5cb5 ("irqchip: mips-gic: Simplify gic_local_irq_domain_map()")
      Fixes: da61fcf9 ("irqchip: mips-gic: Use irq_cpu_online to (un)mask all-VP(E) IRQs")
      Reported-and-tested-by: default avatarArcher Yan <ayan@wavecomp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0730644e
    • Trond Myklebust's avatar
      SUNRPC: Fix up calculation of client message length · 5dfe49ca
      Trond Myklebust authored
      commit 7e3d3620 upstream.
      
      In the case where a record marker was used, xs_sendpages() needs
      to return the length of the payload + record marker so that we
      operate correctly in the case of a partial transmission.
      When the callers check return value, they therefore need to
      take into account the record marker length.
      
      Fixes: 06b5fc3a ("Merge tag 'nfs-rdma-for-5.1-1'...")
      Cc: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5dfe49ca
    • Geert Uytterhoeven's avatar
      cpu/speculation: Warn on unsupported mitigations= parameter · b187fae6
      Geert Uytterhoeven authored
      commit 1bf72720 upstream.
      
      Currently, if the user specifies an unsupported mitigation strategy on the
      kernel command line, it will be ignored silently.  The code will fall back
      to the default strategy, possibly leaving the system more vulnerable than
      expected.
      
      This may happen due to e.g. a simple typo, or, for a stable kernel release,
      because not all mitigation strategies have been backported.
      
      Inform the user by printing a message.
      
      Fixes: 98af8452 ("cpu/speculation: Add 'mitigations=' cmdline option")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.orgSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b187fae6
    • Trond Myklebust's avatar
      NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O · 82d0f7b6
      Trond Myklebust authored
      commit 68f46159 upstream.
      
      Fix a typo where we're confusing the default TCP retrans value
      (NFS_DEF_TCP_RETRANS) for the default TCP timeout value.
      
      Fixes: 15d03055 ("pNFS/flexfiles: Set reasonable default ...")
      Cc: stable@vger.kernel.org # 4.8+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82d0f7b6
    • Ard Biesheuvel's avatar
      efi/memreserve: deal with memreserve entries in unmapped memory · b5961eca
      Ard Biesheuvel authored
      commit 18df7577 upstream.
      
      Ensure that the EFI memreserve entries can be accessed, even if they
      are located in memory that the kernel (e.g., a crashkernel) omits from
      the linear map.
      
      Fixes: 80424b02 ("efi: Reduce the amount of memblock reservations ...")
      Cc: <stable@vger.kernel.org> # 5.0+
      Reported-by: default avatarJonathan Richardson <jonathan.richardson@broadcom.com>
      Reviewed-by: default avatarJonathan Richardson <jonathan.richardson@broadcom.com>
      Tested-by: default avatarJonathan Richardson <jonathan.richardson@broadcom.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5961eca
    • Johannes Weiner's avatar
      mm: fix page cache convergence regression · 994f9a52
      Johannes Weiner authored
      commit 7b785645 upstream.
      
      Since a2833486 ("page cache: Finish XArray conversion"), on most
      major Linux distributions, the page cache doesn't correctly transition
      when the hot data set is changing, and leaves the new pages thrashing
      indefinitely instead of kicking out the cold ones.
      
      On a freshly booted, freshly ssh'd into virtual machine with 1G RAM
      running stock Arch Linux:
      
      [root@ham ~]# ./reclaimtest.sh
      + dd of=workingset-a bs=1M count=0 seek=600
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + ./mincore workingset-a
      153600/153600 workingset-a
      + dd of=workingset-b bs=1M count=0 seek=600
      + cat workingset-b
      + cat workingset-b
      + cat workingset-b
      + cat workingset-b
      + ./mincore workingset-a workingset-b
      104029/153600 workingset-a
      120086/153600 workingset-b
      + cat workingset-b
      + cat workingset-b
      + cat workingset-b
      + cat workingset-b
      + ./mincore workingset-a workingset-b
      104029/153600 workingset-a
      120268/153600 workingset-b
      
      workingset-b is a 600M file on a 1G host that is otherwise entirely
      idle. No matter how often it's being accessed, it won't get cached.
      
      While investigating, I noticed that the non-resident information gets
      aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is
      a problem because a workingset transition like this relies on the
      non-resident information tracked in the page cache tree of evicted
      file ranges: when the cache faults are refaults of recently evicted
      cache, we challenge the existing active set, and that allows a new
      workingset to establish itself.
      
      Tracing the shrinker that maintains this memory revealed that all page
      cache tree nodes were allocated to the root cgroup. This is a problem,
      because 1) the shrinker sizes the amount of non-resident information
      it keeps to the size of the cgroup's other memory and 2) on most major
      Linux distributions, only kernel threads live in the root cgroup and
      everything else gets put into services or session groups:
      
      [root@ham ~]# cat /proc/self/cgroup
      0::/user.slice/user-0.slice/session-c1.scope
      
      As a result, we basically maintain no non-resident information for the
      workloads running on the system, thus breaking the caching algorithm.
      
      Looking through the code, I found the culprit in the above-mentioned
      patch: when switching from the radix tree to xarray, it dropped the
      __GFP_ACCOUNT flag from the tree node allocations - the flag that
      makes sure the allocated memory gets charged to and tracked by the
      cgroup of the calling process - in this case, the one doing the fault.
      
      To fix this, allow xarray users to specify per-tree flag that makes
      xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache
      tree annotation to request such cgroup tracking for the cache nodes.
      
      With this patch applied, the page cache correctly converges on new
      workingsets again after just a few iterations:
      
      [root@ham ~]# ./reclaimtest.sh
      + dd of=workingset-a bs=1M count=0 seek=600
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + cat workingset-a
      + ./mincore workingset-a
      153600/153600 workingset-a
      + dd of=workingset-b bs=1M count=0 seek=600
      + cat workingset-b
      + ./mincore workingset-a workingset-b
      124607/153600 workingset-a
      87876/153600 workingset-b
      + cat workingset-b
      + ./mincore workingset-a workingset-b
      81313/153600 workingset-a
      133321/153600 workingset-b
      + cat workingset-b
      + ./mincore workingset-a workingset-b
      63036/153600 workingset-a
      153600/153600 workingset-b
      
      Cc: stable@vger.kernel.org # 4.20+
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      994f9a52
    • Reinette Chatre's avatar
      x86/resctrl: Prevent possible overrun during bitmap operations · 9b901ec9
      Reinette Chatre authored
      commit 32f010de upstream.
      
      While the DOC at the beginning of lib/bitmap.c explicitly states that
      "The number of valid bits in a given bitmap does _not_ need to be an
      exact multiple of BITS_PER_LONG.", some of the bitmap operations do
      indeed access BITS_PER_LONG portions of the provided bitmap no matter
      the size of the provided bitmap.
      
      For example, if find_first_bit() is provided with an 8 bit bitmap the
      operation will access BITS_PER_LONG bits from the provided bitmap. While
      the operation ensures that these extra bits do not affect the result,
      the memory is still accessed.
      
      The capacity bitmasks (CBMs) are typically stored in u32 since they
      can never exceed 32 bits. A few instances exist where a bitmap_*
      operation is performed on a CBM by simply pointing the bitmap operation
      to the stored u32 value.
      
      The consequence of this pattern is that some bitmap_* operations will
      access out-of-bounds memory when interacting with the provided CBM.
      
      This same issue has previously been addressed with commit 49e00eee
      ("x86/intel_rdt: Fix out-of-bounds memory access in CBM tests")
      but at that time not all instances of the issue were fixed.
      
      Fix this by using an unsigned long to store the capacity bitmask data
      that is passed to bitmap functions.
      
      Fixes: e6519011 ("x86/intel_rdt: Introduce "bit_usage" to display cache allocations details")
      Fixes: f4e80d67 ("x86/intel_rdt: Resctrl files reflect pseudo-locked information")
      Fixes: 95f0b77e ("x86/intel_rdt: Initialize new resource group with sane defaults")
      Signed-off-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/58c9b6081fd9bf599af0dfc01a6fdd335768efef.1560975645.git.reinette.chatre@intel.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9b901ec9
    • Thomas Gleixner's avatar
      x86/microcode: Fix the microcode load on CPU hotplug for real · 3c762ccd
      Thomas Gleixner authored
      commit 5423f5ce upstream.
      
      A recent change moved the microcode loader hotplug callback into the early
      startup phase which is running with interrupts disabled. It missed that
      the callbacks invoke sysfs functions which might sleep causing nice 'might
      sleep' splats with proper debugging enabled.
      
      Split the callbacks and only load the microcode in the early startup phase
      and move the sysfs handling back into the later threaded and preemptible
      bringup phase where it was before.
      
      Fixes: 78f4e932 ("x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906182228350.1766@nanos.tec.linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c762ccd
    • Alejandro Jimenez's avatar
      x86/speculation: Allow guests to use SSBD even if host does not · 6aec2bbd
      Alejandro Jimenez authored
      commit c1f7fec1 upstream.
      
      The bits set in x86_spec_ctrl_mask are used to calculate the guest's value
      of SPEC_CTRL that is written to the MSR before VMENTRY, and control which
      mitigations the guest can enable.  In the case of SSBD, unless the host has
      enabled SSBD always on mode (by passing "spec_store_bypass_disable=on" in
      the kernel parameters), the SSBD bit is not set in the mask and the guest
      can not properly enable the SSBD always on mitigation mode.
      
      This has been confirmed by running the SSBD PoC on a guest using the SSBD
      always on mitigation mode (booted with kernel parameter
      "spec_store_bypass_disable=on"), and verifying that the guest is vulnerable
      unless the host is also using SSBD always on mode. In addition, the guest
      OS incorrectly reports the SSB vulnerability as mitigated.
      
      Always set the SSBD bit in x86_spec_ctrl_mask when the host CPU supports
      it, allowing the guest to use SSBD whether or not the host has chosen to
      enable the mitigation in any of its modes.
      
      Fixes: be6fcb54 ("x86/bugs: Rework spec_ctrl base and mask logic")
      Signed-off-by: default avatarAlejandro Jimenez <alejandro.j.jimenez@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Reviewed-by: default avatarMark Kanda <mark.kanda@oracle.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: bp@alien8.de
      Cc: rkrcmar@redhat.com
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1560187210-11054-1-git-send-email-alejandro.j.jimenez@oracle.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6aec2bbd
    • Jan Kara's avatar
      scsi: vmw_pscsi: Fix use-after-free in pvscsi_queue_lck() · e5fb2093
      Jan Kara authored
      commit 240b4cc8 upstream.
      
      Once we unlock adapter->hw_lock in pvscsi_queue_lck() nothing prevents just
      queued scsi_cmnd from completing and freeing the request. Thus cmd->cmnd[0]
      dereference can dereference already freed request leading to kernel crashes
      or other issues (which one of our customers observed). Store cmd->cmnd[0]
      in a local variable before unlocking adapter->hw_lock to fix the issue.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarEwan D. Milne <emilne@redhat.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5fb2093
    • Jens Axboe's avatar
      io_uring: ensure req->file is cleared on allocation · ddae0798
      Jens Axboe authored
      commit 60c112b0 upstream.
      
      Stephen reports:
      
      I hit the following General Protection Fault when testing io_uring via
      the io_uring engine in fio. This was on a VM running 5.2-rc5 and the
      latest version of fio. The issue occurs for both null_blk and fake NVMe
      drives. I have not tested bare metal or real NVMe SSDs. The fio script
      used is given below.
      
      [io_uring]
      time_based=1
      runtime=60
      filename=/dev/nvme2n1 (note /dev/nullb0 also fails)
      ioengine=io_uring
      bs=4k
      rw=readwrite
      direct=1
      fixedbufs=1
      sqthread_poll=1
      sqthread_poll_cpu=0
      
      general protection fault: 0000 [#1] SMP PTI
      CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      RIP: 0010:fput_many+0x7/0x90
      Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \
      
      RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246
      RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5
      RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d
      R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000
      R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004
      FS:  0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? fput+0x13/0x20
       io_free_req+0x20/0x40
       io_put_req+0x1b/0x20
       io_submit_sqe+0x40a/0x680
       ? __switch_to_asm+0x34/0x70
       ? __switch_to_asm+0x40/0x70
       io_submit_sqes+0xb9/0x160
       ? io_submit_sqes+0xb9/0x160
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       ? __switch_to_asm+0x34/0x70
       io_sq_thread+0x1af/0x470
       ? __switch_to_asm+0x34/0x70
       ? wait_woken+0x80/0x80
       ? __switch_to+0x85/0x410
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread_destroy_worker+0x50/0x50
       ret_from_fork+0x35/0x40
      
      which occurs because using a kernel side submission thread isn't valid
      without using fixed files (registered through io_uring_register()). This
      causes io_uring to put the request after logging an error, but before
      the file field is set in the request. If it happens to be non-zero, we
      attempt to fput() garbage.
      
      Fix this by ensuring that req->file is initialized when the request is
      allocated.
      
      Cc: stable@vger.kernel.org # 5.1+
      Reported-by: default avatarStephen Bates <sbates@raithlin.com>
      Tested-by: default avatarStephen Bates <sbates@raithlin.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddae0798
    • zhangyi (F)'s avatar
      dm log writes: make sure super sector log updates are written in order · 25df4ce3
      zhangyi (F) authored
      commit 211ad4b7 upstream.
      
      Currently, although we submit super bios in order (and super.nr_entries
      is incremented by each logged entry), submit_bio() is async so each
      super sector may not be written to log device in order and then the
      final nr_entries may be smaller than it should be.
      
      This problem can be reproduced by the xfstests generic/455 with ext4:
      
        QA output created by 455
       -Silence is golden
       +mark 'end' does not exist
      
      Fix this by serializing submission of super sectors to make sure each
      is written to the log disk in order.
      
      Fixes: 0e9cebe7 ("dm: add log writes target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Suggested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25df4ce3
    • Gen Zhang's avatar
      dm init: fix incorrect uses of kstrndup() · 6e17b11f
      Gen Zhang authored
      commit dec7e649 upstream.
      
      Fix 2 kstrndup() calls with incorrect argument order.
      
      Fixes: 6bbc923d ("dm: add support to directly boot to a mapped device")
      Cc: stable@vger.kernel.org # v5.1
      Signed-off-by: default avatarGen Zhang <blackgod016574@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e17b11f
    • Huang Ying's avatar
      mm, swap: fix THP swap out · e3d6fe0b
      Huang Ying authored
      commit 1a5f439c upstream.
      
      0-Day test system reported some OOM regressions for several THP
      (Transparent Huge Page) swap test cases.  These regressions are bisected
      to 68614289 ("block: always define BIO_MAX_PAGES as 256").  In the
      commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled.  So the
      bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
      THP.  That causes the OOM.
      
      As in the patch description of 68614289 ("block: always define
      BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
      to swap space.  So the issue is fixed via doing that in get_swap_bio().
      
      BTW: I remember I have checked the THP swap code when 68614289
      ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
      THP swap code needn't to be changed.  But apparently, I was wrong.  I
      should have done this at that time.
      
      Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
      Fixes: 68614289 ("block: always define BIO_MAX_PAGES as 256")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3d6fe0b
    • Colin Ian King's avatar
      mm/page_idle.c: fix oops because end_pfn is larger than max_pfn · 00553cdd
      Colin Ian King authored
      commit 7298e3b0 upstream.
      
      Currently the calcuation of end_pfn can round up the pfn number to more
      than the actual maximum number of pfns, causing an Oops.  Fix this by
      ensuring end_pfn is never more than max_pfn.
      
      This can be easily triggered when on systems where the end_pfn gets
      rounded up to more than max_pfn using the idle-page stress-ng stress test:
      
      sudo stress-ng --idle-page 0
      
        BUG: unable to handle kernel paging request at 00000000000020d8
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
        RIP: 0010:page_idle_get_page+0xc8/0x1a0
        Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be <49> 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48
        RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202
        RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f
        RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700
        RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276
        R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080
        R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400
        FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0
        Call Trace:
          page_idle_bitmap_write+0x8c/0x140
          sysfs_kf_bin_write+0x5c/0x70
          kernfs_fop_write+0x12e/0x1b0
          __vfs_write+0x1b/0x40
          vfs_write+0xab/0x1b0
          ksys_write+0x55/0xc0
          __x64_sys_write+0x1a/0x20
          do_syscall_64+0x5a/0x110
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com
      Fixes: 33c3fc71 ("mm: introduce idle page tracking")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      00553cdd
    • Naoya Horiguchi's avatar
      mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge · 59d44003
      Naoya Horiguchi authored
      commit faf53def upstream.
      
      madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
      for hugepages with overcommitting enabled.  That was caused by the
      suboptimal code in current soft-offline code.  See the following part:
      
          ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
                                  MIGRATE_SYNC, MR_MEMORY_FAILURE);
          if (ret) {
                  ...
          } else {
                  /*
                   * We set PG_hwpoison only when the migration source hugepage
                   * was successfully dissolved, because otherwise hwpoisoned
                   * hugepage remains on free hugepage list, then userspace will
                   * find it as SIGBUS by allocation failure. That's not expected
                   * in soft-offlining.
                   */
                  ret = dissolve_free_huge_page(page);
                  if (!ret) {
                          if (set_hwpoison_free_buddy_page(page))
                                  num_poisoned_pages_inc();
                  }
          }
          return ret;
      
      Here dissolve_free_huge_page() returns -EBUSY if the migration source page
      was freed into buddy in migrate_pages(), but even in that case we actually
      has a chance that set_hwpoison_free_buddy_page() succeeds.  So that means
      current code gives up offlining too early now.
      
      dissolve_free_huge_page() checks that a given hugepage is suitable for
      dissolving, where we should return success for !PageHuge() case because
      the given hugepage is considered as already dissolved.
      
      This change also affects other callers of dissolve_free_huge_page(), which
      are cleaned up together.
      
      [n-horiguchi@ah.jp.nec.com: v3]
        Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
      Fixes: 6bc9b564 ("mm: fix race on soft-offlining")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarChen, Jerry T <jerry.t.chen@intel.com>
      Tested-by: default avatarChen, Jerry T <jerry.t.chen@intel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
      Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59d44003
    • Naoya Horiguchi's avatar
      mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails · 897b17e0
      Naoya Horiguchi authored
      commit b38e5962 upstream.
      
      The pass/fail of soft offline should be judged by checking whether the
      raw error page was finally contained or not (i.e.  the result of
      set_hwpoison_free_buddy_page()), but current code do not work like
      that.  It might lead us to misjudge the test result when
      set_hwpoison_free_buddy_page() fails.
      
      Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
      not offline the original page and will not return an error.
      
      Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Fixes: 6bc9b564 ("mm: fix race on soft-offlining")
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
      Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      897b17e0
    • Ville Syrjälä's avatar
      drm/i915: Skip modeset for cdclk changes if possible · ec73bed9
      Ville Syrjälä authored
      commit 59f9e9ca upstream.
      
      If we have only a single active pipe and the cdclk change only requires
      the cd2x divider to be updated bxt+ can do the update with forcing a full
      modeset on the pipe. Try to hook that up.
      
      v2:
      - Wait for vblank after an optimized CDCLK change.
      - Avoid optimization if the pipe needs a modeset (or was disabled).
      - Split CDCLK change to a pre/post plane update step.
      v3:
      - Use correct version of CDCLK state as old state. (Ville)
      - Remove unused intel_cdclk_can_skip_modeset()
      v4:
      - For consistency call intel_set_cdclk_post_plane_update() only during
        modesets (and not fastsets).
      v5:
      - Remove the logic to update the CD2X divider on-the-fly on ICL, since
        only a divider of 1 is supported there. Clint also noticed that the
        pipe select bits in CDCLK_CTL are oddly defined on ICL, it's not clear
        yet whether that's only an error in the specification.
      Signed-off-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: default avatarAbhay Kumar <abhay.kumar@intel.com>
      Tested-by: default avatarAbhay Kumar <abhay.kumar@intel.com>
      Signed-off-by: default avatarImre Deak <imre.deak@intel.com>
      Reviewed-by: default avatarClint Taylor <Clinton.A.Taylor@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190327101321.3095-1-imre.deak@intel.comSigned-off-by: default avatarJian-Hong Pan <jian-hong@endlessm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec73bed9
    • Imre Deak's avatar
      drm/i915: Remove redundant store of logical CDCLK state · 994f9ddb
      Imre Deak authored
      commit 2b21dfbe upstream.
      
      We copied the original state into the atomic state already earlier in
      the function, so no need to do it a second time.
      
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: default avatarImre Deak <imre.deak@intel.com>
      Reviewed-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190320135439.12201-3-imre.deak@intel.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJian-Hong Pan <jian-hong@endlessm.com>
      994f9ddb
    • Imre Deak's avatar
      drm/i915: Save the old CDCLK atomic state · ca2d6659
      Imre Deak authored
      commit 48d9f87d upstream.
      
      The old state will be needed by an upcoming patch to determine if the
      commit increases or decreases CDCLK, so move the old state to the atomic
      state (while keeping the new one in dev_priv). cdclk.logical and
      cdclk.actual in the atomic state isn't used atm anywhere after the
      atomic check phase, so this should be safe.
      
      v2:
      - Use swap() instead of opencoding it. (Ville)
      Suggested-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: default avatarImre Deak <imre.deak@intel.com>
      Reviewed-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190320135439.12201-2-imre.deak@intel.comSigned-off-by: default avatarJian-Hong Pan <jian-hong@endlessm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ca2d6659
    • Ville Syrjälä's avatar
      drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled · 64e3d1c9
      Ville Syrjälä authored
      commit 905801fe upstream.
      
      CDCLK has to be at least twice the BLCK regardless of audio. Audio
      driver has to probe using this hook and increase the clock even in
      absence of any display.
      
      v2: Use atomic refcount for get_power, put_power so that we can
          call each once(Abhay).
      v3: Reset power well 2 to avoid any transaction on iDisp link
          during cdclk change(Abhay).
      v4: Remove Power well 2 reset workaround(Ville).
      v5: Remove unwanted Power well 2 register defined in v4(Abhay).
      v6:
      - Use a dedicated flag instead of state->modeset for min CDCLK changes
      - Make get/put audio power domain symmetric
      - Rebased on top of intel_wakeref tracking changes.
      Signed-off-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: default avatarAbhay Kumar <abhay.kumar@intel.com>
      Tested-by: default avatarAbhay Kumar <abhay.kumar@intel.com>
      Signed-off-by: default avatarImre Deak <imre.deak@intel.com>
      Reviewed-by: default avatarClint Taylor <Clinton.A.Taylor@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190320135439.12201-1-imre.deak@intel.com
      Cc: <stable@vger.kernel.org> # 5.1.x
      Signed-off-by: default avatarJian-Hong Pan <jian-hong@endlessm.com>
      Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=203623
      Buglink: https://bugs.freedesktop.org/show_bug.cgi?id=110916
      Link: https://www.spinics.net/lists/stable/msg310910.htmlSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64e3d1c9
    • Dinh Nguyen's avatar
      clk: socfpga: stratix10: fix divider entry for the emac clocks · 261b9429
      Dinh Nguyen authored
      commit 74684cce upstream.
      
      The fixed dividers for the emac clocks should be 2 not 4.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDinh Nguyen <dinguyen@kernel.org>
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      261b9429