Commits · 407c85c7ddd6b84d3cbdd2275616f70c27c17913 · Kirill Smelkov / linux

24 Nov, 2020 7 commits

tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7

Alexander Duyck authored Nov 20, 2020

When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: Jakub Kicinski <kuba@kernel.org>

407c85c7

devlink: Fix reload stats structure · 5204bb68

Moshe Shemesh authored Nov 23, 2020

Fix reload stats structure exposed to the user. Change stats structure
hierarchy to have the reload action as a parent of the stat entry and
then stat entry includes value per limit. This will also help to avoid
string concatenation on iproute2 output.

Reload stats structure before this fix:
"stats": {
    "reload": {
        "driver_reinit": 2,
        "fw_activate": 1,
        "fw_activate_no_reset": 0
     }
}

After this fix:
"stats": {
    "reload": {
        "driver_reinit": {
            "unspecified": 2
        },
        "fw_activate": {
            "unspecified": 1,
            "no_reset": 0
        }
}

Fixes: a254c264 ("devlink: Add reload stats")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5204bb68

aquantia: Remove the build_skb path · 9bd2702d

Lincoln Ramsay authored Nov 23, 2020

When performing IPv6 forwarding, there is an expectation that SKBs
will have some headroom. When forwarding a packet from the aquantia
driver, this does not always happen, triggering a kernel warning.

aq_ring.c has this code (edited slightly for brevity):

if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
    skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
} else {
    skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);

There is a significant difference between the SKB produced by these
2 code paths. When napi_alloc_skb creates an SKB, there is a certain
amount of headroom reserved. However, this is not done in the
build_skb codepath.

As the hardware buffer that build_skb is built around does not
handle the presence of the SKB header, this code path is being
removed and the napi_alloc_skb path will always be used. This code
path does have to copy the packet header into the SKB, but it adds
the packet data as a frag.

Fixes: 018423e9 ("net: ethernet: aquantia: Add ring support code")
Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
Link: https://lore.kernel.org/r/MWHPR1001MB23184F3EAFA413E0D1910EC9E8FC0@MWHPR1001MB2318.namprd10.prod.outlook.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9bd2702d

net/packet: fix packet receive on L3 devices without visible hard header · d5496990

Eyal Birger authored Nov 21, 2020

In the patchset merged by commit b9fcf0a0
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.

Fixes: b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d5496990

i40e: Fix removing driver while bare-metal VFs pass traffic · 2980cbd4

Sylwester Dziedziuch authored Nov 20, 2020

Prevent VFs from resetting when PF driver is being unloaded:
- introduce new pf state: __I40E_VF_RESETS_DISABLED;
- check if pf state has __I40E_VF_RESETS_DISABLED state set,
  if so, disable any further VFLR event notifications;
- when i40e_remove (rmmod i40e) is called, disable any resets on
  the VFs;

Previously if there were bare-metal VFs passing traffic and PF
driver was removed, there was a possibility of VFs triggering a Tx
timeout right before iavf_remove. This was causing iavf_close to
not be called because there is a check in the beginning of  iavf_remove
that bails out early if adapter->state < IAVF_DOWN_PENDING. This
makes it so some resources do not get cleaned up.

Fixes: 6a9ddb36 ("i40e: disable IOV before freeing resources")
Signed-off-by: Slawomir Laba <slawomirx.laba@intel.com>
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20201120180640.3654474-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2980cbd4

vsock/virtio: discard packets only when socket is really closed · 3fe356d5

Stefano Garzarella authored Nov 20, 2020

Starting from commit 8692cefc ("virtio_vsock: Fix race condition
in virtio_transport_recv_pkt"), we discard packets in
virtio_transport_recv_pkt() if the socket has been released.

When the socket is connected, we schedule a delayed work to wait the
RST packet from the other peer, also if SHUTDOWN_MASK is set in
sk->sk_shutdown.
This is done to complete the virtio-vsock shutdown algorithm, releasing
the port assigned to the socket definitively only when the other peer
has consumed all the packets.

If we discard the RST packet received, the socket will be closed only
when the VSOCK_CLOSE_TIMEOUT is reached.

Sergio discovered the issue while running ab(1) HTTP benchmark using
libkrun [1] and observing a latency increase with that commit.

To avoid this issue, we discard packet only if the socket is really
closed (SOCK_DONE flag is set).
We also set SOCK_DONE in virtio_transport_release() when we don't need
to wait any packets from the other peer (we didn't schedule the delayed
work). In this case we remove the socket from the vsock lists, releasing
the port assigned.

[1] https://github.com/containers/libkrun

Fixes: 8692cefc ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt")
Cc: justin.he@arm.com
Reported-by: Sergio Lopez <slp@redhat.com>
Tested-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Jia He <justin.he@arm.com>
Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3fe356d5

tcp: fix race condition when creating child sockets from syncookies · 01770a16

Ricardo Dias authored Nov 20, 2020

When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.
Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: Jakub Kicinski <kuba@kernel.org>

01770a16

23 Nov, 2020 1 commit

Merge tag 'wireless-drivers-2020-11-23' of... · 1eae77bf

Jakub Kicinski authored Nov 23, 2020

Merge tag 'wireless-drivers-2020-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for v5.10

First set of fixes for v5.10. One fix for iwlwifi kernel panic, others
less notable.

rtw88

* fix a bogus test found by clang

iwlwifi

* fix long memory reads causing soft lockup warnings

* fix kernel panic during Channel Switch Announcement (CSA)

* other smaller fixes

MAINTAINERS

* email address updates

* tag 'wireless-drivers-2020-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers:
  iwlwifi: mvm: fix kernel panic in case of assert during CSA
  iwlwifi: pcie: set LTR to avoid completion timeout
  iwlwifi: mvm: write queue_sync_state only for sync
  iwlwifi: mvm: properly cancel a session protection for P2P
  iwlwifi: mvm: use the HOT_SPOT_CMD to cancel an AUX ROC
  iwlwifi: sta: set max HE max A-MPDU according to HE capa
  MAINTAINERS: update maintainers list for Cypress
  MAINTAINERS: update Yan-Hsuan's email address
  iwlwifi: pcie: limit memory read spin time
  rtw88: fix fw_fifo_addr check
====================

Link: https://lore.kernel.org/r/20201123161037.C11D1C43460@smtp.codeaurora.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1eae77bf

21 Nov, 2020 17 commits

Merge branch 'ibmvnic-fixes-in-reset-path' · f9b03653

Jakub Kicinski authored Nov 21, 2020

Lijun Pan says:

====================
ibmvnic: fixes in reset path

Patch 1/3 and 2/3 notify peers in failover and migration reset.
Patch 3/3 skips timeout reset if it is already resetting.
====================

Link: https://lore.kernel.org/r/20201120224013.46891-1-ljp@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f9b03653

ibmvnic: skip tx timeout reset while in resetting · 855a631a

Lijun Pan authored Nov 20, 2020

Sometimes it takes longer than 5 seconds (watchdog timeout) to complete
failover, migration, and other resets. In stead of scheduling another
timeout reset, we wait for the current one to complete.
Suggested-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

855a631a

ibmvnic: notify peers when failover and migration happen · 98025bce

Lijun Pan authored Nov 20, 2020

Commit 61d3e1d9 ("ibmvnic: Remove netdev notify for failover resets")
excluded the failover case for notify call because it said
netdev_notify_peers() can cause network traffic to stall or halt.
Current testing does not show network traffic stall
or halt because of the notify call for failover event.
netdev_notify_peers may be used when a device wants to inform the
rest of the network about some sort of a reconfiguration
such as failover or migration.

It is unnecessary to call that in other events like
FATAL, NON_FATAL, CHANGE_PARAM, and TIMEOUT resets
since in those scenarios the hardware does not change.
If the driver must do a hard reset, it is necessary to notify peers.

Fixes: 61d3e1d9 ("ibmvnic: Remove netdev notify for failover resets")
Suggested-by: Brian King <brking@linux.vnet.ibm.com>
Suggested-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

98025bce

ibmvnic: fix call_netdevice_notifiers in do_reset · 83935975

Lijun Pan authored Nov 20, 2020

When netdev_notify_peers was substituted in
commit 986103e7 ("net/ibmvnic: Fix RTNL deadlock during device reset"),
call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev) was missed.
Fix it now.

Fixes: 986103e7 ("net/ibmvnic: Fix RTNL deadlock during device reset")
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

83935975

tun: honor IOCB_NOWAIT flag · 5aac0390

Jens Axboe authored Nov 20, 2020

tun only checks the file O_NONBLOCK flag, but it should also be checking
the iocb IOCB_NOWAIT flag. Any fops using ->read/write_iter() should check
both, otherwise it breaks users that correctly expect O_NONBLOCK semantics
if IOCB_NOWAIT is set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/e9451860-96cc-c7c7-47b8-fe42cadd5f4c@kernel.dkSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5aac0390

net/af_iucv: set correct sk_protocol for child sockets · c5dab094

Julian Wiedmann authored Nov 20, 2020

Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*),
instead of the PF_IUCV protocol that the parent was created with in
iucv_sock_create().

We're currently not using sk->sk_protocol ourselves, so this shouldn't
have much impact (except eg. getting the output in skb_dump() right).

Fixes: eac3731b ("[S390]: Add AF_IUCV socket support")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c5dab094

usbnet: ipheth: fix connectivity with iOS 14 · f33d9e2b

Yves-Alexis Perez authored Nov 19, 2020

Starting with iOS 14 released in September 2020, connectivity using the
personal hotspot USB tethering function of iOS devices is broken.

Communication between the host and the device (for example ICMP traffic
or DNS resolution using the DNS service running in the device itself)
works fine, but communication to endpoints further away doesn't work.

Investigation on the matter shows that no UDP and ICMP traffic from the
tethered host is reaching the Internet at all. For TCP traffic there are
exchanges between tethered host and server but packets are modified in
transit leading to impossible communication.

After some trials Matti Vuorela discovered that reducing the URB buffer
size by two bytes restored the previous behavior. While a better
solution might exist to fix the issue, since the protocol is not
publicly documented and considering the small size of the fix, let's do
that.
Tested-by: Matti Vuorela <matti.vuorela@bitfactor.fi>
Signed-off-by: Yves-Alexis Perez <corsac@corsac.net>
Link: https://lore.kernel.org/linux-usb/CAAn0qaXmysJ9vx3ZEMkViv_B19ju-_ExN8Yn_uSefxpjS6g4Lw@mail.gmail.com/
Link: https://github.com/libimobiledevice/libimobiledevice/issues/1038
Link: https://lore.kernel.org/r/20201119172439.94988-1-corsac@corsac.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f33d9e2b

cxgb4: Fix build failure when CONFIG_TLS=m · 659fbdcf

Tom Seewald authored Nov 20, 2020

After commit 9d2e5e9e ("cxgb4/ch_ktls: decrypted bit is not enough")
whenever CONFIG_TLS=m and CONFIG_CHELSIO_T4=y, the following build
failure occurs:

ld: drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.o: in function
`cxgb_select_queue':
cxgb4_main.c:(.text+0x2dac): undefined reference to `tls_validate_xmit_skb'

Fix this by ensuring that if TLS is set to be a module, CHELSIO_T4 will
also be compiled as a module. As otherwise the cxgb4 driver will not be
able to access TLS' symbols.

Fixes: 9d2e5e9e ("cxgb4/ch_ktls: decrypted bit is not enough")
Signed-off-by: Tom Seewald <tseewald@gmail.com>
Link: https://lore.kernel.org/r/20201120192528.615-1-tseewald@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

659fbdcf

bonding: wait for sysfs kobject destruction before freeing struct slave · b9ad3e9f

Jamie Iles authored Nov 20, 2020

syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, releasing a
struct slave device could result in the following splat:

  kobject: 'bonding_slave' (00000000cecdd4fe): kobject_release, parent 0000000074ceb2b2 (delayed 1000)
  bond0 (unregistering): (slave bond_slave_1): Releasing backup interface
  ------------[ cut here ]------------
  ODEBUG: free active (active state 0) object type: timer_list hint: workqueue_select_cpu_near kernel/workqueue.c:1549 [inline]
  ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98 kernel/workqueue.c:1600
  WARNING: CPU: 1 PID: 842 at lib/debugobjects.c:485 debug_print_object+0x180/0x240 lib/debugobjects.c:485
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 1 PID: 842 Comm: kworker/u4:4 Tainted: G S                5.9.0-rc8+ #96
  Hardware name: linux,dummy-virt (DT)
  Workqueue: netns cleanup_net
  Call trace:
   dump_backtrace+0x0/0x4d8 include/linux/bitmap.h:239
   show_stack+0x34/0x48 arch/arm64/kernel/traps.c:142
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x174/0x1f8 lib/dump_stack.c:118
   panic+0x360/0x7a0 kernel/panic.c:231
   __warn+0x244/0x2ec kernel/panic.c:600
   report_bug+0x240/0x398 lib/bug.c:198
   bug_handler+0x50/0xc0 arch/arm64/kernel/traps.c:974
   call_break_hook+0x160/0x1d8 arch/arm64/kernel/debug-monitors.c:322
   brk_handler+0x30/0xc0 arch/arm64/kernel/debug-monitors.c:329
   do_debug_exception+0x184/0x340 arch/arm64/mm/fault.c:864
   el1_dbg+0x48/0xb0 arch/arm64/kernel/entry-common.c:65
   el1_sync_handler+0x170/0x1c8 arch/arm64/kernel/entry-common.c:93
   el1_sync+0x80/0x100 arch/arm64/kernel/entry.S:594
   debug_print_object+0x180/0x240 lib/debugobjects.c:485
   __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
   debug_check_no_obj_freed+0x200/0x430 lib/debugobjects.c:998
   slab_free_hook mm/slub.c:1536 [inline]
   slab_free_freelist_hook+0x190/0x210 mm/slub.c:1577
   slab_free mm/slub.c:3138 [inline]
   kfree+0x13c/0x460 mm/slub.c:4119
   bond_free_slave+0x8c/0xf8 drivers/net/bonding/bond_main.c:1492
   __bond_release_one+0xe0c/0xec8 drivers/net/bonding/bond_main.c:2190
   bond_slave_netdev_event drivers/net/bonding/bond_main.c:3309 [inline]
   bond_netdev_event+0x8f0/0xa70 drivers/net/bonding/bond_main.c:3420
   notifier_call_chain+0xf0/0x200 kernel/notifier.c:83
   __raw_notifier_call_chain kernel/notifier.c:361 [inline]
   raw_notifier_call_chain+0x44/0x58 kernel/notifier.c:368
   call_netdevice_notifiers_info+0xbc/0x150 net/core/dev.c:2033
   call_netdevice_notifiers_extack net/core/dev.c:2045 [inline]
   call_netdevice_notifiers net/core/dev.c:2059 [inline]
   rollback_registered_many+0x6a4/0xec0 net/core/dev.c:9347
   unregister_netdevice_many.part.0+0x2c/0x1c0 net/core/dev.c:10509
   unregister_netdevice_many net/core/dev.c:10508 [inline]
   default_device_exit_batch+0x294/0x338 net/core/dev.c:10992
   ops_exit_list.isra.0+0xec/0x150 net/core/net_namespace.c:189
   cleanup_net+0x44c/0x888 net/core/net_namespace.c:603
   process_one_work+0x96c/0x18c0 kernel/workqueue.c:2269
   worker_thread+0x3f0/0xc30 kernel/workqueue.c:2415
   kthread+0x390/0x498 kernel/kthread.c:292
   ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:925

This is a potential use-after-free if the sysfs nodes are being accessed
whilst removing the struct slave, so wait for the object destruction to
complete before freeing the struct slave itself.

Fixes: 07699f9a ("bonding: add sysfs /slave dir for bond slave devices.")
Fixes: a068aab4 ("bonding: Fix reference count leak in bond_sysfs_slave_add.")
Cc: Qiushi Wu <wu000273@umn.edu>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20201120142827.879226-1-jamie@nuviainc.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b9ad3e9f

Merge branch 's390-qeth-fixes-2020-11-20' · 207d0bfc

Jakub Kicinski authored Nov 20, 2020

Julian Wiedmann says:

====================
s390/qeth: fixes 2020-11-20

This brings several fixes for qeth's af_iucv-specific code paths.

Also one fix by Alexandra for the recently added BR_LEARNING_SYNC
support. We want to trust the feature indication bit, so that HW can
mask it out if there's any issues on their end.
====================

Link: https://lore.kernel.org/r/20201120090939.101406-1-jwi@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

207d0bfc

s390/qeth: fix tear down of async TX buffers · 7ed10e16

Julian Wiedmann authored Nov 20, 2020

When qeth_iqd_tx_complete() detects that a TX buffer requires additional
async completion via QAOB, it might fail to replace the queue entry's
metadata (and ends up triggering recovery).

Assume now that the device gets torn down, overruling the recovery.
If the QAOB notification then arrives before the tear down has
sufficiently progressed, the buffer state is changed to
QETH_QDIO_BUF_HANDLED_DELAYED by qeth_qdio_handle_aob().

The tear down code calls qeth_drain_output_queue(), where
qeth_cleanup_handled_pending() will then attempt to replace such a
buffer _again_. If it succeeds this time, the buffer ends up dangling in
its replacement's ->next_pending list ... where it will never be freed,
since there's no further call to qeth_cleanup_handled_pending().

But the second attempt isn't actually needed, we can simply leave the
buffer on the queue and re-use it after a potential recovery has
completed. The qeth_clear_output_buffer() in qeth_drain_output_queue()
will ensure that it's in a clean state again.

Fixes: 72861ae7 ("qeth: recovery through asynchronous delivery")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7ed10e16

s390/qeth: fix af_iucv notification race · 8908f36d

Julian Wiedmann authored Nov 20, 2020

The two expected notification sequences are
1. TX_NOTIFY_PENDING with a subsequent TX_NOTIFY_DELAYED_*, when
   our TX completion code first observed the pending TX and the QAOB
   then completes at a later time; or
2. TX_NOTIFY_OK, when qeth_qdio_handle_aob() picked up the QAOB
   completion before our TX completion code even noticed that the TX
   was pending.

But as qeth_iqd_tx_complete() and qeth_qdio_handle_aob() can run
concurrently, we may end up with a race that results in a sequence of
TX_NOTIFY_DELAYED_* followed by TX_NOTIFY_PENDING. Which would confuse
the af_iucv code in its tracking of pending transmits.

Rework the notification code, so that qeth_qdio_handle_aob() defers its
notification if the TX completion code is still active.

Fixes: b3332930 ("qeth: add support for af_iucv HiperSockets transport")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8908f36d

s390/qeth: make af_iucv TX notification call more robust · 34c7f50f

Julian Wiedmann authored Nov 20, 2020

Calling into socket code is ugly already, at least check whether we are
dealing with the expected sk_family. Only looking at skb->protocol is
bound to cause troubles (consider eg. af_packet).

Fixes: b3332930 ("qeth: add support for af_iucv HiperSockets transport")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

34c7f50f

s390/qeth: Remove pnso workaround · 0d0e2b53

Alexandra Winter authored Nov 20, 2020

Remove workaround that supported early hardware implementations
of PNSO OC3. Rely on the 'enarf' feature bit instead.

Fixes: fa115adf ("s390/qeth: Detect PNSO OC3 capability")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
[jwi: use logical instead of bit-wise AND]
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0d0e2b53

Merge branch 'tcp-address-issues-with-ect0-not-being-set-in-dctcp-packets' · e10823c7

Jakub Kicinski authored Nov 20, 2020

Alexander Duyck says:

====================
tcp: Address issues with ECT0 not being set in DCTCP packets

This patch set is meant to address issues seen with SYN/ACK packets not
containing the ECT0 bit when DCTCP is configured as the congestion control
algorithm for a TCP socket.

A simple test using "tcpdump" and "test_progs -t bpf_tcp_ca" makes the
issue obvious. Looking at the packets will result in the SYN/ACK packet
with an ECT0 bit that does not match the other packets for the flow when
the congestion control agorithm is switch from the default. So for example
going from non-DCTCP to a DCTCP congestion control algorithm we will see
the SYN/ACK IPV6 header will not have ECT0 set while the other packets in
the flow will. Likewise if we switch from a default of DCTCP to cubic we
will see the ECT0 bit set in the SYN/ACK while the other packets in the
flow will not.
====================

Link: https://lore.kernel.org/r/160582070138.66684.11785214534154816097.stgit@localhost.localdomainSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e10823c7

tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control · 55472017

Alexander Duyck authored Nov 19, 2020

When setting congestion control via a BPF program it is seen that the
SYN/ACK for packets within a given flow will not include the ECT0 flag. A
bit of simple printk debugging shows that when this is configured without
BPF we will see the value INET_ECN_xmit value initialized in
tcp_assign_congestion_control however when we configure this via BPF the
socket is in the closed state and as such it isn't configured, and I do not
see it being initialized when we transition the socket into the listen
state. The result of this is that the ECT0 bit is configured based on
whatever the default state is for the socket.

Any easy way to reproduce this is to monitor the following with tcpdump:
tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca

Without this patch the SYN/ACK will follow whatever the default is. If dctcp
all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
bit matches the value seen on the other packets in the given stream.

Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

55472017

tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5

Alexander Duyck authored Nov 19, 2020

An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.

This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.

One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.

Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

861602b5

20 Nov, 2020 8 commits

dpaa2-eth: select XGMAC_MDIO for MDIO bus support · d2624e70

Ioana Ciornei authored Nov 19, 2020

Explicitly enable the FSL_XGMAC_MDIO Kconfig option in order to have
MDIO access to internal and external PHYs.

Fixes: 71947923 ("dpaa2-eth: add MAC/PHY support through phylink")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20201119145106.712761-1-ciorneiioana@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d2624e70

cxgb4: fix the panic caused by non smac rewrite · bff45392

Raju Rangoju authored Nov 18, 2020

SMT entry is allocated only when loopback Source MAC
rewriting is requested. Accessing SMT entry for non
smac rewrite cases results in kernel panic.

Fix the panic caused by non smac rewrite

Fixes: 937d8420 ("cxgb4: set up filter action after rewrites")
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Link: https://lore.kernel.org/r/20201118143213.13319-1-rajur@chelsio.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bff45392

net/tls: missing received data after fast remote close · 20ffc7ad

Vadim Fedorenko authored Nov 19, 2020

In case when tcp socket received FIN after some data and the
parser haven't started before reading data caller will receive
an empty buffer. This behavior differs from plain TCP socket and
leads to special treating in user-space.
The flow that triggers the race is simple. Server sends small
amount of data right after the connection is configured to use TLS
and closes the connection. In this case receiver sees TLS Handshake
data, configures TLS socket right after Change Cipher Spec record.
While the configuration is in process, TCP socket receives small
Application Data record, Encrypted Alert record and FIN packet. So
the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with
SK_DONE bit set. The received data is not parsed upon arrival and is
never sent to user-space.

Patch unpauses parser directly if we have unparsed data in tcp
receive queue.

Fixes: fcf4793e ("tls: check RCV_SHUTDOWN in tls_wait_data")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek.ruSigned-off-by: Jakub Kicinski <kuba@kernel.org>

20ffc7ad

bnxt_en: Release PCI regions when DMA mask setup fails during probe. · c54bc3ce

Michael Chan authored Nov 20, 2020

Jump to init_err_release to cleanup. bnxt_unmap_bars() will also be
called but it will do nothing if the BARs are not mapped yet.

Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/1605858271-8209-1-git-send-email-michael.chan@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c54bc3ce

rose: Fix Null pointer dereference in rose_send_frame() · 3b3fd068

Anmol Karn authored Nov 20, 2020

rose_send_frame() dereferences `neigh->dev` when called from
rose_transmit_clear_request(), and the first occurrence of the
`neigh` is in rose_loopback_timer() as `rose_loopback_neigh`,
and it is initialized in rose_add_loopback_neigh() as NULL.
i.e when `rose_loopback_neigh` used in rose_loopback_timer()
its `->dev` was still NULL and rose_loopback_timer() was calling
rose_rx_call_request() without checking for NULL.

- net/rose/rose_link.c
This bug seems to get triggered in this line:

rose_call = (ax25_address *)neigh->dev->dev_addr;

Fix it by adding NULL checking for `rose_loopback_neigh->dev`
in rose_loopback_timer().

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Tested-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=9d2a7ca8c7f2e4b682c97578dfa3f236258300b3Signed-off-by: Anmol Karn <anmol.karan123@gmail.com>
Link: https://lore.kernel.org/r/20201119191043.28813-1-anmol.karan123@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3b3fd068

MAINTAINERS: Change Solarflare maintainers · f46e79aa

Martin Habets authored Nov 20, 2020

Email from solarflare.com will stop working. Update the maintainers.
A replacement for linux-net-drivers@solarflare.com is not working yet,
for now remove it.
Signed-off-by: Martin Habets <mhabets@solarflare.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Link: https://lore.kernel.org/r/20201120113207.GA1605547@mh-desktopSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f46e79aa

bnxt_en: fix error return code in bnxt_init_board() · 3383176e

Zhang Changzhong authored Nov 19, 2020

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Link: https://lore.kernel.org/r/1605792621-6268-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3383176e

bnxt_en: fix error return code in bnxt_init_one() · b5f796b6

Zhang Changzhong authored Nov 18, 2020

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: c213eae8 ("bnxt_en: Improve VF/PF link change logic.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Link: https://lore.kernel.org/r/1605701851-20270-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b5f796b6

19 Nov, 2020 7 commits

Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 4d02da97

Linus Torvalds authored Nov 19, 2020

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.10-rc5, including fixes from the WiFi
  (mac80211), can and bpf (including the strncpy_from_user fix).

  Current release - regressions:

   - mac80211: fix memory leak of filtered powersave frames

   - mac80211: free sta in sta_info_insert_finish() on errors to avoid
     sleeping in atomic context

   - netlabel: fix an uninitialized variable warning added in -rc4

  Previous release - regressions:

   - vsock: forward all packets to the host when no H2G is registered,
     un-breaking AWS Nitro Enclaves

   - net: Exempt multicast addresses from five-second neighbor lifetime
     requirement, decreasing the chances neighbor tables fill up

   - net/tls: fix corrupted data in recvmsg

   - qed: fix ILT configuration of SRC block

   - can: m_can: process interrupt only when not runtime suspended

  Previous release - always broken:

   - page_frag: Recover from memory pressure by not recycling pages
     allocating from the reserves

   - strncpy_from_user: Mask out bytes after NUL terminator

   - ip_tunnels: Set tunnel option flag only when tunnel metadata is
     present, always setting it confuses Open vSwitch

   - bpf, sockmap:
      - Fix partial copy_page_to_iter so progress can still be made
      - Fix socket memory accounting and obeying SO_RCVBUF

   - net: Have netpoll bring-up DSA management interface

   - net: bridge: add missing counters to ndo_get_stats64 callback

   - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt

   - enetc: Workaround MDIO register access HW bug

   - net/ncsi: move netlink family registration to a subsystem init,
     instead of tying it to driver probe

   - net: ftgmac100: unregister NC-SI when removing driver to avoid
     crash

   - lan743x:
      - prevent interrupt storm on open
      - fix freeing skbs in the wrong context

   - net/mlx5e: Fix socket refcount leak on kTLS RX resync

   - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097

   - fix 21 unset return codes and other mistakes on error paths, mostly
     detected by the Hulk Robot"

* tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits)
  fail_function: Remove a redundant mutex unlock
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
  lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
  net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()
  net/smc: fix matching of existing link groups
  ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module
  libbpf: Fix VERSIONED_SYM_COUNT number parsing
  net/mlx4_core: Fix init_hca fields offset
  atm: nicstar: Unmap DMA on send error
  page_frag: Recover from memory pressure
  net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset
  mlxsw: core: Use variable timeout for EMAD retries
  mlxsw: Fix firmware flashing
  net: Have netpoll bring-up DSA management interface
  atl1e: fix error return code in atl1e_probe()
  atl1c: fix error return code in atl1c_probe()
  ah6: fix error return code in ah6_input()
  net: usb: qmi_wwan: Set DTR quirk for MR400
  can: m_can: process interrupt only when not runtime suspended
  can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery
  ...

4d02da97

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 3be28e93

Linus Torvalds authored Nov 19, 2020

Pull rdma fixes from Jason Gunthorpe:
 "The last two weeks have been quiet here, just the usual smattering of
  long standing bug fixes.

  A collection of error case bug fixes:

   - Improper nesting of spinlock types in cm

   - Missing error codes and kfree()

   - Ensure dma_virt_ops users have the right kconfig symbols to work
     properly

   - Compilation failure of tools/testing"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  tools/testing/scatterlist: Fix test to compile and run
  IB/hfi1: Fix error return code in hfi1_init_dd()
  RMDA/sw: Don't allow drivers using dma_virt_ops on highmem configs
  RDMA/pvrdma: Fix missing kfree() in pvrdma_register_device()
  RDMA/cm: Make the local_id_table xarray non-irq

3be28e93

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · e6ea60ba

Jakub Kicinski authored Nov 19, 2020

Alexei Starovoitov says:

====================
1) libbpf should not attempt to load unused subprogs, from Andrii.

2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel.

3) Relax return code check for subprograms in the BPF verifier, from Dmitrii.

4) Fix several sockmap issues, from John.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  fail_function: Remove a redundant mutex unlock
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
  lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
  libbpf: Fix VERSIONED_SYM_COUNT number parsing
  bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
  bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
  bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
  bpf, sockmap: Use truesize with sk_rmem_schedule()
  bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
  bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
  selftests/bpf: Fix error return code in run_getsockopt_test()
  bpf: Relax return code check for subprograms
  tools, bpftool: Add missing close before bpftool net attach exit
  MAINTAINERS/bpf: Update Andrii's entry.
  selftests/bpf: Fix unused attribute usage in subprogs_unused test
  bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
  bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare
  libbpf: Don't attempt to load unused subprog as an entry-point BPF program
====================

Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e6ea60ba

fail_function: Remove a redundant mutex unlock · 2801a5da

Luo Meng authored Nov 18, 2020

Fix a mutex_unlock() issue where before copy_from_user() is
not called mutex_locked.

Fixes: 4b1a29a7 ("error-injection: Support fault injection framework")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/160570737118.263807.8358435412898356284.stgit@devnote2

2801a5da

Merge branch 'Fix bpf_probe_read_user_str() overcopying' · 14d6d86c

Alexei Starovoitov authored Nov 19, 2020

Daniel Xu says:

====================

6ae08ae3 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
kernel}_str helpers") introduced a subtle bug where
bpf_probe_read_user_str() would potentially copy a few extra bytes after
the NUL terminator.

This issue is particularly nefarious when strings are used as map keys,
as seemingly identical strings can occupy multiple entries in a map.

This patchset fixes the issue and introduces a selftest to prevent
future regressions.

v6 -> v7:
* Add comments

v5 -> v6:
* zero-pad up to sizeof(unsigned long) after NUL

v4 -> v5:
* don't read potentially uninitialized memory

v3 -> v4:
* directly pass userspace pointer to prog
* test more strings of different length

v2 -> v3:
* set pid filter before attaching prog in selftest
* use long instead of int as bpf_probe_read_user_str() retval
* style changes

v1 -> v2:
* add Fixes: tag
* add selftest
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

14d6d86c

selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL · c8a36aed

Daniel Xu authored Nov 17, 2020

Previously, bpf_probe_read_user_str() could potentially overcopy the
trailing bytes after the NUL due to how do_strncpy_from_user() does the
copy in long-sized strides. The issue has been fixed in the previous
commit.

This commit adds a selftest that ensures we don't regress
bpf_probe_read_user_str() again.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/4d977508fab4ec5b7b574b85bdf8b398868b6ee9.1605642949.git.dxu@dxuuu.xyz

c8a36aed

lib/strncpy_from_user.c: Mask out bytes after NUL terminator. · 6fa6d280

Daniel Xu authored Nov 17, 2020

do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
destination buffer. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit masks out the bytes following the NUL while preserving
long-sized stride in the fast path.

Fixes: 6ae08ae3 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/21efc982b3e9f2f7b0379eed642294caaa0c27a7.1605642949.git.dxu@dxuuu.xyz

6fa6d280