1. 30 Jul, 2014 4 commits
  2. 29 Jul, 2014 36 commits
    • Anton Kolesov's avatar
      ARC: Implement ptrace(PTRACE_GET_THREAD_AREA) · 70e52877
      Anton Kolesov authored
      commit a4b6cb73 upstream.
      
      This patch adds implementation of GET_THREAD_AREA ptrace request type. This
      is required by GDB to debug NPTL applications.
      Signed-off-by: default avatarAnton Kolesov <Anton.Kolesov@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      70e52877
    • Marek Vasut's avatar
      ARM: dts: imx: Add alias for ethernet controller · 67aa6a14
      Marek Vasut authored
      commit 22970070 upstream.
      
      Add alias for FEC ethernet on i.MX to allow bootloaders (like U-Boot)
      patch-in the MAC address for FEC using this alias.
      Signed-off-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      67aa6a14
    • Benjamin LaHaise's avatar
      aio: protect reqs_available updates from changes in interrupt handlers · 60714352
      Benjamin LaHaise authored
      commit 263782c1 upstream.
      
      As of commit f8567a38 it is now possible to
      have put_reqs_available() called from irq context.  While put_reqs_available()
      is per cpu, it did not protect itself from interrupts on the same CPU.  This
      lead to aio_complete() corrupting the available io requests count when run
      under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
      disabling irq updates around the per cpu batch updates of reqs_available.
      
      Many thanks to Robert and folks for testing and tracking this down.
      Reported-by: default avatarRobert Elliot <Elliott@hp.com>
      Tested-by: default avatarRobert Elliot <Elliott@hp.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      60714352
    • Mateusz Guzik's avatar
      sched: Fix possible divide by zero in avg_atom() calculation · 9f8d4874
      Mateusz Guzik authored
      commit b0ab99e7 upstream.
      
      proc_sched_show_task() does:
      
        if (nr_switches)
      	do_div(avg_atom, nr_switches);
      
      nr_switches is unsigned long and do_div truncates it to 32 bits, which
      means it can test non-zero on e.g. x86-64 and be truncated to zero for
      division.
      
      Fix the problem by using div64_ul() instead.
      
      As a side effect calculations of avg_atom for big nr_switches are now correct.
      Signed-off-by: default avatarMateusz Guzik <mguzik@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9f8d4874
    • Peter Zijlstra's avatar
      locking/mutex: Disable optimistic spinning on some architectures · 91b2716a
      Peter Zijlstra authored
      commit 4badad35 upstream.
      
      The optimistic spin code assumes regular stores and cmpxchg() play nice;
      this is found to not be true for at least: parisc, sparc32, tile32,
      metag-lock1, arc-!llsc and hexagon.
      
      There is further wreckage, but this in particular seemed easy to
      trigger, so blacklist this.
      
      Opt in for known good archs.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Waiman Long <waiman.long@hp.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: John David Anglin <dave.anglin@bell.net>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: sparclinux@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      91b2716a
    • Takashi Iwai's avatar
      PM / sleep: Fix request_firmware() error at resume · d5f32654
      Takashi Iwai authored
      commit 4320f6b1 upstream.
      
      The commit [247bc037: PM / Sleep: Mitigate race between the freezer
      and request_firmware()] introduced the finer state control, but it
      also leads to a new bug; for example, a bug report regarding the
      firmware loading of intel BT device at suspend/resume:
        https://bugzilla.novell.com/show_bug.cgi?id=873790
      
      The root cause seems to be a small window between the process resume
      and the clear of usermodehelper lock.  The request_firmware() function
      checks the UMH lock and gives up when it's in UMH_DISABLE state.  This
      is for avoiding the invalid  f/w loading during suspend/resume phase.
      The problem is, however, that usermodehelper_enable() is called at the
      end of thaw_processes().  Thus, a thawed process in between can kick
      off the f/w loader code path (in this case, via btusb_setup_intel())
      even before the call of usermodehelper_enable().  Then
      usermodehelper_read_trylock() returns an error and request_firmware()
      spews WARN_ON() in the end.
      
      This oneliner patch fixes the issue just by setting to UMH_FREEZING
      state again before restarting tasks, so that the call of
      request_firmware() will be blocked until the end of this function
      instead of returning an error.
      
      Fixes: 247bc037 (PM / Sleep: Mitigate race between the freezer and request_firmware())
      Link: https://bugzilla.novell.com/show_bug.cgi?id=873790Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d5f32654
    • Mike Snitzer's avatar
      dm cache metadata: do not allow the data block size to change · e0779220
      Mike Snitzer authored
      commit 048e5a07 upstream.
      
      The block size for the dm-cache's data device must remained fixed for
      the life of the cache.  Disallow any attempt to change the cache's data
      block size.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e0779220
    • Mike Snitzer's avatar
      dm thin metadata: do not allow the data block size to change · c355b2e3
      Mike Snitzer authored
      commit 9aec8629 upstream.
      
      The block size for the thin-pool's data device must remained fixed for
      the life of the thin-pool.  Disallow any attempt to change the
      thin-pool's data block size.
      
      It should be noted that attempting to change the data block size via
      thin-pool table reload will be ignored as a side-effect of the thin-pool
      handover that the thin-pool target does during thin-pool table reload.
      
      Here is an example outcome of attempting to load a thin-pool table that
      reduced the thin-pool's data block size from 1024K to 512K.
      
      Before:
      kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks
      
      After:
      kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported
      kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object
      kernel: device-mapper: ioctl: error adding target to table
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c355b2e3
    • Ted Juan's avatar
      mtd: devices: elm: fix elm_context_save() and elm_context_restore() functions · b2a13535
      Ted Juan authored
      commit 6938ad40 upstream.
      
      These two function's switch case lack the 'break' that make them always
      return error.
      Signed-off-by: default avatarTed Juan <ted.juan@gmail.com>
      Acked-by: default avatarPekon Gupta <pekon@ti.com>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b2a13535
    • John Stultz's avatar
      alarmtimer: Fix bug where relative alarm timers were treated as absolute · bde32a05
      John Stultz authored
      commit 16927776 upstream.
      
      Sharvil noticed with the posix timer_settime interface, using the
      CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
      tried to specify a relative time timer, it would incorrectly be
      treated as absolute regardless of the state of the flags argument.
      
      This patch corrects this, properly checking the absolute/relative flag,
      as well as adds further error checking that no invalid flag bits are set.
      Reported-by: default avatarSharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bde32a05
    • Alex Deucher's avatar
      drm/radeon: avoid leaking edid data · 9fc6b111
      Alex Deucher authored
      commit 0ac66eff upstream.
      
      In some cases we fetch the edid in the detect() callback
      in order to determine what sort of monitor is connected.
      If that happens, don't fetch the edid again in the get_modes()
      callback or we will leak the edid.
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9fc6b111
    • Jason Wang's avatar
      drm/qxl: return IRQ_NONE if it was not our irq · 5cdebe88
      Jason Wang authored
      commit fbb60fe3 upstream.
      
      Return IRQ_NONE if it was not our irq. This is necessary for the case
      when qxl is sharing irq line with a device A in a crash kernel. If qxl
      is initialized before A and A's irq was raised during this gap,
      returning IRQ_HANDLED in this case will cause this irq to be raised
      again after EOI since kernel think it was handled but in fact it was
      not.
      
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5cdebe88
    • Alex Deucher's avatar
      drm/radeon: set default bl level to something reasonable · ef34ede9
      Alex Deucher authored
      commit 201bb624 upstream.
      
      If the value in the scratch register is 0, set it to the
      max level.  This fixes an issue where the console fb blanking
      code calls back into the backlight driver on unblank and then
      sets the backlight level to 0 after the driver has already
      set the mode and enabled the backlight.
      
      bugs:
      https://bugs.freedesktop.org/show_bug.cgi?id=81382
      https://bugs.freedesktop.org/show_bug.cgi?id=70207Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Tested-by: default avatarDavid Heidelberger <david.heidelberger@ixit.cz>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ef34ede9
    • Tomasz Figa's avatar
      irqchip: gic: Fix core ID calculation when topology is read from DT · 5ef74eaa
      Tomasz Figa authored
      commit 29e697b1 upstream.
      
      Certain GIC implementation, namely those found on earlier, single
      cluster, Exynos SoCs, have registers mapped without per-CPU banking,
      which means that the driver needs to use different offset for each CPU.
      
      Currently the driver calculates the offset by multiplying value returned
      by cpu_logical_map() by CPU offset parsed from DT. This is correct when
      CPU topology is not specified in DT and aforementioned function returns
      core ID alone. However when DT contains CPU topology, the function
      changes to return cluster ID as well, which is non-zero on mentioned
      SoCs and so breaks the calculation in GIC driver.
      
      This patch fixes this by masking out cluster ID in CPU offset
      calculation so that only core ID is considered. Multi-cluster Exynos
      SoCs already have banked GIC implementations, so this simple fix should
      be enough.
      Reported-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Reported-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarTomasz Figa <t.figa@samsung.com>
      Fixes: db0d4db2 ("ARM: gic: allow GIC to support non-banked setups")
      Link: https://lkml.kernel.org/r/1405610624-18722-1-git-send-email-t.figa@samsung.comSigned-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5ef74eaa
    • Matthias Brugger's avatar
      irqchip: gic: Add support for cortex a7 compatible string · 26edd8dd
      Matthias Brugger authored
      commit a97e8027 upstream.
      
      Patch 0a68214b "ARM: DT: Add binding for GIC virtualization extentions (VGIC)" added
      the "arm,cortex-a7-gic" compatible string, but the corresponding IRQCHIP_DECLARE
      was never added to the gic driver.
      
      To let real Cortex-A7 SoCs use it, add the necessary declaration to the device driver.
      Signed-off-by: default avatarMatthias Brugger <matthias.bgg@gmail.com>
      Link: https://lkml.kernel.org/r/1404388732-28890-1-git-send-email-matthias.bgg@gmail.com
      Fixes: 0a68214b ("ARM: DT: Add binding for GIC virtualization extentions (VGIC)")
      Signed-off-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      26edd8dd
    • Martin Lau's avatar
      ring-buffer: Fix polling on trace_pipe · 9b1829d6
      Martin Lau authored
      commit 97b8ee84 upstream.
      
      ring_buffer_poll_wait() should always put the poll_table to its wait_queue
      even there is immediate data available.  Otherwise, the following epoll and
      read sequence will eventually hang forever:
      
      1. Put some data to make the trace_pipe ring_buffer read ready first
      2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
      3. epoll_wait()
      4. read(trace_pipe_fd) till EAGAIN
      5. Add some more data to the trace_pipe ring_buffer
      6. epoll_wait() -> this epoll_wait() will block forever
      
      ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
        ring_buffer_poll_wait() returns immediately without adding poll_table,
        which has poll_table->_qproc pointing to ep_poll_callback(), to its
        wait_queue.
      ~ During the epoll_wait() call in step 3 and step 6,
        ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
        because the poll_table->_qproc is NULL and it is how epoll works.
      ~ When there is new data available in step 6, ring_buffer does not know
        it has to call ep_poll_callback() because it is not in its wait queue.
        Hence, block forever.
      
      Other poll implementation seems to call poll_wait() unconditionally as the very
      first thing to do.  For example, tcp_poll() in tcp.c.
      
      Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
      
      Fixes: 2a2cc8f7 "ftrace: allow the event pipe to be polled"
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarMartin Lau <kafai@fb.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9b1829d6
    • Amitkumar Karwar's avatar
      mwifiex: fix Tx timeout issue · a113b6a5
      Amitkumar Karwar authored
      commit d76744a9 upstream.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=70191
      https://bugzilla.kernel.org/show_bug.cgi?id=77581
      
      It is observed that sometimes Tx packet is downloaded without
      adding driver's txpd header. This results in firmware parsing
      garbage data as packet length. Sometimes firmware is unable
      to read the packet if length comes out as invalid. This stops
      further traffic and timeout occurs.
      
      The root cause is uninitialized fields in tx_info(skb->cb) of
      packet used to get garbage values. In this case if
      MWIFIEX_BUF_FLAG_REQUEUED_PKT flag is mistakenly set, txpd
      header was skipped. This patch makes sure that tx_info is
      correctly initialized to fix the problem.
      Reported-by: default avatarAndrew Wiley <wiley.andrew.j@gmail.com>
      Reported-by: default avatarLinus Gasser <list@markas-al-nour.org>
      Reported-by: default avatarMichael Hirsch <hirsch@teufel.de>
      Tested-by: default avatarXinming Hu <huxm@marvell.com>
      Signed-off-by: default avatarAmitkumar Karwar <akarwar@marvell.com>
      Signed-off-by: default avatarMaithili Hinge <maithili@marvell.com>
      Signed-off-by: default avatarAvinash Patil <patila@marvell.com>
      Signed-off-by: default avatarBing Zhao <bzhao@marvell.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a113b6a5
    • HATAYAMA Daisuke's avatar
      perf/x86/intel: ignore CondChgd bit to avoid false NMI handling · d4f6852e
      HATAYAMA Daisuke authored
      commit b292d7a1 upstream.
      
      Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
      if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
      
      For example, we use external NMI to make system panic to get crash
      dump, but in this case, the external NMI is falsely handled do to the
      issue.
      
      This commit deals with the issue simply by ignoring CondChgd bit.
      
      Here is explanation in detail.
      
      On x86 NMI watchdog uses performance monitoring feature to
      periodically signal NMI each time performance counter gets overflowed.
      
      intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
      handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
      owner of a given NMI by looking at overflow status bits in
      MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
      handles the given NMI as its own NMI.
      
      The problem is that the intel_pmu_handle_irq() doesn't distinguish
      CondChgd bit from other bits. Unlike the other status bits, CondChgd
      bit doesn't represent overflow status for performance counters. Thus,
      CondChgd bit cannot be thought of as a mark indicating a given NMI is
      NMI watchdog's.
      
      As a result, if CondChgd bit is set, any NMI is falsely handled by the
      NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
      is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
      action is never performed until CondChgd bit is cleared.
      
      I noticed this behavior on systems with Ivy Bridge processors: Intel
      Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
      CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
      in the beginning at boot. Then the CondChgd bit is immediately cleared
      by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
      0.
      
      On the other hand, on older processors such as Nehalem, Xeon E7540,
      CondChgd bit is not set in the beginning at boot.
      
      I'm not sure about exact behavior of CondChgd bit, in particular when
      this bit is set. Although I read Intel System Programmer's Manual to
      figure out that, the descriptions I found are:
      
        In 18.9.1:
      
        "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
         indicate changes to the state of performancmonitoring hardware"
      
        In Table 35-2 IA-32 Architectural MSRs
      
        63 CondChg: status bits of this register has changed.
      
      These are different from the bahviour I see on the actual system as I
      explained above.
      
      At least, I think ignoring CondChgd bit should be enough for NMI
      watchdog perspective.
      Signed-off-by: default avatarHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d4f6852e
    • Eric Dumazet's avatar
      ipv4: fix buffer overflow in ip_options_compile() · 5ac7d165
      Eric Dumazet authored
      [ Upstream commit 10ec9472 ]
      
      There is a benign buffer overflow in ip_options_compile spotted by
      AddressSanitizer[1] :
      
      Its benign because we always can access one extra byte in skb->head
      (because header is followed by struct skb_shared_info), and in this case
      this byte is not even used.
      
      [28504.910798] ==================================================================
      [28504.912046] AddressSanitizer: heap-buffer-overflow in ip_options_compile
      [28504.913170] Read of size 1 by thread T15843:
      [28504.914026]  [<ffffffff81802f91>] ip_options_compile+0x121/0x9c0
      [28504.915394]  [<ffffffff81804a0d>] ip_options_get_from_user+0xad/0x120
      [28504.916843]  [<ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
      [28504.918175]  [<ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
      [28504.919490]  [<ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
      [28504.920835]  [<ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
      [28504.922208]  [<ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
      [28504.923459]  [<ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
      [28504.924722]
      [28504.925106] Allocated by thread T15843:
      [28504.925815]  [<ffffffff81804995>] ip_options_get_from_user+0x35/0x120
      [28504.926884]  [<ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
      [28504.927975]  [<ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
      [28504.929175]  [<ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
      [28504.930400]  [<ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
      [28504.931677]  [<ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
      [28504.932851]  [<ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
      [28504.934018]
      [28504.934377] The buggy address ffff880026382828 is located 0 bytes to the right
      [28504.934377]  of 40-byte region [ffff880026382800, ffff880026382828)
      [28504.937144]
      [28504.937474] Memory state around the buggy address:
      [28504.938430]  ffff880026382300: ........ rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.939884]  ffff880026382400: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.941294]  ffff880026382500: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.942504]  ffff880026382600: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.943483]  ffff880026382700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.944511] >ffff880026382800: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.945573]                         ^
      [28504.946277]  ffff880026382900: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.094949]  ffff880026382a00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.096114]  ffff880026382b00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.097116]  ffff880026382c00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.098472]  ffff880026382d00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.099804] Legend:
      [28505.100269]  f - 8 freed bytes
      [28505.100884]  r - 8 redzone bytes
      [28505.101649]  . - 8 allocated bytes
      [28505.102406]  x=1..7 - x allocated bytes + (8-x) redzone bytes
      [28505.103637] ==================================================================
      
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernelSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5ac7d165
    • Manuel Schölling's avatar
      dns_resolver: assure that dns_query() result is null-terminated · d48784b0
      Manuel Schölling authored
      [ Upstream commit 84a7c0b1 ]
      
      dns_query() credulously assumes that keys are null-terminated and
      returns a copy of a memory block that is off by one.
      Signed-off-by: default avatarManuel Schölling <manuel.schoelling@gmx.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d48784b0
    • Sowmini Varadhan's avatar
      sunvnet: clean up objects created in vnet_new() on vnet_exit() · 54a445fe
      Sowmini Varadhan authored
      [ Upstream commit a4b70a07 ]
      
      Nothing cleans up the objects created by
      vnet_new(), they are completely leaked.
      
      vnet_exit(), after doing the vio_unregister_driver() to clean
      up ports, should call a helper function that iterates over vnet_list
      and cleans up those objects. This includes unregister_netdevice()
      as well as free_netdev().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Reviewed-by: default avatarKarl Volz <karl.volz@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      54a445fe
    • Christoph Schulz's avatar
      net: pppoe: use correct channel MTU when using Multilink PPP · b06db24c
      Christoph Schulz authored
      [ Upstream commit a8a3e41c ]
      
      The PPP channel MTU is used with Multilink PPP when ppp_mp_explode() (see
      ppp_generic module) tries to determine how big a fragment might be. According
      to RFC 1661, the MTU excludes the 2-byte PPP protocol field, see the
      corresponding comment and code in ppp_mp_explode():
      
      		/*
      		 * hdrlen includes the 2-byte PPP protocol field, but the
      		 * MTU counts only the payload excluding the protocol field.
      		 * (RFC1661 Section 2)
      		 */
      		mtu = pch->chan->mtu - (hdrlen - 2);
      
      However, the pppoe module *does* include the PPP protocol field in the channel
      MTU, which is wrong as it causes the PPP payload to be 1-2 bytes too big under
      certain circumstances (one byte if PPP protocol compression is used, two
      otherwise), causing the generated Ethernet packets to be dropped. So the pppoe
      module has to subtract two bytes from the channel MTU. This error only
      manifests itself when using Multilink PPP, as otherwise the channel MTU is not
      used anywhere.
      
      In the following, I will describe how to reproduce this bug. We configure two
      pppd instances for multilink PPP over two PPPoE links, say eth2 and eth3, with
      a MTU of 1492 bytes for each link and a MRRU of 2976 bytes. (This MRRU is
      computed by adding the two link MTUs and subtracting the MP header twice, which
      is 4 bytes long.) The necessary pppd statements on both sides are "multilink
      mtu 1492 mru 1492 mrru 2976". On the client side, we additionally need "plugin
      rp-pppoe.so eth2" and "plugin rp-pppoe.so eth3", respectively; on the server
      side, we additionally need to start two pppoe-server instances to be able to
      establish two PPPoE sessions, one over eth2 and one over eth3. We set the MTU
      of the PPP network interface to the MRRU (2976) on both sides of the connection
      in order to make use of the higher bandwidth. (If we didn't do that, IP
      fragmentation would kick in, which we want to avoid.)
      
      Now we send a ICMPv4 echo request with a payload of 2948 bytes from client to
      server over the PPP link. This results in the following network packet:
      
         2948 (echo payload)
       +    8 (ICMPv4 header)
       +   20 (IPv4 header)
      ---------------------
         2976 (PPP payload)
      
      These 2976 bytes do not exceed the MTU of the PPP network interface, so the
      IP packet is not fragmented. Now the multilink PPP code in ppp_mp_explode()
      prepends one protocol byte (0x21 for IPv4), making the packet one byte bigger
      than the negotiated MRRU. So this packet would have to be divided in three
      fragments. But this does not happen as each link MTU is assumed to be two bytes
      larger. So this packet is diveded into two fragments only, one of size 1489 and
      one of size 1488. Now we have for that bigger fragment:
      
         1489 (PPP payload)
       +    4 (MP header)
       +    2 (PPP protocol field for the MP payload (0x3d))
       +    6 (PPPoE header)
      --------------------------
         1501 (Ethernet payload)
      
      This packet exceeds the link MTU and is discarded.
      
      If one configures the link MTU on the client side to 1501, one can see the
      discarded Ethernet frames with tcpdump running on the client. A
      
      ping -s 2948 -c 1 192.168.15.254
      
      leads to the smaller fragment that is correctly received on the server side:
      
      (tcpdump -vvvne -i eth3 pppoes and ppp proto 0x3d)
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x3] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [end], length 1492
      
      and to the bigger fragment that is not received on the server side:
      
      (tcpdump -vvvne -i eth2 pppoes and ppp proto 0x3d)
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1515: PPPoE  [ses 0x5] MLPPP (0x003d), length 1495: seq 0x000,
        Flags [begin], length 1493
      
      With the patch below, we correctly obtain three fragments:
      
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [begin], length 1492
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [none], length 1492
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 27: PPPoE  [ses 0x1] MLPPP (0x003d), length 7: seq 0x000,
        Flags [end], length 5
      
      And the ICMPv4 echo request is successfully received at the server side:
      
      IP (tos 0x0, ttl 64, id 21925, offset 0, flags [DF], proto ICMP (1),
        length 2976)
          192.168.222.2 > 192.168.15.254: ICMP echo request, id 30530, seq 0,
            length 2956
      
      The bug was introduced in commit c9aa6895
      ("[PPPOE]: Advertise PPPoE MTU") from the very beginning. This patch applies
      to 3.10 upwards but the fix can be applied (with minor modifications) to
      kernels as old as 2.6.32.
      Signed-off-by: default avatarChristoph Schulz <develop@kristov.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b06db24c
    • Daniel Borkmann's avatar
      net: sctp: fix information leaks in ulpevent layer · a96dcc00
      Daniel Borkmann authored
      [ Upstream commit 8f2e5ae4 ]
      
      While working on some other SCTP code, I noticed that some
      structures shared with user space are leaking uninitialized
      stack or heap buffer. In particular, struct sctp_sndrcvinfo
      has a 2 bytes hole between .sinfo_flags and .sinfo_ppid that
      remains unfilled by us in sctp_ulpevent_read_sndrcvinfo() when
      putting this into cmsg. But also struct sctp_remote_error
      contains a 2 bytes hole that we don't fill but place into a skb
      through skb_copy_expand() via sctp_ulpevent_make_remote_error().
      
      Both structures are defined by the IETF in RFC6458:
      
      * Section 5.3.2. SCTP Header Information Structure:
      
        The sctp_sndrcvinfo structure is defined below:
      
        struct sctp_sndrcvinfo {
          uint16_t sinfo_stream;
          uint16_t sinfo_ssn;
          uint16_t sinfo_flags;
          <-- 2 bytes hole  -->
          uint32_t sinfo_ppid;
          uint32_t sinfo_context;
          uint32_t sinfo_timetolive;
          uint32_t sinfo_tsn;
          uint32_t sinfo_cumtsn;
          sctp_assoc_t sinfo_assoc_id;
        };
      
      * 6.1.3. SCTP_REMOTE_ERROR:
      
        A remote peer may send an Operation Error message to its peer.
        This message indicates a variety of error conditions on an
        association. The entire ERROR chunk as it appears on the wire
        is included in an SCTP_REMOTE_ERROR event. Please refer to the
        SCTP specification [RFC4960] and any extensions for a list of
        possible error formats. An SCTP error notification has the
        following format:
      
        struct sctp_remote_error {
          uint16_t sre_type;
          uint16_t sre_flags;
          uint32_t sre_length;
          uint16_t sre_error;
          <-- 2 bytes hole  -->
          sctp_assoc_t sre_assoc_id;
          uint8_t  sre_data[];
        };
      
      Fix this by setting both to 0 before filling them out. We also
      have other structures shared between user and kernel space in
      SCTP that contains holes (e.g. struct sctp_paddrthlds), but we
      copy that buffer over from user space first and thus don't need
      to care about it in that cases.
      
      While at it, we can also remove lengthy comments copied from
      the draft, instead, we update the comment with the correct RFC
      number where one can look it up.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a96dcc00
    • Jon Paul Maloy's avatar
      tipc: clear 'next'-pointer of message fragments before reassembly · 0515cc26
      Jon Paul Maloy authored
      [ Upstream commit 99941754 ]
      
      If the 'next' pointer of the last fragment buffer in a message is not
      zeroed before reassembly, we risk ending up with a corrupt message,
      since the reassembly function itself isn't doing this.
      
      Currently, when a buffer is retrieved from the deferred queue of the
      broadcast link, the next pointer is not cleared, with the result as
      described above.
      
      This commit corrects this, and thereby fixes a bug that may occur when
      long broadcast messages are transmitted across dual interfaces. The bug
      has been present since 40ba3cdf ("tipc:
      message reassembly using fragment chain")
      
      This commit should be applied to both net and net-next.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0515cc26
    • Suresh Reddy's avatar
      be2net: set EQ DB clear-intr bit in be_open() · a67b16e2
      Suresh Reddy authored
      [ Upstream commit 4cad9f3b ]
      
      On BE3, if the clear-interrupt bit of the EQ doorbell is not set the first
      time it is armed, ocassionally we have observed that the EQ doesn't raise
      anymore interrupts even if it is in armed state.
      This patch fixes this by setting the clear-interrupt bit when EQs are
      armed for the first time in be_open().
      Signed-off-by: default avatarSuresh Reddy <Suresh.Reddy@emulex.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a67b16e2
    • Ben Pfaff's avatar
      netlink: Fix handling of error from netlink_dump(). · 8df19ce7
      Ben Pfaff authored
      [ Upstream commit ac30ef83 ]
      
      netlink_dump() returns a negative errno value on error.  Until now,
      netlink_recvmsg() directly recorded that negative value in sk->sk_err, but
      that's wrong since sk_err takes positive errno values.  (This manifests as
      userspace receiving a positive return value from the recv() system call,
      falsely indicating success.) This bug was introduced in the commit that
      started checking the netlink_dump() return value, commit b44d211e (netlink:
      handle errors from netlink_dump()).
      
      Multithreaded Netlink dumps are one way to trigger this behavior in
      practice, as described in the commit message for the userspace workaround
      posted here:
          http://openvswitch.org/pipermail/dev/2014-June/042339.html
      
      This commit also fixes the same bug in netlink_poll(), introduced in commit
      cd1df525 (netlink: add flow control for memory mapped I/O).
      Signed-off-by: default avatarBen Pfaff <blp@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8df19ce7
    • Thomas Petazzoni's avatar
      net: mvneta: fix operation in 10 Mbit/s mode · 477e05a6
      Thomas Petazzoni authored
      [ Upstream commit 4d12bc63 ]
      
      As reported by Maggie Mae Roxas, the mvneta driver doesn't behave
      properly in 10 Mbit/s mode. This is due to a misconfiguration of the
      MVNETA_GMAC_AUTONEG_CONFIG register: bit MVNETA_GMAC_CONFIG_MII_SPEED
      must be set for a 100 Mbit/s speed, but cleared for a 10 Mbit/s speed,
      which the driver was not properly doing. This commit adjusts that by
      setting the MVNETA_GMAC_CONFIG_MII_SPEED bit only in 100 Mbit/s mode,
      and relying on the fact that all the speed related bits of this
      register are cleared at the beginning of the mvneta_adjust_link()
      function.
      
      This problem exists since c5aff182 ("net: mvneta: driver for
      Marvell Armada 370/XP network unit") which is the commit that
      introduced the mvneta driver in the kernel.
      
      Cc: <stable@vger.kernel.org> # v3.8+
      Fixes: c5aff182 ("net: mvneta: driver for Marvell Armada 370/XP network unit")
      Reported-by: default avatarMaggie Mae Roxas <maggie.mae.roxas@gmail.com>
      Cc: Maggie Mae Roxas <maggie.mae.roxas@gmail.com>
      Signed-off-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      477e05a6
    • Andrey Utkin's avatar
      appletalk: Fix socket referencing in skb · aabe442e
      Andrey Utkin authored
      [ Upstream commit 36beddc2 ]
      
      Setting just skb->sk without taking its reference and setting a
      destructor is invalid. However, in the places where this was done, skb
      is used in a way not requiring skb->sk setting. So dropping the setting
      of skb->sk.
      Thanks to Eric Dumazet <eric.dumazet@gmail.com> for correct solution.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=79441Reported-by: default avatarEd Martin <edman007@edman007.com>
      Signed-off-by: default avatarAndrey Utkin <andrey.krieger.utkin@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      aabe442e
    • Yuchung Cheng's avatar
      tcp: fix false undo corner cases · 06fc671a
      Yuchung Cheng authored
      [ Upstream commit 6e08d5e3 ]
      
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      06fc671a
    • dingtianhong's avatar
      igmp: fix the problem when mc leave group · 31ae2665
      dingtianhong authored
      [ Upstream commit 52ad353a ]
      
      The problem was triggered by these steps:
      
      1) create socket, bind and then setsockopt for add mc group.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
         setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
      
      2) drop the mc group for this socket.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
         setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
      
      3) and then drop the socket, I found the mc group was still used by the dev:
      
         netstat -g
      
         Interface       RefCnt Group
         --------------- ------ ---------------------
         eth2		   1	  255.0.0.37
      
      Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
      to be released for the netdev when drop the socket, but this process was broken when
      route default is NULL, the reason is that:
      
      The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
      is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
      then polling the inet->mc_list and return -ENODEV, but if the default route dev is NULL,
      the in_dev and ifIndex is both NULL, when polling the inet->mc_list, the mc group will be
      released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
      when dropping the socket, the mc_list is NULL and the dev still keep this group.
      
      v1->v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
      	so I add the checking for the in_dev before polling the mc_list, make sure when
      	we remove the mc group, dec the refcnt to the real dev which was using the mc address.
      	The problem would never happened again.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      31ae2665
    • Loic Prylli's avatar
      net: Fix NETDEV_CHANGE notifier usage causing spurious arp flush · 4c824ea2
      Loic Prylli authored
      [ Upstream commit 54951194 ]
      
      A bug was introduced in NETDEV_CHANGE notifier sequence causing the
      arp table to be sometimes spuriously cleared (including manual arp
      entries marked permanent), upon network link carrier changes.
      
      The changed argument for the notifier was applied only to a single
      caller of NETDEV_CHANGE, missing among others netdev_state_change().
      So upon net_carrier events induced by the network, which are
      triggering a call to netdev_state_change(), arp_netdev_event() would
      decide whether to clear or not arp cache based on random/junk stack
      values (a kind of read buffer overflow).
      
      Fixes: be9efd36 ("net: pass changed flags along with NETDEV_CHANGE event")
      Fixes: 6c8b4e3f ("arp: flush arp cache on IFF_NOARP change")
      Signed-off-by: default avatarLoic Prylli <loicp@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4c824ea2
    • Bjørn Mork's avatar
      net: qmi_wwan: add two Sierra Wireless/Netgear devices · 196162de
      Bjørn Mork authored
      [ Upstream commit 53433300 ]
      
      Add two device IDs found in an out-of-tree driver downloadable
      from Netgear.
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      196162de
    • Bernd Wachter's avatar
      net: qmi_wwan: Add ID for Telewell TW-LTE 4G v2 · 0b92a040
      Bernd Wachter authored
      [ Upstream commit 8dcb4b15 ]
      
      There's a new version of the Telewell 4G modem working with, but not
      recognized by this driver.
      Signed-off-by: default avatarBernd Wachter <bernd.wachter@jolla.com>
      Acked-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0b92a040
    • Edward Allcutt's avatar
      ipv4: icmp: Fix pMTU handling for rare case · 1f947b07
      Edward Allcutt authored
      [ Upstream commit 68b7107b ]
      
      Some older router implementations still send Fragmentation Needed
      errors with the Next-Hop MTU field set to zero. This is explicitly
      described as an eventuality that hosts must deal with by the
      standard (RFC 1191) since older standards specified that those
      bits must be zero.
      
      Linux had a generic (for all of IPv4) implementation of the algorithm
      described in the RFC for searching a list of MTU plateaus for a good
      value. Commit 46517008 ("ipv4: Kill ip_rt_frag_needed().")
      removed this as part of the changes to remove the routing cache.
      Subsequently any Fragmentation Needed packet with a zero Next-Hop
      MTU has been discarded without being passed to the per-protocol
      handlers or notifying userspace for raw sockets.
      
      When there is a router which does not implement RFC 1191 on an
      MTU limited path then this results in stalled connections since
      large packets are discarded and the local protocols are not
      notified so they never attempt to lower the pMTU.
      
      One example I have seen is an OpenBSD router terminating IPSec
      tunnels. It's worth pointing out that this case is distinct from
      the BSD 4.2 bug which incorrectly calculated the Next-Hop MTU
      since the commit in question dismissed that as a valid concern.
      
      All of the per-protocols handlers implement the simple approach from
      RFC 1191 of immediately falling back to the minimum value. Although
      this is sub-optimal it is vastly preferable to connections hanging
      indefinitely.
      
      Remove the Next-Hop MTU != 0 check and allow such packets
      to follow the normal path.
      
      Fixes: 46517008 ("ipv4: Kill ip_rt_frag_needed().")
      Signed-off-by: default avatarEdward Allcutt <edward.allcutt@openmarket.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1f947b07
    • Christoph Paasch's avatar
      tcp: Fix divide by zero when pushing during tcp-repair · 9f22c5c1
      Christoph Paasch authored
      [ Upstream commit 5924f17a ]
      
      When in repair-mode and TCP_RECV_QUEUE is set, we end up calling
      tcp_push with mss_now being 0. If data is in the send-queue and
      tcp_set_skb_tso_segs gets called, we crash because it will divide by
      mss_now:
      
      [  347.151939] divide error: 0000 [#1] SMP
      [  347.152907] Modules linked in:
      [  347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4
      [  347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [  347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000
      [  347.152907] EIP: 0060:[<c1601359>] EFLAGS: 00210246 CPU: 1
      [  347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0
      [  347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000
      [  347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00
      [  347.152907]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
      [  347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0
      [  347.152907] Stack:
      [  347.152907]  c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080
      [  347.152907]  f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85
      [  347.152907]  c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8
      [  347.152907] Call Trace:
      [  347.152907]  [<c167f9d9>] ? apic_timer_interrupt+0x2d/0x34
      [  347.152907]  [<c16013e6>] tcp_init_tso_segs+0x36/0x50
      [  347.152907]  [<c1603b5a>] tcp_write_xmit+0x7a/0xbf0
      [  347.152907]  [<c10a0188>] ? up+0x28/0x40
      [  347.152907]  [<c10acc85>] ? console_unlock+0x295/0x480
      [  347.152907]  [<c10ad24f>] ? vprintk_emit+0x1ef/0x4b0
      [  347.152907]  [<c1605716>] __tcp_push_pending_frames+0x36/0xd0
      [  347.152907]  [<c15f4860>] tcp_push+0xf0/0x120
      [  347.152907]  [<c15f7641>] tcp_sendmsg+0xf1/0xbf0
      [  347.152907]  [<c116d920>] ? kmem_cache_free+0xf0/0x120
      [  347.152907]  [<c106a682>] ? __sigqueue_free+0x32/0x40
      [  347.152907]  [<c106a682>] ? __sigqueue_free+0x32/0x40
      [  347.152907]  [<c114f0f0>] ? do_wp_page+0x3e0/0x850
      [  347.152907]  [<c161c36a>] inet_sendmsg+0x4a/0xb0
      [  347.152907]  [<c1150269>] ? handle_mm_fault+0x709/0xfb0
      [  347.152907]  [<c15a006b>] sock_aio_write+0xbb/0xd0
      [  347.152907]  [<c1180b79>] do_sync_write+0x69/0xa0
      [  347.152907]  [<c1181023>] vfs_write+0x123/0x160
      [  347.152907]  [<c1181d55>] SyS_write+0x55/0xb0
      [  347.152907]  [<c167f0d8>] sysenter_do_call+0x12/0x28
      
      This can easily be reproduced with the following packetdrill-script (the
      "magic" with netem, sk_pacing and limit_output_bytes is done to prevent
      the kernel from pushing all segments, because hitting the limit without
      doing this is not so easy with packetdrill):
      
      0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      
      +0  bind(3, ..., ...) = 0
      +0  listen(3, 1) = 0
      
      +0  < S 0:0(0) win 32792 <mss 1460>
      +0  > S. 0:0(0) ack 1 <mss 1460>
      +0.1  < . 1:1(0) ack 1 win 65000
      
      +0  accept(3, ..., ...) = 4
      
      // This forces that not all segments of the snd-queue will be pushed
      +0 `tc qdisc add dev tun0 root netem delay 10ms`
      +0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2`
      +0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0
      
      +0 write(4,...,10000) = 10000
      +0 write(4,...,10000) = 10000
      
      // Set tcp-repair stuff, particularly TCP_RECV_QUEUE
      +0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0
      +0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0
      
      // This now will make the write push the remaining segments
      +0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0
      +0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000`
      
      // Now we will crash
      +0 write(4,...,1000) = 1000
      
      This happens since ec342325 (tcp: fix retransmission in repair
      mode). Prior to that, the call to tcp_push was prevented by a check for
      tp->repair.
      
      The patch fixes it, by adding the new goto-label out_nopush. When exiting
      tcp_sendmsg and a push is not required, which is the case for tp->repair,
      we go to this label.
      
      When repairing and calling send() with TCP_RECV_QUEUE, the data is
      actually put in the receive-queue. So, no push is required because no
      data has been added to the send-queue.
      
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Fixes: ec342325 (tcp: fix retransmission in repair mode)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Acked-by: default avatarAndrew Vagin <avagin@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9f22c5c1
    • Eric Dumazet's avatar
      bnx2x: fix possible panic under memory stress · 7ec8d47a
      Eric Dumazet authored
      [ Upstream commit 07b0f009 ]
      
      While it is legal to kfree(NULL), it is not wise to use :
      put_page(virt_to_head_page(NULL))
      
       BUG: unable to handle kernel paging request at ffffeba400000000
       IP: [<ffffffffc01f5928>] virt_to_head_page+0x36/0x44 [bnx2x]
      Reported-by: default avatarMichel Lespinasse <walken@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ariel Elior <ariel.elior@qlogic.com>
      Fixes: d46d132c ("bnx2x: use netdev_alloc_frag()")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7ec8d47a