1. 19 May, 2013 31 commits
    • Jeff Layton's avatar
      audit: vfs: fix audit_inode call in O_CREAT case of do_last · 93d927e2
      Jeff Layton authored
      commit 33e2208a upstream.
      
      Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
      In older kernels, creating a file with open(..., O_CREAT) created
      audit_name records that looked like this:
      
      type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      ...in recent kernels though, they look like this:
      
      type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
      
      Richard bisected to determine that the problems started with commit
      bfcec708, but the log messages have changed with some later
      audit-related patches.
      
      The problem is that this audit_inode call is passing in the parent of
      the dentry being opened, but audit_inode is being called with the parent
      flag false. This causes later audit_inode and audit_inode_child calls to
      match the wrong entry in the audit_names list.
      
      This patch simply sets the flag to properly indicate that this inode
      represents the parent. With this, the audit_names entries are back to
      looking like they did before.
      Reported-by: default avatarJiri Jaburek <jjaburek@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Test By: Richard Guy Briggs <rbriggs@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93d927e2
    • Anton Blanchard's avatar
      audit: Syscall rules are not applied to existing processes on non-x86 · 16f0b63b
      Anton Blanchard authored
      commit cdee3904 upstream.
      
      Commit b05d8447 (audit: inline audit_syscall_entry to reduce
      burden on archs) changed audit_syscall_entry to check for a dummy
      context before calling __audit_syscall_entry. Unfortunately the dummy
      context state is maintained in __audit_syscall_entry so once set it
      never gets cleared, even if the audit rules change.
      
      As a result, if there are no auditing rules when a process starts
      then it will never be subject to any rules added later. x86 doesn't
      see this because it has an assembly fast path that calls directly into
      __audit_syscall_entry.
      
      I noticed this issue when working on audit performance optimisations.
      I wrote a set of simple test cases available at:
      
      http://ozlabs.org/~anton/junkcode/audit_tests.tar.gz
      
      02_new_rule.py fails without the patch and passes with it. The
      test case clears all rules, starts a process, adds a rule then
      verifies the process produces a syscall audit record.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16f0b63b
    • James Bottomley's avatar
      SCSI: sd: fix array cache flushing bug causing performance problems · ccb2c9da
      James Bottomley authored
      commit 39c60a09 upstream.
      
      Some arrays synchronize their full non volatile cache when the sd driver sends
      a SYNCHRONIZE CACHE command.  Unfortunately, they can have Terrabytes of this
      and we send a SYNCHRONIZE CACHE for every barrier if an array reports it has a
      writeback cache.  This leads to massive slowdowns on journalled filesystems.
      
      The fix is to allow userspace to turn off the writeback cache setting as a
      temporary measure (i.e. without doing the MODE SELECT to write it back to the
      device), so even though the device reported it has a writeback cache, the
      user, knowing that the cache is non volatile and all they care about is
      filesystem correctness, can turn that bit off in the kernel and avoid the
      performance ruinous (and safety irrelevant) SYNCHRONIZE CACHE commands.
      
      The way you do this is add a 'temporary' prefix when performing the usual
      cache setting operations, so
      
      echo temporary write through > /sys/class/scsi_disk/<disk>/cache_type
      Reported-by: default avatarRic Wheeler <rwheeler@redhat.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ccb2c9da
    • Konrad Rzeszutek Wilk's avatar
      xen/vcpu/pvhvm: Fix vcpu hotplugging hanging. · db9f69dc
      Konrad Rzeszutek Wilk authored
      commit 7f1fc268 upstream.
      
      If a user did:
      
      	echo 0 > /sys/devices/system/cpu/cpu1/online
      	echo 1 > /sys/devices/system/cpu/cpu1/online
      
      we would (this a build with DEBUG enabled) get to:
      smpboot: ++++++++++++++++++++=_---CPU UP  1
      .. snip..
      smpboot: Stack at about ffff880074c0ff44
      smpboot: CPU1: has booted.
      
      and hang. The RCU mechanism would kick in an try to IPI the CPU1
      but the IPIs (and all other interrupts) would never arrive at the
      CPU1. At first glance at least. A bit digging in the hypervisor
      trace shows that (using xenanalyze):
      
      [vla] d4v1 vec 243 injecting
         0.043163027 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
      ]  0.043163639 --|x d4v1 vmentry cycles 1468
      ]  0.043164913 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
         0.043164913 --|x d4v1 inj_virq vec 243  real
        [vla] d4v1 vec 243 injecting
         0.043164913 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
      ]  0.043165526 --|x d4v1 vmentry cycles 1472
      ]  0.043166800 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
         0.043166800 --|x d4v1 inj_virq vec 243  real
        [vla] d4v1 vec 243 injecting
      
      there is a pending event (subsequent debugging shows it is the IPI
      from the VCPU0 when smpboot.c on VCPU1 has done
      "set_cpu_online(smp_processor_id(), true)") and the guest VCPU1 is
      interrupted with the callback IPI (0xf3 aka 243) which ends up calling
      __xen_evtchn_do_upcall.
      
      The __xen_evtchn_do_upcall seems to do *something* but not acknowledge
      the pending events. And the moment the guest does a 'cli' (that is the
      ffffffff81673254 in the log above) the hypervisor is invoked again to
      inject the IPI (0xf3) to tell the guest it has pending interrupts.
      This repeats itself forever.
      
      The culprit was the per_cpu(xen_vcpu, cpu) pointer. At the bootup
      we set each per_cpu(xen_vcpu, cpu) to point to the
      shared_info->vcpu_info[vcpu] but later on use the VCPUOP_register_vcpu_info
      to register per-CPU  structures (xen_vcpu_setup).
      This is used to allow events for more than 32 VCPUs and for performance
      optimizations reasons.
      
      When the user performs the VCPU hotplug we end up calling the
      the xen_vcpu_setup once more. We make the hypercall which returns
      -EINVAL as it does not allow multiple registration calls (and
      already has re-assigned where the events are being set). We pick
      the fallback case and set per_cpu(xen_vcpu, cpu) to point to the
      shared_info->vcpu_info[vcpu] (which is a good fallback during bootup).
      However the hypervisor is still setting events in the register
      per-cpu structure (per_cpu(xen_vcpu_info, cpu)).
      
      As such when the events are set by the hypervisor (such as timer one),
      and when we iterate in __xen_evtchn_do_upcall we end up reading stale
      events from the shared_info->vcpu_info[vcpu] instead of the
      per_cpu(xen_vcpu_info, cpu) structures. Hence we never acknowledge the
      events that the hypervisor has set and the hypervisor keeps on reminding
      us to ack the events which we never do.
      
      The fix is simple. Don't on the second time when xen_vcpu_setup is
      called over-write the per_cpu(xen_vcpu, cpu) if it points to
      per_cpu(xen_vcpu_info).
      Acked-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db9f69dc
    • Li Zefan's avatar
      shm: fix null pointer deref when userspace specifies invalid hugepage size · 159590f2
      Li Zefan authored
      commit 091d0d55 upstream.
      
      Dave reported an oops triggered by trinity:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: newseg+0x10d/0x390
        PGD cf8c1067 PUD cf8c2067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67
        ...
        Call Trace:
          ipcget+0x182/0x380
          SyS_shmget+0x5a/0x60
          tracesys+0xdd/0xe2
      
      This bug was introduced by commit af73e4d9 ("hugetlbfs: fix mmap
      failure in unaligned size request").
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarLi Zefan <lizfan@huawei.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      159590f2
    • Shuah Khan's avatar
      hp_accel: Ignore the error from lis3lv02d_poweron() at resume · 7b44587e
      Shuah Khan authored
      commit 77838199 upstream.
      
      The error in lis3lv02_poweron() is harmless in the resume path, so
      we should ignore it. It is inline with the other usages of lis3lv02_poweron()
      and matches the 3.0 code for this routine. This patch is in suse git and
      might have missed making it into the mainline.
      opensuse - commit id: 66ccdac87c322cf7af12bddba8c805af640b1cff
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarShuah Khan <shuah.khan@hp.com>
      Signed-off-by: default avatarMatthew Garrett <matthew.garrett@nebula.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b44587e
    • Jeff Layton's avatar
      nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error · 59d7914f
      Jeff Layton authored
      commit 7255e716 upstream.
      
      Toralf reported the following oops to the linux-nfs mailing list:
      
          -----------------[snip]------------------
          NFSD: unable to generate recoverydir name (-2).
          NFSD: disabling legacy clientid tracking. Reboot recovery will not function correctly!
          BUG: unable to handle kernel NULL pointer dereference at 000003c8
          IP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
          *pdpt = 000000002ba33001 *pde = 0000000000000000
          Oops: 0000 [#1] SMP
          Modules linked in: loop nfsd auth_rpcgss ipt_MASQUERADE xt_owner xt_multiport ipt_REJECT xt_tcpudp xt_recent xt_conntrack nf_conntrack_ftp xt_limit xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables af_packet pppoe pppox ppp_generic slhc bridge stp llc tun arc4 iwldvm mac80211 coretemp kvm_intel uvcvideo sdhci_pci sdhci mmc_core videobuf2_vmalloc videobuf2_memops usblp videobuf2_core i915 iwlwifi psmouse videodev cfg80211 kvm fbcon bitblit cfbfillrect acpi_cpufreq mperf evdev softcursor font cfbimgblt i2c_algo_bit cfbcopyarea intel_agp intel_gtt drm_kms_helper snd_hda_codec_conexant drm agpgart fb fbdev tpm_tis thinkpad_acpi tpm nvram e1000e rfkill thermal ptp wmi pps_core tpm_bios 8250_pci processor 8250 ac snd_hda_intel snd_hda_codec snd_pcm battery video i2c_i801 snd_page_alloc snd_timer button serial_core i2c_core snd soundcore thermal_sys hwmon aesni_intel ablk_helper cryp
      td lrw aes_i586 xts gf128mul cbc fuse nfs lockd sunrpc dm_crypt dm_mod hid_monterey hid_microsoft hid_logitech hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech hid_generic usbhid hid sr_mod cdrom sg [last unloaded: microcode]
          Pid: 6374, comm: nfsd Not tainted 3.9.1 #6 LENOVO 4180F65/4180F65
          EIP: 0060:[<f90a3d91>] EFLAGS: 00010202 CPU: 0
          EIP is at nfsd4_client_tracking_exit+0x11/0x50 [nfsd]
          EAX: 00000000 EBX: fffffffe ECX: 00000007 EDX: 00000007
          ESI: eb9dcb00 EDI: eb2991c0 EBP: eb2bde38 ESP: eb2bde34
          DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
          CR0: 80050033 CR2: 000003c8 CR3: 2ba80000 CR4: 000407f0
          DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
          DR6: ffff0ff0 DR7: 00000400
          Process nfsd (pid: 6374, ti=eb2bc000 task=eb2711c0 task.ti=eb2bc000)
          Stack:
          fffffffe eb2bde4c f90a3e0c f90a7754 fffffffe eb0a9c00 eb2bdea0 f90a41ed
          eb2991c0 1b270000 eb2991c0 eb2bde7c f9099ce9 eb2bde98 0129a020 eb29a020
          eb2bdecc eb2991c0 eb2bdea8 f9099da5 00000000 eb9dcb00 00000001 67822f08
          Call Trace:
          [<f90a3e0c>] legacy_recdir_name_error+0x3c/0x40 [nfsd]
          [<f90a41ed>] nfsd4_create_clid_dir+0x15d/0x1c0 [nfsd]
          [<f9099ce9>] ? nfsd4_lookup_stateid+0x99/0xd0 [nfsd]
          [<f9099da5>] ? nfs4_preprocess_seqid_op+0x85/0x100 [nfsd]
          [<f90a4287>] nfsd4_client_record_create+0x37/0x50 [nfsd]
          [<f909d6ce>] nfsd4_open_confirm+0xfe/0x130 [nfsd]
          [<f90980b1>] ? nfsd4_encode_operation+0x61/0x90 [nfsd]
          [<f909d5d0>] ? nfsd4_free_stateid+0xc0/0xc0 [nfsd]
          [<f908fd0b>] nfsd4_proc_compound+0x41b/0x530 [nfsd]
          [<f9081b7b>] nfsd_dispatch+0x8b/0x1a0 [nfsd]
          [<f857b85d>] svc_process+0x3dd/0x640 [sunrpc]
          [<f908165d>] nfsd+0xad/0x110 [nfsd]
          [<f90815b0>] ? nfsd_destroy+0x70/0x70 [nfsd]
          [<c1054824>] kthread+0x94/0xa0
          [<c1486937>] ret_from_kernel_thread+0x1b/0x28
          [<c1054790>] ? flush_kthread_work+0xd0/0xd0
          Code: 86 b0 00 00 00 90 c5 0a f9 c7 04 24 70 76 0a f9 e8 74 a9 3d c8 eb ba 8d 76 00 55 89 e5 53 66 66 66 66 90 8b 15 68 c7 0a f9 85 d2 <8b> 88 c8 03 00 00 74 2c 3b 11 77 28 8b 5c 91 08 85 db 74 22 8b
          EIP: [<f90a3d91>] nfsd4_client_tracking_exit+0x11/0x50 [nfsd] SS:ESP 0068:eb2bde34
          CR2: 00000000000003c8
          ---[ end trace 09e54015d145c9c6 ]---
      
      The problem appears to be a regression that was introduced in commit
      9a9c6478 "nfsd: make NFSv4 recovery client tracking options per net".
      Prior to that commit, it was safe to pass a NULL net pointer to
      nfsd4_client_tracking_exit in the legacy recdir case, and
      legacy_recdir_name_error did so. After that comit, the net pointer must
      be valid.
      
      This patch just fixes legacy_recdir_name_error to pass in a valid net
      pointer to that function.
      Reported-and-tested-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59d7914f
    • J. Bruce Fields's avatar
      nfsd4: don't allow owner override on 4.1 CLAIM_FH opens · faad5f5c
      J. Bruce Fields authored
      commit 9f415eb2 upstream.
      
      The Linux client is using CLAIM_FH to implement regular opens, not just
      recovery cases, so it depends on the server to check permissions
      correctly.
      
      Therefore the owner override, which may make sense in the delegation
      recovery case, isn't right in the CLAIM_FH case.
      
      Symptoms: on a client with 49f9a0fa
      "NFSv4.1: Enable open-by-filehandle", Bryan noticed this:
      
      	touch test.txt
      	chmod 000 test.txt
      	echo test > test.txt
      
      succeeding.
      Reported-by: default avatarBryan Schumaker <bjschuma@netapp.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      faad5f5c
    • Stanislaw Gruszka's avatar
      sched: Avoid prev->stime underflow · 6bc7f6ef
      Stanislaw Gruszka authored
      commit 68aa8efc upstream.
      
      Dave Hansen reported strange utime/stime values on his system:
      https://lkml.org/lkml/2013/4/4/435
      
      This happens because prev->stime value is bigger than rtime
      value. Root of the problem are non-monotonic rtime values (i.e.
      current rtime is smaller than previous rtime) and that should be
      debugged and fixed.
      
      But since problem did not manifest itself before commit
      62188451 "cputime: Avoid
      multiplication overflow on utime scaling", it should be threated
      as regression, which we can easily fixed on cputime_adjust()
      function.
      
      For now, let's apply this fix, but further work is needed to fix
      root of the problem.
      Reported-and-tested-by: default avatarDave Hansen <dave@sr71.net>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: rostedt@goodmis.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bc7f6ef
    • Stanislaw Gruszka's avatar
      Revert "math64: New div64_u64_rem helper" · 859a8c0d
      Stanislaw Gruszka authored
      commit f3002134 upstream.
      
      This reverts commit f7926850.
      
      The cputime scaling code was changed/fixed and does not need the
      div64_u64_rem() primitive anymore. It has no other users, so let's
      remove them.
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: rostedt@goodmis.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1367314507-9728-4-git-send-email-sgruszka@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      859a8c0d
    • Stanislaw Gruszka's avatar
      sched: Do not account bogus utime · f25d7d1c
      Stanislaw Gruszka authored
      commit 772c808a upstream.
      
      Due to rounding in scale_stime(), for big numbers, scaled stime
      values will grow in chunks. Since rtime grow in jiffies and we
      calculate utime like below:
      
      	prev->stime = max(prev->stime, stime);
      	prev->utime = max(prev->utime, rtime - prev->stime);
      
      we could erroneously account stime values as utime. To prevent
      that only update prev->{u,s}time values when they are smaller
      than current rtime.
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: rostedt@goodmis.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f25d7d1c
    • Stanislaw Gruszka's avatar
      sched: Avoid cputime scaling overflow · 434c4913
      Stanislaw Gruszka authored
      commit 55eaa7c1 upstream.
      
      Here is patch, which adds Linus's cputime scaling algorithm to the
      kernel.
      
      This is a follow up (well, fix) to commit
      d9a3c982 ("sched: Lower chances
      of cputime scaling overflow") which commit tried to avoid
      multiplication overflow, but did not guarantee that the overflow
      would not happen.
      
      Linus crated a different algorithm, which completely avoids the
      multiplication overflow by dropping precision when numbers are
      big.
      
      It was tested by me and it gives good relative error of
      scaled numbers. Testing method is described here:
      http://marc.info/?l=linux-kernel&m=136733059505406&w=2
      
      Originally-From: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: rostedt@goodmis.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      434c4913
    • Frederic Weisbecker's avatar
      sched: Lower chances of cputime scaling overflow · 96fc7a7d
      Frederic Weisbecker authored
      commit d9a3c982 upstream.
      
      Some users have reported that after running a process with
      hundreds of threads on intensive CPU-bound loads, the cputime
      of the group started to freeze after a few days.
      
      This is due to how we scale the tick-based cputime against
      the scheduler precise execution time value.
      
      We add the values of all threads in the group and we multiply
      that against the sum of the scheduler exec runtime of the whole
      group.
      
      This easily overflows after a few days/weeks of execution.
      
      A proposed solution to solve this was to compute that multiplication
      on stime instead of utime:
         62188451
         ("cputime: Avoid multiplication overflow on utime scaling")
      
      The rationale behind that was that it's easy for a thread to
      spend most of its time in userspace under intensive CPU-bound workload
      but it's much harder to do CPU-bound intensive long run in the kernel.
      
      This postulate got defeated when a user recently reported he was still
      seeing cputime freezes after the above patch. The workload that
      triggers this issue relates to intensive networking workloads where
      most of the cputime is consumed in the kernel.
      
      To reduce much more the opportunities for multiplication overflow,
      lets reduce the multiplication factors to the remainders of the division
      between sched exec runtime and cputime. Assuming the difference between
      these shouldn't ever be that large, it could work on many situations.
      
      This gets the same results as in the upstream scaling code except for
      a small difference: the upstream code always rounds the results to
      the nearest integer not greater to what would be the precise result.
      The new code rounds to the nearest integer either greater or not
      greater. In practice this difference probably shouldn't matter but
      it's worth mentioning.
      
      If this solution appears not to be enough in the end, we'll
      need to partly revert back to the behaviour prior to commit
           0cf55e1e
           ("sched, cputime: Introduce thread_group_times()")
      
      Back then, the scaling was done on exit() time before adding the cputime
      of an exiting thread to the signal struct. And then we'll need to
      scale one-by-one the live threads cputime in thread_group_cputime(). The
      drawback may be a slightly slower code on exit time.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96fc7a7d
    • Frederic Weisbecker's avatar
      math64: New div64_u64_rem helper · c459e23a
      Frederic Weisbecker authored
      commit f7926850 upstream.
      
      Provide an extended version of div64_u64() that
      also returns the remainder of the division.
      
      We are going to need this to refine the cputime
      scaling code.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c459e23a
    • Wei Yongjun's avatar
      dm cache: fix error return code in cache_create · 3dc73aa4
      Wei Yongjun authored
      commit fa4d683a upstream.
      
      Return -ENOMEM if memory allocation fails in cache_create
      instead of 0 (to avoid NULL pointer dereference).
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3dc73aa4
    • Wei Yongjun's avatar
      dm snapshot: fix error return code in snapshot_ctr · 62253ab0
      Wei Yongjun authored
      commit 09e8b813 upstream.
      
      Return -ENOMEM instead of success if unable to allocate pending
      exception mempool in snapshot_ctr.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62253ab0
    • Mikulas Patocka's avatar
      dm bufio: avoid a possible __vmalloc deadlock · 8f9341a6
      Mikulas Patocka authored
      commit 502624bd upstream.
      
      This patch uses memalloc_noio_save to avoid a possible deadlock in
      dm-bufio.  (it could happen only with large block size, at most
      PAGE_SIZE << MAX_ORDER (typically 8MiB).
      
      __vmalloc doesn't fully respect gfp flags. The specified gfp flags are
      used for allocation of requested pages, structures vmap_area, vmap_block
      and vm_struct and the radix tree nodes.
      
      However, the kernel pagetables are allocated always with GFP_KERNEL.
      Thus the allocation of pagetables can recurse back to the I/O layer and
      cause a deadlock.
      
      This patch uses the function memalloc_noio_save to set per-process
      PF_MEMALLOC_NOIO flag and the function memalloc_noio_restore to restore
      it. When this flag is set, all allocations in the process are done with
      implied GFP_NOIO flag, thus the deadlock can't happen.
      
      This should be backported to stable kernels, but they don't have the
      PF_MEMALLOC_NOIO flag and memalloc_noio_save/memalloc_noio_restore
      functions. So, PF_MEMALLOC should be set and restored instead.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f9341a6
    • Mike Snitzer's avatar
      dm stripe: fix regression in stripe_width calculation · ce397c5e
      Mike Snitzer authored
      commit d793e684 upstream.
      
      Fix a regression in the calculation of the stripe_width in the
      dm stripe target which led to incorrect processing of device limits.
      
      The stripe_width is the stripe device length divided by the number of
      stripes.  The group of commits in the range f14fa693 ("dm stripe: fix
      size test") to eb850de6 ("dm stripe: support for non power of 2
      chunksize") interfered with each other (a merging error) and led to the
      stripe_width being set incorrectly to the stripe device length divided by
      chunk_size * stripe_count.
      
      For example, a stripe device's table with: 0 33553920 striped 3 512 ...
      should result in a stripe_width of 11184640 (33553920 / 3), but due to
      the bug it was getting set to 21845 (33553920 / (512 * 3)).
      
      The impact of this bug is that device topologies that previously worked
      fine with the stripe target are no longer considered valid.  In
      particular, there is a higher risk of seeing this issue if one of the
      stripe devices has a 4K logical block size.  Resulting in an error
      message like this:
      "device-mapper: table: 253:4: len=21845 not aligned to h/w logical block size 4096 of dm-1"
      
      The fix is to swap the order of the divisions and to use a temporary
      variable for the second one, so that width retains the intended
      value.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce397c5e
    • Mike Snitzer's avatar
      dm table: fix write same support · e861593f
      Mike Snitzer authored
      commit dc019b21 upstream.
      
      If device_not_write_same_capable() returns true then the iterate_devices
      loop in dm_table_supports_write_same() should return false.
      Reported-by: default avatarBharata B Rao <bharata.rao@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e861593f
    • Viresh Kumar's avatar
      DMA: OF: Check properties value before running be32_to_cpup() on it · 82d72f05
      Viresh Kumar authored
      commit 9a188eb1 upstream.
      
      In of_dma_controller_register() routine we are calling of_get_property() as an
      parameter to be32_to_cpup(). In case the property doesn't exist we will get a
      crash.
      
      This patch changes this code to check if we got a valid property first and then
      runs be32_to_cpup() on it.
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarRobert Richter <robert.richter@calxeda.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82d72f05
    • Takashi Iwai's avatar
      ALSA: hda - Fix 3.9 regression of EAPD init on Conexant codecs · 7a6ff79b
      Takashi Iwai authored
      commit ff359b14 upstream.
      
      The older Conexant codecs have up to two EAPDs and these are supposed
      to be rather statically turned on.  The new generic parser code
      assumes the dynamic on/off per path usage, thus it resulted in the
      silent output on some machines.
      
      This patch fixes the problem by simply assuming the static EAPD on for
      such old Conexant codecs as we did until 3.8 kernel.
      Reported-and-tested-by: default avatarChristopher K. <c.krooss@gmail.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a6ff79b
    • Wang YanQing's avatar
      ALSA: HDA: Fix Oops caused by dereference NULL pointer · 9437f0b0
      Wang YanQing authored
      commit 2195b063 upstream.
      
      The interrupt handler azx_interrupt will call azx_update_rirb,
      which may call snd_hda_queue_unsol_event, snd_hda_queue_unsol_event
      will dereference chip->bus pointer.
      
      The problem is we alloc chip->bus in azx_codec_create
      which will be called after we enable IRQ and enable unsolicited
      event in azx_probe.
      
      This will cause Oops due dereference NULL pointer. I meet it, good luck:)
      
      [Rearranged the NULL check before the tracepoint and added another
       NULL check of bus->workq -- tiwai]
      Signed-off-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9437f0b0
    • Takashi Iwai's avatar
      Revert "ALSA: hda - Don't set up active streams twice" · 06856c2e
      Takashi Iwai authored
      commit 6c35ae3c upstream.
      
      This reverts commit affdb62b.
      
      The commit introduced a regression with AD codecs where the stream is
      always clean up.  Since the patch is just a minor optimization and
      reverting the commit fixes the issue, let's just revert it.
      Reported-and-tested-by: default avatarMichael Burian <michael.burian@sbg.at>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06856c2e
    • Bob Moore's avatar
      ACPICA: Fix possible buffer overflow during a field unit read operation · a58e4edc
      Bob Moore authored
      commit 61388f9e upstream.
      
      Can only happen under these conditions: 1) The DSDT version is 1,
      meaning integers are 32-bits.  2) The field is between 33 and 64
      bits long.
      
      It applies cleanly back to ACPICA 20100806+ (Linux v2.6.37+).
      Signed-off-by: default avatarBob Moore <robert.moore@intel.com>
      Signed-off-by: default avatarLv Zheng <lv.zheng@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a58e4edc
    • Dan Carpenter's avatar
      ASoC: wm8994: missing break in wm8994_aif3_hw_params() · ccd72f9a
      Dan Carpenter authored
      commit 4495e46f upstream.
      
      The missing break here means that we always return early and the
      function is a no-op.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ccd72f9a
    • Aaro Koskinen's avatar
      ARM: OMAP: RX-51: change probe order of touchscreen and panel SPI devices · 2016e20a
      Aaro Koskinen authored
      commit e65f131a upstream.
      
      Commit 9fdca9df (spi: omap2-mcspi: convert to module_platform_driver)
      broke the SPI display/panel driver probe on RX-51/N900. The exact cause is
      not fully understood, but it seems to be related to the probe order. SPI
      communication to the panel driver (spi1.2) fails unless the touchscreen
      (spi1.0) has been probed/initialized before. When the omap2-mcspi driver
      was converted to a platform driver, it resulted in that the devices are
      probed immediately after the board registers them in the order they are
      listed in the board file.
      
      Fix the issue by moving the touchscreen before the panel in the SPI
      device list.
      
      The patch fixes the following failure:
      
      [    1.260955] acx565akm spi1.2: invalid display ID
      [    1.265899] panel-acx565akm display0: acx_panel_probe panel detect error
      [    1.273071] omapdss CORE error: driver probe failed: -19
      Tested-by: default avatarSebastian Reichel <sre@debian.org>
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Cc: Pali Rohár <pali.rohar@gmail.com>
      Cc: Joni Lapilainen <joni.lapilainen@gmail.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2016e20a
    • Fernando Luis Vazquez Cao's avatar
      HID: reintroduce fix-up for certain Sony RF receivers · 6defe2bd
      Fernando Luis Vazquez Cao authored
      commit c1e0ac19 upstream.
      
      It looks like the manual merge 0d69a3c7 ("Merge
      branches 'for-3.9/sony' and 'for-3.9/steelseries' into for-linus") accidentally
      removed Sony RF receiver with USB product id 0x0374 from the "have special
      driver" list, effectively nullifying a4649184
      ("HID: add support for Sony RF receiver with USB product id 0x0374"). Add the
      device back to the list.
      Signed-off-by: default avatarFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6defe2bd
    • Paolo Bonzini's avatar
      KVM: emulator: emulate SALC · c2b49720
      Paolo Bonzini authored
      commit 326f578f upstream.
      
      This is an almost-undocumented instruction available in 32-bit mode.
      I say "almost" undocumented because AMD documents it in their opcode
      maps just to say that it is unavailable in 64-bit mode (sections
      "A.2.1 One-Byte Opcodes" and "B.3 Invalid and Reassigned Instructions
      in 64-Bit Mode").
      
      It is roughly equivalent to "sbb %al, %al" except it does not
      set the flags.  Use fastop to emulate it, but do not use the opcode
      directly because it would fail if the host is 64-bit!
      Reported-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c2b49720
    • Paolo Bonzini's avatar
      KVM: emulator: emulate XLAT · e3ed61f1
      Paolo Bonzini authored
      commit 7fa57952 upstream.
      
      This is used by SGABIOS, KVM breaks with emulate_invalid_guest_state=1.
      It is just a MOV in disguise, with a funny source address.
      Reported-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3ed61f1
    • Paolo Bonzini's avatar
      KVM: emulator: emulate AAM · a58a4482
      Paolo Bonzini authored
      commit a035d5c6 upstream.
      
      This is used by SGABIOS, KVM breaks with emulate_invalid_guest_state=1.
      
      AAM needs the source operand to be unsigned; do the same in AAD as well
      for consistency, even though it does not affect the result.
      Reported-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a58a4482
    • Gleb Natapov's avatar
      KVM: VMX: fix halt emulation while emulating invalid guest sate · f7f76899
      Gleb Natapov authored
      commit 8d76c49e upstream.
      
      The invalid guest state emulation loop does not check halt_request
      which causes 100% cpu loop while guest is in halt and in invalid
      state, but more serious issue is that this leaves halt_request set, so
      random instruction emulated by vm86 #GP exit can be interpreted
      as halt which causes guest hang. Fix both problems by handling
      halt_request in emulation loop.
      Reported-by: default avatarTomas Papan <tomas.papan@gmail.com>
      Tested-by: default avatarTomas Papan <tomas.papan@gmail.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7f76899
  2. 11 May, 2013 9 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.9.2 · 57049bb1
      Greg Kroah-Hartman authored
      57049bb1
    • Chen Gang's avatar
      kernel/audit_tree.c: tree will leak memory when failure occurs in audit_trim_trees() · bf9ccddf
      Chen Gang authored
      commit 12b2f117 upstream.
      
      audit_trim_trees() calls get_tree().  If a failure occurs we must call
      put_tree().
      
      [akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement]
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJonghwan Choi <jhbird.choi@samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf9ccddf
    • Trond Myklebust's avatar
      NFSv4.x: Fix handling of partially delegated locks · dc994243
      Trond Myklebust authored
      commit c5a2a15f upstream.
      
      If a NFS client receives a delegation for a file after it has taken
      a lock on that file, we can currently end up in a situation where
      we mistakenly skip unlocking that file.
      
      The following patch swaps an erroneous check in nfs4_proc_unlck for
      whether or not the file has a delegation to one which checks whether
      or not we hold a lock stateid for that file.
      Reported-by: default avatarChuck Lever <Chuck.Lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Tested-by: default avatarChuck Lever <Chuck.Lever@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc994243
    • Dan Williams's avatar
      qmi_wwan/cdc_ether: add device IDs for Dell 5804 (Novatel E371) WWAN card · 8441a6f4
      Dan Williams authored
      commit 7fdb7846 upstream.
      
      A rebranded Novatel E371 for AT&T's LTE bands.  qmi_wwan should drive this
      device, while cdc_ether should ignore it.  Even though the USB descriptors
      are plain CDC-ETHER that USB interface is a QMI interface.
      Signed-off-by: default avatarDan Williams <dcbw@redhat.com>
      Acked-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8441a6f4
    • Yinghai Lu's avatar
      PCI: Delay final fixups until resources are assigned · bb22b760
      Yinghai Lu authored
      commit e253aaf0 upstream.
      
      Commit 4f535093 "PCI: Put pci_dev in device tree as early as possible"
      moved final fixups from pci_bus_add_device() to pci_device_add().  But
      pci_device_add() happens before resource assignment, so BARs may not be
      valid yet.
      
      Typical flow for hot-add:
      
          pciehp_configure_device
            pci_scan_slot
              pci_scan_single_device
                pci_device_add
                  pci_fixup_device(pci_fixup_final, dev)  # previous location
            # resource assignment happens here
            pci_bus_add_devices
              pci_bus_add_device
                pci_fixup_device(pci_fixup_final, dev)    # new location
      
      [bhelgaas: changelog, move fixups to pci_bus_add_device()]
      Reference: https://lkml.kernel.org/r/20130415182614.GB9224@xanatosReported-by: default avatarDavid Bulkow <David.Bulkow@stratus.com>
      Tested-by: default avatarDavid Bulkow <David.Bulkow@stratus.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb22b760
    • Srivatsa S. Bhat's avatar
      EDAC: Don't give write permission to read-only files · f2e426a4
      Srivatsa S. Bhat authored
      commit c8c64d16 upstream.
      
      I get the following warning on boot:
      
      ------------[ cut here ]------------
      WARNING: at drivers/base/core.c:575 device_create_file+0x9a/0xa0()
      Hardware name:  -[8737R2A]-
      Write permission without 'store'
      ...
      </snip>
      
      Drilling down, this is related to dynamic channel ce_count attribute
      files sporting a S_IWUSR mode without a ->store() function. Looking
      around, it appears that they aren't supposed to have a ->store()
      function. So remove the bogus write permission to get rid of the
      warning.
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
      [ shorten commit message ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2e426a4
    • Josef Bacik's avatar
      Btrfs: fix extent logging with O_DIRECT into prealloc · d2775711
      Josef Bacik authored
      commit eb384b55 upstream.
      
      This is the same as the fix from commit
      
      Btrfs: fix bad extent logging
      
      but for O_DIRECT.  I missed this when I fixed the problem originally, we were
      still using the em for the orig_start and orig_block_len, which would be the
      merged extent.  We need to use the actual extent from the on disk file extent
      item, which we have to lookup to make sure it's ok to nocow anyway so just pass
      in some pointers to hold this info.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2775711
    • Josef Bacik's avatar
      Btrfs: compare relevant parts of delayed tree refs · a2d8e3c7
      Josef Bacik authored
      commit 41b0fc42 upstream.
      
      A user reported a panic while running a balance.  What was happening was he was
      relocating a block, which added the reference to the relocation tree.  Then
      relocation would walk through the relocation tree and drop that reference and
      free that block, and then it would walk down a snapshot which referenced the
      same block and add another ref to the block.  The problem is this was all
      happening in the same transaction, so the parent block was free'ed up when we
      drop our reference which was immediately available for allocation, and then it
      was used _again_ to add a reference for the same block from a different
      snapshot.  This resulted in something like this in the delayed ref tree
      
      add ref to 90234880, parent=2067398656, ref_root 1766, level 1
      del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
      add ref to 90234880, parent=2067398656, ref_root 1767, level 1
      
      as you can see the ref_root's don't match, because when we inc the ref we use
      the header owner, which is the original tree the block belonged to, instead of
      the data reloc tree.  Then when we remove the extent we use the reloc tree
      objectid.  But none of this matters, since it is a shared reference which means
      only the parent matters.  When the delayed ref stuff runs it adds all the
      increments first, and then does all the drops, to make sure that we don't delete
      the ref if we net a positive ref count.  But tree blocks aren't allowed to have
      multiple refs from the same block, so this panics when it tries to add the
      second ref.  We need the add and the drop to cancel each other out in memory so
      we only do the final add.
      
      So to fix this we need to adjust how the delayed refs are added to the tree.
      Only the ref_root matters when it is a normal backref, and only the parent
      matters when it is a shared backref.  So make our decision based on what ref
      type we have.  This allows us to keep the ref_root in memory in case anybody
      wants to use it for something else, and it allows the delayed refs to be merged
      properly so we don't end up with this panic.
      
      With this patch the users image no longer panics on mount, and it has a clean
      fsck after a normal mount/umount cycle.  Thanks,
      Reported-by: default avatarRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2d8e3c7
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Fix ftrace_dump() · 67d9d1c1
      Steven Rostedt (Red Hat) authored
      commit 7fe70b57 upstream.
      
      ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
      ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
      will dump out the ftrace buffers to the console when either a oops,
      panic, or a sysrq-z occurs.
      
      This was written a long time ago when ftrace was fragile to recursion.
      But it wasn't written well even for that.
      
      There's a possible deadlock that can occur if a ftrace_dump() is happening
      and an NMI triggers another dump. This is because it grabs a lock
      before checking if the dump ran.
      
      It also totally disables ftrace, and tracing for no good reasons.
      
      As the ring_buffer now checks if it is read via a oops or NMI, where
      there's a chance that the buffer gets corrupted, it will disable
      itself. No need to have ftrace_dump() do the same.
      
      ftrace_dump() is now cleaned up where it uses an atomic counter to
      make sure only one dump happens at a time. A simple atomic_inc_return()
      is enough that is needed for both other CPUs and NMIs. No need for
      a spinlock, as if one CPU is running the dump, no other CPU needs
      to do it too.
      
      The tracing_on variable is turned off and not turned on. The original
      code did this, but it wasn't pretty. By just disabling this variable
      we get the result of not seeing traces that happen between crashes.
      
      For sysrq-z, it doesn't get turned on, but the user can always write
      a '1' to the tracing_on file. If they are using sysrq-z, then they should
      know about tracing_on.
      
      The new code is much easier to read and less error prone. No more
      deadlock possibility when an NMI triggers here.
      Reported-by: default avatarzhangwei(Jovi) <jovi.zhangwei@huawei.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67d9d1c1