1. 18 Mar, 2015 17 commits
  2. 06 Mar, 2015 23 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.19.1 · 5392bc6b
      Greg Kroah-Hartman authored
      5392bc6b
    • Christian Borntraeger's avatar
      ppc/kvm: Replace ACCESS_ONCE with READ_ONCE · 3c201105
      Christian Borntraeger authored
      commit 5ee07612 upstream.
      
      ACCESS_ONCE does not work reliably on non-scalar types. For
      example gcc 4.6 and 4.7 might remove the volatile tag for such
      accesses during the SRA (scalar replacement of aggregates) step
      (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
      
      Change the ppc/kvm code to replace ACCESS_ONCE with READ_ONCE.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarAlexander Graf <agraf@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c201105
    • Christian Borntraeger's avatar
      ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE · 2a806d27
      Christian Borntraeger authored
      commit da1a288d upstream.
      
      ACCESS_ONCE does not work reliably on non-scalar types. For
      example gcc 4.6 and 4.7 might remove the volatile tag for such
      accesses during the SRA (scalar replacement of aggregates) step
      (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
      
      Change the ppc/hugetlbfs code to replace ACCESS_ONCE with READ_ONCE.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a806d27
    • Christian Borntraeger's avatar
      mm/gup: Replace ACCESS_ONCE with READ_ONCE · dd4c1085
      Christian Borntraeger authored
      commit 38c5ce93 upstream.
      
      ACCESS_ONCE does not work reliably on non-scalar types. For
      example gcc 4.6 and 4.7 might remove the volatile tag for such
      accesses during the SRA (scalar replacement of aggregates) step
      (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
      
      Fixup gup_pmd_range.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd4c1085
    • Guenter Roeck's avatar
      next: sh: Fix compile error · 74df55cf
      Guenter Roeck authored
      commit 378af02b upstream.
      
      Commit 927609d6 ("kernel: tighten rules for ACCESS ONCE") results in a
      compile failure for sh builds with CONFIG_X2TLB enabled.
      
      arch/sh/mm/gup.c: In function 'gup_get_pte':
      arch/sh/mm/gup.c:20:2: error: invalid initializer
      make[1]: *** [arch/sh/mm/gup.o] Error 1
      
      Replace ACCESS_ONCE with READ_ONCE to fix the problem.
      
      Fixes: 927609d6 ("kernel: tighten rules for ACCESS ONCE")
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74df55cf
    • Jan Kara's avatar
      quota: Store maximum space limit in bytes · e1adf6d7
      Jan Kara authored
      commit b10a0819 upstream.
      
      Currently maximum space limit quota format supports is in blocks however
      since we store space limits in bytes, this is somewhat confusing. So
      store the maximum limit in bytes as well. Also rename the field to match
      the new unit and related inode field to match the new naming scheme.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1adf6d7
    • Christian Borntraeger's avatar
      x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE · 24c94559
      Christian Borntraeger authored
      commit 1760f1eb upstream.
      
      ACCESS_ONCE does not work reliably on non-scalar types. For
      example gcc 4.6 and 4.7 might remove the volatile tag for such
      accesses during the SRA (scalar replacement of aggregates) step
      (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
      
      Change the p2m code to replace ACCESS_ONCE with READ_ONCE.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24c94559
    • Raghavendra K T's avatar
      x86/spinlocks/paravirt: Fix memory corruption on unlock · dcc2e51a
      Raghavendra K T authored
      commit d6abfdb2 upstream.
      
      Paravirt spinlock clears slowpath flag after doing unlock.
      As explained by Linus currently it does:
      
                      prev = *lock;
                      add_smp(&lock->tickets.head, TICKET_LOCK_INC);
      
                      /* add_smp() is a full mb() */
      
                      if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                              __ticket_unlock_slowpath(lock, prev);
      
      which is *exactly* the kind of things you cannot do with spinlocks,
      because after you've done the "add_smp()" and released the spinlock
      for the fast-path, you can't access the spinlock any more.  Exactly
      because a fast-path lock might come in, and release the whole data
      structure.
      
      Linus suggested that we should not do any writes to lock after unlock(),
      and we can move slowpath clearing to fastpath lock.
      
      So this patch implements the fix with:
      
       1. Moving slowpath flag to head (Oleg):
          Unlocked locks don't care about the slowpath flag; therefore we can keep
          it set after the last unlock, and clear it again on the first (try)lock.
          -- this removes the write after unlock. note that keeping slowpath flag would
          result in unnecessary kicks.
          By moving the slowpath flag from the tail to the head ticket we also avoid
          the need to access both the head and tail tickets on unlock.
      
       2. use xadd to avoid read/write after unlock that checks the need for
          unlock_kick (Linus):
          We further avoid the need for a read-after-release by using xadd;
          the prev head value will include the slowpath flag and indicate if we
          need to do PV kicking of suspended spinners -- on modern chips xadd
          isn't (much) more expensive than an add + load.
      
      Result:
       setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
       benchmark overcommit %improve
       kernbench  1x           -0.13
       kernbench  2x            0.02
       dbench     1x           -1.77
       dbench     2x           -0.63
      
      [Jeremy: Hinted missing TICKET_LOCK_INC for kick]
      [Oleg: Moved slowpath flag to head, ticket_equals idea]
      [PeterZ: Added detailed changelog]
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: a.ryabinin@samsung.com
      Cc: dave@stgolabs.net
      Cc: hpa@zytor.com
      Cc: jasowang@redhat.com
      Cc: jeremy@goop.org
      Cc: paul.gortmaker@windriver.com
      Cc: riel@redhat.com
      Cc: tglx@linutronix.de
      Cc: waiman.long@hp.com
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dcc2e51a
    • Linus Torvalds's avatar
      kernel: make READ_ONCE() valid on const arguments · b6888abd
      Linus Torvalds authored
      commit dd369297 upstream.
      
      The use of READ_ONCE() causes lots of warnings witht he pending paravirt
      spinlock fixes, because those ends up having passing a member to a
      'const' structure to READ_ONCE().
      
      There should certainly be nothing wrong with using READ_ONCE() with a
      const source, but the helper function __read_once_size() would cause
      warnings because it would drop the 'const' qualifier, but also because
      the destination would be marked 'const' too due to the use of 'typeof'.
      
      Use a union of types in READ_ONCE() to avoid this issue.
      
      Also make sure to use parenthesis around the macro arguments to avoid
      possible operator precedence issues.
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b6888abd
    • Christian Borntraeger's avatar
      kernel: Fix sparse warning for ACCESS_ONCE · fec1418d
      Christian Borntraeger authored
      commit c5b19946 upstream.
      
      Commit 927609d6 ("kernel: tighten rules for ACCESS ONCE") results in
      sparse warnings like "Using plain integer as NULL pointer" - Let's add a
      type cast to the dummy assignment.
      To avoid warnings lik "sparse: warning: cast to restricted __hc32" we also
      use __force on that cast.
      
      Fixes: 927609d6 ("kernel: tighten rules for ACCESS ONCE")
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fec1418d
    • Christian Borntraeger's avatar
      kernel: tighten rules for ACCESS ONCE · 02c3106b
      Christian Borntraeger authored
      commit 927609d6 upstream.
      
      Now that all non-scalar users of ACCESS_ONCE have been converted
      to READ_ONCE or ASSIGN once, lets tighten ACCESS_ONCE to only
      work on scalar types.
      This variant was proposed by Alexei Starovoitov.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      02c3106b
    • Andy Shevchenko's avatar
      x86: pmc-atom: Assign debugfs node as soon as possible · db8b5886
      Andy Shevchenko authored
      commit 1b43d712 upstream.
      
      pmc_dbgfs_unregister() will be called when pmc->dbgfs_dir is unconditionally
      NULL on error path in pmc_dbgfs_register(). To prevent this we move the
      assignment to where is should be.
      
      Fixes: f855911c (x86/pmc_atom: Expose PMC device state and platform sleep state)
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Aubrey Li <aubrey.li@linux.intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Kumar P. Mahesh <mahesh.kumar.p@intel.com>
      Link: http://lkml.kernel.org/r/1421253575-22509-2-git-send-email-andriy.shevchenko@linux.intel.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db8b5886
    • Jiang Liu's avatar
      x86/irq: Fix regression caused by commit b568b860 · 58557add
      Jiang Liu authored
      commit 1ea76fba upstream.
      
      Commit b568b860 ("Treat SCI interrupt as normal GSI interrupt")
      accidently removes support of legacy PIC interrupt when fixing a
      regression for Xen, which causes a nasty regression on HP/Compaq
      nc6000 where we fail to register the ACPI interrupt, and thus
      lose eg. thermal notifications leading a potentially overheated
      machine.
      
      So reintroduce support of legacy PIC based ACPI SCI interrupt.
      Reported-by: default avatarVille Syrjälä <syrjala@sci.fi>
      Tested-by: default avatarVille Syrjälä <syrjala@sci.fi>
      Signed-off-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: linux-pm@vger.kernel.org
      Link: http://lkml.kernel.org/r/1424052673-22974-1-git-send-email-jiang.liu@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58557add
    • Hector Marco-Gisbert's avatar
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · 2f455ae1
      Hector Marco-Gisbert authored
      commit 4e7c22d4 upstream.
      
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: default avatarHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: default avatarIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f455ae1
    • Matt Fleming's avatar
      x86/efi: Avoid triple faults during EFI mixed mode calls · c4f934bc
      Matt Fleming authored
      commit 96738c69 upstream.
      
      Andy pointed out that if an NMI or MCE is received while we're in the
      middle of an EFI mixed mode call a triple fault will occur. This can
      happen, for example, when issuing an EFI mixed mode call while running
      perf.
      
      The reason for the triple fault is that we execute the mixed mode call
      in 32-bit mode with paging disabled but with 64-bit kernel IDT handlers
      installed throughout the call.
      
      At Andy's suggestion, stop playing the games we currently do at runtime,
      such as disabling paging and installing a 32-bit GDT for __KERNEL_CS. We
      can simply switch to the __KERNEL32_CS descriptor before invoking
      firmware services, and run in compatibility mode. This way, if an
      NMI/MCE does occur the kernel IDT handler will execute correctly, since
      it'll jump to __KERNEL_CS automatically.
      
      However, this change is only possible post-ExitBootServices(). Before
      then the firmware "owns" the machine and expects for its 32-bit IDT
      handlers to be left intact to service interrupts, etc.
      
      So, we now need to distinguish between early boot and runtime
      invocations of EFI services. During early boot, we need to restore the
      GDT that the firmware expects to be present. We can only jump to the
      __KERNEL32_CS code segment for mixed mode calls after ExitBootServices()
      has been invoked.
      
      A liberal sprinkling of comments in the thunking code should make the
      differences in early and late environments more apparent.
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Tested-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4f934bc
    • Thadeu Lima de Souza Cascardo's avatar
      blk-throttle: check stats_cpu before reading it from sysfs · 84cf4333
      Thadeu Lima de Souza Cascardo authored
      commit 045c47ca upstream.
      
      When reading blkio.throttle.io_serviced in a recently created blkio
      cgroup, it's possible to race against the creation of a throttle policy,
      which delays the allocation of stats_cpu.
      
      Like other functions in the throttle code, just checking for a NULL
      stats_cpu prevents the following oops caused by that race.
      
      [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
      [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
      [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
      [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
      [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
      [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
      [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
      [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
      [ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 42008884  XER: 20000000
      [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
      GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
      GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
      GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
      GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
      GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
      GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
      [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
      [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
      [ 1137.734943] Call Trace:
      [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
      [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
      [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
      [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
      [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
      [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
      [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
      [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
      [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
      [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
      [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
      [ 1137.735383] Instruction dump:
      [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
      [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
      
      And here is one code that allows to easily reproduce this, although this
      has first been found by running docker.
      
      void run(pid_t pid)
      {
      	int n;
      	int status;
      	int fd;
      	char *buffer;
      	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
      	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
      	fd = open(CGPATH "/test/tasks", O_WRONLY);
      	write(fd, buffer, n);
      	close(fd);
      	if (fork() > 0) {
      		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
      		read(fd, buffer, 512);
      		close(fd);
      		wait(&status);
      	} else {
      		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
      		n = read(fd, buffer, BUFFER_SIZE);
      		close(fd);
      	}
      	free(buffer);
      	exit(0);
      }
      
      void test(void)
      {
      	int status;
      	mkdir(CGPATH "/test", 0666);
      	if (fork() > 0)
      		wait(&status);
      	else
      		run(getpid());
      	rmdir(CGPATH "/test");
      }
      
      int main(int argc, char **argv)
      {
      	int i;
      	for (i = 0; i < NR_TESTS; i++)
      		test();
      	return 0;
      }
      Reported-by: default avatarRicardo Marin Matinata <rmm@br.ibm.com>
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84cf4333
    • Filipe Manana's avatar
      Btrfs: fix fsync data loss after adding hard link to inode · c5966bf7
      Filipe Manana authored
      commit 1a4bcf47 upstream.
      
      We have a scenario where after the fsync log replay we can lose file data
      that had been previously fsync'ed if we added an hard link for our inode
      and after that we sync'ed the fsync log (for example by fsync'ing some
      other file or directory).
      
      This is because when adding an hard link we updated the inode item in the
      log tree with an i_size value of 0. At that point the new inode item was
      in memory only and a subsequent fsync log replay would not make us lose
      the file data. However if after adding the hard link we sync the log tree
      to disk, by fsync'ing some other file or directory for example, we ended
      up losing the file data after log replay, because the inode item in the
      persisted log tree had an an i_size of zero.
      
      This is easy to reproduce, and the following excerpt from my test for
      xfstests shows this:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create one file with data and fsync it.
        # This made the btrfs fsync log persist the data and the inode metadata with
        # a correct inode->i_size (4096 bytes).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now add one hard link to our file. This made the btrfs code update the fsync
        # log, in memory only, with an inode metadata having a size of 0.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now force persistence of the fsync log to disk, for example, by fsyncing some
        # other file.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # Before a power loss or crash, we could read the 4Kb of data from our file as
        # expected.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After the fsync log replay, because the fsync log had a value of 0 for our
        # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
        # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      Another alternative test, that doesn't need to fsync an inode in the same
      transaction it was created, is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with some data.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Make sure the file is durably persisted.
        sync
      
        # Append some data to our file, to increase its size.
        $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
             $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, so from this point on if a crash/power failure happens, our
        # new data is guaranteed to be there next time the fs is mounted.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Add one hard link to our file. This made btrfs write into the in memory fsync
        # log a special inode with generation 0 and an i_size of 0 too. Note that this
        # didn't update the inode in the fsync log on disk.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link
      
        # Now make sure the in memory fsync log is durably persisted.
        # Creating and fsync'ing another file will do it.
        touch $SCRATCH_MNT/bar
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
      
        # As expected, before the crash/power failure, we should be able to read the
        # 12Kb of file data.
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After mounting the fs again, the fsync log was replayed.
        # The btrfs fsync log replay code didn't update the i_size of the persisted
        # inode because the inode item in the log had a special generation with a
        # value of 0 (and it couldn't know the correct i_size, since that inode item
        # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
        # effectively lost.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      This isn't a new issue/regression. This problem has been around since the
      log tree code was added in 2008:
      
        Btrfs: Add a write ahead tree log to optimize synchronous operations
        (commit e02119d5)
      
      Test cases for xfstests follow soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5966bf7
    • David Sterba's avatar
      btrfs: fix leak of path in btrfs_find_item · 46da9e76
      David Sterba authored
      commit 381cf658 upstream.
      
      If btrfs_find_item is called with NULL path it allocates one locally but
      does not free it. Affected paths are inserting an orphan item for a file
      and for a subvol root.
      
      Move the path allocation to the callers.
      
      Fixes: 3f870c28 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46da9e76
    • David Sterba's avatar
      btrfs: set proper message level for skinny metadata · 1c034fa3
      David Sterba authored
      commit 5efa0490 upstream.
      
      This has been confusing people for too long, the message is really just
      informative.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c034fa3
    • Ilya Dryomov's avatar
      libceph: fix double __remove_osd() problem · 1c9f7cbf
      Ilya Dryomov authored
      commit 7eb71e03 upstream.
      
      It turns out it's possible to get __remove_osd() called twice on the
      same OSD.  That doesn't sit well with rb_erase() - depending on the
      shape of the tree we can get a NULL dereference, a soft lockup or
      a random crash at some point in the future as we end up touching freed
      memory.  One scenario that I was able to reproduce is as follows:
      
                  <osd3 is idle, on the osd lru list>
      <con reset - osd3>
      con_fault_finish()
        osd_reset()
                                    <osdmap - osd3 down>
                                    ceph_osdc_handle_map()
                                      <takes map_sem>
                                      kick_requests()
                                        <takes request_mutex>
                                        reset_changed_osds()
                                          __reset_osd()
                                            __remove_osd()
                                        <releases request_mutex>
                                      <releases map_sem>
          <takes map_sem>
          <takes request_mutex>
          __kick_osd_requests()
            __reset_osd()
              __remove_osd() <-- !!!
      
      A case can be made that osd refcounting is imperfect and reworking it
      would be a proper resolution, but for now Sage and I decided to fix
      this by adding a safe guard around __remove_osd().
      
      Fixes: http://tracker.ceph.com/issues/8087
      
      Cc: Sage Weil <sage@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c9f7cbf
    • Hans de Goede's avatar
      samsung-laptop: Add use_native_backlight quirk, and enable it on some models · c0accdab
      Hans de Goede authored
      commit 4690555e upstream.
      
      Since kernel 3.14 the backlight control has been broken on various Samsung
      Atom based netbooks. This has been bisected and this problem happens since
      commit b35684b8 ("drm/i915: do full backlight setup at enable time")
      
      This has been reported and discussed in detail here:
      http://lists.freedesktop.org/archives/intel-gfx/2014-July/049395.html
      
      Unfortunately no-one has been able to fix this. This only affects Samsung
      Atom netbooks, and the Linux kernel and the BIOS of those laptops have never
      worked well together. All affected laptops already have a quirk to avoid using
      the standard acpi-video interface and instead use the samsung specific SABI
      interface which samsung-laptop uses. It seems that recent fixes to the i915
      driver have also broken backlight control through the SABI interface.
      
      The intel_backlight driver OTOH works fine, and also allows for finer grained
      backlight control. So add a new use_native_backlight quirk, and replace the
      broken_acpi_video quirk with this quirk for affected models. This new quirk
      disables acpi-video as before and also stops samsung-laptop from registering
      the SABI based samsung_laptop backlight interface, leaving only the working
      intel_backlight interface.
      
      This commit enables this new quirk for 3 models which are known to be affected,
      chances are that it needs to be used on other models too.
      
      BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1094948 # N145P
      BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1115713 # N250P
      Reported-by: Bertrik Sikken <bertrik@sikken.nl> # N150P
      Cc: stable@vger.kernel.org # 3.16
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c0accdab
    • Chen Jie's avatar
      jffs2: fix handling of corrupted summary length · c84a6855
      Chen Jie authored
      commit 164c2406 upstream.
      
      sm->offset maybe wrong but magic maybe right, the offset do not have CRC.
      
      Badness at c00c7580 [verbose debug info unavailable]
      NIP: c00c7580 LR: c00c718c CTR: 00000014
      REGS: df07bb40 TRAP: 0700   Not tainted  (2.6.34.13-WR4.3.0.0_standard)
      MSR: 00029000 <EE,ME,CE>  CR: 22084f84  XER: 00000000
      TASK = df84d6e0[908] 'mount' THREAD: df07a000
      GPR00: 00000001 df07bbf0 df84d6e0 00000000 00000001 00000000 df07bb58 00000041
      GPR08: 00000041 c0638860 00000000 00000010 22084f88 100636c8 df814ff8 00000000
      GPR16: df84d6e0 dfa558cc c05adb90 00000048 c0452d30 00000000 000240d0 000040d0
      GPR24: 00000014 c05ae734 c05be2e0 00000000 00000001 00000000 00000000 c05ae730
      NIP [c00c7580] __alloc_pages_nodemask+0x4d0/0x638
      LR [c00c718c] __alloc_pages_nodemask+0xdc/0x638
      Call Trace:
      [df07bbf0] [c00c718c] __alloc_pages_nodemask+0xdc/0x638 (unreliable)
      [df07bc90] [c00c7708] __get_free_pages+0x20/0x48
      [df07bca0] [c00f4a40] __kmalloc+0x15c/0x1ec
      [df07bcd0] [c01fc880] jffs2_scan_medium+0xa58/0x14d0
      [df07bd70] [c01ff38c] jffs2_do_mount_fs+0x1f4/0x6b4
      [df07bdb0] [c020144c] jffs2_do_fill_super+0xa8/0x260
      [df07bdd0] [c020230c] jffs2_fill_super+0x104/0x184
      [df07be00] [c0335814] get_sb_mtd_aux+0x9c/0xec
      [df07be20] [c033596c] get_sb_mtd+0x84/0x1e8
      [df07be60] [c0201ed0] jffs2_get_sb+0x1c/0x2c
      [df07be70] [c0103898] vfs_kern_mount+0x78/0x1e8
      [df07bea0] [c0103a58] do_kern_mount+0x40/0x100
      [df07bec0] [c011fe90] do_mount+0x240/0x890
      [df07bf10] [c0120570] sys_mount+0x90/0xd8
      [df07bf40] [c00110d8] ret_from_syscall+0x0/0x4
      
      === Exception: c01 at 0xff61a34
          LR = 0x100135f0
      Instruction dump:
      38800005 38600000 48010f41 4bfffe1c 4bfc2d15 4bfffe8c 72e90200 4082fc28
      3d20c064 39298860 8809000d 68000001 <0f000000> 2f800000 419efc0c 38000001
      mount: mounting /dev/mtdblock3 on /common failed: Input/output error
      Signed-off-by: default avatarChen Jie <chenjie6@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c84a6855
    • Daniel J Blueman's avatar
      EDAC, amd64_edac: Prevent OOPS with >16 memory controllers · 6d209d9a
      Daniel J Blueman authored
      commit 0c510cc8 upstream.
      
      When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
      the kernel fatally dereferences unallocated structures, see splat below;
      this occurs on at least NumaConnect systems.
      
      Fix by checking if a memory controller info structure was found.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
      IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
      PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
      Oops: 0000 [#2] SMP
      Modules linked in:
      CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1
      Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b    01/28/2015
      task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
      RIP: 0010:[<ffffffff819f714f>] [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
      RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
      RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
      RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
      RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
      R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
      R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
      FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
      Stack:
       0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
       000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
       ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
      Call Trace:
       <IRQ>
       ? vprintk_default
       ? printk
       amd_decode_mce
       notifier_call_chain
       atomic_notifier_call_chain
       mce_log
       machine_check_poll
       mce_timer_fn
       ? mce_cpu_restart
       call_timer_fn.isra.29
       run_timer_softirq
       __do_softirq
       irq_exit
       smp_apic_timer_interrupt
       apic_timer_interrupt
       <EOI>
       ? down_read_trylock
       __do_page_fault
       ? __schedule
       do_page_fault
       page_fault
      Signed-off-by: default avatarDaniel J Blueman <daniel@numascale.com>
      Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
      [ Boris: massage commit message ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d209d9a