1. 07 Feb, 2018 7 commits
    • Tang Junhui's avatar
      bcache: return attach error when no cache set exist · 7f4fc93d
      Tang Junhui authored
      I attach a back-end device to a cache set, and the cache set is not
      registered yet, this back-end device did not attach successfully, and no
      error returned:
      [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach
      [root]#
      
      In sysfs_attach(), the return value "v" is initialized to "size" in
      the beginning, and if no cache set exist in bch_cache_sets, the "v" value
      would not change any more, and return to sysfs, sysfs regard it as success
      since the "size" is a positive number.
      
      This patch fixes this issue by assigning "v" with "-ENOENT" in the
      initialization.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f4fc93d
    • Coly Li's avatar
      bcache: set writeback_rate_update_seconds in range [1, 60] seconds · 7a5e3ecb
      Coly Li authored
      dc->writeback_rate_update_seconds can be set via sysfs and its value can
      be set to [1, ULONG_MAX].  It does not make sense to set such a large
      value, 60 seconds is long enough value considering the default 5 seconds
      works well for long time.
      
      Because dc->writeback_rate_update is a special delayed work, it re-arms
      itself inside the delayed work routine update_writeback_rate(). When
      stopping it by cancel_delayed_work_sync(), there should be a timeout to
      wait and make sure the re-armed delayed work is stopped too. A small max
      value of dc->writeback_rate_update_seconds is also helpful to decide a
      reasonable small timeout.
      
      This patch limits sysfs interface to set dc->writeback_rate_update_seconds
      in range of [1, 60] seconds, and replaces the hand-coded number by macros.
      
      Changelog:
      v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
      v1: initial version.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7a5e3ecb
    • Tang Junhui's avatar
      bcache: fix for allocator and register thread race · 682811b3
      Tang Junhui authored
      After long time running of random small IO writing,
      I reboot the machine, and after the machine power on,
      I found bcache got stuck, the stack is:
      [root@ceph153 ~]# cat /proc/2510/task/*/stack
      [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache]
      [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache]
      [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache]
      [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache]
      [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache]
      [<ffffffff810a631f>] kthread+0xcf/0xe0
      [<ffffffff8164c318>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      [root@ceph153 ~]# cat /proc/2038/task/*/stack
      [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
      [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache]
      [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache]
      [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache]
      [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache]
      [<ffffffff812f702f>] kobj_attr_store+0xf/0x20
      [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140
      [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0
      [<ffffffff811e069f>] SyS_write+0x7f/0xe0
      [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1
      The stack shows the register thread and allocator thread
      were getting stuck when registering cache device.
      
      I reboot the machine several times, the issue always
      exsit in this machine.
      
      I debug the code, and found the call trace as bellow:
      register_bcache()
         ==>run_cache_set()
            ==>bch_journal_replay()
               ==>bch_btree_insert()
                  ==>__bch_btree_map_nodes()
                     ==>btree_insert_fn()
                        ==>btree_split() //node need split
                           ==>btree_check_reserve()
      In btree_check_reserve(), It will check if there is enough buckets
      of RESERVE_BTREE type, since allocator thread did not work yet, so
      no buckets of RESERVE_BTREE type allocated, so the register thread
      waits on c->btree_cache_wait, and goes to sleep.
      
      Then the allocator thread initialized, the call trace is bellow:
      bch_allocator_thread()
      ==>bch_prio_write()
         ==>bch_journal_meta()
            ==>bch_journal()
               ==>journal_wait_for_write()
      In journal_wait_for_write(), It will check if journal is full by
      journal_full(), but the long time random small IO writing
      causes the exhaustion of journal buckets(journal.blocks_free=0),
      In order to release the journal buckets,
      the allocator calls btree_flush_write() to flush keys to
      btree nodes, and waits on c->journal.wait until btree nodes writing
      over or there has already some journal buckets space, then the
      allocator thread goes to sleep. but in btree_flush_write(), since
      bch_journal_replay() is not finished, so no btree nodes have journal
      (condition "if (btree_current_write(b)->journal)" never satisfied),
      so we got no btree node to flush, no journal bucket released,
      and allocator sleep all the times.
      
      Through the above analysis, we can see that:
      1) Register thread wait for allocator thread to allocate buckets of
         RESERVE_BTREE type;
      2) Alloctor thread wait for register thread to replay journal, so it
         can flush btree nodes and get journal bucket.
         then they are all got stuck by waiting for each other.
      
      Hua Rui provided a patch for me, by allocating some buckets of
      RESERVE_BTREE type in advance, so the register thread can get bucket
      when btree node splitting and no need to waiting for the allocator
      thread. I tested it, it has effect, and register thread run a step
      forward, but finally are still got stuck, the reason is only 8 bucket
      of RESERVE_BTREE type were allocated, and in bch_journal_replay(),
      after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left,
      then btree_check_reserve() is not satisfied anymore, so it goes to sleep
      again, and in the same time, alloctor thread did not flush enough btree
      nodes to release a journal bucket, so they all got stuck again.
      
      So we need to allocate more buckets of RESERVE_BTREE type in advance,
      but how much is enough?  By experience and test, I think it should be
      as much as journal buckets. Then I modify the code as this patch,
      and test in the machine, and it works.
      
      This patch modified base on Hua Rui’s patch, and allocate more buckets
      of RESERVE_BTREE type in advance to avoid register thread and allocate
      thread going to wait for each other.
      
      [patch v2] ca->sb.njournal_buckets would be 0 in the first time after
      cache creation, and no journal exists, so just 8 btree buckets is OK.
      Signed-off-by: default avatarHua Rui <huarui.dev@gmail.com>
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      682811b3
    • Coly Li's avatar
      bcache: set error_limit correctly · 7ba0d830
      Coly Li authored
      Struct cache uses io_errors for two purposes,
      - Error decay: when cache set error_decay is set, io_errors is used to
        generate a small piece of delay when I/O error happens.
      - I/O errors counter: in order to generate big enough value for error
        decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a
        IO_ERROR_SHIFT).
      
      In function bch_count_io_errors(), if I/O errors counter reaches cache set
      error limit, bch_cache_set_error() will be called to retire the whold cache
      set. But current code is problematic when checking the error limit, see the
      following code piece from bch_count_io_errors(),
      
       90     if (error) {
       91             char buf[BDEVNAME_SIZE];
       92             unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
       93                                                 &ca->io_errors);
       94             errors >>= IO_ERROR_SHIFT;
       95
       96             if (errors < ca->set->error_limit)
       97                     pr_err("%s: IO error on %s, recovering",
       98                            bdevname(ca->bdev, buf), m);
       99             else
      100                     bch_cache_set_error(ca->set,
      101                                         "%s: too many IO errors %s",
      102                                         bdevname(ca->bdev, buf), m);
      103     }
      
      At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real
      errors counter to compare at line 96. But ca->set->error_limit is initia-
      lized with an amplified value in bch_cache_set_alloc(),
      1545         c->error_limit  = 8 << IO_ERROR_SHIFT;
      
      It means by default, in bch_count_io_errors(), before 8<<20 errors happened
      bch_cache_set_error() won't be called to retire the problematic cache
      device. If the average request size is 64KB, it means bcache won't handle
      failed device until 512GB data is requested. This is too large to be an I/O
      threashold. So I believe the correct error limit should be much less.
      
      This patch sets default cache set error limit to 8, then in
      bch_count_io_errors() when errors counter reaches 8 (if it is default
      value), function bch_cache_set_error() will be called to retire the whole
      cache set. This patch also removes bits shifting when store or show
      io_error_limit value via sysfs interface.
      
      Nowadays most of SSDs handle internal flash failure automatically by LBA
      address re-indirect mapping. If an I/O error can be observed by upper layer
      code, it will be a notable error because that SSD can not re-indirect
      map the problematic LBA address to an available flash block. This situation
      indicates the whole SSD will be failed very soon. Therefore setting 8 as
      the default io error limit value makes sense, it is enough for most of
      cache devices.
      
      Changelog:
      v2: add reviewed-by from Hannes.
      v1: initial version for review.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ba0d830
    • Coly Li's avatar
      bcache: properly set task state in bch_writeback_thread() · 99361bbf
      Coly Li authored
      Kernel thread routine bch_writeback_thread() has the following code block,
      
      447         down_write(&dc->writeback_lock);
      448~450     if (check conditions) {
      451                 up_write(&dc->writeback_lock);
      452                 set_current_state(TASK_INTERRUPTIBLE);
      453
      454                 if (kthread_should_stop())
      455                         return 0;
      456
      457                 schedule();
      458                 continue;
      459         }
      
      If condition check is true, its task state is set to TASK_INTERRUPTIBLE
      and call schedule() to wait for others to wake up it.
      
      There are 2 issues in current code,
      1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if
         another process changes the condition and call wake_up_process(dc->
         writeback_thread), then at line 452 task state is set back to
         TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be
         waken up.
      2, At line 454 if kthread_should_stop() is true, writeback kernel thread
         will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and
         call do_exit(). It is not good to enter do_exit() with task state
         TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a
         warning message is reported by __might_sleep(): "WARNING: do not call
         blocking ops when !TASK_RUNNING; state=1 set at [xxxx]".
      
      For the first issue, task state should be set before condition checks.
      Ineed because dc->writeback_lock is required when modifying all the
      conditions, calling set_current_state() inside code block where dc->
      writeback_lock is hold is safe. But this is quite implicit, so I still move
      set_current_state() before all the condition checks.
      
      For the second issue, frankley speaking it does not hurt when kernel thread
      exits with TASK_INTERRUPTIBLE state, but this warning message scares users,
      makes them feel there might be something risky with bcache and hurt their
      data.  Setting task state to TASK_RUNNING before returning fixes this
      problem.
      
      In alloc.c:allocator_wait(), there is also a similar issue, and is also
      fixed in this patch.
      
      Changelog:
      v3: merge two similar fixes into one patch
      v2: fix the race issue in v1 patch.
      v1: initial buggy fix.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Michael Lyle <mlyle@lyle.org>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99361bbf
    • Tang Junhui's avatar
      bcache: fix high CPU occupancy during journal · c4dc2497
      Tang Junhui authored
      After long time small writing I/O running, we found the occupancy of CPU
      is very high and I/O performance has been reduced by about half:
      
      [root@ceph151 internal]# top
      top - 15:51:05 up 1 day,2:43,  4 users,  load average: 16.89, 15.15, 16.53
      Tasks: 2063 total,   4 running, 2059 sleeping,   0 stopped,   0 zombie
      %Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa,  0.0 hi,  0.5 si,  0.0 st
      KiB Mem : 65450044 total, 24586420 free, 38909008 used,  1954616 buff/cache
      KiB Swap: 65667068 total, 65667068 free,        0 used. 25136812 avail Mem
      
        PID USER PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
       2023 root 20  0       0      0      0 S 55.1  0.0   0:04.42 kworker/11:191
      14126 root 20  0       0      0      0 S 42.9  0.0   0:08.72 kworker/10:3
       9292 root 20  0       0      0      0 S 30.4  0.0   1:10.99 kworker/6:1
       8553 ceph 20  0 4242492 1.805g  18804 S 30.0  2.9 410:07.04 ceph-osd
      12287 root 20  0       0      0      0 S 26.7  0.0   0:28.13 kworker/7:85
      31019 root 20  0       0      0      0 S 26.1  0.0   1:30.79 kworker/22:1
       1787 root 20  0       0      0      0 R 25.7  0.0   5:18.45 kworker/8:7
      32169 root 20  0       0      0      0 S 14.5  0.0   1:01.92 kworker/23:1
      21476 root 20  0       0      0      0 S 13.9  0.0   0:05.09 kworker/1:54
       2204 root 20  0       0      0      0 S 12.5  0.0   1:25.17 kworker/9:10
      16994 root 20  0       0      0      0 S 12.2  0.0   0:06.27 kworker/5:106
      15714 root 20  0       0      0      0 R 10.9  0.0   0:01.85 kworker/19:2
       9661 ceph 20  0 4246876 1.731g  18800 S 10.6  2.8 403:00.80 ceph-osd
      11460 ceph 20  0 4164692 2.206g  18876 S 10.6  3.5 360:27.19 ceph-osd
       9960 root 20  0       0      0      0 S 10.2  0.0   0:02.75 kworker/2:139
      11699 ceph 20  0 4169244 1.920g  18920 S 10.2  3.1 355:23.67 ceph-osd
       6843 ceph 20  0 4197632 1.810g  18900 S  9.6  2.9 380:08.30 ceph-osd
      
      The kernel work consumed a lot of CPU, and I found they are running journal
      work, The journal is reclaiming source and flush btree node with surprising
      frequency.
      
      Through further analysis, we found that in btree_flush_write(), we try to
      get a btree node with the smallest fifo idex to flush by traverse all the
      btree nodein c->bucket_hash, after we getting it, since no locker protects
      it, this btree node may have been written to cache device by other works,
      and if this occurred, we retry to traverse in c->bucket_hash and get
      another btree node. When the problem occurrd, the retry times is very high,
      and we consume a lot of CPU in looking for a appropriate btree node.
      
      In this patch, we try to record 128 btree nodes with the smallest fifo idex
      in heap, and pop one by one when we need to flush btree node. It greatly
      reduces the time for the loop to find the appropriate BTREE node, and also
      reduce the occupancy of CPU.
      
      [note by mpl: this triggers a checkpatch error because of adjacent,
      pre-existing style violations]
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4dc2497
    • Tang Junhui's avatar
      bcache: add journal statistic · a728eacb
      Tang Junhui authored
      Sometimes, Journal takes up a lot of CPU, we need statistics
      to know what's the journal is doing. So this patch provide
      some journal statistics:
      1) reclaim: how many times the journal try to reclaim resource,
         usually the journal bucket or/and the pin are exhausted.
      2) flush_write: how many times the journal try to flush btree node
         to cache device, usually the journal bucket are exhausted.
      3) retry_flush_write: how many times the journal retry to flush
         the next btree node, usually the previous tree node have been
         flushed by other thread.
      we show these statistic by sysfs interface. Through these statistics
      We can totally see the status of journal module when the CPU is too
      high.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a728eacb
  2. 06 Feb, 2018 6 commits
    • Howard McLauchlan's avatar
      block: Add should_fail_bio() for bpf error injection · 30abb3a6
      Howard McLauchlan authored
      The classic error injection mechanism, should_fail_request() does not
      support use cases where more information is required (from the entire
      struct bio, for example).
      
      To that end, this patch introduces should_fail_bio(), which calls
      should_fail_request() under the hood but provides a convenient
      place for kprobes to hook into if they require the entire struct bio.
      This patch also replaces some existing calls to should_fail_request()
      with should_fail_bio() with no degradation in performance.
      Signed-off-by: default avatarHoward McLauchlan <hmclauchlan@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30abb3a6
    • Jens Axboe's avatar
      blk-wbt: account flush requests correctly · 5235553d
      Jens Axboe authored
      Mikulas reported a workload that saw bad performance, and figured
      out what it was due to various other types of requests being
      accounted as reads. Flush requests, for instance. Due to the
      high latency of those, we heavily throttle the writes to keep
      the latencies in balance. But they really should be accounted
      as writes.
      
      Fix this by checking the exact type of the request. If it's a
      read, account as a read, if it's a write or a flush, account
      as a write. Any other request we disregard. Previously everything
      would have been mistakenly accounted as reads.
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5235553d
    • Linus Torvalds's avatar
      Merge tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 68c5735e
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
      
       - videobuf2 was moved to a media/common dir, as it is now used by the
         DVB subsystem too
      
       - Digital TV core memory mapped support interface
      
       - new sensor driver: ov7740
      
       - several improvements at ddbridge driver
      
       - new V4L2 driver: IPU3 CIO2 CSI-2 receiver unit, found on some Intel
         SoCs
      
       - new tuner driver: tda18250
      
       - finally got rid of all LIRC staging drivers
      
       - as we don't have old lirc drivers anymore, restruct the lirc device
         code
      
       - add support for UVC metadata
      
       - add a new staging driver for NVIDIA Tegra Video Decoder Engine
      
       - DVB kAPI headers moved to include/media
      
       - synchronize the kAPI and uAPI for the DVB subsystem, removing the gap
         for non-legacy APIs
      
       - reduce the kAPI gap for V4L2
      
       - lots of other driver enhancements, cleanups, etc.
      
      * tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (407 commits)
        media: v4l2-compat-ioctl32.c: make ctrl_is_pointer work for subdevs
        media: v4l2-compat-ioctl32.c: refactor compat ioctl32 logic
        media: v4l2-compat-ioctl32.c: don't copy back the result for certain errors
        media: v4l2-compat-ioctl32.c: drop pr_info for unknown buffer type
        media: v4l2-compat-ioctl32.c: copy clip list in put_v4l2_window32
        media: v4l2-compat-ioctl32.c: fix ctrl_is_pointer
        media: v4l2-compat-ioctl32.c: copy m.userptr in put_v4l2_plane32
        media: v4l2-compat-ioctl32.c: avoid sizeof(type)
        media: v4l2-compat-ioctl32.c: move 'helper' functions to __get/put_v4l2_format32
        media: v4l2-compat-ioctl32.c: fix the indentation
        media: v4l2-compat-ioctl32.c: add missing VIDIOC_PREPARE_BUF
        media: v4l2-ioctl.c: don't copy back the result for -ENOTTY
        media: v4l2-ioctl.c: use check_fmt for enum/g/s/try_fmt
        media: vivid: fix module load error when enabling fb and no_error_inj=1
        media: dvb_demux: improve debug messages
        media: dvb_demux: Better handle discontinuity errors
        media: cxusb, dib0700: ignore XC2028_I2C_FLUSH
        media: ts2020: avoid integer overflows on 32 bit machines
        media: i2c: ov7740: use gpio/consumer.h instead of gpio.h
        media: entity: Add a nop variant of media_entity_cleanup
        ...
      68c5735e
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 2246edfa
      Linus Torvalds authored
      Pull more rdma updates from Doug Ledford:
       "Items of note:
      
         - two patches fix a regression in the 4.15 kernel. The 4.14 kernel
           worked fine with NVMe over Fabrics and mlx5 adapters. That broke in
           4.15. The fix is here.
      
         - one of the patches (the endian notation patch from Lijun) looks
           like a lot of lines of change, but it's mostly mechanical in
           nature. It amounts to the biggest chunk of change in it (it's about
           2/3rds of the overall pull request).
      
        Summary:
      
         - Clean up some function signatures in rxe for clarity
      
         - Tidy the RDMA netlink header to remove unimplemented constants
      
         - bnxt_re driver fixes, one is a regression this window.
      
         - Minor hns driver fixes
      
         - Various fixes from Dan Carpenter and his tool
      
         - Fix IRQ cleanup race in HFI1
      
         - HF1 performance optimizations and a fix to report counters in the right units
      
         - Fix for an IPoIB startup sequence race with the external manager
      
         - Oops fix for the new kabi path
      
         - Endian cleanups for hns
      
         - Fix for mlx5 related to the new automatic affinity support"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (38 commits)
        net/mlx5: increase async EQ to avoid EQ overrun
        mlx5: fix mlx5_get_vector_affinity to start from completion vector 0
        RDMA/hns: Fix the endian problem for hns
        IB/uverbs: Use the standard kConfig format for experimental
        IB: Update references to libibverbs
        IB/hfi1: Add 16B rcvhdr trace support
        IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
        IB/core: Avoid a potential OOPs for an unused optional parameter
        IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
        IB/ipoib: Fix for potential no-carrier state
        IB/hfi1: Show fault stats in both TX and RX directions
        IB/hfi1: Remove blind constants from 16B update
        IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
        IB/hfi1: Do not override given pcie_pset value
        IB/hfi1: Optimize process_receive_ib()
        IB/hfi1: Remove unnecessary fecn and becn fields
        IB/hfi1: Look up ibport using a pointer in receive path
        IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
        IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
        IB/hfi1: Remove dependence on qp->s_hdrwords
        ...
      2246edfa
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 3ff1b28c
      Linus Torvalds authored
      Pull libnvdimm updates from Ross Zwisler:
      
       - Require struct page by default for filesystem DAX to remove a number
         of surprising failure cases. This includes failures with direct I/O,
         gdb and fork(2).
      
       - Add support for the new Platform Capabilities Structure added to the
         NFIT in ACPI 6.2a. This new table tells us whether the platform
         supports flushing of CPU and memory controller caches on unexpected
         power loss events.
      
       - Revamp vmem_altmap and dev_pagemap handling to clean up code and
         better support future future PCI P2P uses.
      
       - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
         become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
         spec, and instead rely on the generic ND_CMD_CALL approach used by
         the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.
      
       - Enhance nfit_test so we can test some of the new things added in
         version 1.6 of the DSM specification. This includes testing firmware
         download and simulating the Last Shutdown State (LSS) status.
      
      * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
        libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
        acpi, nfit: fix register dimm error handling
        libnvdimm, namespace: make min namespace size 4K
        tools/testing/nvdimm: force nfit_test to depend on instrumented modules
        libnvdimm/nfit_test: adding support for unit testing enable LSS status
        libnvdimm/nfit_test: add firmware download emulation
        nfit-test: Add platform cap support from ACPI 6.2a to test
        libnvdimm: expose platform persistence attribute for nd_region
        acpi: nfit: add persistent memory control flag for nd_region
        acpi: nfit: Add support for detect platform CPU cache flush on power loss
        device-dax: Fix trailing semicolon
        libnvdimm, btt: fix uninitialized err_lock
        dax: require 'struct page' by default for filesystem dax
        ext2: auto disable dax instead of failing mount
        ext4: auto disable dax instead of failing mount
        mm, dax: introduce pfn_t_special()
        mm: Fix devm_memremap_pages() collision handling
        mm: Fix memory size alignment in devm_memremap_pages_release()
        memremap: merge find_dev_pagemap into get_dev_pagemap
        memremap: change devm_memremap_pages interface to use struct dev_pagemap
        ...
      3ff1b28c
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 105cf3c8
      Linus Torvalds authored
      Pull PCI updates from Bjorn Helgaas:
      
       - skip AER driver error recovery callbacks for correctable errors
         reported via ACPI APEI, as we already do for errors reported via the
         native path (Tyler Baicar)
      
       - fix DPC shared interrupt handling (Alex Williamson)
      
       - print full DPC interrupt number (Keith Busch)
      
       - enable DPC only if AER is available (Keith Busch)
      
       - simplify DPC code (Bjorn Helgaas)
      
       - calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
         Helgaas)
      
       - enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
         Helgaas)
      
       - move ASPM internal interfaces out of public header (Bjorn Helgaas)
      
       - allow hot-removal of VGA devices (Mika Westerberg)
      
       - speed up unplug and shutdown by assuming Thunderbolt controllers
         don't support Command Completed events (Lukas Wunner)
      
       - add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
         Jay Cornwall)
      
       - expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)
      
       - clean up PCI DMA interface usage (Christoph Hellwig)
      
       - remove PCI pool API (replaced with DMA pool) (Romain Perier)
      
       - deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
         Kaya)
      
       - move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)
      
       - add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)
      
       - remove warnings on sysfs mmap failure (Bjorn Helgaas)
      
       - quiet ROM validation messages (Alex Deucher)
      
       - remove redundant memory alloc failure messages (Markus Elfring)
      
       - fill in types for compile-time VGA and other I/O port resources
         (Bjorn Helgaas)
      
       - make "pci=pcie_scan_all" work for Root Ports as well as Downstream
         Ports to help AmigaOne X1000 (Bjorn Helgaas)
      
       - add SPDX tags to all PCI files (Bjorn Helgaas)
      
       - quirk Marvell 9128 DMA aliases (Alex Williamson)
      
       - quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)
      
       - fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
         Cassel)
      
       - use DMA API to get MSI address for DesignWare IP (Niklas Cassel)
      
       - fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)
      
       - fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)
      
       - add support for ARTPEC-7 SoC (Niklas Cassel)
      
       - add endpoint-mode support for ARTPEC (Niklas Cassel)
      
       - add Cadence PCIe host and endpoint controller driver (Cyrille
         Pitchen)
      
       - handle multiple INTx status bits being set in dra7xx (Vignesh R)
      
       - translate dra7xx hwirq range to fix INTD handling (Vignesh R)
      
       - remove deprecated Exynos PHY initialization code (Jaehoon Chung)
      
       - fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)
      
       - fix NULL pointer dereference in iProc BCMA driver (Ray Jui)
      
       - fix Keystone interrupt-controller-node lookup (Johan Hovold)
      
       - constify qcom driver structures (Julia Lawall)
      
       - rework Tegra config space mapping to increase space available for
         endpoints (Vidya Sagar)
      
       - simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)
      
       - remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)
      
       - add support for Global Fabric Manager Server (GFMS) event to
         Microsemi Switchtec switch driver (Logan Gunthorpe)
      
       - add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)
      
      * tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
        PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
        dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
        PCI: endpoint: Fix EPF device name to support multi-function devices
        PCI: endpoint: Add the function number as argument to EPC ops
        PCI: cadence: Add host driver for Cadence PCIe controller
        dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
        PCI: Add vendor ID for Cadence
        PCI: Add generic function to probe PCI host controllers
        PCI: generic: fix missing call of pci_free_resource_list()
        PCI: OF: Add generic function to parse and allocate PCI resources
        PCI: Regroup all PCI related entries into drivers/pci/Makefile
        PCI/DPC: Reformat DPC register definitions
        PCI/DPC: Add and use DPC Status register field definitions
        PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
        PCI/DPC: Remove unnecessary RP PIO register structs
        PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
        PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
        PCI/DPC: Make RP PIO log size check more generic
        PCI/DPC: Rename local "status" to "dpc_status"
        PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
        ...
      105cf3c8
  3. 05 Feb, 2018 15 commits
  4. 04 Feb, 2018 12 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 35277995
      Linus Torvalds authored
      Pull spectre/meltdown updates from Thomas Gleixner:
       "The next round of updates related to melted spectrum:
      
         - The initial set of spectre V1 mitigations:
      
             - Array index speculation blocker and its usage for syscall,
               fdtable and the n180211 driver.
      
             - Speculation barrier and its usage in user access functions
      
         - Make indirect calls in KVM speculation safe
      
         - Blacklisting of known to be broken microcodes so IPBP/IBSR are not
           touched.
      
         - The initial IBPB support and its usage in context switch
      
         - The exposure of the new speculation MSRs to KVM guests.
      
         - A fix for a regression in x86/32 related to the cpu entry area
      
         - Proper whitelisting for known to be safe CPUs from the mitigations.
      
         - objtool fixes to deal proper with retpolines and alternatives
      
         - Exclude __init functions from retpolines which speeds up the boot
           process.
      
         - Removal of the syscall64 fast path and related cleanups and
           simplifications
      
         - Removal of the unpatched paravirt mode which is yet another source
           of indirect unproteced calls.
      
         - A new and undisputed version of the module mismatch warning
      
         - A couple of cleanup and correctness fixes all over the place
      
        Yet another step towards full mitigation. There are a few things still
        missing like the RBS underflow mitigation for Skylake and other small
        details, but that's being worked on.
      
        That said, I'm taking a belated christmas vacation for a week and hope
        that everything is magically solved when I'm back on Feb 12th"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
        KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
        KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
        KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
        KVM/x86: Add IBPB support
        KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
        x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
        x86/pti: Mark constant arrays as __initconst
        x86/spectre: Simplify spectre_v2 command line parsing
        x86/retpoline: Avoid retpolines for built-in __init functions
        x86/kvm: Update spectre-v1 mitigation
        KVM: VMX: make MSR bitmaps per-VCPU
        x86/paravirt: Remove 'noreplace-paravirt' cmdline option
        x86/speculation: Use Indirect Branch Prediction Barrier in context switch
        x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
        x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
        x86/spectre: Report get_user mitigation for spectre_v1
        nl80211: Sanitize array index in parse_txq_params
        vfs, fdtable: Prevent bounds-check bypass via speculative execution
        x86/syscall: Sanitize syscall table de-references under speculation
        x86/get_user: Use pointer masking to limit speculation
        ...
      35277995
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 0a646e9c
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "A small set of changes:
      
         - a fixup for kexec related to 5-level paging mode. That covers most
           of the cases except kexec from a 5-level kernel to a 4-level
           kernel. The latter needs more work and is going to come in 4.17
      
         - two trivial fixes for build warnings triggered by LTO and gcc-8"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/power: Fix swsusp_arch_resume prototype
        x86/dumpstack: Avoid uninitlized variable
        x86/kexec: Make kexec (mostly) work in 5-level paging mode
      0a646e9c
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f74a127f
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Two small changes:
      
         - a fix for a interrupt regression caused by the vector management
           changes in 4.15 affecting museum pieces which rely on interrupt
           probing for legacy (e.g. parallel port) devices.
      
           One of the startup calls in the autoprobe code was not changed to
           the new activate_and_startup() function resulting in a warning and
           as a consequence failing to discover the device interrupt.
      
         - a trivial update to the copyright/license header of the STM32 irq
           chip driver"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Make legacy autoprobing work again
        irqchip/stm32: Fix copyright
      f74a127f
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20180204' of git://git.kernel.dk/linux-block · 64b28683
      Linus Torvalds authored
      Pull more block updates from Jens Axboe:
       "Most of this is fixes and not new code/features:
      
         - skd fix from Arnd, fixing a build error dependent on sla allocator
           type.
      
         - blk-mq scheduler discard merging fixes, one from me and one from
           Keith. This fixes a segment miscalculation for blk-mq-sched, where
           we mistakenly think two segments are physically contigious even
           though the request isn't carrying real data. Also fixes a bio-to-rq
           merge case.
      
         - Don't re-set a bit on the buffer_head flags, if it's already set.
           This can cause scalability concerns on bigger machines and
           workloads. From Kemi Wang.
      
         - Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to
           distuingish between a local (device related) resource starvation
           and a global one. The latter might happen without IO being in
           flight, so it has to be handled a bit differently. From Ming"
      
      * tag 'for-linus-20180204' of git://git.kernel.dk/linux-block:
        block: skd: fix incorrect linux/slab_def.h inclusion
        buffer: Avoid setting buffer bits that are already set
        blk-mq-sched: Enable merging discard bio into request
        blk-mq: fix discard merge with scheduler attached
        blk-mq: introduce BLK_STS_DEV_RESOURCE
      64b28683
    • Linus Torvalds's avatar
      Merge tag 'ntb-4.16' of git://github.com/jonmason/ntb · d3658c22
      Linus Torvalds authored
      Pull NTB updates from Jon Mason:
       "Bug fixes galore, removal of the ntb atom driver, and updates to the
        ntb tools and tests to support the multi-port interface"
      
      * tag 'ntb-4.16' of git://github.com/jonmason/ntb: (37 commits)
        NTB: ntb_perf: fix cast to restricted __le32
        ntb_perf: Fix an error code in perf_copy_chunk()
        ntb_hw_switchtec: Make function switchtec_ntb_remove() static
        NTB: ntb_tool: fix memory leak on 'buf' on error exit path
        NTB: ntb_perf: fix printing of resource_size_t
        NTB: ntb_hw_idt: Set NTB_TOPO_SWITCH topology
        NTB: ntb_test: Update ntb_perf tests
        NTB: ntb_test: Update ntb_tool MW tests
        NTB: ntb_test: Add ntb_tool Message tests
        NTB: ntb_test: Update ntb_tool Scratchpad tests
        NTB: ntb_test: Update ntb_tool DB tests
        NTB: ntb_test: Update ntb_tool link tests
        NTB: ntb_test: Add ntb_tool port tests
        NTB: ntb_test: Safely use paths with whitespace
        NTB: ntb_perf: Add full multi-port NTB API support
        NTB: ntb_tool: Add full multi-port NTB API support
        NTB: ntb_pp: Add full multi-port NTB API support
        NTB: Fix UB/bug in ntb_mw_get_align()
        NTB: Set dma mask and dma coherent mask to NTB devices
        NTB: Rename NTB messaging API methods
        ...
      d3658c22
    • Linus Torvalds's avatar
      Merge tag 'mailbox-v4.16' of git://git.linaro.org/landing-teams/working/fujitsu/integration · 8ac4840a
      Linus Torvalds authored
      Pull mailbox updates from Jassi Brar:
       "Misc driver changes only:
      
         - TI-MsgMgr: Fix print format for a printk
      
         - TI-MSgMgr: SPDX license switch for the driver
      
         - QCOM-IPC: Convert driver to use regmap
      
         - QCOM-IPC: Spawn sibling clock device from mailbox driver"
      
      * tag 'mailbox-v4.16' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
        dt-bindings: mailbox: qcom: Document the APCS clock binding
        mailbox: qcom: Create APCS child device for clock controller
        mailbox: qcom: Convert APCS IPC driver to use regmap
        mailbox: ti-msgmgr: Use %zu for size_t print format
        mailbox: ti-msgmgr: Switch to SPDX Licensing
      8ac4840a
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 4141cf67
      Linus Torvalds authored
      Pull i2c updates from Wolfram Sang:
       "I2C has the following changes for you:
      
         - new flag to mark DMA safe buffers in i2c_msg. Also, some
           infrastructure around it. And docs.
      
         - huge refactoring of the at24 driver led by the new maintainer
           Bartosz
      
         - update I2C bus recovery to send STOP after recovery
      
         - conversion from gpio to gpiod for I2C bus recovery
      
         - adding a fault-injector to the i2c-gpio driver
      
         - lots of small driver improvements, and bigger ones to
           i2c-sh_mobile"
      
      * 'i2c/for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (99 commits)
        i2c: mv64xxx: Add myself as maintainer for this driver
        i2c: mv64xxx: Fix clock resource by adding an optional bus clock
        i2c: mv64xxx: Remove useless test before clk_disable_unprepare
        i2c: mxs: use true and false for boolean values
        i2c: meson: update doc description to fix build warnings
        i2c: meson: add configurable divider factors
        dt-bindings: i2c: update documentation for the Meson-AXG
        i2c: imx-lpi2c: add runtime pm support
        i2c: rcar: fix some trivial typos in comments
        i2c: davinci: fix the cpufreq transition
        i2c: rk3x: add proper kerneldoc header
        i2c: rk3x: account for const type of of_device_id.data
        i2c: acorn: remove outdated path from file header
        i2c: acorn: add MODULE_LICENSE tag
        i2c: rcar: implement bus recovery
        i2c: send STOP after successful bus recovery
        i2c: ensure SDA is released in recovery if SDA is controllable
        i2c: add 'set_sda' to bus_recovery_info
        i2c: add identifier in declarations for i2c_bus_recovery
        i2c: make kerneldoc about bus recovery more precise
        ...
      4141cf67
    • Linus Torvalds's avatar
      Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt · 3462ac57
      Linus Torvalds authored
      Pull fscrypt updates from Ted Ts'o:
       "Refactor support for encrypted symlinks to move common code to fscrypt"
      
      Ted also points out about the merge:
       "This makes the f2fs symlink code use the fscrypt_encrypt_symlink()
        from the fscrypt tree. This will end up dropping the kzalloc() ->
        f2fs_kzalloc() change, which means the fscrypt-specific allocation
        won't get tested by f2fs's kmalloc error injection system; which is
        fine"
      
      * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: (26 commits)
        fscrypt: fix build with pre-4.6 gcc versions
        fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
        fscrypt: document symlink length restriction
        fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
        fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
        fscrypt: calculate NUL-padding length in one place only
        fscrypt: move fscrypt_symlink_data to fscrypt_private.h
        fscrypt: remove fscrypt_fname_usr_to_disk()
        ubifs: switch to fscrypt_get_symlink()
        ubifs: switch to fscrypt ->symlink() helper functions
        ubifs: free the encrypted symlink target
        f2fs: switch to fscrypt_get_symlink()
        f2fs: switch to fscrypt ->symlink() helper functions
        ext4: switch to fscrypt_get_symlink()
        ext4: switch to fscrypt ->symlink() helper functions
        fscrypt: new helper function - fscrypt_get_symlink()
        fscrypt: new helper functions for ->symlink()
        fscrypt: trim down fscrypt.h includes
        fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
        fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
        ...
      3462ac57
    • Jason Gunthorpe's avatar
      IB/uverbs: Use the standard kConfig format for experimental · e9d1e389
      Jason Gunthorpe authored
      We really don't want people turning this on just yet, make it very
      clear with capital letters.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e9d1e389
    • Jason Gunthorpe's avatar
      IB: Update references to libibverbs · 46adb179
      Jason Gunthorpe authored
      These days the userspace comes from rdma-core, revise references
      in the kernel to point to the current repository.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      46adb179
    • Georgi Djakov's avatar
      dt-bindings: mailbox: qcom: Document the APCS clock binding · 0ae7d327
      Georgi Djakov authored
      Update the binding documentation for APCS to mention that the APCS
      hardware block also expose a clock controller functionality.
      
      The APCS clock controller is a mux and half-integer divider. It has the
      main CPU PLL as an input and provides the clock for the application CPU.
      Signed-off-by: default avatarGeorgi Djakov <georgi.djakov@linaro.org>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Acked-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarJassi Brar <jaswinder.singh@linaro.org>
      0ae7d327
    • Georgi Djakov's avatar
      mailbox: qcom: Create APCS child device for clock controller · c815d769
      Georgi Djakov authored
      There is a clock controller functionality provided by the APCS hardware
      block of msm8916 devices. The device-tree would represent an APCS node
      with both mailbox and clock provider properties.
      Create a platform child device for the clock controller functionality so
      the driver can probe and use APCS as parent.
      Signed-off-by: default avatarGeorgi Djakov <georgi.djakov@linaro.org>
      Acked-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarJassi Brar <jaswinder.singh@linaro.org>
      c815d769