1. 02 May, 2013 14 commits
  2. 01 May, 2013 26 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next · 73287a43
      Linus Torvalds authored
      Pull networking updates from David Miller:
       "Highlights (1721 non-merge commits, this has to be a record of some
        sort):
      
         1) Add 'random' mode to team driver, from Jiri Pirko and Eric
            Dumazet.
      
         2) Make it so that any driver that supports configuration of multiple
            MAC addresses can provide the forwarding database add and del
            calls by providing a default implementation and hooking that up if
            the driver doesn't have an explicit set of handlers.  From Vlad
            Yasevich.
      
         3) Support GSO segmentation over tunnels and other encapsulating
            devices such as VXLAN, from Pravin B Shelar.
      
         4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
      
         5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
            Dukkipati.
      
         6) In the PHY layer, allow supporting wake-on-lan in situations where
            the PHY registers have to be written for it to be configured.
      
            Use it to support wake-on-lan in mv643xx_eth.
      
            From Michael Stapelberg.
      
         7) Significantly improve firewire IPV6 support, from YOSHIFUJI
            Hideaki.
      
         8) Allow multiple packets to be sent in a single transmission using
            network coding in batman-adv, from Martin Hundebøll.
      
         9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
      
        10) Generalize the VXLAN forwarding tables so that there is more
            flexibility in configurating various aspects of the endpoints.
            From David Stevens.
      
        11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
            from Dmitry Kravkov.
      
        12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
            Neira Ayuso.
      
        13) Start adding networking selftests.
      
        14) In situations of overload on the same AF_PACKET fanout socket, or
            per-cpu packet receive queue, minimize drop by distributing the
            load to other cpus/fanouts.  From Willem de Bruijn and Eric
            Dumazet.
      
        15) Add support for new payload offset BPF instruction, from Daniel
            Borkmann.
      
        16) Convert several drivers over to mdoule_platform_driver(), from
            Sachin Kamat.
      
        17) Provide a minimal BPF JIT image disassembler userspace tool, from
            Daniel Borkmann.
      
        18) Rewrite F-RTO implementation in TCP to match the final
            specification of it in RFC4138 and RFC5682.  From Yuchung Cheng.
      
        19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
            you like netlink, so I implemented netlink dumping of netlink
            sockets.") From Andrey Vagin.
      
        20) Remove ugly passing of rtnetlink attributes into rtnl_doit
            functions, from Thomas Graf.
      
        21) Allow userspace to be able to see if a configuration change occurs
            in the middle of an address or device list dump, from Nicolas
            Dichtel.
      
        22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
            Frederic Sowa.
      
        23) Increase accuracy of packet length used by packet scheduler, from
            Jason Wang.
      
        24) Beginning set of changes to make ipv4/ipv6 fragment handling more
            scalable and less susceptible to overload and locking contention,
            from Jesper Dangaard Brouer.
      
        25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
            instead.  From Hong Zhiguo.
      
        26) Optimize route usage in IPVS by avoiding reference counting where
            possible, from Julian Anastasov.
      
        27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
      
        28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
            Eitzenberger.
      
        29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
            nfnetlink_log, and nfnetlink_queue.  From Gao feng.
      
        30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
      
        31) Support several new r8169 chips, from Hayes Wang.
      
        32) Support tokenized interface identifiers in ipv6, from Daniel
            Borkmann.
      
        33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
      
        34) Add 802.1ad vlan offload support, from Patrick McHardy.
      
        35) Support mmap() based netlink communication, also from Patrick
            McHardy.
      
        36) Support HW timestamping in mlx4 driver, from Amir Vadai.
      
        37) Rationalize AF_PACKET packet timestamping when transmitting, from
            Willem de Bruijn and Daniel Borkmann.
      
        38) Bring parity to what's provided by /proc/net/packet socket dumping
            and the info provided by netlink socket dumping of AF_PACKET
            sockets.  From Nicolas Dichtel.
      
        39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
            Poirier"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
        filter: fix va_list build error
        af_unix: fix a fatal race with bit fields
        bnx2x: Prevent memory leak when cnic is absent
        bnx2x: correct reading of speed capabilities
        net: sctp: attribute printl with __printf for gcc fmt checks
        netlink: kconfig: move mmap i/o into netlink kconfig
        netpoll: convert mutex into a semaphore
        netlink: Fix skb ref counting.
        net_sched: act_ipt forward compat with xtables
        mlx4_en: fix a build error on 32bit arches
        Revert "bnx2x: allow nvram test to run when device is down"
        bridge: avoid OOPS if root port not found
        drivers: net: cpsw: fix kernel warn on cpsw irq enable
        sh_eth: use random MAC address if no valid one supplied
        3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
        tg3: fix to append hardware time stamping flags
        unix/stream: fix peeking with an offset larger than data in queue
        unix/dgram: fix peeking with an offset larger than data in queue
        unix/dgram: peek beyond 0-sized skbs
        openvswitch: Remove unneeded ovs_netdev_get_ifindex()
        ...
      73287a43
    • Xi Wang's avatar
      filter: fix va_list build error · 20074f35
      Xi Wang authored
      This patch fixes the following build error.
      
      In file included from include/linux/filter.h:52:0,
                       from arch/arm/net/bpf_jit_32.c:14:
      include/linux/printk.h:54:2: error: unknown type name ‘va_list’
      include/linux/printk.h:105:21: error: unknown type name ‘va_list’
      include/linux/printk.h:108:30: error: unknown type name ‘va_list’
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20074f35
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 251df49d
      Linus Torvalds authored
      Pull input updates from Dmitry Torokhov:
       "Assorted fixes and cleanups to the existing drivers plus a new driver
        for IMS Passenger Control Unit device they use for ther in-flight
        entertainment system."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (44 commits)
        Input: trackpoint - Optimize trackpoint init to use power-on reset
        Input: apbps2 - convert to devm_ioremap_resource()
        Input: ALPS - use %ph to print buffers
        ARM - shmobile: Armadillo800EVA: Move st1232 reset pin handling
        Input: st1232 - add reset pin handling
        Input: st1232 - convert to devm_* infrastructure
        Input: MT - handle semi-mt devices in core
        Input: adxl34x - use spi_get_drvdata()
        Input: ad7877 - use spi_get_drvdata() and spi_set_drvdata()
        Input: ads7846 - use spi_get_drvdata() and spi_set_drvdata()
        Input: ims-pcu - fix a memory leak on error
        Input: sysrq - supplement reset sequence with timeout functionality
        Input: tegra-kbc - support for defining row/columns based on SoC
        Input: imx_keypad - switch to using managed resources
        Input: arc_ps2 - add support for device tree
        Input: mma8450 - fix signed 12bits to 32bits conversion
        Input: eeti_ts - remove redundant null check
        Input: edt-ft5x06 - remove redundant null check before kfree
        Input: ad714x - add CONFIG_PM_SLEEP to suspend/resume functions
        Input: adxl34x - add CONFIG_PM_SLEEP to suspend/resume functions
        ...
      251df49d
    • Eric Dumazet's avatar
      af_unix: fix a fatal race with bit fields · 60bc851a
      Eric Dumazet authored
      Using bit fields is dangerous on ppc64/sparc64, as the compiler [1]
      uses 64bit instructions to manipulate them.
      If the 64bit word includes any atomic_t or spinlock_t, we can lose
      critical concurrent changes.
      
      This is happening in af_unix, where unix_sk(sk)->gc_candidate/
      gc_maybe_cycle/lock share the same 64bit word.
      
      This leads to fatal deadlock, as one/several cpus spin forever
      on a spinlock that will never be available again.
      
      A safer way would be to use a long to store flags.
      This way we are sure compiler/arch wont do bad things.
      
      As we own unix_gc_lock spinlock when clearing or setting bits,
      we can use the non atomic __set_bit()/__clear_bit().
      
      recursion_level can share the same 64bit location with the spinlock,
      as it is set only with this spinlock held.
      
      [1] bug fixed in gcc-4.8.0 :
      http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080Reported-by: default avatarAmbrose Feinstein <ambrose@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60bc851a
    • David S. Miller's avatar
      Merge branch 'bnx2x' · c3b28ea3
      David S. Miller authored
      Yuval Mintz says:
      
      ====================
      This fixes 2 small bugs - one which may cause an unnecessary link flap,
      and the other is a small memory leak when unloading while cnic is not
      loaded.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3b28ea3
    • Yuval Mintz's avatar
      bnx2x: Prevent memory leak when cnic is absent · 05952246
      Yuval Mintz authored
      bnx2x driver allocates searcher T2 tables, but it releases that memory
      during unload only released if the cnic is loaded.
      Signed-off-by: default avatarYuval Mintz <yuvalmin@broadcom.com>
      Signed-off-by: default avatarEilon Greenstein <eilong@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05952246
    • Yaniv Rosner's avatar
      bnx2x: correct reading of speed capabilities · b0261926
      Yaniv Rosner authored
      When the bnx2x driver reads the port configuration - mask irrelevant bits.
      
      Without this change, the unintended bits may cause the driver to needlessly
      toggle the link, as a comparison in the link flap avoidance flow will show
      that the old link did not advertise the same capabilities and thus cannot
      be retained.
      Signed-off-by: default avatarYaniv Rosner <yanivr@broadcom.com>
      Signed-off-by: default avatarYuval Mintz <yuvalmin@broadcom.com>
      Signed-off-by: default avatarEilon Greenstein <eilong@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0261926
    • Daniel Borkmann's avatar
      net: sctp: attribute printl with __printf for gcc fmt checks · be3e4581
      Daniel Borkmann authored
      Let GCC check for format string errors in sctp's probe printl
      function. This patch fixes the warning when compiled with W=1:
      
      net/sctp/probe.c:73:2: warning: function might be possible candidate
      for 'gnu_printf' format attribute [-Wmissing-format-attribute]
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be3e4581
    • Daniel Borkmann's avatar
      netlink: kconfig: move mmap i/o into netlink kconfig · ee1bec9b
      Daniel Borkmann authored
      Currently, in menuconfig, Netlink's new mmaped IO is the very first
      entry under the ``Networking support'' item and comes even before
      ``Networking options'':
      
        [ ]   Netlink: mmaped IO
        Networking options  --->
        ...
      
      Lets move this into ``Networking options'' under netlink's Kconfig,
      since this might be more appropriate. Introduced by commit ccdfcc39
      (``netlink: mmaped netlink: ring setup'').
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee1bec9b
    • Neil Horman's avatar
      netpoll: convert mutex into a semaphore · bd7c4b60
      Neil Horman authored
      Bart Van Assche recently reported a warning to me:
      
      <IRQ>  [<ffffffff8103d79f>] warn_slowpath_common+0x7f/0xc0
      [<ffffffff8103d7fa>] warn_slowpath_null+0x1a/0x20
      [<ffffffff814761dd>] mutex_trylock+0x16d/0x180
      [<ffffffff813968c9>] netpoll_poll_dev+0x49/0xc30
      [<ffffffff8136a2d2>] ? __alloc_skb+0x82/0x2a0
      [<ffffffff81397715>] netpoll_send_skb_on_dev+0x265/0x410
      [<ffffffff81397c5a>] netpoll_send_udp+0x28a/0x3a0
      [<ffffffffa0541843>] ? write_msg+0x53/0x110 [netconsole]
      [<ffffffffa05418bf>] write_msg+0xcf/0x110 [netconsole]
      [<ffffffff8103eba1>] call_console_drivers.constprop.17+0xa1/0x1c0
      [<ffffffff8103fb76>] console_unlock+0x2d6/0x450
      [<ffffffff8104011e>] vprintk_emit+0x1ee/0x510
      [<ffffffff8146f9f6>] printk+0x4d/0x4f
      [<ffffffffa0004f1d>] scsi_print_command+0x7d/0xe0 [scsi_mod]
      
      This resulted from my commit ca99ca14 which introduced a mutex_trylock
      operation in a path that could execute in interrupt context.  When mutex
      debugging is enabled, the above warns the user when we are in fact
      exectuting in interrupt context
      interrupt context.
      
      After some discussion, It seems that a semaphore is the proper mechanism to use
      here.  While mutexes are defined to be unusable in interrupt context, no such
      condition exists for semaphores (save for the fact that the non blocking api
      calls, like up and down_trylock must be used when in irq context).
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Reported-by: default avatarBart Van Assche <bvanassche@acm.org>
      CC: Bart Van Assche <bvanassche@acm.org>
      CC: David Miller <davem@davemloft.net>
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd7c4b60
    • Pravin B Shelar's avatar
      netlink: Fix skb ref counting. · ae6164ad
      Pravin B Shelar authored
      Commit f9c22888 (netlink:
      implement memory mapped recvmsg) increamented skb->users
      ref count twice for a dump op which does not look right.
      
      Following patch fixes that.
      
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae6164ad
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · 8a72f382
      Linus Torvalds authored
      Pull tile arch changes from Chris Metcalf:
       "These are some minor new feature work and other changes that didn't
        merit getting pushed up after the 3.9 merge window closed.
      
        There should be a lot more activity in the 3.11 merge window"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
        arch/tile: Fix syscall return value passed to tracepoint
        tile: comment assumption about __insn_mtspr for <asm/irqflags.h>
        tile: ns2cycles should use __raw_get_cpu_var
        arch: remove KCORE_ELF again [tile]
        tile: remove two outdated Kconfig entries
        tile: support atomic64_dec_if_positive()
        tile: support TIF_SYSCALL_TRACEPOINT; select HAVE_SYSCALL_TRACEPOINTS
        tile: Add definition of NR_syscalls
        tile: move declaration of sys_call_table to <asm/syscall.h>
        arch/tile: Enable HAVE_ARCH_TRACEHOOK
        arch/tile: Call tracehook_report_syscall_{entry,exit} in syscall trace
      8a72f382
    • Steven Rostedt's avatar
      init: Do not warn on non-zero initcall return · bf5d770b
      Steven Rostedt authored
      Commit f91eb62f ("init: scream bloody murder if interrupts are
      enabled too early") added three new warnings.  The first two seemed
      reasonable, but the third included a warning when an initcall returned
      non-zero.  Although, the third WARN() does include an imbalanced preempt
      disabled, or irqs disable, it shouldn't warn if it only had an initcall
      that just returns non-zero.
      
      In fact, according to Linus, it shouldn't print at all.  As it only
      prints with initcall_debug set, and that already shows enough
      information to fix things.
      
      Link: http://lkml.kernel.org/r/CA+55aFzaBC5SFi7=F2mfm+KWY5qTsBmOqgbbs8E+LUS8JK-sBg@mail.gmail.comSuggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf5d770b
    • Jamal Hadi Salim's avatar
      net_sched: act_ipt forward compat with xtables · 0dcffd09
      Jamal Hadi Salim authored
      Deal with changes in newer xtables while maintaining backward
      compatibility. Thanks to Jan Engelhardt for suggestions.
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dcffd09
    • Linus Torvalds's avatar
      Merge branch 'topic/omap3isp' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a49fe6d5
      Linus Torvalds authored
      Pull omap3isp clk support from Mauro Carvalho Chehab:
       "This patch were sent in separate as it depends on a merge from clock
        framework, that you merged in commit 362ed48d"
      
      * 'topic/omap3isp' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        [media] omap3isp: Use the common clock framework
      a49fe6d5
    • Dmitry Torokhov's avatar
      Merge branch 'next' into for-linus · bf61c884
      Dmitry Torokhov authored
      Prepare first set of updates for 3.10 merge window.
      bf61c884
    • Linus Torvalds's avatar
      Merge branch 'ipc-scalability' · 823e75f7
      Linus Torvalds authored
      Merge IPC cleanup and scalability patches from Andrew Morton.
      
      This cleans up many of the oddities in the IPC code, uses the list
      iterator helpers, splits out locking and adds per-semaphore locks for
      greater scalability of the IPC semaphore code.
      
      Most normal user-level locking by now uses futexes (ie pthreads, but
      also a lot of specialized locks), but SysV IPC semaphores are apparently
      still used in some big applications, either for portability reasons, or
      because they offer tracking and undo (and you don't need to have a
      special shared memory area for them).
      
      Our IPC semaphore scalability was pitiful.  We used to lock much too big
      ranges, and we used to have a single ipc lock per ipc semaphore array.
      Most loads never cared, but some do.  There are some numbers in the
      individual commits.
      
      * ipc-scalability:
        ipc: sysv shared memory limited to 8TiB
        ipc/msg.c: use list_for_each_entry_[safe] for list traversing
        ipc,sem: fine grained locking for semtimedop
        ipc,sem: have only one list in struct sem_queue
        ipc,sem: open code and rename sem_lock
        ipc,sem: do not hold ipc lock more than necessary
        ipc: introduce lockless pre_down ipcctl
        ipc: introduce obtaining a lockless ipc object
        ipc: remove bogus lock comment for ipc_checkid
        ipc/msgutil.c: use linux/uaccess.h
        ipc: refactor msg list search into separate function
        ipc: simplify msg list search
        ipc: implement MSG_COPY as a new receive mode
        ipc: remove msg handling from queue scan
        ipc: set EFAULT as default error in load_msg()
        ipc: tighten msg copy loops
        ipc: separate msg allocation from userspace copy
        ipc: clamp with min()
      823e75f7
    • Robin Holt's avatar
      ipc: sysv shared memory limited to 8TiB · d69f3bad
      Robin Holt authored
      Trying to run an application which was trying to put data into half of
      memory using shmget(), we found that having a shmall value below 8EiB-8TiB
      would prevent us from using anything more than 8TiB.  By setting
      kernel.shmall greater than 8EiB-8TiB would make the job work.
      
      In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.
      
      ipc/shm.c:
       458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
       459 {
      ...
       465         int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
      ...
       474         if (ns->shm_tot + numpages > ns->shm_ctlall)
       475                 return -ENOSPC;
      
      [akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Reported-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d69f3bad
    • Nikola Pajkovsky's avatar
      ipc/msg.c: use list_for_each_entry_[safe] for list traversing · 41239fe8
      Nikola Pajkovsky authored
      The ipc/msg.c code does its list operations by hand and it open-codes the
      accesses, instead of using for_each_entry_[safe].
      Signed-off-by: default avatarNikola Pajkovsky <npajkovs@redhat.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41239fe8
    • Rik van Riel's avatar
      ipc,sem: fine grained locking for semtimedop · 6062a8dc
      Rik van Riel authored
      Introduce finer grained locking for semtimedop, to handle the common case
      of a program wanting to manipulate one semaphore from an array with
      multiple semaphores.
      
      If the call is a semop manipulating just one semaphore in an array with
      multiple semaphores, only take the lock for that semaphore itself.
      
      If the call needs to manipulate multiple semaphores, or another caller is
      in a transaction that manipulates multiple semaphores, the sem_array lock
      is taken, as well as all the locks for the individual semaphores.
      
      On a 24 CPU system, performance numbers with the semop-multi
      test with N threads and N semaphores, look like this:
      
      	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
      threads			patches		rwlock patches	v3 patches
      10	610652		726325		1783589		2142206
      20	341570		365699		1520453		1977878
      30	288102		307037		1498167		2037995
      40	290714		305955		1612665		2256484
      50	288620		312890		1733453		2650292
      60	289987		306043		1649360		2388008
      70	291298		306347		1723167		2717486
      80	290948		305662		1729545		2763582
      90	290996		306680		1736021		2757524
      100	292243		306700		1773700		3059159
      
      [davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
      [davidlohr.bueso@hp.com: make refcounter atomic]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Chegu Vinod <chegu_vinod@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarEmmanuel Benisty <benisty.e@gmail.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6062a8dc
    • Rik van Riel's avatar
      ipc,sem: have only one list in struct sem_queue · 9f1bc2c9
      Rik van Riel authored
      Having only one list in struct sem_queue, and only queueing simple
      semaphore operations on the list for the semaphore involved, allows us to
      introduce finer grained locking for semtimedop.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Chegu Vinod <chegu_vinod@hp.com>
      Cc: Emmanuel Benisty <benisty.e@gmail.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f1bc2c9
    • Rik van Riel's avatar
      ipc,sem: open code and rename sem_lock · c460b662
      Rik van Riel authored
      Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
      later that only locks the sem_array and does nothing else.
      
      Open code the locking from ipc_lock() in sem_obtain_lock() so we can
      introduce finer grained locking for the sem_array in the next patch.
      
      [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Chegu Vinod <chegu_vinod@hp.com>
      Cc: Emmanuel Benisty <benisty.e@gmail.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c460b662
    • Davidlohr Bueso's avatar
      ipc,sem: do not hold ipc lock more than necessary · 16df3674
      Davidlohr Bueso authored
      Instead of holding the ipc lock for permissions and security checks, among
      others, only acquire it when necessary.
      
      Some numbers....
      
      1) With Rik's semop-multi.c microbenchmark we can see the following
         results:
      
      Baseline (3.9-rc1):
      cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
      total operations: 151452270, ops/sec 5048409
      
      +  59.40%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock
      +   6.14%            a.out  [kernel.kallsyms]  [k] sys_semtimedop
      +   3.84%            a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
      +   3.64%            a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
      +   2.06%            a.out  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
      +   1.86%            a.out  [kernel.kallsyms]  [k] ipc_lock
      
      With this patchset:
      cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
      total operations: 273156400, ops/sec 9105213
      
      +  18.54%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock
      +  11.72%            a.out  [kernel.kallsyms]  [k] sys_semtimedop
      +   7.70%            a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
      +   6.58%            a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
      +   6.54%            a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
      +   4.71%            a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check
      
      2) While on an Oracle swingbench DSS (data mining) workload the
         improvements are not as exciting as with Rik's benchmark, we can see
         some positive numbers.  For an 8 socket machine the following are the
         percentages of %sys time incurred in the ipc lock:
      
      Baseline (3.9-rc1):
      100 swingbench users: 8,74%
      400 swingbench users: 21,86%
      800 swingbench users: 84,35%
      
      With this patchset:
      100 swingbench users: 8,11%
      400 swingbench users: 19,93%
      800 swingbench users: 77,69%
      
      [riel@redhat.com: fix two locking bugs]
      [sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarChegu Vinod <chegu_vinod@hp.com>
      Acked-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Emmanuel Benisty <benisty.e@gmail.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16df3674
    • Davidlohr Bueso's avatar
      ipc: introduce lockless pre_down ipcctl · 444d0f62
      Davidlohr Bueso authored
      Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and
      check permissions, mostly for IPC_RMID and IPC_SET commands.
      
      Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
      The locking version is retained, yet modified to call the nolock version
      without affecting its semantics, thus transparent to all ipc callers.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Chegu Vinod <chegu_vinod@hp.com>
      Cc: Emmanuel Benisty <benisty.e@gmail.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444d0f62
    • Davidlohr Bueso's avatar
      ipc: introduce obtaining a lockless ipc object · 4d2bff5e
      Davidlohr Bueso authored
      Through ipc_lock() and therefore ipc_lock_check() we currently return the
      locked ipc object.  This is not necessary for all situations and can,
      therefore, cause unnecessary ipc lock contention.
      
      Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
      functions that only lookup and return the ipc object.
      
      Both these functions must be called within the RCU read critical section.
      
      [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarChegu Vinod <chegu_vinod@hp.com>
      Acked-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Emmanuel Benisty <benisty.e@gmail.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d2bff5e
    • Davidlohr Bueso's avatar
      ipc: remove bogus lock comment for ipc_checkid · 7bb4deff
      Davidlohr Bueso authored
      This series makes the sysv semaphore code more scalable, by reducing the
      time the semaphore lock is held, and making the locking more scalable for
      semaphore arrays with multiple semaphores.
      
      The first four patches were written by Davidlohr Buesso, and reduce the
      hold time of the semaphore lock.
      
      The last three patches change the sysv semaphore code locking to be more
      fine grained, providing a performance boost when multiple semaphores in a
      semaphore array are being manipulated simultaneously.
      
      On a 24 CPU system, performance numbers with the semop-multi
      test with N threads and N semaphores, look like this:
      
      	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
      	threads			patches		rwlock patches	v3 patches
      	10	610652		726325		1783589		2142206
      	20	341570		365699		1520453		1977878
      	30	288102		307037		1498167		2037995
      	40	290714		305955		1612665		2256484
      	50	288620		312890		1733453		2650292
      	60	289987		306043		1649360		2388008
      	70	291298		306347		1723167		2717486
      	80	290948		305662		1729545		2763582
      	90	290996		306680		1736021		2757524
      	100	292243		306700		1773700		3059159
      
      This patch:
      
      There is no reason to be holding the ipc lock while reading ipcp->seq,
      hence remove misleading comment.
      
      Also simplify the return value for the function.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Chegu Vinod <chegu_vinod@hp.com>
      Cc: Emmanuel Benisty <benisty.e@gmail.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bb4deff