1. 01 Apr, 2019 1 commit
    • Ilya Dryomov's avatar
      libceph: handle an empty authorize reply · 948da95e
      Ilya Dryomov authored
      BugLink: https://bugs.launchpad.net/bugs/1822271
      
      commit 0fd3fd0a upstream.
      
      The authorize reply can be empty, for example when the ticket used to
      build the authorizer is too old and TAG_BADAUTHORIZER is returned from
      the service.  Calling ->verify_authorizer_reply() results in an attempt
      to decrypt and validate (somewhat) random data in au->buf (most likely
      the signature block from calc_signature()), which fails and ends up in
      con_fault_finish() with !con->auth_retry.  The ticket isn't invalidated
      and the connection is retried again and again until a new ticket is
      obtained from the monitor:
      
        libceph: osd2 192.168.122.1:6809 bad authorize reply
        libceph: osd2 192.168.122.1:6809 bad authorize reply
        libceph: osd2 192.168.122.1:6809 bad authorize reply
        libceph: osd2 192.168.122.1:6809 bad authorize reply
      
      Let TAG_BADAUTHORIZER handler kick in and increment con->auth_retry.
      
      Cc: stable@vger.kernel.org
      Fixes: 5c056fdc ("libceph...
      948da95e
  2. 14 Mar, 2019 1 commit
  3. 23 May, 2018 1 commit
  4. 27 Apr, 2017 1 commit
    • Ilya Dryomov's avatar
      libceph: force GFP_NOIO for socket allocations · 4bd5240f
      Ilya Dryomov authored
      BugLink: http://bugs.launchpad.net/bugs/1681862
      
      commit 633ee407 upstream.
      
      sock_alloc_inode() allocates socket+inode and socket_wq with
      GFP_KERNEL, which is not allowed on the writeback path:
      
          Workqueue: ceph-msgr con_work [libceph]
          ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
          0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
          ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
          Call Trace:
          [<ffffffff816dd629>] schedule+0x29/0x70
          [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
          [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
          [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
          [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
          [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
          [<ffffffff81086335>] flush_work+0x165/0x250
          [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
          [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
          [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
          [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
          [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
          [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
          [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
          [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
          [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
          [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
          [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
          [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
          [<ffffffff8115af70>] shrink_slab+0x100/0x140
          [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
          [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
          [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
          [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
          [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
          [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
          [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
          [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
          [<ffffffff811d8566>] alloc_inode+0x26/0xa0
          [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
          [<ffffffff815b933e>] sock_alloc+0x1e/0x80
          [<ffffffff815ba855>] __sock_create+0x95/0x220
          [<ffffffff815baa04>] sock_create_kern+0x24/0x30
          [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
          [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
          [<ffffffff81084c19>] process_one_work+0x159/0x4f0
          [<ffffffff8108561b>] worker_thread+0x11b/0x530
          [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
          [<ffffffff8108b6f9>] kthread+0xc9/0xe0
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
          [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
      
      Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
      
      Link: http://tracker.ceph.com/issues/19309
      
      Reported-by: default avatarSergey Jerusalimov <wintchester@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
      4bd5240f
  5. 20 Jan, 2017 1 commit
  6. 06 Apr, 2016 3 commits
  7. 02 Nov, 2015 5 commits
    • Ilya Dryomov's avatar
      libceph: clear msg->con in ceph_msg_release() only · 583d0fef
      Ilya Dryomov authored
      The following bit in ceph_msg_revoke_incoming() is unsafe:
      
          struct ceph_connection *con = msg->con;
          if (!con)
                  return;
          mutex_lock(&con->mutex);
          <more msg->con use>
      
      There is nothing preventing con from getting destroyed right after
      msg->con test.  One easy way to reproduce this is to disable message
      signing only on the server side and try to map an image.  The system
      will go into a
      
          libceph: read_partial_message ffff880073f0ab68 signature check failed
          libceph: osd0 192.168.255.155:6801 bad crc/signature
          libceph: read_partial_message ffff880073f0ab68 signature check failed
          libceph: osd0 192.168.255.155:6801 bad crc/signature
      
      loop which has to be interrupted with Ctrl-C.  Hit Ctrl-C and you are
      likely to end up with a random GP fault if the reset handler executes
      "within" ceph_msg_revoke_incoming():
      
                           <yet another reply w/o a signature>
                                         ...
                <Ctrl-C>
          rbd_obj_request_end
            ceph_osdc_cancel_request
              __unregister_request
                ceph_osdc_put_request
                  ceph_msg_revoke_incoming
                                         ...
                                      osd_reset
                                        __kick_osd_requests
                                          __reset_osd
                                            remove_osd
                                              ceph_con_close
                                                reset_connection
                                                  <clear con->in_msg->con>
                                                  <put con ref>
                                                    put_osd
                                                      <free osd/con>
                    <msg->con use> <-- !!!
      
      If ceph_msg_revoke_incoming() executes "before" the reset handler,
      osd/con will be leaked because ceph_msg_revoke_incoming() clears
      con->in_msg but doesn't put con ref, while reset_connection() only puts
      con ref if con->in_msg != NULL.
      
      The current msg->con scheme was introduced by commits 38941f80
      ("libceph: have messages point to their connection") and 92ce034b
      
      
      ("libceph: have messages take a connection reference"), which defined
      when messages get associated with a connection and when that
      association goes away.  Part of the problem is that this association is
      supposed to go away in much too many places; closing this race entirely
      requires either a rework of the existing or an addition of a new layer
      of synchronization.
      
      In lieu of that, we can make it *much* less likely to hit by
      disassociating messages only on their destruction and resend through
      a different connection.  This makes the code simpler and is probably
      a good thing to do regardless - this patch adds a msg_con_set() helper
      which is is called from only three places: ceph_con_send() and
      ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear
      it.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      583d0fef
    • Ilya Dryomov's avatar
      libceph: add nocephx_sign_messages option · a51983e4
      Ilya Dryomov authored
      
      Support for message signing was merged into 3.19, along with
      nocephx_require_signatures option.  But, all that option does is allow
      the kernel client to talk to clusters that don't support MSG_AUTH
      feature bit.  That's pretty useless, given that it's been supported
      since bobtail.
      
      Meanwhile, if one disables message signing on the server side with
      "cephx sign messages = false", it becomes impossible to use the kernel
      client since it expects messages to be signed if MSG_AUTH was
      negotiated.  Add nocephx_sign_messages option to support this use case.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      a51983e4
    • Ilya Dryomov's avatar
      libceph: stop duplicating client fields in messenger · 859bff51
      Ilya Dryomov authored
      
      supported_features and required_features serve no purpose at all, while
      nocrc and tcp_nodelay belong to ceph_options::flags.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      859bff51
    • Ilya Dryomov's avatar
      libceph: msg signing callouts don't need con argument · 79dbd1ba
      Ilya Dryomov authored
      
      We can use msg->con instead - at the point we sign an outgoing message
      or check the signature on the incoming one, msg->con is always set.  We
      wouldn't know how to sign a message without an associated session (i.e.
      msg->con == NULL) and being able to sign a message using an explicitly
      provided authorizer is of no use.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      79dbd1ba
    • Shraddha Barke's avatar
      libceph: use local variable cursor instead of &msg->cursor · 343128ce
      Shraddha Barke authored
      
      Use local variable cursor in place of &msg->cursor in
      read_partial_msg_data() and write_partial_msg_data().
      Signed-off-by: default avatarShraddha Barke <shraddha.6596@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      343128ce
  8. 17 Sep, 2015 1 commit
  9. 09 Sep, 2015 1 commit
    • Ilya Dryomov's avatar
      libceph: check data_len in ->alloc_msg() · d15f9d69
      Ilya Dryomov authored
      
      Only ->alloc_msg() should check data_len of the incoming message
      against the preallocated ceph_msg, doing it in the messenger is not
      right.  The contract is that either ->alloc_msg() returns a ceph_msg
      which will fit all of the portions of the incoming message, or it
      returns NULL and possibly sets skip, signaling whether NULL is due to
      an -ENOMEM.  ->alloc_msg() should be the only place where we make the
      skip/no-skip decision.
      
      I stumbled upon this while looking at con/osd ref counting.  Right now,
      if we get a non-extent message with a larger data portion than we are
      prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
      it in the messenger, we don't put the con/osd ref acquired in
      ceph_con_in_msg_alloc() (which is normally put in process_message()),
      so this also fixes a memory leak.
      
      An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
      corrupt random memory should a buggy ->alloc_msg() return an unfit
      ceph_msg.
      
      While at it, I changed the "unknown tid" dout() to a pr_warn() to make
      sure all skips are seen and unified format strings.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      d15f9d69
  10. 08 Sep, 2015 3 commits
  11. 09 Jul, 2015 2 commits
    • Ilya Dryomov's avatar
      libceph: treat sockaddr_storage with uninitialized family as blank · c44bd69c
      Ilya Dryomov authored
      
      addr_is_blank() should return true if family is neither AF_INET nor
      AF_INET6.  This is what its counterpart entity_addr_t::is_blank_ip() is
      doing and it is the right thing to do: in process_banner() we check if
      our address is blank and if it is "learn" it from our peer.  As it is,
      we never learn our address and always send out a blank one.  This goes
      way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and
      some ipv6 support groundwork") from 2009.
      
      While at at, do not open-code ipv6_addr_any() and use INADDR_ANY
      constant instead of 0.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      c44bd69c
    • Ilya Dryomov's avatar
      libceph: enable ceph in a non-default network namespace · 757856d2
      Ilya Dryomov authored
      
      Grab a reference on a network namespace of the 'rbd map' (in case of
      rbd) or 'mount' (in case of ceph) process and use that to open sockets
      instead of always using init_net and bailing if network namespace is
      anything but init_net.  Be careful to not share struct ceph_client
      instances between different namespaces and don't add any code in the
      !CONFIG_NET_NS case.
      
      This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      757856d2
  12. 29 Jun, 2015 1 commit
  13. 25 Jun, 2015 1 commit
  14. 11 May, 2015 1 commit
  15. 20 Apr, 2015 1 commit
    • Ilya Dryomov's avatar
      libceph: don't overwrite specific con error msgs · 67c64eb7
      Ilya Dryomov authored
      - specific con->error_msg messages (e.g. "protocol version mismatch")
        end up getting overwritten by a catch-all "socket error on read
        / write", introduced in commit 3a140a0d
      
       ("libceph: report socket
        read/write error message")
      - "bad message sequence # for incoming message" loses to "bad crc" due
        to the fact that -EBADMSG is used for both
      
      Fix it, and tidy up con->error_msg assignments and pr_errs while at it.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      67c64eb7
  16. 07 Apr, 2015 1 commit
    • Ilya Dryomov's avatar
      Revert "libceph: use memalloc flags for net IO" · 6d7fdb0a
      Ilya Dryomov authored
      This reverts commit 89baaa57.
      
      Dirty page throttling should be sufficient for us in the general case
      so there is no need to use __GFP_MEMALLOC - it would be needed only in
      the swap-over-rbd case, which we currently don't support.  (It would
      probably take approximately the commit that is being reverted to add
      that support, but we would also need the "swap" option to distinguish
      from the general case and make sure swap ceph_client-s aren't shared
      with anything else.)  See ceph-devel threads [1] and [2] for the
      details of why enabling pfmemalloc reserves for all cases is a bad
      thing.
      
      On top of potential system lockups related to drained emergency
      reserves, this turned out to cause ceph lockups in case peers are on
      the same host and communicating via loopback due to sk_filter()
      dropping pfmemalloc skbs on the receiving side because the receiving
      loopback socket is not tagged with SOCK_MEMALLOC.
      
      [1] "SOCK_MEMALLOC vs loopback"
        ...
      6d7fdb0a
  17. 19 Feb, 2015 1 commit
  18. 17 Dec, 2014 2 commits
  19. 30 Oct, 2014 1 commit
    • Mike Christie's avatar
      libceph: use memalloc flags for net IO · 89baaa57
      Mike Christie authored
      
      This patch has ceph's lib code use the memalloc flags.
      
      If the VM layer needs to write data out to free up memory to handle new
      allocation requests, the block layer must be able to make forward progress.
      To handle that requirement we use structs like mempools to reserve memory for
      objects like bios and requests.
      
      The problem is when we send/receive block layer requests over the network
      layer, net skb allocations can fail and the system can lock up.
      To solve this, the memalloc related flags were added. NBD, iSCSI
      and NFS uses these flags to tell the network/vm layer that it should
      use memory reserves to fullfill allcation requests for structs like
      skbs.
      
      I am running ceph in a bunch of VMs in my laptop, so this patch was
      not tested very harshly.
      Signed-off-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Reviewed-by: default avatarIlya Dryomov <idryomov@redhat.com>
      89baaa57
  20. 14 Oct, 2014 3 commits
  21. 09 Aug, 2014 1 commit
    • Ilya Dryomov's avatar
      libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly · 5f740d7e
      Ilya Dryomov authored
      
      Determining ->last_piece based on the value of ->page_offset + length
      is incorrect because length here is the length of the entire message.
      ->last_piece set to false even if page array data item length is <=
      PAGE_SIZE, which results in invalid length passed to
      ceph_tcp_{send,recv}page() and causes various asserts to fire.
      
          # cat pages-cursor-init.sh
          #!/bin/bash
          rbd create --size 10 --image-format 2 foo
          FOO_DEV=$(rbd map foo)
          dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
          rbd snap create foo@snap
          rbd snap protect foo@snap
          rbd clone foo@snap bar
          # rbd_resize calls librbd rbd_resize(), size is in bytes
          ./rbd_resize bar $(((4 << 20) + 512))
          rbd resize --size 10 bar
          BAR_DEV=$(rbd map bar)
          # trigger a 512-byte copyup -- 512-byte page array data item
          dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5
      
      The problem exists only in ceph_msg_data_pages_cursor_init(),
      ceph_msg_data_pages_advance() does the right thing.  The size_t cast is
      unnecessary.
      
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      5f740d7e
  22. 08 Jul, 2014 2 commits
  23. 16 May, 2014 1 commit
  24. 11 Apr, 2014 1 commit
    • David S. Miller's avatar
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller authored
      
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      676d2369
  25. 05 Apr, 2014 1 commit
  26. 07 Feb, 2014 1 commit
  27. 26 Jan, 2014 1 commit
    • Ilya Dryomov's avatar
      libceph: add ceph_kv{malloc,free}() and switch to them · eeb0bed5
      Ilya Dryomov authored
      
      Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
      two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.
      
      ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
      vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
      behaviour:
      
      - for buffers (ceph_buffer_new()), from trying to kmalloc() everything
        and using vmalloc() just as a fallback
      
      - for messages (ceph_msg_new()), from going to vmalloc() for anything
        bigger than a page
      
      - for messages (ceph_msg_new()), from disallowing vmalloc() to use high
        memory
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      eeb0bed5