1. 07 Jan, 2022 1 commit
  2. 17 Dec, 2021 1 commit
  3. 09 Dec, 2021 1 commit
    • Jarkko Sakkinen's avatar
      x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node · 50468e43
      Jarkko Sakkinen authored
      == Problem ==
      
      The amount of SGX memory on a system is determined by the BIOS and it
      varies wildly between systems.  It can be as small as dozens of MB's
      and as large as many GB's on servers.  Just like how applications need
      to know how much regular RAM is available, enclave builders need to
      know how much SGX memory an enclave can consume.
      
      == Solution ==
      
      Introduce a new sysfs file:
      
      	/sys/devices/system/node/nodeX/x86/sgx_total_bytes
      
      to enumerate the amount of SGX memory available in each NUMA node.
      This serves the same function for SGX as /proc/meminfo or
      /sys/devices/system/node/nodeX/meminfo does for normal RAM.
      
      'sgx_total_bytes' is needed today to help drive the SGX selftests.
      SGX-specific swap code is exercised by creating overcommitted enclaves
      which are larger than the physical SGX memory on the system.  They
      currently use a CPUID-based approach which can diverge from the actual
      amount of SGX memory available.  'sgx_total_bytes' ensures that the
      selftests can work efficiently and do not attempt stupid things like
      creating a 100,000 MB enclave on a system with 128 MB of SGX memory.
      
      == Implementation Details ==
      
      Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an
      arch specific attribute group, and add an attribute for the amount of
      SGX memory in bytes to each NUMA node:
      
      == ABI Design Discussion ==
      
      As opposed to the per-node ABI, a single, global ABI was considered.
      However, this would prevent enclaves from being able to size
      themselves so that they fit on a single NUMA node.  Essentially, a
      single value would rule out NUMA optimizations for enclaves.
      
      Create a new "x86/" directory inside each "nodeX/" sysfs directory.
      'sgx_total_bytes' is expected to be the first of at least a few
      sgx-specific files to be placed in the new directory.  Just scanning
      /proc/meminfo, these are the no-brainers that we have for RAM, but we
      need for SGX:
      
      	MemTotal:       xxxx kB // sgx_total_bytes (implemented here)
      	MemFree:        yyyy kB // sgx_free_bytes
      	SwapTotal:      zzzz kB // sgx_swapped_bytes
      
      So, at *least* three.  I think we will eventually end up needing
      something more along the lines of a dozen.  A new directory (as
      opposed to being in the nodeX/ "root") directory avoids cluttering the
      root with several "sgx_*" files.
      
      Place the new file in a new "nodeX/x86/" directory because SGX is
      highly x86-specific.  It is very unlikely that any other architecture
      (or even non-Intel x86 vendor) will ever implement SGX.  Using "sgx/"
      as opposed to "x86/" was also considered.  But, there is a real chance
      this can get used for other arch-specific purposes.
      
      [ dhansen: rewrite changelog ]
      Signed-off-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211116162116.93081-2-jarkko@kernel.org
      50468e43
  4. 19 Nov, 2021 1 commit
  5. 17 Nov, 2021 1 commit
  6. 16 Nov, 2021 1 commit
    • Reinette Chatre's avatar
      x86/sgx: Fix free page accounting · ac5d272a
      Reinette Chatre authored
      The SGX driver maintains a single global free page counter,
      sgx_nr_free_pages, that reflects the number of free pages available
      across all NUMA nodes. Correspondingly, a list of free pages is
      associated with each NUMA node and sgx_nr_free_pages is updated
      every time a page is added or removed from any of the free page
      lists. The main usage of sgx_nr_free_pages is by the reclaimer
      that runs when it (sgx_nr_free_pages) goes below a watermark
      to ensure that there are always some free pages available to, for
      example, support efficient page faults.
      
      With sgx_nr_free_pages accessed and modified from a few places
      it is essential to ensure that these accesses are done safely but
      this is not the case. sgx_nr_free_pages is read without any
      protection and updated with inconsistent protection by any one
      of the spin locks associated with the individual NUMA nodes.
      For example:
      
            CPU_A                                 CPU_B
            -----                                 -----
       spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
       ...                                   ...
       sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
      
       spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
      
      Since sgx_nr_free_pages may be protected by different spin locks
      while being modified from different CPUs, the following scenario
      is possible:
      
            CPU_A                                CPU_B
            -----                                -----
      {sgx_nr_free_pages = 100}
       spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
       sgx_nr_free_pages--;                  sgx_nr_free_pages--;
       /* LOAD sgx_nr_free_pages = 100 */    /* LOAD sgx_nr_free_pages = 100 */
       /* sgx_nr_free_pages--          */    /* sgx_nr_free_pages--          */
       /* STORE sgx_nr_free_pages = 99 */    /* STORE sgx_nr_free_pages = 99 */
       spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
      
      In the above scenario, sgx_nr_free_pages is decremented from two CPUs
      but instead of sgx_nr_free_pages ending with a value that is two less
      than it started with, it was only decremented by one while the number
      of free pages were actually reduced by two. The consequence of
      sgx_nr_free_pages not being protected is that its value may not
      accurately reflect the actual number of free pages on the system,
      impacting the availability of free pages in support of many flows.
      
      The problematic scenario is when the reclaimer does not run because it
      believes there to be sufficient free pages while any attempt to allocate
      a page fails because there are no free pages available. In the SGX driver
      the reclaimer's watermark is only 32 pages so after encountering the
      above example scenario 32 times a user space hang is possible when there
      are no more free pages because of repeated page faults caused by no
      free pages made available.
      
      The following flow was encountered:
      asm_exc_page_fault
       ...
         sgx_vma_fault()
           sgx_encl_load_page()
             sgx_encl_eldu() // Encrypted page needs to be loaded from backing
                             // storage into newly allocated SGX memory page
               sgx_alloc_epc_page() // Allocate a page of SGX memory
                 __sgx_alloc_epc_page() // Fails, no free SGX memory
                 ...
                 if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
                   wake_up(&ksgxd_waitq);
                 return -EBUSY; // Return -EBUSY giving reclaimer time to run
             return -EBUSY;
           return -EBUSY;
         return VM_FAULT_NOPAGE;
      
      The reclaimer is triggered in above flow with the following code:
      
      static bool sgx_should_reclaim(unsigned long watermark)
      {
              return sgx_nr_free_pages < watermark &&
                     !list_empty(&sgx_active_page_list);
      }
      
      In the problematic scenario there were no free pages available yet the
      value of sgx_nr_free_pages was above the watermark. The allocation of
      SGX memory thus always failed because of a lack of free pages while no
      free pages were made available because the reclaimer is never started
      because of sgx_nr_free_pages' incorrect value. The consequence was that
      user space kept encountering VM_FAULT_NOPAGE that caused the same
      address to be accessed repeatedly with the same result.
      
      Change the global free page counter to an atomic type that
      ensures simultaneous updates are done safely. While doing so, move
      the updating of the variable outside of the spin lock critical
      section to which it does not belong.
      
      Cc: stable@vger.kernel.org
      Fixes: 901ddbb9 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
      Suggested-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com
      ac5d272a
  7. 15 Nov, 2021 21 commits
  8. 14 Nov, 2021 13 commits