• Reinette Chatre's avatar
    x86/sgx: Fix free page accounting · ac5d272a
    Reinette Chatre authored
    The SGX driver maintains a single global free page counter,
    sgx_nr_free_pages, that reflects the number of free pages available
    across all NUMA nodes. Correspondingly, a list of free pages is
    associated with each NUMA node and sgx_nr_free_pages is updated
    every time a page is added or removed from any of the free page
    lists. The main usage of sgx_nr_free_pages is by the reclaimer
    that runs when it (sgx_nr_free_pages) goes below a watermark
    to ensure that there are always some free pages available to, for
    example, support efficient page faults.
    
    With sgx_nr_free_pages accessed and modified from a few places
    it is essential to ensure that these accesses are done safely but
    this is not the case. sgx_nr_free_pages is read without any
    protection and updated with inconsistent protection by any one
    of the spin locks associated with the individual NUMA nodes.
    For example:
    
          CPU_A                                 CPU_B
          -----                                 -----
     spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
     ...                                   ...
     sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
    
     spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
    
    Since sgx_nr_free_pages may be protected by different spin locks
    while being modified from different CPUs, the following scenario
    is possible:
    
          CPU_A                                CPU_B
          -----                                -----
    {sgx_nr_free_pages = 100}
     spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
     sgx_nr_free_pages--;                  sgx_nr_free_pages--;
     /* LOAD sgx_nr_free_pages = 100 */    /* LOAD sgx_nr_free_pages = 100 */
     /* sgx_nr_free_pages--          */    /* sgx_nr_free_pages--          */
     /* STORE sgx_nr_free_pages = 99 */    /* STORE sgx_nr_free_pages = 99 */
     spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
    
    In the above scenario, sgx_nr_free_pages is decremented from two CPUs
    but instead of sgx_nr_free_pages ending with a value that is two less
    than it started with, it was only decremented by one while the number
    of free pages were actually reduced by two. The consequence of
    sgx_nr_free_pages not being protected is that its value may not
    accurately reflect the actual number of free pages on the system,
    impacting the availability of free pages in support of many flows.
    
    The problematic scenario is when the reclaimer does not run because it
    believes there to be sufficient free pages while any attempt to allocate
    a page fails because there are no free pages available. In the SGX driver
    the reclaimer's watermark is only 32 pages so after encountering the
    above example scenario 32 times a user space hang is possible when there
    are no more free pages because of repeated page faults caused by no
    free pages made available.
    
    The following flow was encountered:
    asm_exc_page_fault
     ...
       sgx_vma_fault()
         sgx_encl_load_page()
           sgx_encl_eldu() // Encrypted page needs to be loaded from backing
                           // storage into newly allocated SGX memory page
             sgx_alloc_epc_page() // Allocate a page of SGX memory
               __sgx_alloc_epc_page() // Fails, no free SGX memory
               ...
               if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
                 wake_up(&ksgxd_waitq);
               return -EBUSY; // Return -EBUSY giving reclaimer time to run
           return -EBUSY;
         return -EBUSY;
       return VM_FAULT_NOPAGE;
    
    The reclaimer is triggered in above flow with the following code:
    
    static bool sgx_should_reclaim(unsigned long watermark)
    {
            return sgx_nr_free_pages < watermark &&
                   !list_empty(&sgx_active_page_list);
    }
    
    In the problematic scenario there were no free pages available yet the
    value of sgx_nr_free_pages was above the watermark. The allocation of
    SGX memory thus always failed because of a lack of free pages while no
    free pages were made available because the reclaimer is never started
    because of sgx_nr_free_pages' incorrect value. The consequence was that
    user space kept encountering VM_FAULT_NOPAGE that caused the same
    address to be accessed repeatedly with the same result.
    
    Change the global free page counter to an atomic type that
    ensures simultaneous updates are done safely. While doing so, move
    the updating of the variable outside of the spin lock critical
    section to which it does not belong.
    
    Cc: stable@vger.kernel.org
    Fixes: 901ddbb9 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
    Suggested-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: default avatarReinette Chatre <reinette.chatre@intel.com>
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
    Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
    Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com
    ac5d272a
main.c 20.5 KB