• Andrew Worsley's avatar
    init/calibrate.c: fix for critical bogoMIPS intermittent calculation failure · d2b46313
    Andrew Worsley authored
    A fix to the TSC (Time Stamp Counter) based bogoMIPS calculation used on
    secondary CPUs which has two faults:
    
    1: Not handling wrapping of the lower 32 bits of the TSC counter on
       32bit kernel - perhaps TSC is not reset by a warm reset?
    
    2: TSC and Jiffies are no incrementing together properly.  Either
       jiffies increment too quickly or Time Stamp Counter isn't incremented
       in during an SMI but the real time clock is and jiffies are
       incremented.
    
    Case 1 can result in a factor of 16 too large a value which makes udelay()
    values too small and can cause mysterious driver errors.  Case 2 appears
    to give smaller 10-15% errors after averaging but enough to cause
    occasional failures on my own board
    
    I have tested this code on my own branch and attach patch suitable for
    current kernel code.  See below for examples of the failures and how the
    fix handles these situations now.
    
    I reported this issue earlier here:
         Intermittent problem with BogoMIPs calculation on Intel AP CPUs -
    http://marc.info/?l=linux-kernel&m=129947246316875&w=4
    
    I suspect this issue has been seen by others but as it is intermittent and
    bogoMIPS for secondary CPUs are no longer printed out it might have been
    difficult to identify this as the cause.  Perhaps these unresolved issues,
    although quite old, might be relevant as possibly this fault has been
    around for a while.  In particular Case 1 may only be relevant to 32bit
    kernels on newer HW (most people run 64bit kernels?).  Case 2 is less
    dramatic since the earlier fix in this area and also intermittent.
    
       Re: bogomips discrepancy on Intel Core2 Quad CPU -
    http://marc.info/?l=linux-kernel&m=118929277524298&w=4
       slow system and bogus bogomips  -
    http://marc.info/?l=linux-kernel&m=116791286716107&w=4
       Re: Re: [RFC-PATCH] clocksource: update lpj if clocksource has -
    http://marc.info/?l=linux-kernel&m=128952775819467&w=4
    
    This issue is masked a little by commit feae3203 ("timers, init:
    Limit the number of per cpu calibration bootup messages") which only
    prints out the first bogoMIPS value making it much harder to notice other
    values differing.  Perhaps it should be changed to only suppress them when
    they are similar values?
    
    Here are some outputs showing faults occurring and the new code handling
    them properly.  See my earlier message for examples of the original
    failure.
    
        Case 1:   A Time Stamp Counter wrap:
    ...
    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 6332.70 BogoMIPS (lpj=31663540)
    ....
    calibrate_delay_direct() timer_rate_max=31666493
    timer_rate_min=31666151 pre_start=4170369255 pre_end=4202035539
    calibrate_delay_direct() timer_rate_max=2425955274
    timer_rate_min=2425954941 pre_start=4265368533 pre_end=2396356387
    calibrate_delay_direct() ignoring timer_rate as we had a TSC wrap
    around start=4265368581 >=post_end=2396356511
    calibrate_delay_direct() timer_rate_max=31666274
    timer_rate_min=31665942 pre_start=2440373374 pre_end=2472039515
    calibrate_delay_direct() timer_rate_max=31666492
    timer_rate_min=31666160 pre_start=2535372139 pre_end=2567038422
    calibrate_delay_direct() timer_rate_max=31666455
    timer_rate_min=31666207 pre_start=2630371084 pre_end=2662037415
    Calibrating delay using timer specific routine.. 6333.28 BogoMIPS (lpj=31666428)
    Total of 2 processors activated (12665.99 BogoMIPS).
    ....
    
        Case 2:  Some thing (presumably the SMM interrupt?) causing the
    very low increase in TSC counter for the DELAY_CALIBRATION_TICKS
    increase in jiffies
    ...
    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 6333.25 BogoMIPS (lpj=31666270)
    ...
    calibrate_delay_direct() timer_rate_max=31666483
    timer_rate_min=31666074 pre_start=4199536526 pre_end=4231202809
    calibrate_delay_direct() timer_rate_max=864348 timer_rate_min=864016
    pre_start=2405343672 pre_end=2406207897
    calibrate_delay_direct() timer_rate_max=31666483
    timer_rate_min=31666179 pre_start=2469540464 pre_end=2501206823
    calibrate_delay_direct() timer_rate_max=31666511
    timer_rate_min=31666122 pre_start=2564539400 pre_end=2596205712
    calibrate_delay_direct() timer_rate_max=31666084
    timer_rate_min=31665685 pre_start=2659538782 pre_end=2691204657
    calibrate_delay_direct() dropping min bogoMips estimate 1 = 864348
    Calibrating delay using timer specific routine.. 6333.27 BogoMIPS (lpj=31666390)
    Total of 2 processors activated (12666.53 BogoMIPS).
    ...
    
    After 70 boots I saw 2 variations <1% slip through
    
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix straggly printk mess]
    Signed-off-by: default avatarAndrew Worsley <amworsley@gmail.com>
    Reviewed-by: default avatarPhil Carmody <ext-phil.2.carmody@nokia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d2b46313
calibrate.c 7.77 KB