RAS/CEC: Reduce offline page threshold for Intel systems
authorTony Luck <tony.luck@intel.com>
Tue, 2 Aug 2022 16:18:47 +0000 (09:18 -0700)
committerBorislav Petkov <bp@suse.de>
Mon, 22 Aug 2022 17:30:02 +0000 (19:30 +0200)
A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

Set the threshold to "2" on Intel systems. AMD guidance is that this is
not necessary for their systems.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220607212015.175591-1-tony.luck@intel.com
Link: https://lore.kernel.org/r/YulOZ/Eso0bwUcC4@agluck-desk3.sc.intel.com
drivers/ras/cec.c

index 42f2fc0bc8a98f993fdd61bfcb101a4b52ce0430..321af498ee119d62610b308901932848124b1230 100644 (file)
@@ -556,6 +556,14 @@ static int __init cec_init(void)
        if (ce_arr.disabled)
                return -ENODEV;
 
+       /*
+        * Intel systems may avoid uncorrectable errors
+        * if pages with corrected errors are aggressively
+        * taken offline.
+        */
+       if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+               action_threshold = 2;
+
        ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
        if (!ce_arr.array) {
                pr_err("Error allocating CE array page!\n");