Bjorn Helgaas [Tue, 24 May 2022 21:42:23 +0000 (16:42 -0500)]
Merge branch 'pci/virtualization'
- Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store() (Yicong
Yang, Jay Zhou)
* pci/virtualization:
PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store()
Bjorn Helgaas [Tue, 24 May 2022 21:42:22 +0000 (16:42 -0500)]
Merge branch 'pci/resource'
- Clip only host bridge windows for E820 regions and log what is clipped
(Bjorn Helgaas)
- Add kernel cmdline options to use/ignore E820 reserved regions (Hans de
Goede)
- Disable E820 reserved region clipping for IdeaPads, Yoga, Yoga Slip, Acer
Spin 5, Clevo Barebone systems where clipping leaves no usable address
space for touchpads, Thunderbolt devices, etc (Hans de Goede)
- Disable E820 reserved region clipping by default starting in 2023 (Hans
de Goede)
* pci/resource:
x86/PCI: Disable E820 reserved region clipping starting in 2023
x86/PCI: Disable E820 reserved region clipping via quirks
x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions
x86/PCI: Clip only host bridge windows for E820 regions
x86: Log resource clipping for E820 regions
x86/PCI: Eliminate remove_e820_regions() common subexpressions
Bjorn Helgaas [Tue, 24 May 2022 21:42:22 +0000 (16:42 -0500)]
Merge branch 'pci/pm'
- Define pci_restore_standard_config() only for CONFIG_PM_SLEEP, since it's
not used otherwise (Krzysztof Kozlowski)
- Power up all devices during runtime resume (Rafael J. Wysocki)
- Resume subordinate bus in bus type callbacks (Rafael J. Wysocki)
- Drop pci_dev runtime_d3cold flag since no uses remain (Rafael J. Wysocki)
- Move power-up to D0 code to pci_power_up() and rename
pci_raw_set_power_state() to pci_set_low_power_state() (Rafael J.
Wysocki)
- Set current_state to D3cold if the device is not accessible (Rafael J.
Wysocki)
- Do not call pci_update_current_state() from pci_power_up() (Rafael J.
Wysocki)
- Write 0 to PMCSR in pci_power_up() in all cases (Rafael J. Wysocki)
- Split part of pci_power_up() off to pci_set_full_power_state() (Rafael J.
Wysocki)
- Do not restore BARs if device is not in D0 (Rafael J. Wysocki)
* pci/pm:
PCI/PM: Replace pci_set_power_state() in pci_pm_thaw_noirq()
PCI/PM: Rearrange pci_set_power_state()
PCI/PM: Clean up pci_set_low_power_state()
PCI/PM: Do not restore BARs if device is not in D0
PCI/PM: Split pci_power_up()
PCI/PM: Write 0 to PMCSR in pci_power_up() in all cases
PCI/PM: Do not call pci_update_current_state() from pci_power_up()
PCI/PM: Unfold pci_platform_power_transition() in pci_power_up()
PCI/PM: Set current_state to D3cold if the device is not accessible
PCI/PM: Relocate pci_set_low_power_state()
PCI/PM: Split pci_raw_set_power_state()
PCI/PM: Rearrange pci_update_current_state()
PCI/PM: Drop the runtime_d3cold device flag
PCI/PM: Resume subordinate bus in bus type callbacks
PCI/PM: Power up all devices during runtime resume
PCI/PM: Define pci_restore_standard_config() only for CONFIG_PM_SLEEP
Bjorn Helgaas [Tue, 24 May 2022 21:42:22 +0000 (16:42 -0500)]
Merge branch 'pci/p2pdma'
- Update pci_p2pdma_whitelist[] checking so we accept Skylake-E Root Ports
even if they're not at devfn 00.0 (Shlomo Pongratz)
* pci/p2pdma:
PCI/P2PDMA: Whitelist Intel Skylake-E Root Ports at any devfn
Bjorn Helgaas [Tue, 24 May 2022 21:42:21 +0000 (16:42 -0500)]
Merge branch 'pci/misc'
- Change pci_set_dma_mask() documentation references to dma_set_mask()
(Alex Williamson)
* pci/misc:
PCI/doc: Update obsolete pci_set_dma_mask() references
Bjorn Helgaas [Tue, 24 May 2022 21:42:21 +0000 (16:42 -0500)]
Merge branch 'pci/hotplug'
- Allow D3 only if Root Port can signal and wake from D3 so we don't miss
hotplug events on AMD Yellow Carp (Mario Limonciello)
- Clean up hotplug include files to enable future powerpc cleanup
(Christophe Leroy)
* pci/hotplug:
PCI: hotplug: Clean up include files
PCI/ACPI: Allow D3 only if Root Port can signal and wake from D3
Bjorn Helgaas [Tue, 24 May 2022 21:42:21 +0000 (16:42 -0500)]
Merge branch 'pci/error'
- Clear AER "multiple errors" bits to avoid race that left them set forever
(Kuppuswamy Sathyanarayanan)
* pci/error:
PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits
Bjorn Helgaas [Tue, 24 May 2022 21:42:20 +0000 (16:42 -0500)]
Merge branch 'pci/aspm'
- Quirk Intel DG2 ASPM L1 acceptable latency to be unlimited. The device
advertises that it can only tolerate 1us, which means L1 is rarely if
ever used (Mika Westerberg)
* pci/aspm:
PCI/ASPM: Make Intel DG2 L1 acceptable latency unlimited
Hans de Goede [Thu, 19 May 2022 15:21:50 +0000 (17:21 +0200)]
x86/PCI: Disable E820 reserved region clipping starting in 2023
Some firmware includes unusable space (host bridge registers, hidden PCI
device BARs, etc) in PCI host bridge _CRS. As far as we know, there's
nothing in the ACPI, UEFI, or PCI Firmware spec that requires the OS to
remove E820 reserved regions from _CRS, so this seems like a firmware
defect.
As a workaround,
4dc2287c1805 ("x86: avoid E820 regions when allocating
address space") has clipped out the unusable space in the past. This is
required for machines like the following:
- Dell Precision T3500 (the original motivator for
4dc2287c1805); see
https://bugzilla.kernel.org/show_bug.cgi?id=16228
- Asus C523NA (Coral) Chromebook; see
https://lore.kernel.org/all/
4e9fca2f-0af1-3684-6c97-
4c35befd5019@redhat.com/
- Lenovo ThinkPad X1 Gen 2; see:
https://bugzilla.redhat.com/show_bug.cgi?id=
2029207
But other firmware supplies E820 reserved regions that cover entire _CRS
windows, and clipping throws away the entire window, leaving none for
hot-added or uninitialized devices. This clipping breaks a whole range of
Lenovo IdeaPads, Yogas, Yoga Slims, and notebooks, as well as Acer Spin 5
and Clevo X170KM-G Barebone machines.
E820 reserved entries that cover a memory-mapped PCI host bridge, including
its registers and memory/IO windows, are probably *not* a firmware defect.
Per ACPI v5.4, sec 15.2, the E820 memory map may include:
Address ranges defined for baseboard memory-mapped I/O devices, such as
APICs, are returned as reserved.
Disable the E820 clipping by default for all post-2022 machines. We
already have quirks to disable clipping for pre-2023 machines, and we'll
likely need quirks to *enable* clipping for post-2022 machines that
incorrectly include unusable space in _CRS, including Chromebooks and
Lenovo ThinkPads.
Here's the rationale for doing this. If we do nothing, and continue
clipping by default:
- Future systems like the Lenovo IdeaPads, Yogas, etc, Acer Spin, and
Clevo Barebones will require new quirks to disable clipping.
- The problem here is E820 entries that cover entire _CRS windows that
should not be clipped out.
- I think these E820 entries are legal per spec, and it would be hard to
get BIOS vendors to change them.
- We will discover new systems that need clipping disabled piecemeal as
they are released.
- Future systems like Lenovo X1 Carbon and the Chromebooks (probably
anything using coreboot) will just work, even though their _CRS is
incorrect, so we will not notice new ones that rely on the clipping.
- BIOS updates will not require new quirks unless they change the DMI
model string.
If we add the date check in this commit that disables clipping, e.g., "no
clipping when date >= 2023":
- Future systems like Lenovo *IIL*, Acer Spin, and Clevo Barebones will
just work without new quirks.
- Future systems like Lenovo X1 Carbon and the Chromebooks will require
new quirks to *enable* clipping.
- The problem here is that _CRS contains regions that are not usable by
PCI devices, and we rely on the E820 kludge to clip them out.
- I think this use of E820 is clearly a firmware bug, so we have a
fighting chance of getting it changed eventually.
- BIOS updates after the cutoff date *will* require quirks, but only for
systems like Lenovo X1 Carbon and Chromebooks that we already think
have broken firmware.
It seems to me like it's better to add quirks for firmware that we think is
broken than for firmware that seems unusual but correct.
[bhelgaas: comment and commit log]
Link: https://lore.kernel.org/linux-pci/20220518220754.GA7911@bhelgaas/
Link: https://lore.kernel.org/r/20220519152150.6135-4-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Benoit Grégoire <benoitg@coeus.ca>
Cc: Hui Wang <hui.wang@canonical.com>
Hans de Goede [Thu, 19 May 2022 15:21:49 +0000 (17:21 +0200)]
x86/PCI: Disable E820 reserved region clipping via quirks
To avoid unusable space that some firmware includes in PCI host bridge
_CRS, Linux currently excludes E820 reserved regions from _CRS windows; see
4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
However, some systems supply E820 reserved regions that cover the entire
memory window from _CRS, so clipping them out leaves no space for hot-added
or uninitialized PCI devices.
For example, from a Lenovo IdeaPad 3 15IIL 81WE:
BIOS-e820: [mem 0x4bc50000-0xcfffffff] reserved
pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
pci 0000:00:15.0: BAR 0: [mem 0x00000000-0x00000fff 64bit]
pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
Add quirks to disable the E820 clipping for machines known to do this.
A single DMI_PRODUCT_VERSION "IIL" quirk matches all the below:
Lenovo IdeaPad 3 14IIL05
Lenovo IdeaPad 3 15IIL05
Lenovo IdeaPad 3 17IIL05
Lenovo IdeaPad 5 14IIL05
Lenovo IdeaPad 5 15IIL05
Lenovo IdeaPad Slim 7 14IIL05
Lenovo IdeaPad Slim 7 15IIL05
Lenovo IdeaPad S145-15IIL
Lenovo IdeaPad S340-14IIL
Lenovo IdeaPad S340-15IIL
Lenovo IdeaPad C340-15IIL
Lenovo BS145-15IIL
Lenovo V14-IIL
Lenovo V15-IIL
Lenovo V17-IIL
Lenovo Yoga C940-14IIL
Lenovo Yoga S740-14IIL
Lenovo Yoga Slim 7 14IIL05
Lenovo Yoga Slim 7 15IIL05
in addition to the following that don't actually need it because they have
no E820 reserved regions that overlap _CRS windows:
Lenovo IdeaPad Flex 5 14IIL05
Lenovo IdeaPad Flex 5 15IIL05
Lenovo ThinkBook 14-IIL
Lenovo ThinkBook 15-IIL
Lenovo Yoga S940-14IIL
Other quirks match these:
Acer Spin 5 (SP513-54N)
Clevo X170KM-G Barebone
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206459
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214259
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
Link: https://bugs.launchpad.net/bugs/1878279
Link: https://bugs.launchpad.net/bugs/1880172
Link: https://bugs.launchpad.net/bugs/1884232
Link: https://bugs.launchpad.net/bugs/1921649
Link: https://bugs.launchpad.net/bugs/1931715
Link: https://bugs.launchpad.net/bugs/1932069
Link: https://lore.kernel.org/r/20220519152150.6135-3-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Benoit Grégoire <benoitg@coeus.ca>
Cc: Hui Wang <hui.wang@canonical.com>
Hans de Goede [Thu, 19 May 2022 15:21:48 +0000 (17:21 +0200)]
x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions
Some firmware supplies PCI host bridge _CRS that includes address space
unusable by PCI devices, e.g., space occupied by host bridge registers or
used by hidden PCI devices.
To avoid this unusable space, Linux currently excludes E820 reserved
regions from _CRS windows; see
4dc2287c1805 ("x86: avoid E820 regions when
allocating address space").
However, this use of E820 reserved regions to clip things out of _CRS is
not supported by ACPI, UEFI, or PCI Firmware specs, and some systems have
E820 reserved regions that cover the entire memory window from _CRS.
4dc2287c1805 clips the entire window, leaving no space for hot-added or
uninitialized PCI devices.
For example, from a Lenovo IdeaPad 3 15IIL 81WE:
BIOS-e820: [mem 0x4bc50000-0xcfffffff] reserved
pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
pci 0000:00:15.0: BAR 0: [mem 0x00000000-0x00000fff 64bit]
pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
Future patches will add quirks to enable/disable E820 clipping
automatically.
Add a "pci=no_e820" kernel command line option to disable clipping with
E820 reserved regions. Also add a matching "pci=use_e820" option to enable
clipping with E820 reserved regions if that has been disabled by default by
further patches in this patch-set.
Both options taint the kernel because they are intended for debugging and
workaround purposes until a quirk can set them automatically.
[bhelgaas: commit log, add printk]
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
Link: https://lore.kernel.org/r/20220519152150.6135-2-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Benoit Grégoire <benoitg@coeus.ca>
Cc: Hui Wang <hui.wang@canonical.com>
Kuppuswamy Sathyanarayanan [Mon, 18 Apr 2022 15:02:37 +0000 (15:02 +0000)]
PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits
When a Root Port or Root Complex Event Collector receives an error Message
e.g., ERR_COR, it sets PCI_ERR_ROOT_COR_RCV in the Root Error Status
register and logs the Requester ID in the Error Source Identification
register. If it receives a second ERR_COR Message before software clears
PCI_ERR_ROOT_COR_RCV, hardware sets PCI_ERR_ROOT_MULTI_COR_RCV and the
Requester ID is lost.
In the following scenario, PCI_ERR_ROOT_MULTI_COR_RCV was never cleared:
- hardware receives ERR_COR message
- hardware sets PCI_ERR_ROOT_COR_RCV
- aer_irq() entered
- aer_irq(): status = pci_read_config_dword(PCI_ERR_ROOT_STATUS)
- aer_irq(): now status == PCI_ERR_ROOT_COR_RCV
- hardware receives second ERR_COR message
- hardware sets PCI_ERR_ROOT_MULTI_COR_RCV
- aer_irq(): pci_write_config_dword(PCI_ERR_ROOT_STATUS, status)
- PCI_ERR_ROOT_COR_RCV is cleared; PCI_ERR_ROOT_MULTI_COR_RCV is set
- aer_irq() entered again
- aer_irq(): status = pci_read_config_dword(PCI_ERR_ROOT_STATUS)
- aer_irq(): now status == PCI_ERR_ROOT_MULTI_COR_RCV
- aer_irq() exits because PCI_ERR_ROOT_COR_RCV not set
- PCI_ERR_ROOT_MULTI_COR_RCV is still set
The same problem occurred with ERR_NONFATAL/ERR_FATAL Messages and
PCI_ERR_ROOT_UNCOR_RCV and PCI_ERR_ROOT_MULTI_UNCOR_RCV.
Fix the problem by queueing an AER event and clearing the Root Error Status
bits when any of these bits are set:
PCI_ERR_ROOT_COR_RCV
PCI_ERR_ROOT_UNCOR_RCV
PCI_ERR_ROOT_MULTI_COR_RCV
PCI_ERR_ROOT_MULTI_UNCOR_RCV
See the bugzilla link for details from Eric about how to reproduce this
problem.
[bhelgaas: commit log, move repro details to bugzilla]
Fixes: e167bfcaa4cd ("PCI: aerdrv: remove magical ROOT_ERR_STATUS_MASKS")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215992
Link: https://lore.kernel.org/r/20220418150237.1021519-1-sathyanarayanan.kuppuswamy@linux.intel.com
Reported-by: Eric Badger <ebadger@purestorage.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Yicong Yang [Mon, 4 Apr 2022 06:25:39 +0000 (14:25 +0800)]
PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store()
The sysfs sriov_numvfs_store() path acquires the device lock before the
config space access lock:
sriov_numvfs_store
device_lock # A (1) acquire device lock
sriov_configure
vfio_pci_sriov_configure # (for example)
vfio_pci_core_sriov_configure
pci_disable_sriov
sriov_disable
pci_cfg_access_lock
pci_wait_cfg # B (4) wait for dev->block_cfg_access == 0
Previously, pci_dev_lock() acquired the config space access lock before the
device lock:
pci_dev_lock
pci_cfg_access_lock
dev->block_cfg_access = 1 # B (2) set dev->block_cfg_access = 1
device_lock # A (3) wait for device lock
Any path that uses pci_dev_lock(), e.g., pci_reset_function(), may
deadlock with sriov_numvfs_store() if the operations occur in the sequence
(1) (2) (3) (4).
Avoid the deadlock by reversing the order in pci_dev_lock() so it acquires
the device lock before the config space access lock, the same as the
sriov_numvfs_store() path.
[bhelgaas: combined and adapted commit log from Jay Zhou's independent
subsequent posting:
https://lore.kernel.org/r/
20220404062539.1710-1-jianjay.zhou@huawei.com]
Link: https://lore.kernel.org/linux-pci/1583489997-17156-1-git-send-email-yangyicong@hisilicon.com/
Also-posted-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:18:09 +0000 (20:18 +0200)]
PCI/PM: Replace pci_set_power_state() in pci_pm_thaw_noirq()
Calling pci_set_power_state() to put the given device into D0 in
pci_pm_thaw_noirq() may cause it to restore the device's BARs, which is
redundant before calling pci_restore_state(), so replace it with a direct
pci_power_up() call followed by pci_update_current_state() if it returns a
nonzero value, in analogy with pci_pm_default_resume_early().
Avoid code duplication by introducing a wrapper function to contain the
repeating pattern and calling it in both places.
Link: https://lore.kernel.org/r/3639079.MHq7AAxBmi@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:16:50 +0000 (20:16 +0200)]
PCI/PM: Rearrange pci_set_power_state()
The part of pci_set_power_state() related to transitions into
low-power states is unnecessary convoluted, so clearly divide it
into the D3cold special case and the general case covering all of
the other states.
Also fix a potential issue with calling pci_bus_set_current_state()
to set the current state of all devices on the subordinate bus to
D3cold without checking if the power state of the parent bridge has
really changed to D3cold.
Link: https://lore.kernel.org/r/2139440.Mh6RI2rZIc@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:15:34 +0000 (20:15 +0200)]
PCI/PM: Clean up pci_set_low_power_state()
Make the following assorted non-essential changes in
pci_set_low_power_state():
1. Drop two redundant checks from it (the caller takes care of these
conditions).
2. Change the log level of a messages printed by it to "debug",
because it only indicates a programming mistake.
Link: https://lore.kernel.org/r/2539071.Lt9SDvczpP@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:14:24 +0000 (20:14 +0200)]
PCI/PM: Do not restore BARs if device is not in D0
Do not attempt to restore the device's BARs in
pci_set_full_power_state() if the actual current
power state of the device is not D0.
Link: https://lore.kernel.org/r/1849718.CQOukoFCf9@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:13:00 +0000 (20:13 +0200)]
PCI/PM: Split pci_power_up()
One of the two callers of pci_power_up() invokes
pci_update_current_state() and pci_restore_state() right after calling
it, in which case running the part of it happening after the mandatory
transition delays is redundant, so move that part out of it into a new
function called pci_set_full_power_state() that will be invoked from
pci_set_power_state() which is the other caller of pci_power_up().
Link: https://lore.kernel.org/r/1942150.usQuhbGJ8B@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:10:43 +0000 (20:10 +0200)]
PCI/PM: Write 0 to PMCSR in pci_power_up() in all cases
Make pci_power_up() write 0 to the device's PCI_PM_CTRL register in
order to put it into D0 regardless of the power state returned by
the previous read from that register which should not matter.
Link: https://lore.kernel.org/r/5748066.MhkbZ0Pkbq@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:09:12 +0000 (20:09 +0200)]
PCI/PM: Do not call pci_update_current_state() from pci_power_up()
Notice that calling pci_update_current_state() from pci_power_up() is
redundant and may be harmful in some cases.
First, if the device is in a low-power state before pci_power_up()
gets called for it and platform_pci_set_power_state() successfully
changes its power state to D0, pci_update_current_state() will update
current_state to reflect that and pci_power_up() will return success
right away without restoring the device's BARs or reconfiguring ASPM
which may be necessary. This is arguably incorrect and definitely
inconsistent with the case when platform_pci_set_power_state() returns
an error (for example, because the device is not power-manageable by
the platform firmware).
Second, current_state should not be overwritten until the decision
whether or not to restore the device's BARs is made, because that
decision generally depends on its value. Again, calling
pci_update_current_state() in pci_power_up() is not consistent with
this observation.
Next, pci_power_up() attempts to read from the device's PCI_PM_CTRL
register regardless of the current_state value unless it is PCI_D0,
including the case when pci_update_current_state() sets current_state
to PCI_D3cold to indicate that the device is not accessible. If the
register read is not successful, current_state will be set to
PCI_D3cold anyway, so that pci_update_current_state() action is
redundant.
Further, if pci_update_current_state() reads the device's PCI_PM_CTRL
register, pci_power_up() will repeat that read going forward and
it is not necessary to update current_state in the meantime.
Finally, if pm_cap is not set (in which case the PCI_PM_CTRL register
is not present), the power state of the device should be determined
with the help of the platform firmware or set to D0 if that's not
possible and pci_update_current_state() does not do that.
Accordingly, rearrange pci_power_up() so as to address the above
shortcomings.
Link: https://lore.kernel.org/r/3695055.kQq0lBPeGt@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:05:15 +0000 (20:05 +0200)]
PCI/PM: Unfold pci_platform_power_transition() in pci_power_up()
Some actions carried out by pci_platform_power_transition(() in
pci_power_up() are redundant, but before dealing with them, make
pci_power_up() call the pci_platform_power_transition() code directly
(and avoid a redundant check when pm_cap is unset while at it).
Link: https://lore.kernel.org/r/1922486.PYKUYFuaPT@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:04:07 +0000 (20:04 +0200)]
PCI/PM: Set current_state to D3cold if the device is not accessible
Make pci_power_up() and pci_set_low_power_state() change current_state
to PCI_D3cold when the device is not accessible along the lines of
pci_update_current_state().
Link: https://lore.kernel.org/r/10104376.nUPlyArG6x@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:02:52 +0000 (20:02 +0200)]
PCI/PM: Relocate pci_set_low_power_state()
Because pci_set_power_state() is the only caller of
pci_set_low_power_state(), put the latter next to the former.
No functional impact.
Link: https://lore.kernel.org/r/3202976.44csPzL39Z@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 5 May 2022 18:00:33 +0000 (20:00 +0200)]
PCI/PM: Split pci_raw_set_power_state()
The transitions from low-power states to D0 and the other way around
are unnecessarily tangled in pci_raw_set_power_state() which makes it
rather hard to follow.
Moreover, the only caller of pci_raw_set_power_state() passing PCI_D0
as its state argument is pci_power_up(), so the code carrying out
transitions into D0 can be put directly into that function.
Accordingly, move the code handling transitions from low-power states
into D0 directly into pci_power_up() and rename the remaining part
of pci_raw_set_power_state() to pci_set_low_power_state(), because
it only handles transitions into low-power state now.
While at it, fix up some white space, update some comments and modify
messages printed by pci_power_up() and pci_set_low_power_state() to
be less confusing (which is the only expected functional impact of
this change).
Link: https://lore.kernel.org/r/13038676.uLZWGnKmhe@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 14 Apr 2022 13:07:24 +0000 (15:07 +0200)]
PCI/PM: Rearrange pci_update_current_state()
Save one config space access in pci_update_current_state() by testing the
retrieved PCI_PM_CTRL register value against PCI_POSSIBLE_ERROR() instead
of invoking pci_device_is_present() separately.
While at it, drop a pair of unnecessary parens.
No expected functional impact.
Link: https://lore.kernel.org/r/1917095.PYKUYFuaPT@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Rafael J. Wysocki [Thu, 14 Apr 2022 13:04:27 +0000 (15:04 +0200)]
PCI/PM: Drop the runtime_d3cold device flag
The runtime_d3cold flag is not needed any more, so drop it.
Link: https://lore.kernel.org/r/8077784.T7Z3S40VBb@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Rafael J. Wysocki [Thu, 14 Apr 2022 13:04:13 +0000 (15:04 +0200)]
PCI/PM: Resume subordinate bus in bus type callbacks
Calling pci_resume_bus() on the secondary bus from pci_power_up() as it is
done now is questionable, because it depends on the mandatory bridge
power-up delays that are only covered by the PCI bus type PM callbacks.
For this reason, move the subordinate bus resume to those callbacks too and
use the observation that if a bridge is being powered-up during resume from
system-wide suspend, it may be still desirable to runtime-resume its
subordinate bus after completing the system-wide transition (in case the
resume of the devices on that bus is skipped during it).
Link: https://lore.kernel.org/r/3190097.aeNJFYEL58@kreacher
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Rafael J. Wysocki [Fri, 8 Apr 2022 18:29:01 +0000 (20:29 +0200)]
PCI/PM: Power up all devices during runtime resume
Currently, endpoint devices may not be powered up entirely during runtime
resume that follows a D3hot -> D0 transition of the parent bridge.
Namely, even if the power state of an endpoint device, as indicated by its
PCI_PM_CTRL register, is D0 after powering up its parent bridge, it may be
still necessary to bring its ACPI companion into D0 and that should be done
before accessing it. However, the current code assumes that reading the
PCI_PM_CTRL register is sufficient to establish the endpoint device's power
state, which may lead to problems.
Address that by forcing a power-up of all PCI devices, including the
platform firmware part of it, during runtime resume.
Link: https://lore.kernel.org/linux-pm/11967527.O9o76ZdvQC@kreacher
Fixes: 5775b843a619 ("PCI: Restore config space on runtime resume despite being unbound")
Link: https://lore.kernel.org/r/2652115.mvXUDI8C0e@kreacher
Reported-by: Abhishek Sahu <abhsahu@nvidia.com>
Tested-by: Abhishek Sahu <abhsahu@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Krzysztof Kozlowski [Wed, 20 Apr 2022 14:11:35 +0000 (16:11 +0200)]
PCI/PM: Define pci_restore_standard_config() only for CONFIG_PM_SLEEP
pci_restore_standard_config() was defined under CONFIG_PM but called only
by pci_pm_resume() (defined under CONFIG_SUSPEND) and pci_pm_restore()
(defined under CONFIG_HIBERNATE_CALLBACKS). A configuration with only
CONFIG_PM leads to a warning:
drivers/pci/pci-driver.c:533:12: error: ‘pci_restore_standard_config’ defined but not used [-Werror=unused-function]
CONFIG_PM_SLEEP depends on CONFIG_SUSPEND and CONFIG_HIBERNATE_CALLBACKS,
so define pci_restore_standard_config() under that instead.
Link: https://lore.kernel.org/r/20220420141135.444820-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Alex Williamson [Wed, 20 Apr 2022 20:45:05 +0000 (14:45 -0600)]
PCI/doc: Update obsolete pci_set_dma_mask() references
The function is dma_set_mask(), fix a missed instance of the old
pci_set_dma_mask() and a reference to a function that doesn't exist.
Fixes: 05b0ebd06ae6 ("PCI/doc: cleanup references to the legacy PCI DMA API")
Link: https://lore.kernel.org/r/165048747271.2959320.13475081883467312497.stgit@omen
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Shlomo Pongratz [Sun, 10 Apr 2022 10:52:13 +0000 (13:52 +0300)]
PCI/P2PDMA: Whitelist Intel Skylake-E Root Ports at any devfn
In
7b94b53db34f ("PCI/P2PDMA: Add Intel Sky Lake-E Root Ports B, C, D to
the whitelist"), Andrew Maier added Skylake-E 2031, 2032, and 2033 Root
Ports to the pci_p2pdma_whitelist[], so we assume P2PDMA between devices
below these ports works.
Previously we only checked the whitelist for a device at devfn 00.0 on the
root bus, which is often a "host bridge". But these Skylake Root Ports may
be at any devfn and there may be no "host bridge" device.
Generalize pci_host_bridge_dev() so we check the first device on the root
bus, whether it is devfn 00.0 or a PCIe Root Port, against the whitelist.
[bhelgaas: commit log, comment]
Link: https://lore.kernel.org/r/20220410105213.690-2-shlomop@pliops.com
Tested-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Shlomo Pongratz <shlomop@pliops.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Andrew Maier <andrew.maier@eideticom.com>
Bjorn Helgaas [Thu, 3 Mar 2022 22:00:39 +0000 (16:00 -0600)]
x86/PCI: Clip only host bridge windows for E820 regions
ACPI firmware advertises PCI host bridge resources via PNP0A03 _CRS
methods. Some BIOSes include non-window address space in _CRS, and if we
allocate that non-window space for PCI devices, they don't work.
4dc2287c1805 ("x86: avoid E820 regions when allocating address space")
works around this issue by clipping out any regions mentioned in the E820
table in the allocate_resource() path, but the implementation has a couple
issues:
- The clipping is done for *all* allocations, not just those for PCI
address space, and
- The clipping is done at each allocation instead of being done once when
setting up the host bridge windows.
Rework the implementation so we only clip PCI host bridge windows, and we
do it once when setting them up.
Example output changes:
BIOS-e820: [mem 0x00000000b0000000-0x00000000c00fffff] reserved
+ acpi PNP0A08:00: clipped [mem 0xc0000000-0xfebfffff window] to [mem 0xc0100000-0xfebfffff window] for e820 entry [mem 0xb0000000-0xc00fffff]
- pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
+ pci_bus 0000:00: root bus resource [mem 0xc0100000-0xfebfffff window]
Link: https://lore.kernel.org/r/20220304035110.988712-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Bjorn Helgaas [Thu, 7 Apr 2022 22:42:02 +0000 (17:42 -0500)]
x86: Log resource clipping for E820 regions
When remove_e820_regions() clips a resource because an E820 region overlaps
it, log a note in dmesg to add in debugging.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Mika Westerberg [Tue, 5 Apr 2022 09:38:10 +0000 (12:38 +0300)]
PCI/ASPM: Make Intel DG2 L1 acceptable latency unlimited
Intel DG2 discrete graphics PCIe endpoints advertise L1 acceptable exit
latency to be < 1us even though they can actually tolerate unlimited exit
latencies just fine. Quirk the L1 acceptable exit latency for these
endpoints to be unlimited so ASPM L1 can be enabled.
[bhelgaas: use FIELD_GET/FIELD_PREP, wordsmith comment & commit log]
Link: https://lore.kernel.org/r/20220405093810.76613-1-mika.westerberg@linux.intel.com
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Christophe Leroy [Sat, 2 Apr 2022 10:11:56 +0000 (12:11 +0200)]
PCI: hotplug: Clean up include files
arch/powerpc/include/asm/prom.h includes some headers that it doesn't need
itself. Add the missing headers to files that include prom.h so we can
remove them from prom.h.
Link: https://lore.kernel.org/r/79201f5fae8d003164ac36ed3be7789db1bc5ab4.1648833421.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Bjorn Helgaas [Fri, 4 Mar 2022 00:04:43 +0000 (18:04 -0600)]
x86/PCI: Eliminate remove_e820_regions() common subexpressions
Add local variables to reduce repetition later. No functional change
intended.
Link: https://lore.kernel.org/r/20220304035110.988712-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Mario Limonciello [Fri, 1 Apr 2022 03:40:03 +0000 (22:40 -0500)]
PCI/ACPI: Allow D3 only if Root Port can signal and wake from D3
acpi_pci_bridge_d3(dev) returns "true" if "dev" is a hotplug bridge that
can handle hotplug events while in D3. Previously this meant either:
- "dev" has a _PS0 or _PR0 method (acpi_pci_power_manageable()), or
- The Root Port above "dev" has a _DSD with a "HotPlugSupportInD3"
property with value 1.
This did not consider _PRW, which tells us about wakeup GPEs (ACPI v6.4,
sec 7.3.13). Without a wakeup GPE, from an ACPI perspective the Root Port
has no way of generating wakeup signals, so hotplug events will be lost if
we use D3.
Similarly, it did not consider _S0W, which tells us the deepest D-state
from which a device can wake itself (sec 7.3.20). If _S0W tells us the
device cannot wake from D3, hotplug events will again be lost if we use D3.
Some platforms, e.g., AMD Yellow Carp, supply "HotPlugSupportInD3" without
_PRW or with an _S0W that says the Root Port cannot wake from D3. On those
platforms, we previously put bridges in D3hot, hotplug events were lost,
and hotplugged devices would not be recognized without manually rescanning.
Allow bridges to be put in D3 only if the Root Port can generate wakeup
GPEs (wakeup.flags.valid), it can wake from D3 (_S0W), AND it has the
"HotPlugSupportInD3" property.
Neither Windows 10 nor Windows 11 puts the bridge in D3 when the firmware
is configured this way, and this change aligns the handling of the
situation to be the same.
[bhelgaas: commit log, tidy "HotPlugSupportInD3" check and comment]
Link: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/07_Power_and_Performance_Mgmt/device-power-management-objects.html?highlight=s0w#s0w-s0-device-wake-state
Link: https://docs.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-pcie-root-ports-supporting-hot-plug-in-d3
Link: https://lore.kernel.org/r/20220401034003.3166-1-mario.limonciello@amd.com
Fixes: 26ad34d510a87 ("PCI / ACPI: Whitelist D3 for more PCIe hotplug ports")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Linus Torvalds [Sun, 3 Apr 2022 21:08:21 +0000 (14:08 -0700)]
Linux 5.18-rc1
Linus Torvalds [Sun, 3 Apr 2022 19:26:01 +0000 (12:26 -0700)]
Merge tag 'trace-v5.18-2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- Rename the staging files to give them some meaning. Just
stage1,stag2,etc, does not show what they are for
- Check for NULL from allocation in bootconfig
- Hold event mutex for dyn_event call in user events
- Mark user events to broken (to work on the API)
- Remove eBPF updates from user events
- Remove user events from uapi header to keep it from being installed.
- Move ftrace_graph_is_dead() into inline as it is called from hot
paths and also convert it into a static branch.
* tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Move user_events.h temporarily out of include/uapi
ftrace: Make ftrace_graph_is_dead() a static branch
tracing: Set user_events to BROKEN
tracing/user_events: Remove eBPF interfaces
tracing/user_events: Hold event_mutex during dyn_event_add
proc: bootconfig: Add null pointer check
tracing: Rename the staging files for trace_events
Linus Torvalds [Sun, 3 Apr 2022 19:21:14 +0000 (12:21 -0700)]
Merge tag 'clk-for-linus' of git://git./linux/kernel/git/clk/linux
Pull clk fix from Stephen Boyd:
"A single revert to fix a boot regression seen when clk_put() started
dropping rate range requests. It's best to keep various systems
booting so we'll kick this out and try again next time"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
Revert "clk: Drop the rate range on clk_put()"
Linus Torvalds [Sun, 3 Apr 2022 19:15:47 +0000 (12:15 -0700)]
Merge tag 'x86-urgent-2022-04-03' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 fixes and updates:
- Make the prctl() for enabling dynamic XSTATE components correct so
it adds the newly requested feature to the permission bitmap
instead of overwriting it. Add a selftest which validates that.
- Unroll string MMIO for encrypted SEV guests as the hypervisor
cannot emulate it.
- Handle supervisor states correctly in the FPU/XSTATE code so it
takes the feature set of the fpstate buffer into account. The
feature sets can differ between host and guest buffers. Guest
buffers do not contain supervisor states. So far this was not an
issue, but with enabling PASID it needs to be handled in the buffer
offset calculation and in the permission bitmaps.
- Avoid a gazillion of repeated CPUID invocations in by caching the
values early in the FPU/XSTATE code.
- Enable CONFIG_WERROR in x86 defconfig.
- Make the X86 defconfigs more useful by adapting them to Y2022
reality"
* tag 'x86-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Consolidate size calculations
x86/fpu/xstate: Handle supervisor states in XSTATE permissions
x86/fpu/xsave: Handle compacted offsets correctly with supervisor states
x86/fpu: Cache xfeature flags from CPUID
x86/fpu/xsave: Initialize offset/size cache early
x86/fpu: Remove unused supervisor only offsets
x86/fpu: Remove redundant XCOMP_BV initialization
x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IO
x86/config: Make the x86 defconfigs a bit more usable
x86/defconfig: Enable WERROR
selftests/x86/amx: Update the ARCH_REQ_XCOMP_PERM test
x86/fpu/xstate: Fix the ARCH_REQ_XCOMP_PERM implementation
Linus Torvalds [Sun, 3 Apr 2022 19:08:26 +0000 (12:08 -0700)]
Merge tag 'core-urgent-2022-04-03' of git://git./linux/kernel/git/tip/tip
Pull RT signal fix from Thomas Gleixner:
"Revert the RT related signal changes. They need to be reworked and
generalized"
* tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"
Linus Torvalds [Sun, 3 Apr 2022 17:31:00 +0000 (10:31 -0700)]
Merge tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping
Pull more dma-mapping updates from Christoph Hellwig:
- fix a regression in dma remap handling vs AMD memory encryption (me)
- finally kill off the legacy PCI DMA API (Christophe JAILLET)
* tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: move pgprot_decrypted out of dma_pgprot
PCI/doc: cleanup references to the legacy PCI DMA API
PCI: Remove the deprecated "pci-dma-compat.h" API
Linus Torvalds [Sun, 3 Apr 2022 17:17:48 +0000 (10:17 -0700)]
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
- avoid unnecessary rebuilds for library objects
- fix return value of __setup handlers
- fix invalid input check for "crashkernel=" kernel option
- silence KASAN warnings in unwind_frame
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9191/1: arm/stacktrace, kasan: Silence KASAN warnings in unwind_frame()
ARM: 9190/1: kdump: add invalid input check for 'crashkernel=0'
ARM: 9187/1: JIVE: fix return value of __setup handler
ARM: 9189/1: decompressor: fix unneeded rebuilds of library objects
Stephen Boyd [Sun, 3 Apr 2022 02:28:18 +0000 (19:28 -0700)]
Revert "clk: Drop the rate range on clk_put()"
This reverts commit
7dabfa2bc4803eed83d6f22bd6f045495f40636b. There are
multiple reports that this breaks boot on various systems. The common
theme is that orphan clks are having rates set on them when that isn't
expected. Let's revert it out for now so that -rc1 boots.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Tony Lindgren <tony@atomide.com>
Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/r/366a0232-bb4a-c357-6aa8-636e398e05eb@samsung.com
Cc: Maxime Ripard <maxime@cerno.tech>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Link: https://lore.kernel.org/r/20220403022818.39572-1-sboyd@kernel.org
Linus Torvalds [Sat, 2 Apr 2022 19:57:17 +0000 (12:57 -0700)]
Merge tag 'perf-tools-for-v5.18-2022-04-02' of git://git./linux/kernel/git/acme/linux
Pull more perf tools updates from Arnaldo Carvalho de Melo:
- Avoid SEGV if core.cpus isn't set in 'perf stat'.
- Stop depending on .git files for building PERF-VERSION-FILE, used in
'perf --version', fixing some perf tools build scenarios.
- Convert tracepoint.py example to python3.
- Update UAPI header copies from the kernel sources: socket,
mman-common, msr-index, KVM, i915 and cpufeatures.
- Update copy of libbpf's hashmap.c.
- Directly return instead of using local ret variable in
evlist__create_syswide_maps(), found by coccinelle.
* tag 'perf-tools-for-v5.18-2022-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf python: Convert tracepoint.py example to python3
perf evlist: Directly return instead of using local ret variable
perf cpumap: More cpu map reuse by merge.
perf cpumap: Add is_subset function
perf evlist: Rename cpus to user_requested_cpus
perf tools: Stop depending on .git files for building PERF-VERSION-FILE
tools headers cpufeatures: Sync with the kernel sources
tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools kvm headers arm64: Update KVM headers from the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
tools headers UAPI: Sync asm-generic/mman-common.h with the kernel
perf beauty: Update copy of linux/socket.h with the kernel sources
perf tools: Update copy of libbpf's hashmap.c
perf stat: Avoid SEGV if core.cpus isn't set
Linus Torvalds [Sat, 2 Apr 2022 19:33:31 +0000 (12:33 -0700)]
Merge tag 'kbuild-fixes-v5.18' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix empty $(PYTHON) expansion.
- Fix UML, which got broken by the attempt to suppress Clang warnings.
- Fix warning message in modpost.
* tag 'kbuild-fixes-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
modpost: restore the warning message for missing symbol versions
Revert "um: clang: Strip out -mno-global-merge from USER_CFLAGS"
kbuild: Remove '-mno-global-merge'
kbuild: fix empty ${PYTHON} in scripts/link-vmlinux.sh
kconfig: remove stale comment about removed kconfig_print_symbol()
Linus Torvalds [Sat, 2 Apr 2022 19:14:38 +0000 (12:14 -0700)]
Merge tag 'mips_5.18_1' of git://git./linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
- build fix for gpio
- fix crc32 build problems
- check for failed memory allocations
* tag 'mips_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: crypto: Fix CRC32 code
MIPS: rb532: move GPIOD definition into C-files
MIPS: lantiq: check the return value of kzalloc()
mips: sgi-ip22: add a check for the return of kzalloc()
Linus Torvalds [Sat, 2 Apr 2022 19:09:02 +0000 (12:09 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
- Documentation improvements
- Prevent module exit until all VMs are freed
- PMU Virtualization fixes
- Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
- Other miscellaneous bugfixes
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
KVM: x86: fix sending PV IPI
KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
KVM: x86: Remove redundant vm_entry_controls_clearbit() call
KVM: x86: cleanup enter_rmode()
KVM: x86: SVM: fix tsc scaling when the host doesn't support it
kvm: x86: SVM: remove unused defines
KVM: x86: SVM: move tsc ratio definitions to svm.h
KVM: x86: SVM: fix avic spec based definitions again
KVM: MIPS: remove reference to trap&emulate virtualization
KVM: x86: document limitations of MSR filtering
KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
KVM: x86: Trace all APICv inhibit changes and capture overall status
KVM: x86: Add wrappers for setting/clearing APICv inhibits
KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
KVM: X86: Handle implicit supervisor access with SMAP
KVM: X86: Rename variable smap to not_smap in permission_fault()
...
Masahiro Yamada [Fri, 1 Apr 2022 15:56:10 +0000 (00:56 +0900)]
modpost: restore the warning message for missing symbol versions
This log message was accidentally chopped off.
I was wondering why this happened, but checking the ML log, Mark
precisely followed my suggestion [1].
I just used "..." because I was too lazy to type the sentence fully.
Sorry for the confusion.
[1]: https://lore.kernel.org/all/CAK7LNAR6bXXk9-ZzZYpTqzFqdYbQsZHmiWspu27rtsFxvfRuVA@mail.gmail.com/
Fixes: 4a6795933a89 ("kbuild: modpost: Explicitly warn about unprototyped symbols")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Linus Torvalds [Sat, 2 Apr 2022 18:03:03 +0000 (11:03 -0700)]
Merge tag 'for-5.18/drivers-2022-04-02' of git://git.kernel.dk/linux-block
Pull block driver fix from Jens Axboe:
"Got two reports on nbd spewing warnings on load now, which is a
regression from a commit that went into your tree yesterday.
Revert the problematic change for now"
* tag 'for-5.18/drivers-2022-04-02' of git://git.kernel.dk/linux-block:
Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"
Linus Torvalds [Sat, 2 Apr 2022 17:54:52 +0000 (10:54 -0700)]
Merge tag 'pci-v5.18-changes-2' of git://git./linux/kernel/git/helgaas/pci
Pull pci fix from Bjorn Helgaas:
- Fix Hyper-V "defined but not used" build issue added during merge
window (YueHaibing)
* tag 'pci-v5.18-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: hv: Remove unused hv_set_msi_entry_from_desc()
Linus Torvalds [Sat, 2 Apr 2022 17:44:18 +0000 (10:44 -0700)]
Merge tag 'tag-chrome-platform-for-v5.18' of git://git./linux/kernel/git/chrome-platform/linux
Pull chrome platform updates from Benson Leung:
"cros_ec_typec:
- Check for EC device - Fix a crash when using the cros_ec_typec
driver on older hardware not capable of typec commands
- Make try power role optional
- Mux configuration reorganization series from Prashant
cros_ec_debugfs:
- Fix use after free. Thanks Tzung-bi
sensorhub:
- cros_ec_sensorhub fixup - Split trace include file
misc:
- Add new mailing list for chrome-platform development:
chrome-platform@lists.linux.dev
Now with patchwork!"
* tag 'tag-chrome-platform-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_debugfs: detach log reader wq from devm
platform: chrome: Split trace include file
platform/chrome: cros_ec_typec: Update mux flags during partner removal
platform/chrome: cros_ec_typec: Configure muxes at start of port update
platform/chrome: cros_ec_typec: Get mux state inside configure_mux
platform/chrome: cros_ec_typec: Move mux flag checks
platform/chrome: cros_ec_typec: Check for EC device
platform/chrome: cros_ec_typec: Make try power role optional
MAINTAINERS: platform-chrome: Add new chrome-platform@lists.linux.dev list
Jens Axboe [Sat, 2 Apr 2022 17:40:23 +0000 (11:40 -0600)]
Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"
This reverts commit
6d35d04a9e18990040e87d2bbf72689252669d54.
Both Gabriel and Borislav report that this commit casues a regression
with nbd:
sysfs: cannot create duplicate filename '/dev/block/43:0'
Revert it before 5.18-rc1 and we'll investigage this separately in
due time.
Link: https://lore.kernel.org/all/YkiJTnFOt9bTv6A2@zn.tnic/
Reported-by: Gabriel L. Somlo <somlo@cmu.edu>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Dumazet [Mon, 28 Mar 2022 17:07:04 +0000 (18:07 +0100)]
watch_queue: Free the page array when watch_queue is dismantled
Commit
7ea1a0124b6d ("watch_queue: Free the alloc bitmap when the
watch_queue is torn down") took care of the bitmap, but not the page
array.
BUG: memory leak
unreferenced object 0xffff88810d9bc140 (size 32):
comm "syz-executor335", pid 3603, jiffies
4294946994 (age 12.840s)
hex dump (first 32 bytes):
40 a7 40 04 00 ea ff ff 00 00 00 00 00 00 00 00 @.@.............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
kmalloc_array include/linux/slab.h:621 [inline]
kcalloc include/linux/slab.h:652 [inline]
watch_queue_set_size+0x12f/0x2e0 kernel/watch_queue.c:251
pipe_ioctl+0x82/0x140 fs/pipe.c:632
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
Reported-by: syzbot+25ea042ae28f3888727a@syzkaller.appspotmail.com
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20220322004654.618274-1-eric.dumazet@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steven Rostedt (Google) [Fri, 1 Apr 2022 18:39:03 +0000 (14:39 -0400)]
tracing: mark user_events as BROKEN
After being merged, user_events become more visible to a wider audience
that have concerns with the current API.
It is too late to fix this for this release, but instead of a full
revert, just mark it as BROKEN (which prevents it from being selected in
make config). Then we can work finding a better API. If that fails,
then it will need to be completely reverted.
To not have the code silently bitrot, still allow building it with
COMPILE_TEST.
And to prevent the uapi header from being installed, then later changed,
and then have an old distro user space see the old version, move the
header file out of the uapi directory.
Surround the include with CONFIG_COMPILE_TEST to the current location,
but when the BROKEN tag is taken off, it will use the uapi directory,
and fail to compile. This is a good way to remind us to move the header
back.
Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home
Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steven Rostedt (Google) [Fri, 1 Apr 2022 18:39:03 +0000 (14:39 -0400)]
tracing: Move user_events.h temporarily out of include/uapi
While user_events API is under development and has been marked for broken
to not let the API become fixed, move the header file out of the uapi
directory. This is to prevent it from being installed, then later changed,
and then have an old distro user space update with a new kernel, where
applications see the user_events being available, but the old header is in
place, and then they get compiled incorrectly.
Also, surround the include with CONFIG_COMPILE_TEST to the current
location, but when the BROKEN tag is taken off, it will use the uapi
directory, and fail to compile. This is a good way to remind us to move
the header back.
Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home
Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com
Link: https://lkml.kernel.org/r/20220401143903.188384f3@gandalf.local.home
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Christophe Leroy [Wed, 30 Mar 2022 07:00:19 +0000 (09:00 +0200)]
ftrace: Make ftrace_graph_is_dead() a static branch
ftrace_graph_is_dead() is used on hot paths, it just reads a variable
in memory and is not worth suffering function call constraints.
For instance, at entry of prepare_ftrace_return(), inlining it avoids
saving prepare_ftrace_return() parameters to stack and restoring them
after calling ftrace_graph_is_dead().
While at it using a static branch is even more performant and is
rather well adapted considering that the returned value will almost
never change.
Inline ftrace_graph_is_dead() and replace 'kill_ftrace_graph' bool
by a static branch.
The performance improvement is noticeable.
Link: https://lkml.kernel.org/r/e0411a6a0ed3eafff0ad2bc9cd4b0e202b4617df.1648623570.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Wed, 30 Mar 2022 19:58:35 +0000 (15:58 -0400)]
tracing: Set user_events to BROKEN
After being merged, user_events become more visible to a wider audience
that have concerns with the current API. It is too late to fix this for
this release, but instead of a full revert, just mark it as BROKEN (which
prevents it from being selected in make config). Then we can work finding
a better API. If that fails, then it will need to be completely reverted.
Link: https://lore.kernel.org/all/2059213643.196683.1648499088753.JavaMail.zimbra@efficios.com/
Link: https://lkml.kernel.org/r/20220330155835.5e1f6669@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Beau Belgrave [Tue, 29 Mar 2022 17:30:51 +0000 (10:30 -0700)]
tracing/user_events: Remove eBPF interfaces
Remove eBPF interfaces within user_events to ensure they are fully
reviewed.
Link: https://lore.kernel.org/all/20220329165718.GA10381@kbox/
Link: https://lkml.kernel.org/r/20220329173051.10087-1-beaub@linux.microsoft.com
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Beau Belgrave [Mon, 28 Mar 2022 22:32:25 +0000 (15:32 -0700)]
tracing/user_events: Hold event_mutex during dyn_event_add
Make sure the event_mutex is properly held during dyn_event_add call.
This is required when adding dynamic events.
Link: https://lkml.kernel.org/r/20220328223225.1992-1-beaub@linux.microsoft.com
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Lv Ruyi [Tue, 29 Mar 2022 10:40:04 +0000 (10:40 +0000)]
proc: bootconfig: Add null pointer check
kzalloc is a memory allocation function which can return NULL when some
internal memory errors happen. It is safer to add null pointer check.
Link: https://lkml.kernel.org/r/20220329104004.2376879-1-lv.ruyi@zte.com.cn
Cc: stable@vger.kernel.org
Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Tue, 29 Mar 2022 20:50:44 +0000 (16:50 -0400)]
tracing: Rename the staging files for trace_events
When looking for implementation of different phases of the creation of the
TRACE_EVENT() macro, it is pretty useless when all helper macro
redefinitions are in files labeled "stageX_defines.h". Rename them to
state which phase the files are for. For instance, when looking for the
defines that are used to create the event fields, seeing
"stage4_event_fields.h" gives the developer a good idea that the defines
are in that file.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Li RongQing [Wed, 9 Mar 2022 08:35:44 +0000 (16:35 +0800)]
KVM: x86: fix sending PV IPI
If apic_id is less than min, and (max - apic_id) is greater than
KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
the new apic_id does not fit the bitmask. In this case __send_ipi_mask
should send the IPI.
This is mostly theoretical, but it can happen if the apic_ids on three
iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
Fixes: aaffcfd1e82 ("KVM: X86: Implement PV IPIs in linux guest")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <
1646814944-51801-1-git-send-email-lirongqing@baidu.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 29 Mar 2022 16:56:24 +0000 (12:56 -0400)]
KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
FNAME(cmpxchg_gpte) is an inefficient mess. It is at least decent if it
can go through get_user_pages_fast(), but if it cannot then it tries to
use memremap(); that is not just terribly slow, it is also wrong because
it assumes that the VM_PFNMAP VMA is contiguous.
The right way to do it would be to do the same thing as
hva_to_pfn_remapped() does since commit
add6a0cd1c5b ("KVM: MMU: try to
fix up page faults before giving up", 2016-07-05), using follow_pte()
and fixup_user_fault() to determine the correct address to use for
memremap(). To do this, one could for example extract hva_to_pfn()
for use outside virt/kvm/kvm_main.c. But really there is no reason to
do that either, because there is already a perfectly valid address to
do the cmpxchg() on, only it is a userspace address. That means doing
user_access_begin()/user_access_end() and writing the code in assembly
to handle exceptions correctly. Worse, the guest PTE can be 8-byte
even on i686 so there is the extra complication of using cmpxchg8b to
account for. But at least it is an efficient mess.
(Thanks to Linus for suggesting improvement on the inline assembly).
Reported-by: Qiuhao Li <qiuhao@sysec.org>
Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zhenzhong Duan [Fri, 11 Mar 2022 10:26:43 +0000 (18:26 +0800)]
KVM: x86: Remove redundant vm_entry_controls_clearbit() call
When emulating exit from long mode, EFER_LMA is cleared with
vmx_set_efer(). This will already unset the VM_ENTRY_IA32E_MODE control
bit as requested by SDM, so there is no need to unset VM_ENTRY_IA32E_MODE
again in exit_lmode() explicitly. In case EFER isn't supported by
hardware, long mode isn't supported, so exit_lmode() cannot be reached.
Note that, thanks to the shadow controls mechanism, this change doesn't
eliminate vmread or vmwrite.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-Id: <
20220311102643.807507-3-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zhenzhong Duan [Fri, 11 Mar 2022 10:26:42 +0000 (18:26 +0800)]
KVM: x86: cleanup enter_rmode()
vmx_set_efer() sets uret->data but, in fact if the value of uret->data
will be used vmx_setup_uret_msrs() will have rewritten it with the value
returned by update_transition_efer(). uret->data is consumed if and only
if uret->load_into_hardware is true, and vmx_setup_uret_msrs() takes care
of (a) updating uret->data before setting uret->load_into_hardware to true
(b) setting uret->load_into_hardware to false if uret->data isn't updated.
Opportunistically use "vmx" directly instead of redoing to_vmx().
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Message-Id: <
20220311102643.807507-2-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Tue, 22 Mar 2022 17:24:48 +0000 (19:24 +0200)]
KVM: x86: SVM: fix tsc scaling when the host doesn't support it
It was decided that when TSC scaling is not supported,
the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
value.
However in this case kvm_max_tsc_scaling_ratio is not set,
which breaks various assumptions.
Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
host support. For consistency, do the same for VMX.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20220322172449.235575-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Tue, 22 Mar 2022 17:24:47 +0000 (19:24 +0200)]
kvm: x86: SVM: remove unused defines
Remove some unused #defines from svm.c
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20220322172449.235575-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Tue, 22 Mar 2022 17:24:46 +0000 (19:24 +0200)]
KVM: x86: SVM: move tsc ratio definitions to svm.h
Another piece of SVM spec which should be in the header file
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20220322172449.235575-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Tue, 22 Mar 2022 17:24:45 +0000 (19:24 +0200)]
KVM: x86: SVM: fix avic spec based definitions again
Due to wrong rebase, commit
4a204f7895878 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255")
moved avic spec #defines back to avic.c.
Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12
bits as well (it will be used in nested avic)
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <
20220322172449.235575-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sun, 13 Mar 2022 14:05:22 +0000 (15:05 +0100)]
KVM: MIPS: remove reference to trap&emulate virtualization
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <
20220313140522.
1307751-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 15 Mar 2022 22:17:15 +0000 (18:17 -0400)]
KVM: x86: document limitations of MSR filtering
MSR filtering requires an exit to userspace that is hard to implement and
would be very slow in the case of nested VMX vmexit and vmentry MSR
accesses. Document the limitation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hou Wenlong [Mon, 7 Mar 2022 12:26:33 +0000 (20:26 +0800)]
KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
If MSR access is rejected by MSR filtering,
kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
and the return value is only handled well for rdmsr/wrmsr.
However, some instruction emulation and state transition also
use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
some unexpected results if MSR access is rejected, E.g. RDPID
emulation would inject a #UD but RDPID wouldn't cause a exit
when RDPID is supported in hardware and ENABLE_RDTSCP is set.
And it would also cause failure when load MSR at nested entry/exit.
Since msr filtering is based on MSR bitmap, it is better to only
do MSR filtering for rdmsr/wrmsr.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <
2b2774154f7532c96a6f04d71c82a8bec7d9e80b.
1646655860.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hou Wenlong [Wed, 2 Mar 2022 13:15:14 +0000 (21:15 +0800)]
KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
When RDTSCP is supported but RDPID is not supported in host,
RDPID emulation is available. However, __kvm_get_msr() would
only fail when RDTSCP/RDPID both are disabled in guest, so
the emulator wouldn't inject a #UD when RDPID is disabled but
RDTSCP is enabled in guest.
Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <
1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.
1646226788.git.houwenlong.hwl@antgroup.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like Xu [Wed, 9 Mar 2022 08:42:57 +0000 (16:42 +0800)]
KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
HSW_IN_TX* bits are used in generic code which are not supported on
AMD. Worse, these bits overlap with AMD EventSelect[11:8] and hence
using HSW_IN_TX* bits unconditionally in generic code is resulting in
unintentional pmu behavior on AMD. For example, if EventSelect[11:8]
is 0x2, pmc_reprogram_counter() wrongly assumes that
HSW_IN_TX_CHECKPOINTED is set and thus forces sampling period to be 0.
Also per the SDM, both bits 32 and 33 "may only be set if the processor
supports HLE or RTM" and for "IN_TXCP (bit 33): this bit may only be set
for IA32_PERFEVTSEL2."
Opportunistically eliminate code redundancy, because if the HSW_IN_TX*
bit is set in pmc->eventsel, it is already set in attr.config.
Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Reported-by: Jim Mattson <jmattson@google.com>
Fixes: 103af0a98788 ("perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5")
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <
20220309084257.88931-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Wed, 2 Mar 2022 10:24:57 +0000 (12:24 +0200)]
KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
It makes more sense to print new SPTE value than the
old value.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20220302102457.588450-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jim Mattson [Sat, 26 Feb 2022 23:41:31 +0000 (15:41 -0800)]
KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some
reserved bits are cleared, and some are not. Specifically, on
Zen3/Milan, bits 19 and 42 are not cleared.
When emulating such a WRMSR, KVM should not synthesize a #GP,
regardless of which bits are set. However, undocumented bits should
not be passed through to the hardware MSR. So, rather than checking
for reserved bits and synthesizing a #GP, just clear the reserved
bits.
This may seem pedantic, but since KVM currently does not support the
"Host/Guest Only" bits (41:40), it is necessary to clear these bits
rather than synthesizing #GP, because some popular guests (e.g Linux)
will set the "Host Only" bit even on CPUs that don't support
EFER.SVME, and they don't expect a #GP.
For example,
root@Ubuntu1804:~# perf stat -e r26 -a sleep 1
Performance counter stats for 'system wide':
0 r26
1.
001070977 seconds time elapsed
Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30)
Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379958] Call Trace:
Feb 23 03:59:58 Ubuntu1804 kernel: [ 405.379963] amd_pmu_disable_event+0x27/0x90
Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Lotus Fenn <lotusf@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Like Xu <likexu@tencent.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Message-Id: <
20220226234131.
2167175-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Fri, 11 Mar 2022 04:35:17 +0000 (04:35 +0000)]
KVM: x86: Trace all APICv inhibit changes and capture overall status
Trace all APICv inhibit changes instead of just those that result in
APICv being (un)inhibited, and log the current state. Debugging why
APICv isn't working is frustrating as it's hard to see why APICv is still
inhibited, and logging only the first inhibition means unnecessary onion
peeling.
Opportunistically drop the export of the tracepoint, it is not and should
not be used by vendor code due to the need to serialize toggling via
apicv_update_lock.
Note, using the common flow means kvm_apicv_init() switched from atomic
to non-atomic bitwise operations. The VM is unreachable at init, so
non-atomic is perfectly ok.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20220311043517.17027-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Fri, 11 Mar 2022 04:35:16 +0000 (04:35 +0000)]
KVM: x86: Add wrappers for setting/clearing APICv inhibits
Add set/clear wrappers for toggling APICv inhibits to make the call sites
more readable, and opportunistically rename the inner helpers to align
with the new wrappers and to make them more readable as well. Invert the
flag from "activate" to "set"; activate is painfully ambiguous as it's
not obvious if the inhibit is being activated, or if APICv is being
activated, in which case the inhibit is being deactivated.
For the functions that take @set, swap the order of the inhibit reason
and @set so that the call sites are visually similar to those that bounce
through the wrapper.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20220311043517.17027-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Fri, 11 Mar 2022 04:35:15 +0000 (04:35 +0000)]
KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
Use an enum for the APICv inhibit reasons, there is no meaning behind
their values and they most definitely are not "unsigned longs". Rename
the various params to "reason" for consistency and clarity (inhibit may
be confused as a command, i.e. inhibit APICv, instead of the reason that
is getting toggled/checked).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20220311043517.17027-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Fri, 11 Mar 2022 07:03:44 +0000 (15:03 +0800)]
KVM: X86: Handle implicit supervisor access with SMAP
There are two kinds of implicit supervisor access
implicit supervisor access when CPL = 3
implicit supervisor access when CPL < 3
Current permission_fault() handles only the first kind for SMAP.
But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.
So the second kind should be also supported.
The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.
But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.
The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.
This patch removes the call to ->get_cpl() for access mode is determined
by @access. Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP. The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it. The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <
20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Fri, 11 Mar 2022 07:03:43 +0000 (15:03 +0800)]
KVM: X86: Rename variable smap to not_smap in permission_fault()
Comments above the variable says the bit is set when SMAP is overridden
or the same meaning in update_permission_bitmask(): it is not subjected
to SMAP restriction.
Renaming it to reflect the negative implication and make the code better
readability.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <
20220311070346.45023-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Fri, 11 Mar 2022 07:03:42 +0000 (15:03 +0800)]
KVM: X86: Fix comments in update_permission_bitmask
The commit
09f037aa48f3 ("KVM: MMU: speedup update_permission_bitmask")
refactored the code of update_permission_bitmask() and change the
comments. It added a condition into a list to match the new code,
so the number/order for conditions in the comments should be updated
too.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <
20220311070346.45023-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lai Jiangshan [Fri, 11 Mar 2022 07:03:41 +0000 (15:03 +0800)]
KVM: X86: Change the type of access u32 to u64
Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().
The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.
And SMAP relies a factor for supervisor access: explicit or implicit.
So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.
Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
o Extra bits will be in the higher 32 bits, so that we can
easily obtain the traditional access mode (UWX) by converting
it to u32.
o Reuse the value for the access kind defined by SVM's nested
paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
@error_code in kvm_handle_page_fault().
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <
20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Woodhouse [Thu, 3 Mar 2022 15:41:12 +0000 (15:41 +0000)]
KVM: Remove dirty handling from gfn_to_pfn_cache completely
It isn't OK to cache the dirty status of a page in internal structures
for an indefinite period of time.
Any time a vCPU exits the run loop to userspace might be its last; the
VMM might do its final check of the dirty log, flush the last remaining
dirty pages to the destination and complete a live migration. If we
have internal 'dirty' state which doesn't get flushed until the vCPU
is finally destroyed on the source after migration is complete, then
we have lost data because that will escape the final copy.
This problem already exists with the use of kvm_vcpu_unmap() to mark
pages dirty in e.g. VMX nesting.
Note that the actual Linux MM already considers the page to be dirty
since we have a writeable mapping of it. This is just about the KVM
dirty logging.
For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to
track which gfn_to_pfn_caches have been used and explicitly mark the
corresponding pages dirty before returning to userspace. But we would
have needed external tracking of that anyway, rather than walking the
full list of GPCs to find those belonging to this vCPU which are dirty.
So let's rely *solely* on that external tracking, and keep it simple
rather than laying a tempting trap for callers to fall into.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <
20220303154127.202856-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Thu, 3 Mar 2022 15:41:11 +0000 (15:41 +0000)]
KVM: Use enum to track if cached PFN will be used in guest and/or host
Replace the guest_uses_pa and kernel_map booleans in the PFN cache code
with a unified enum/bitmask. Using explicit names makes it easier to
review and audit call sites.
Opportunistically add a WARN to prevent passing garbage; instantating a
cache without declaring its usage is either buggy or pointless.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <
20220303154127.202856-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Peter Gonda [Fri, 4 Mar 2022 16:10:32 +0000 (08:10 -0800)]
KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode()
Include kvm_cache_regs.h to pick up the definition of is_guest_mode(),
which is referenced by nested_svm_virtualize_tpr() in svm.h. Remove
include from svm_onhpyerv.c which was done only because of lack of
include in svm.h.
Fixes: 883b0a91f41ab ("KVM: SVM: Move Nested SVM Implementation to nested.c")
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Gonda <pgonda@google.com>
Message-Id: <
20220304161032.
2270688-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jim Mattson [Tue, 8 Mar 2022 01:24:52 +0000 (17:24 -0800)]
KVM: x86/pmu: Use different raw event masks for AMD and Intel
The third nybble of AMD's event select overlaps with Intel's IN_TX and
IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel
platforms that support TSX.
Declare a raw_event_mask in the kvm_pmu structure, initialize it in
the vendor-specific pmu_refresh() functions, and use that mask for
PERF_TYPE_RAW configurations in reprogram_gp_counter().
Fixes: 710c47651431 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <
20220308012452.
3468611-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Wed, 23 Feb 2022 16:53:02 +0000 (16:53 +0000)]
KVM: Don't actually set a request when evicting vCPUs for GFN cache invd
Don't actually set a request bit in vcpu->requests when making a request
purely to force a vCPU to exit the guest. Logging a request but not
actually consuming it would cause the vCPU to get stuck in an infinite
loop during KVM_RUN because KVM would see the pending request and bail
from VM-Enter to service the request.
Note, it's currently impossible for KVM to set KVM_REQ_GPC_INVALIDATE as
nothing in KVM is wired up to set guest_uses_pa=true. But, it'd be all
too easy for arch code to introduce use of kvm_gfn_to_pfn_cache_init()
without implementing handling of the request, especially since getting
test coverage of MMU notifier interaction with specific KVM features
usually requires a directed test.
Opportunistically rename gfn_to_pfn_cache_invalidate_start()'s wake_vcpus
to evict_vcpus. The purpose of the request is to get vCPUs out of guest
mode, it's supposed to _avoid_ waking vCPUs that are blocking.
Opportunistically rename KVM_REQ_GPC_INVALIDATE to be more specific as to
what it wants to accomplish, and to genericize the name so that it can
used for similar but unrelated scenarios, should they arise in the future.
Add a comment and documentation to explain why the "no action" request
exists.
Add compile-time assertions to help detect improper usage. Use the inner
assertless helper in the one s390 path that makes requests without a
hardcoded request.
Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <
20220223165302.
3205276-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Woodhouse [Tue, 29 Mar 2022 17:11:47 +0000 (13:11 -0400)]
KVM: avoid double put_page with gfn-to-pfn cache
If the cache's user host virtual address becomes invalid, there
is still a path from kvm_gfn_to_pfn_cache_refresh() where __release_gpc()
could release the pfn but the gpc->pfn field has not been overwritten
with an error value. If this happens, kvm_gfn_to_pfn_cache_unmap will
call put_page again on the same page.
Cc: stable@vger.kernel.org
Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Fri, 25 Mar 2022 23:03:48 +0000 (23:03 +0000)]
KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).
This effectively revert commits
873dd122172f8cce329113cfb0dfe3d2344d80c0
and
fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit
cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top:
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <
20220325230348.
2587437-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yi Wang [Wed, 9 Mar 2022 11:30:25 +0000 (19:30 +0800)]
KVM: SVM: fix panic on out-of-bounds guest IRQ
As guest_irq is coming from KVM_IRQFD API call, it may trigger
crash in svm_update_pi_irte() due to out-of-bounds:
crash> bt
PID: 22218 TASK:
ffff951a6ad74980 CPU: 73 COMMAND: "vcpu8"
#0 [
ffffb1ba6707fa40] machine_kexec at
ffffffff8565b397
#1 [
ffffb1ba6707fa90] __crash_kexec at
ffffffff85788a6d
#2 [
ffffb1ba6707fb58] crash_kexec at
ffffffff8578995d
#3 [
ffffb1ba6707fb70] oops_end at
ffffffff85623c0d
#4 [
ffffb1ba6707fb90] no_context at
ffffffff856692c9
#5 [
ffffb1ba6707fbf8] exc_page_fault at
ffffffff85f95b51
#6 [
ffffb1ba6707fc50] asm_exc_page_fault at
ffffffff86000ace
[exception RIP: svm_update_pi_irte+227]
RIP:
ffffffffc0761b53 RSP:
ffffb1ba6707fd08 RFLAGS:
00010086
RAX:
ffffb1ba6707fd78 RBX:
ffffb1ba66d91000 RCX:
0000000000000001
RDX:
00003c803f63f1c0 RSI:
000000000000019a RDI:
ffffb1ba66db2ab8
RBP:
000000000000019a R8:
0000000000000040 R9:
ffff94ca41b82200
R10:
ffffffffffffffcf R11:
0000000000000001 R12:
0000000000000001
R13:
0000000000000001 R14:
ffffffffffffffcf R15:
000000000000005f
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0018
#7 [
ffffb1ba6707fdb8] kvm_irq_routing_update at
ffffffffc09f19a1 [kvm]
#8 [
ffffb1ba6707fde0] kvm_set_irq_routing at
ffffffffc09f2133 [kvm]
#9 [
ffffb1ba6707fe18] kvm_vm_ioctl at
ffffffffc09ef544 [kvm]
RIP:
00007f143c36488b RSP:
00007f143a4e04b8 RFLAGS:
00000246
RAX:
ffffffffffffffda RBX:
00007f05780041d0 RCX:
00007f143c36488b
RDX:
00007f05780041d0 RSI:
000000004008ae6a RDI:
0000000000000020
RBP:
00000000000004e8 R8:
0000000000000008 R9:
00007f05780041e0
R10:
00007f0578004560 R11:
0000000000000246 R12:
00000000000004e0
R13:
000000000000001a R14:
00007f1424001c60 R15:
00007f0578003bc0
ORIG_RAX:
0000000000000010 CS: 0033 SS: 002b
Vmx have been fix this in commit
3a8b0677fc61 (KVM: VMX: Do not BUG() on
out-of-bounds guest IRQ), so we can just copy source from that to fix
this.
Co-developed-by: Yi Liu <liu.yi24@zte.com.cn>
Signed-off-by: Yi Liu <liu.yi24@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Message-Id: <
20220309113025.44469-1-wang.yi59@zte.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 25 Mar 2022 16:42:52 +0000 (12:42 -0400)]
KVM: MMU: propagate alloc_workqueue failure
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Sat, 2 Apr 2022 02:57:03 +0000 (19:57 -0700)]
Merge branch 'work.misc' of git://git./linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
"Assorted bits and pieces"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: drop needless assignment in aio_read()
clean overflow checks in count_mounts() a bit
seq_file: fix NULL pointer arithmetic warning
uml/x86: use x86 load_unaligned_zeropad()
asm/user.h: killed unused macros
constify struct path argument of finish_automount()/do_add_mount()
fs: Remove FIXME comment in generic_write_checks()
Linus Torvalds [Sat, 2 Apr 2022 02:35:56 +0000 (19:35 -0700)]
Merge tag 'vfs-5.18-merge-1' of git://git./fs/xfs/xfs-linux
Pull vfs fix from Darrick Wong:
"The erofs developers felt that FIEMAP should handle ranged requests
starting at s_maxbytes by returning EFBIG instead of passing the
filesystem implementation a nonsense 0-byte request.
Not sure why they keep tagging this 'iomap', but the VFS shouldn't be
asking for information about ranges of a file that the filesystem
already declared that it does not support.
- Fix a potential infinite loop in FIEMAP by fixing an off by one
error when comparing the requested range against s_maxbytes"
* tag 'vfs-5.18-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: fix an infinite loop in iomap_fiemap
Linus Torvalds [Sat, 2 Apr 2022 02:30:44 +0000 (19:30 -0700)]
Merge tag 'xfs-5.18-merge-4' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"This fixes multiple problems in the reserve pool sizing functions: an
incorrect free space calculation, a pointless infinite loop, and even
more braindamage that could result in the pool being overfilled. The
pile of patches from Dave fix myriad races and UAF bugs in the log
recovery code that much to our mutual surprise nobody's tripped over.
Dave also fixed a performance optimization that had turned into a
regression.
Dave Chinner is taking over as XFS maintainer starting Sunday and
lasting until 5.19-rc1 is tagged so that I can focus on starting a
massive design review for the (feature complete after five years)
online repair feature. From then on, he and I will be moving XFS to a
co-maintainership model by trading duties every other release.
NOTE: I hope very strongly that the other pieces of the (X)FS
ecosystem (fstests and xfsprogs) will make similar changes to spread
their maintenance load.
Summary:
- Fix an incorrect free space calculation in xfs_reserve_blocks that
could lead to a request for free blocks that will never succeed.
- Fix a hang in xfs_reserve_blocks caused by an infinite loop and the
incorrect free space calculation.
- Fix yet a third problem in xfs_reserve_blocks where multiple racing
threads can overfill the reserve pool.
- Fix an accounting error that lead to us reporting reserved space as
"available".
- Fix a race condition during abnormal fs shutdown that could cause
UAF problems when memory reclaim and log shutdown try to clean up
inodes.
- Fix a bug where log shutdown can race with unmount to tear down the
log, thereby causing UAF errors.
- Disentangle log and filesystem shutdown to reduce confusion.
- Fix some confusion in xfs_trans_commit such that a race between
transaction commit and filesystem shutdown can cause unlogged dirty
inode metadata to be committed, thereby corrupting the filesystem.
- Remove a performance optimization in the log as it was discovered
that certain storage hardware handle async log flushes so poorly as
to cause serious performance regressions. Recent restructuring of
other parts of the logging code mean that no performance benefit is
seen on hardware that handle it well"
* tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: drop async cache flushes from CIL commits.
xfs: shutdown during log recovery needs to mark the log shutdown
xfs: xfs_trans_commit() path must check for log shutdown
xfs: xfs_do_force_shutdown needs to block racing shutdowns
xfs: log shutdown triggers should only shut down the log
xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
xfs: shutdown in intent recovery has non-intent items in the AIL
xfs: aborting inodes on shutdown may need buffer lock
xfs: don't report reserved bnobt space as available
xfs: fix overfilling of reserve pool
xfs: always succeed at setting the reserve pool size
xfs: remove infinite loop when reserving free block pool
xfs: don't include bnobt blocks when reserving free block pool
xfs: document the XFS_ALLOC_AGFL_RESERVE constant
Linus Torvalds [Sat, 2 Apr 2022 02:19:56 +0000 (19:19 -0700)]
Merge tag 'riscv-for-linus-5.18-mw2' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fix from Palmer Dabbelt:
- Fix the RISC-V section of the generic CPU idle bindings to comply
with the recently tightened DT schema.
* tag 'riscv-for-linus-5.18-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
dt-bindings: Fix phandle-array issues in the idle-states bindings
Linus Torvalds [Fri, 1 Apr 2022 23:26:57 +0000 (16:26 -0700)]
Merge tag 'for-5.18/drivers-2022-04-01' of git://git.kernel.dk/linux-block
Pull block driver fixes from Jens Axboe:
"Followup block driver updates and fixes for the 5.18-rc1 merge window.
In detail:
- NVMe pull request
- Fix multipath hang when disk goes live over reconnect (Anton
Eidelman)
- fix RCU hole that allowed for endless looping in multipath
round robin (Chris Leech)
- remove redundant assignment after left shift (Colin Ian King)
- add quirks for Samsung X5 SSDs (Monish Kumar R)
- fix the read-only state for zoned namespaces with unsupposed
features (Pankaj Raghav)
- use a private workqueue instead of the system workqueue in
nvmet (Sagi Grimberg)
- allow duplicate NSIDs for private namespaces (Sungup Moon)
- expose use_threaded_interrupts read-only in sysfs (Xin Hao)"
- nbd minor allocation fix (Zhang)
- drbd fixes and maintainer addition (Lars, Jakob, Christoph)
- n64cart build fix (Jackie)
- loop compat ioctl fix (Carlos)
- misc fixes (Colin, Dongli)"
* tag 'for-5.18/drivers-2022-04-01' of git://git.kernel.dk/linux-block:
drbd: remove check of list iterator against head past the loop body
drbd: remove usage of list iterator variable after loop
nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
MAINTAINERS: add drbd co-maintainer
drbd: fix potential silent data corruption
loop: fix ioctl calls using compat_loop_info
nvme-multipath: fix hang when disk goes live over reconnect
nvme: fix RCU hole that allowed for endless looping in multipath round robin
nvme: allow duplicate NSIDs for private namespaces
nvmet: remove redundant assignment after left shift
nvmet: use a private workqueue instead of the system workqueue
nvme-pci: add quirks for Samsung X5 SSDs
nvme-pci: expose use_threaded_interrupts read-only in sysfs
nvme: fix the read-only state for zoned namespaces with unsupposed features
n64cart: convert bi_disk to bi_bdev->bd_disk fix build
xen/blkfront: fix comment for need_copy
xen-blkback: remove redundant assignment to variable i
Linus Torvalds [Fri, 1 Apr 2022 23:20:00 +0000 (16:20 -0700)]
Merge tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Either fixes or a few additions that got missed in the initial merge
window pull. In detail:
- List iterator fix to avoid leaking value post loop (Jakob)
- One-off fix in minor count (Christophe)
- Fix for a regression in how io priority setting works for an
exiting task (Jiri)
- Fix a regression in this merge window with blkg_free() being called
in an inappropriate context (Ming)
- Misc fixes (Ming, Tom)"
* tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-block:
blk-wbt: remove wbt_track stub
block: use dedicated list iterator variable
block: Fix the maximum minor value is blk_alloc_ext_minor()
block: restore the old set_task_ioprio() behaviour wrt PF_EXITING
block: avoid calling blkg_free() in atomic context
lib/sbitmap: allocate sb->map via kvzalloc_node