powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race
authorAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tue, 5 May 2020 07:17:29 +0000 (12:47 +0530)
committerMichael Ellerman <mpe@ellerman.id.au>
Tue, 5 May 2020 11:20:16 +0000 (21:20 +1000)
commit75358ea359e7c0dfceb3c7b3d854570b4260cb7f
treeefa39d494826c9623198d66794b55f3cea021e7d
parente21dfbf01346ee4447d1533b1c57a003c773c6e3
powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race

MADV_DONTNEED holds mmap_sem in read mode and that implies a
parallel page fault is possible and the kernel can end up with a level 1 PTE
entry (THP entry) converted to a level 0 PTE entry without flushing
the THP TLB entry.

Most architectures including POWER have issues with kernel instantiating a level
0 PTE entry while holding level 1 TLB entries.

The code sequence I am looking at is

down_read(mmap_sem)                         down_read(mmap_sem)

zap_pmd_range()
 zap_huge_pmd()
  pmd lock held
  pmd_cleared
  table details added to mmu_gather
  pmd_unlock()
                                         insert a level 0 PTE entry()

tlb_finish_mmu().

Fix this by forcing a tlb flush before releasing pmd lock if this is
not a fullmm invalidate. We can safely skip this invalidate for
task exit case (fullmm invalidate) because in that case we are sure
there can be no parallel fault handlers.

This do change the Qemu guest RAM del/unplug time as below

128 core, 496GB guest:

Without patch:
munmap start: timer = 196449 ms, PID=6681
munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms

With patch:
munmap start: timer = 196345 ms, PID=6879
munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
arch/powerpc/include/asm/book3s/64/pgtable.h
arch/powerpc/mm/book3s64/pgtable.c