Anisse Astier [Wed, 12 Sep 2018 13:07:05 +0000 (15:07 +0200)]
 
HID: i2c-hid: disable runtime PM operations on hantick touchpad
This hantick HTIX5288 touchpad can quickly fall in a wrong state if
there are too many open/close operations. This will either make it stop
reporting any input, or will shift all the input reads by a few bytes,
making it impossible to decode.
Here, we never release the probed touchpad runtime pm while the driver
is loaded, which should disable all runtime pm suspend/resumes.
This fast repetition of sleep/wakeup is also more likely to happen when
using runtime PM, which is why the quirk is done there, and not for all
power downs, which would include suspend or module removal.
Signed-off-by: Anisse Astier <anisse@astier.eu>
Cc: stable@vger.kernel.org
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Philip Müller <philm@manjaro.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Kai-Heng Feng [Thu, 6 Sep 2018 02:55:18 +0000 (10:55 +0800)]
 
HID: i2c-hid: Don't reset device upon system resume
Raydium touchscreen triggers interrupt storm after system-wide suspend:
	[ 179.085033] i2c_hid i2c-CUST0000:00: i2c_hid_get_input: incomplete report (58/65535)
According to Raydium, Windows driver does not reset the device after system
resume.
The HID over I2C spec does specify a reset should be used at intialization, but
it doesn't specify if reset is required for system suspend.
Tested this patch on other i2c-hid touchpanels I have and those touchpanels do
work after S3 without doing reset. If any regression happens to other
touchpanel vendors, we can use quirk for Raydium devices.
There's still one device uses I2C_HID_QUIRK_RESEND_REPORT_DESCR so keep it
there.
Cc: Aaron Ma <aaron.ma@canonical.com>
Cc: AceLan Kao <acelan.kao@canonical.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Hans de Goede [Sat, 18 Aug 2018 08:12:08 +0000 (10:12 +0200)]
 
HID: sensor-hub: Restore fixup for Lenovo ThinkPad Helix 2 sensor hub report
Commit 
b0f847e16c1e ("HID: hid-sensor-hub: Force logical minimum to 1 for
power and report state") not only replaced the descriptor fixup done for
devices with the HID_SENSOR_HUB_ENUM_QUIRK with a generic fix, but also
accidentally removed the unrelated descriptor fixup for the Lenovo ThinkPad
Helix 2 sensor hub. This commit restores this fixup.
Restoring this fixup not only fixes the Lenovo ThinkPad Helix 2's sensors,
but also the Lenovo ThinkPad 8's sensors.
Fixes: b0f847e16c1e ("HID: hid-sensor-hub: Force logical minimum ...")
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Fernando D S Lima <fernandodsl@gmail.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
 
Gustavo A. R. Silva [Wed, 29 Aug 2018 15:22:09 +0000 (10:22 -0500)]
 
HID: core: fix NULL pointer dereference
There is a NULL pointer dereference in case memory resources
for *parse* are not successfully allocated.
Fix this by adding a new goto label and make the execution
path jump to it in case vzalloc() fails.
Addresses-Coverity-ID: 
1473081 ("Dereference after null check")
Fixes: b2dd9f2e5a8a ("HID: core: fix memory leak on probe")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
 
Benjamin Tissoires [Tue, 4 Sep 2018 13:31:14 +0000 (15:31 +0200)]
 
HID: core: fix grouping by application
commit 
f07b3c1da92d ("HID: generic: create one input report per
application type") was effectively the same as MULTI_INPUT:
hidinput->report was never set, so hidinput_match_application()
always returned null.
Fix that by testing against the real application.
Note that this breaks some old eGalax touchscreens that expect MULTI_INPUT
instead of HID_QUIRK_INPUT_PER_APP. Enable this quirk for backward
compatibility on all non-Win8 touchscreens.
link: https://bugzilla.kernel.org/show_bug.cgi?id=200847
link: https://bugzilla.kernel.org/show_bug.cgi?id=200849
link: https://bugs.archlinux.org/task/59699
link: https://github.com/NixOS/nixpkgs/issues/45165
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
 
Benjamin Tissoires [Tue, 4 Sep 2018 13:31:12 +0000 (15:31 +0200)]
 
HID: multitouch: fix Elan panels with 2 input modes declaration
When implementing commit 
7f81c8db5489 ("HID: multitouch: simplify
the settings of the various features"), I wrongly removed a test
that made sure we never try to set the second InputMode feature
to something else than 0.
This broke badly some recent Elan panels that now forget to send the
click button in some area of the touchpad.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200899
Fixes: 7f81c8db5489 ("HID: multitouch: simplify the settings of the various features")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
 
Harry Mallon [Tue, 28 Aug 2018 21:51:29 +0000 (22:51 +0100)]
 
HID: hid-saitek: Add device ID for RAT 7 Contagion
Signed-off-by: Harry Mallon <hjmallon@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Stefan Agner [Tue, 28 Aug 2018 11:29:54 +0000 (13:29 +0200)]
 
HID: core: fix memory leak on probe
The dynamically allocted collection stack does not get freed in
all situations. Make sure to also free the collection stack when
using the parser in hid_open_report().
Fixes: 08a8a7cf1459 ("HID: core: do not upper bound the collection stack")
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Stefan Agner [Tue, 28 Aug 2018 11:29:55 +0000 (13:29 +0200)]
 
HID: input: fix leaking custom input node name
Make sure to free the custom input node name on disconnect.
Cc: stable@vger.kernel.org # v4.18+
Fixes: c554bb045511 ("HID: input: append a suffix matching the application")
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Sean O'Brien [Mon, 27 Aug 2018 20:02:15 +0000 (13:02 -0700)]
 
HID: add support for Apple Magic Keyboards
USB device
	Vendor 05ac (Apple)
	Device 026c (Magic Keyboard with Numeric Keypad)
Bluetooth devices
	Vendor 004c (Apple)
	Device 0267 (Magic Keyboard)
	Device 026c (Magic Keyboard with Numeric Keypad)
Support already exists for the Magic Keyboard over USB connection.
Add support for the Magic Keyboard over Bluetooth connection, and for
the Magic Keyboard with Numeric Keypad over Bluetooth and USB
connection.
Signed-off-by: Sean O'Brien <seobrien@chromium.org>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
AceLan Kao [Tue, 21 Aug 2018 08:55:13 +0000 (16:55 +0800)]
 
HID: i2c-hid: Fix flooded incomplete report after S3 on Rayd touchscreen
The incomplete report flooded after S3 and touchscreen becomes
malfunctioned.
[ 1367.646244] i2c_hid i2c-CUST0000:00: i2c_hid_get_input: incomplete report (58/18785)
[ 1367.649471] i2c_hid i2c-CUST0000:00: i2c_hid_get_input: incomplete report (58/28743)
[ 1367.651092] i2c_hid i2c-CUST0000:00: i2c_hid_get_input: incomplete report (58/26757)
[ 1367.652658] i2c_hid i2c-CUST0000:00: i2c_hid_get_input: incomplete report (58/52280)
[ 1367.654287] i2c_hid i2c-CUST0000:00: i2c_hid_get_input: incomplete report (58/56059)
Adding device ID, 04F3:30CC, to the quirk to re-send report description
after resume.
Cc: stable@vger.kernel.org
Signed-off-by: AceLan Kao <acelan.kao@canonical.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Andreas Bosch [Fri, 17 Aug 2018 20:16:00 +0000 (22:16 +0200)]
 
HID: intel-ish-hid: Enable Sunrise Point-H ish driver
Added PCI ID for Sunrise Point-H ISH.
Signed-off-by: Andreas Bosch <linux@progandy.de>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Linus Torvalds [Mon, 20 Aug 2018 22:59:01 +0000 (15:59 -0700)]
 
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:
 - touch_max detection improvements and quirk handling fixes in wacom
   driver from Jason Gerecke and Ping Cheng
 - Palm rejection from Dmitry Torokhov and _dial support from Benjamin
   Tissoires for hid-multitouch driver
 - Low voltage support for i2c-hid driver from Stephen Boyd
 - Guitar-Hero support from Nicolas Adenis-Lamarre
 - other assorted small fixes and device ID additions
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (40 commits)
  HID: intel_ish-hid: tx_buf memory leak on probe/remove
  HID: intel-ish-hid: Prevent loading of driver on Mehlow
  HID: cougar: Add support for the Cougar 500k Gaming Keyboard
  HID: cougar: make compare_device_paths reusable
  HID: intel-ish-hid: remove redundant variable num_frags
  HID: multitouch: handle palm for touchscreens
  HID: multitouch: touchscreens also use confidence reports
  HID: multitouch: report MT_TOOL_PALM for non-confident touches
  HID: microsoft: support the Surface Dial
  HID: core: do not upper bound the collection stack
  HID: input: enable Totem on the Dell Canvas 27
  HID: multitouch: remove one copy of values
  HID: multitouch: ditch mt_report_id
  HID: multitouch: store a per application quirks value
  HID: multitouch: Store per collection multitouch data
  HID: multitouch: make sure the static list of class is not changed
  input: add MT_TOOL_DIAL
  HID: elan: Add support for touchpad on the Toshiba Click Mini L9W
  HID: elan: Add USB-id for HP x2 10-n000nd touchpad
  HID: elan: Add a flag for selecting if the touchpad has a LED
  ...
Linus Torvalds [Mon, 20 Aug 2018 22:41:37 +0000 (15:41 -0700)]
 
Merge tag 'backlight-next-4.19' of git://git./linux/kernel/git/lee/backlight
Pull backlight updates from Lee Jones:
 "Core Framework:
   - Remove unused/obsolete code/comments
  New Functionality:
   - Allow less granular brightness specification for high-res PWMs; pwm_bl
   - Align brightness {inc,dec}rements with that perceived by the human-eye; pwm_bl
  Fix-ups:
   - Prepare for the introduction of -Wimplicit-fall-through; adp8860_bl
  Bug Fixes:
   - Fix uninitialised variable; pwm_bl"
* tag 'backlight-next-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: pwm_bl: Fix uninitialized variable
  backlight: adp8860: Mark expected switch fall-through
  backlight: Remove obsolete comment for ->state
  dt-bindings: pwm-backlight: Move brightness-levels to optional
  backlight: pwm_bl: Compute brightness of LED linearly to human eye
  dt-bindings: pwm-backlight: Add a num-interpolation-steps property
  backlight: pwm_bl: Linear interpolation between brightness-levels
Linus Torvalds [Mon, 20 Aug 2018 22:38:44 +0000 (15:38 -0700)]
 
Merge tag 'mfd-next-4.19' of git://git./linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
 "New Drivers:
   - Add Cirrus Logic Madera Codec (CS47L35, CS47L85 and CS47L90/91) driver
   - Add ChromeOS EC CEC driver
   - Add ROHM 
BD71837 PMIC driver
  New Device Support:
   - Add support for Dialog Semi DA9063L PMIC variant to DA9063
   - Add support for Intel Ice Lake to Intel-PLSS-PCI
   - Add support for X-Powers AXP806 to AXP20x
  New Functionality:
   - Add support for USB Charging to the ChromeOS Embedded Controller
   - Add support for HDMI CEC to the ChromeOS Embedded Controller
   - Add support for HDMI CEC to Intel HDMI
   - Add support for accessory detection to Madera devices
   - Allow individual pins to be configured via DT' wlf,csnaddr-pd
   - Provide legacy platform specific EEPROM/Watchdog commands; rave-sp
  Fix-upsL
   - Trivial renaming/spelling fixes; cros_ec, da9063-*
   - Convert to Managed Resources (devm_*); da9063-*, ti_am335x_tscadc
   - Transition to helper macros/functions; da9063-*
   - Constify; kempld-core
   - Improve error path/messages; wm8994-core
   - Disable IRQs locally instead of relying on USB subsystem; dln2
   - Remove unused code; rave-sp
   - New exports; sec-core
  Bug Fixes:
   - Fix possible false I2C transaction error; arizona-core
   - Fix declared memory area size; hi655x-pmic
   - Fix checksum type; rave-sp
   - Fix incorrect default serial port configuration: rave-sp
   - Fix incorrect coherent DMA mask for sub-devices; sm501"
* tag 'mfd-next-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (60 commits)
  mfd: madera: Add register definitions for accessory detect
  mfd: sm501: Set coherent_dma_mask when creating subdevices
  mfd: 
bd71837: Devicetree bindings for ROHM 
BD71837 PMIC
  mfd: 
bd71837: Core driver for ROHM 
BD71837 PMIC
  media: platform: cros-ec-cec: Fix dependency on MFD_CROS_EC
  mfd: sec-core: Export OF module alias table
  mfd: as3722: Disable auto-power-on when AC OK
  mfd: axp20x: Support AXP806 in I2C mode
  mfd: axp20x: Add self-working mode support for AXP806
  dt-bindings: mfd: axp20x: Add "self-working" mode for AXP806
  mfd: wm8994: Allow to configure CS/ADDR Pulldown from dts
  mfd: wm8994: Allow to configure Speaker Mode Pullup from dts
  mfd: rave-sp: Emulate CMD_GET_STATUS on device that don't support it
  mfd: rave-sp: Add legacy watchdog ping command translation
  mfd: rave-sp: Add legacy EEPROM access command translation
  mfd: rave-sp: Initialize flow control and parity of the port
  mfd: rave-sp: Fix incorrectly specified checksum type
  mfd: rave-sp: Remove unused defines
  mfd: hi655x: Fix regmap area declared size for hi655x
  mfd: ti_am335x_tscadc: Fix struct clk memory leak
  ...
 
Linus Torvalds [Mon, 20 Aug 2018 22:28:54 +0000 (15:28 -0700)]
 
Merge tag 'edac_fixes_for_4.19' of git://git./linux/kernel/git/bp/bp
Pull EDAC fix from Borislav Petkov:
 "An urgent fix for a NULL ptr deref on machines with LRDDR4 DIMMs, from
  Takashi Iwai"
* tag 'edac_fixes_for_4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
  EDAC: Add missing MEM_LRDDR4 entry in edac_mem_types[]
Joe Perches [Mon, 20 Aug 2018 20:15:26 +0000 (13:15 -0700)]
 
Raise the minimum required gcc version to 4.6
Various architectures fail to build properly with older versions of the
gcc compiler.
An example from Guenter Roeck in thread [1]:
>
>   In file included from ./include/linux/mm.h:17:0,
>                    from ./include/linux/pid_namespace.h:7,
>                    from ./include/linux/ptrace.h:10,
>                    from arch/openrisc/kernel/asm-offsets.c:32:
>   ./include/linux/mm_types.h:497:16: error: flexible array member in otherwise empty struct
>
> This is just an example with gcc 4.5.1 for or32. I have seen the problem
> with gcc 4.4 (for unicore32) as well.
So update the minimum required version of gcc to 4.6.
[1] https://lore.kernel.org/lkml/
20180814170904.GA12768@roeck-us.net/
Miscellanea:
 - Update Documentation/process/changes.rst
 - Remove and consolidate version test blocks in compiler-gcc.h for
   versions lower than 4.6
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Tony Luck [Mon, 20 Aug 2018 16:31:04 +0000 (09:31 -0700)]
 
ia64: Fix kernel BUG at lib/ioremap.c:72!
Commit 
0bbf47eab469 ("ia64: use asm-generic/io.h") results in a BUG
while booting ia64.  This is because asm-generic/io.h defines
PCI_IOBASE, which results in the function acpi_pci_root_remap_iospace()
doing a lot of unnecessary (and wrong) things.
I'd suggested an #if !CONFIG_IA64 in the functon, but Arnd suggested
keeping the fix inside the arch/ia64 tree.
Fixes: 0bbf47eab469 ("ia64: use asm-generic/io.h")
Suggested-by: Arnd Bergman <arnd@arndb.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Jiri Kosina [Mon, 20 Aug 2018 16:13:57 +0000 (18:13 +0200)]
 
Merge branch 'for-4.19/wiimote' into for-linus
Guitar-Hero devices support for hid-wiimote
Jiri Kosina [Mon, 20 Aug 2018 16:12:42 +0000 (18:12 +0200)]
 
Merge branch 'for-4.19/wacom' into for-linus
Wacom driver updates:
- touch_max detection improvements
- quirk handling cleanup
- get rid of wacom custom usages
Jiri Kosina [Mon, 20 Aug 2018 16:11:20 +0000 (18:11 +0200)]
 
Merge branch 'for-4.19/upstream' into for-linus
Assorted small driver/core fixes.
Jiri Kosina [Mon, 20 Aug 2018 16:10:33 +0000 (18:10 +0200)]
 
Merge branch 'for-4.19/sony' into for-linus
devm_* API conversion for hid-sony
Jiri Kosina [Mon, 20 Aug 2018 16:09:06 +0000 (18:09 +0200)]
 
Merge branch 'for-4.19/multitouch-multiaxis' into for-linus
Multitouch updates:
- Dial support
- Palm rejection for touchscreens
- a few small assorted fixes
Jiri Kosina [Mon, 20 Aug 2018 16:07:36 +0000 (18:07 +0200)]
 
Merge branch 'for-4.19/intel-ish' into for-linus
Device-specific fixes for hid-intel-ish
Jiri Kosina [Mon, 20 Aug 2018 16:07:01 +0000 (18:07 +0200)]
 
Merge branch 'for-4.19/i2c-hid' into for-linus
Low voltage support for i2c-hid
Jiri Kosina [Mon, 20 Aug 2018 16:06:30 +0000 (18:06 +0200)]
 
Merge branch 'for-4.19/elan' into for-linus
Resolution/pressure fixes and new device support for hid-elan
Jiri Kosina [Mon, 20 Aug 2018 16:05:17 +0000 (18:05 +0200)]
 
Merge branch 'for-4.19/cougar' into for-linus
New device support for hid-cougar
Linus Torvalds [Sun, 19 Aug 2018 23:23:03 +0000 (16:23 -0700)]
 
Merge branch 'for-next' of git://git./linux/kernel/git/gerg/m68knommu
Pull m68knommu updates from Greg Ungerer:
 "Only two changes.
  One cleans up warnings in the ColdFire DMA code, the other stubs out
  (with warnings) ColdFire clock api functions not normally used"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68knommu: Fix typos in Coldfire 5272 DMA debug code
  m68k: coldfire: Normalize clk API
Linus Torvalds [Sun, 19 Aug 2018 18:51:45 +0000 (11:51 -0700)]
 
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 1) Fix races in IPVS, from Tan Hu.
 2) Missing unbind in matchall classifier, from Hangbin Liu.
 3) Missing act_ife action release, from Vlad Buslov.
 4) Cure lockdep splats in ila, from Cong Wang.
 5) veth queue leak on link delete, from Toshiaki Makita.
 6) Disable isdn's IIOCDBGVAR ioctl, it exposes kernel addresses. From
    Kees Cook.
 7) RCU usage fixup in XDP, from Tariq Toukan.
 8) Two TCP ULP fixes from Daniel Borkmann.
 9) r8169 needs REALTEK_PHY as a Kconfig dependency, from Heiner
    Kallweit.
10) Always take tcf_lock with BH disabled, otherwise we can deadlock
    with rate estimator code paths. From Vlad Buslov.
11) Don't use MSI-X on RTL8106e r8169 chips, they don't resume properly.
    From Jian-Hong Pan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  ip6_vti: fix creating fallback tunnel device for vti6
  ip_vti: fix a null pointer deferrence when create vti fallback tunnel
  r8169: don't use MSI-X on RTL8106e
  net: lan743x_ptp: convert to ktime_get_clocktai_ts64
  net: sched: always disable bh when taking tcf_lock
  ip6_vti: simplify stats handling in vti6_xmit
  bpf: fix redirect to map under tail calls
  r8169: add missing Kconfig dependency
  tools/bpf: fix bpf selftest test_cgroup_storage failure
  bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
  bpf, sockmap: fix map elem deletion race with smap_stop_sock
  bpf, sockmap: fix leakage of smap_psock_map_entry
  tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
  tcp, ulp: add alias for all ulp modules
  bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
  samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
  net/xdp: Fix suspicious RCU usage warning
  net/mlx5e: Delete unneeded function argument
  Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb
  isdn: Disable IIOCDBGVAR
  ...
Haishuang Yan [Sun, 19 Aug 2018 07:05:05 +0000 (15:05 +0800)]
 
ip6_vti: fix creating fallback tunnel device for vti6
When set fb_tunnels_only_for_init_net to 1, don't create fallback tunnel
device for vti6 when a new namespace is created.
Tested:
[root@builder2 ~]# modprobe ip6_tunnel
[root@builder2 ~]# modprobe ip6_vti
[root@builder2 ~]# echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
[root@builder2 ~]# unshare -n
[root@builder2 ~]# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haishuang Yan [Sun, 19 Aug 2018 07:05:04 +0000 (15:05 +0800)]
 
ip_vti: fix a null pointer deferrence when create vti fallback tunnel
After set fb_tunnels_only_for_init_net to 1, the itn->fb_tunnel_dev will
be NULL and will cause following crash:
[ 2742.849298] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000941
[ 2742.851380] PGD 
800000042c21a067 P4D 
800000042c21a067 PUD 
42aaed067 PMD 0
[ 2742.852818] Oops: 0002 [#1] SMP PTI
[ 2742.853570] CPU: 7 PID: 2484 Comm: unshare Kdump: loaded Not tainted 4.18.0-rc8+ #2
[ 2742.855163] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014
[ 2742.856970] RIP: 0010:vti_init_net+0x3a/0x50 [ip_vti]
[ 2742.858034] Code: 90 83 c0 48 c7 c2 20 a1 83 c0 48 89 fb e8 6e 3b f6 ff 85 c0 75 22 8b 0d f4 19 00 00 48 8b 93 00 14 00 00 48 8b 14 ca 48 8b 12 <c6> 82 41 09 00 00 04 c6 82 38 09 00 00 45 5b c3 66 0f 1f 44 00 00
[ 2742.861940] RSP: 0018:
ffff9be28207fde0 EFLAGS: 
00010246
[ 2742.863044] RAX: 
0000000000000000 RBX: 
ffff8a71ebed4980 RCX: 
0000000000000013
[ 2742.864540] RDX: 
0000000000000000 RSI: 
0000000000000013 RDI: 
ffff8a71ebed4980
[ 2742.866020] RBP: 
ffff8a71ea717000 R08: 
ffffffffc083903c R09: 
ffff8a71ea717000
[ 2742.867505] R10: 
0000000000000000 R11: 
0000000000000000 R12: 
ffff8a71ebed4980
[ 2742.868987] R13: 
0000000000000013 R14: 
ffff8a71ea5b49c0 R15: 
0000000000000000
[ 2742.870473] FS:  
00007f02266c9740(0000) GS:
ffff8a71ffdc0000(0000) knlGS:
0000000000000000
[ 2742.872143] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
[ 2742.873340] CR2: 
0000000000000941 CR3: 
000000042bc20006 CR4: 
00000000001606e0
[ 2742.874821] Call Trace:
[ 2742.875358]  ops_init+0x38/0xf0
[ 2742.876078]  setup_net+0xd9/0x1f0
[ 2742.876789]  copy_net_ns+0xb7/0x130
[ 2742.877538]  create_new_namespaces+0x11a/0x1d0
[ 2742.878525]  unshare_nsproxy_namespaces+0x55/0xa0
[ 2742.879526]  ksys_unshare+0x1a7/0x330
[ 2742.880313]  __x64_sys_unshare+0xe/0x20
[ 2742.881131]  do_syscall_64+0x5b/0x180
[ 2742.881933]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reproduce:
echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
modprobe ip_vti
unshare -n
Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
 
Jian-Hong Pan [Fri, 17 Aug 2018 05:07:35 +0000 (13:07 +0800)]
 
r8169: don't use MSI-X on RTL8106e
Found the ethernet network on ASUS X441UAR doesn't come back on resume
from suspend when using MSI-X.  The chip is RTL8106e - version 39.
[   21.848357] libphy: r8169: probed
[   21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID
44900000, IRQ 127
[   22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0
[   29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[   63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full -
flow control off
[  124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic
PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
Here is the ethernet controller in detail:
02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136]
(rev 07)
	Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast
Ethernet controller [1043:200f]
	Flags: bus master, fast devsel, latency 0, IRQ 16
	I/O ports at e000 [size=256]
	Memory at 
ef100000 (64-bit, non-prefetchable) [size=4K]
	Memory at 
e0000000 (64-bit, prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
	Kernel modules: r8169
Falling back to MSI fixes the issue.
Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling")
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
 
Arnd Bergmann [Wed, 15 Aug 2018 17:49:49 +0000 (19:49 +0200)]
 
net: lan743x_ptp: convert to ktime_get_clocktai_ts64
timekeeping_clocktai64() has been renamed to ktime_get_clocktai_ts64()
for consistency with the other ktime_get_* access functions.
Rename the new caller that has come up as well.
Question: this is the only ptp driver that sets the hardware time
to the current system time in TAI. Why does it do that?
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Buslov [Tue, 14 Aug 2018 18:46:16 +0000 (21:46 +0300)]
 
net: sched: always disable bh when taking tcf_lock
Recently, ops->init() and ops->dump() of all actions were modified to
always obtain tcf_lock when accessing private action state. Actions that
don't depend on tcf_lock for synchronization with their data path use
non-bh locking API. However, tcf_lock is also used to protect rate
estimator stats in softirq context by timer callback.
Change ops->init() and ops->dump() of all actions to disable bh when using
tcf_lock to prevent deadlock reported by following lockdep warning:
[  105.470398] ================================
[  105.475014] WARNING: inconsistent lock state
[  105.479628] 4.18.0-rc8+ #664 Not tainted
[  105.483897] --------------------------------
[  105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  105.500449] 
00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0
[  105.509696] {SOFTIRQ-ON-W} state was registered at:
[  105.514925]   _raw_spin_lock+0x2c/0x40
[  105.519022]   tcf_bpf_init+0x579/0x820 [act_bpf]
[  105.523990]   tcf_action_init_1+0x4e4/0x660
[  105.528518]   tcf_action_init+0x1ce/0x2d0
[  105.532880]   tcf_exts_validate+0x1d8/0x200
[  105.537416]   fl_change+0x55a/0x268b [cls_flower]
[  105.542469]   tc_new_tfilter+0x748/0xa20
[  105.546738]   rtnetlink_rcv_msg+0x56a/0x6d0
[  105.551268]   netlink_rcv_skb+0x18d/0x200
[  105.555628]   netlink_unicast+0x2d0/0x370
[  105.559990]   netlink_sendmsg+0x3b9/0x6a0
[  105.564349]   sock_sendmsg+0x6b/0x80
[  105.568271]   ___sys_sendmsg+0x4a1/0x520
[  105.572547]   __sys_sendmsg+0xd7/0x150
[  105.576655]   do_syscall_64+0x72/0x2c0
[  105.580757]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  105.586243] irq event stamp: 489296
[  105.590084] hardirqs last  enabled at (489296): [<
ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40
[  105.599765] hardirqs last disabled at (489295): [<
ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50
[  105.609277] softirqs last  enabled at (489292): [<
ffffffffb413a6a3>] irq_enter+0x83/0xa0
[  105.618001] softirqs last disabled at (489293): [<
ffffffffb413a800>] irq_exit+0x140/0x190
[  105.626813]
               other info that might help us debug this:
[  105.633976]  Possible unsafe locking scenario:
[  105.640526]        CPU0
[  105.643325]        ----
[  105.646125]   lock(&(&p->tcfa_lock)->rlock);
[  105.650747]   <Interrupt>
[  105.653717]     lock(&(&p->tcfa_lock)->rlock);
[  105.658514]
                *** DEADLOCK ***
[  105.665349] 1 lock held by swapper/16/0:
[  105.669629]  #0: 
00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550
[  105.678200]
               stack backtrace:
[  105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664
[  105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  105.698626] Call Trace:
[  105.701421]  <IRQ>
[  105.703791]  dump_stack+0x92/0xeb
[  105.707461]  print_usage_bug+0x336/0x34c
[  105.711744]  mark_lock+0x7c9/0x980
[  105.715500]  ? print_shortest_lock_dependencies+0x2e0/0x2e0
[  105.721424]  ? check_usage_forwards+0x230/0x230
[  105.726315]  __lock_acquire+0x923/0x26f0
[  105.730597]  ? debug_show_all_locks+0x240/0x240
[  105.735478]  ? mark_lock+0x493/0x980
[  105.739412]  ? check_chain_key+0x140/0x1f0
[  105.743861]  ? __lock_acquire+0x836/0x26f0
[  105.748323]  ? lock_acquire+0x12e/0x290
[  105.752516]  lock_acquire+0x12e/0x290
[  105.756539]  ? est_fetch_counters+0x3c/0xa0
[  105.761084]  _raw_spin_lock+0x2c/0x40
[  105.765099]  ? est_fetch_counters+0x3c/0xa0
[  105.769633]  est_fetch_counters+0x3c/0xa0
[  105.773995]  est_timer+0x87/0x390
[  105.777670]  ? est_fetch_counters+0xa0/0xa0
[  105.782210]  ? lock_acquire+0x12e/0x290
[  105.786410]  call_timer_fn+0x161/0x550
[  105.790512]  ? est_fetch_counters+0xa0/0xa0
[  105.795055]  ? del_timer_sync+0xd0/0xd0
[  105.799249]  ? __lock_is_held+0x93/0x110
[  105.803531]  ? mark_held_locks+0x20/0xe0
[  105.807813]  ? _raw_spin_unlock_irq+0x29/0x40
[  105.812525]  ? est_fetch_counters+0xa0/0xa0
[  105.817069]  ? est_fetch_counters+0xa0/0xa0
[  105.821610]  run_timer_softirq+0x3c4/0x9f0
[  105.826064]  ? lock_acquire+0x12e/0x290
[  105.830257]  ? __bpf_trace_timer_class+0x10/0x10
[  105.835237]  ? __lock_is_held+0x25/0x110
[  105.839517]  __do_softirq+0x11d/0x7bf
[  105.843542]  irq_exit+0x140/0x190
[  105.847208]  smp_apic_timer_interrupt+0xac/0x3b0
[  105.852182]  apic_timer_interrupt+0xf/0x20
[  105.856628]  </IRQ>
[  105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0
[  105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53
[  105.884288] RSP: 0018:
ffff8803ad94fd20 EFLAGS: 
00000246 ORIG_RAX: 
ffffffffffffff13
[  105.892494] RAX: 
0000000000000000 RBX: 
ffffe8fb300829c0 RCX: 
ffffffffb41e19e1
[  105.899988] RDX: 
0000000000000007 RSI: 
dffffc0000000000 RDI: 
ffff8803ad9358ac
[  105.907503] RBP: 
ffffffffb6636300 R08: 
0000000000000004 R09: 
0000000000000000
[  105.914997] R10: 
0000000000000000 R11: 
0000000000000000 R12: 
0000000000000004
[  105.922487] R13: 
ffffffffb6636140 R14: 
ffffffffb66362d8 R15: 
000000188d36091b
[  105.929988]  ? trace_hardirqs_on_caller+0x141/0x2d0
[  105.935232]  do_idle+0x28e/0x320
[  105.938817]  ? arch_cpu_idle_exit+0x40/0x40
[  105.943361]  ? mark_lock+0x8c1/0x980
[  105.947295]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[  105.952619]  cpu_startup_entry+0xc2/0xd0
[  105.956900]  ? cpu_in_idle+0x20/0x20
[  105.960830]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[  105.966146]  ? trace_hardirqs_on_caller+0x141/0x2d0
[  105.971391]  start_secondary+0x2b5/0x360
[  105.975669]  ? set_cpu_sibling_map+0x1330/0x1330
[  105.980654]  secondary_startup_64+0xa5/0xb0
Taking tcf_lock in sample action with bh disabled causes lockdep to issue a
warning regarding possible irq lock inversion dependency between tcf_lock,
and psample_groups_lock that is taken when holding tcf_lock in sample init:
[  162.108959]  Possible interrupt unsafe locking scenario:
[  162.116386]        CPU0                    CPU1
[  162.121277]        ----                    ----
[  162.126162]   lock(psample_groups_lock);
[  162.130447]                                local_irq_disable();
[  162.136772]                                lock(&(&p->tcfa_lock)->rlock);
[  162.143957]                                lock(psample_groups_lock);
[  162.150813]   <Interrupt>
[  162.153808]     lock(&(&p->tcfa_lock)->rlock);
[  162.158608]
                *** DEADLOCK ***
In order to prevent potential lock inversion dependency between tcf_lock
and psample_groups_lock, extract call to psample_group_get() from tcf_lock
protected section in sample action init function.
Fixes: 4e232818bd32 ("net: sched: act_mirred: remove dependency on rtnl lock")
Fixes: 764e9a24480f ("net: sched: act_vlan: remove dependency on rtnl lock")
Fixes: 729e01260989 ("net: sched: act_tunnel_key: remove dependency on rtnl lock")
Fixes: d77284956656 ("net: sched: act_sample: remove dependency on rtnl lock")
Fixes: e8917f437006 ("net: sched: act_gact: remove dependency on rtnl lock")
Fixes: b6a2b971c0b0 ("net: sched: act_csum: remove dependency on rtnl lock")
Fixes: 2142236b4584 ("net: sched: act_bpf: remove dependency on rtnl lock")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
 
Linus Torvalds [Sun, 19 Aug 2018 17:38:36 +0000 (10:38 -0700)]
 
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull first set of KVM updates from Paolo Bonzini:
 "PPC:
   - minor code cleanups
  x86:
   - PCID emulation and CR3 caching for shadow page tables
   - nested VMX live migration
   - nested VMCS shadowing
   - optimized IPI hypercall
   - some optimizations
  ARM will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
  kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
  KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
  KVM: X86: Implement PV IPIs in linux guest
  KVM: X86: Add kvm hypervisor init time platform setup callback
  KVM: X86: Implement "send IPI" hypercall
  KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
  KVM: x86: Skip pae_root shadow allocation if tdp enabled
  KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
  KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
  KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
  KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
  KVM: vmx: move struct host_state usage to struct loaded_vmcs
  KVM: vmx: compute need to reload FS/GS/LDT on demand
  KVM: nVMX: remove a misleading comment regarding vmcs02 fields
  KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
  KVM: vmx: add dedicated utility to access guest's kernel_gs_base
  KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
  KVM: vmx: refactor segmentation code in vmx_save_host_state()
  kvm: nVMX: Fix fault priority for VMX operations
  kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
  ...
Linus Torvalds [Sun, 19 Aug 2018 16:56:38 +0000 (09:56 -0700)]
 
Merge tag 'riscv-for-linus-4.19-mw0' of git://git./linux/kernel/git/palmer/riscv-linux
Pull RISC-V updates from Palmer Dabbelt:
 "This contains some major improvements to the RISC-V port, including
  the necessary interrupt controller and timer support to actually make
  it to userspace. Support for three devices has been added:
   - the ISA-mandated timers on RISC-V systems.
   - the ISA-mandated first-level interrupt controller on RISC-V
     systems, which is handled as part of our core arch code because
     it's very small and tightly tied to the ISA.
   - SiFive's platform-level interrupt controller, which talks to the
     actual devices.
  In addition to these new devices, there are a handful of cleanups all
  over the RISC-V tree:
   - build fixes for various configurations:
      * A fix to the vDSO build's makefile so it respects CFLAGS.
      * The addition of __lshrti3, a libgcc derived function necessary
        for some 32-bit configurations.
      * !SMP && PERF_EVENTS
   - Cleanups to the arch code to remove the remnants of old versions of
     the drivers that were just properly submitted.
      * Some dead code from the timer driver, most of which wasn't ever
        even compiled.
      * Cleanups of some interrupt #defines, which are now local to the
        interrupt handling code.
   - Fixes to ptrace(), which while not being sufficient to fully make
     GDB work are at least sufficient to get simple GDB tasks to work.
   - Early printk support via RISC-V's architecturally mandated SBI
     console device.
   - A fix to our early debug trap handler to ensure it's always
     aligned.
  These patches have all been through a fairly extensive review process,
  but as this enables a whole pile of functionality (ie, userspace) I'm
  confident we'll need to submit a few more patches. The only concrete
  issues I know about are the sys_riscv_flush_icache patches, but as I
  managed to screw those up on Friday I figured it'd be best to let them
  bake another week.
  This tag boots a Fedora root filesystem on QEMU's master branch for
  me, and before this morning's rebase (from 4.18-rc8 to 4.18) it booted
  on the HiFive Unleashed.
  Thanks to Christoph Hellwig and the other guys at WD for getting the
  new drivers in shape!"
* tag 'riscv-for-linus-4.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
  dt-bindings: interrupt-controller: SiFive Plaform Level Interrupt Controller
  dt-bindings: interrupt-controller: RISC-V local interrupt controller
  RISC-V: Fix !CONFIG_SMP compilation error
  irqchip: add a SiFive PLIC driver
  RISC-V: Add the directive for alignment of stvec's value
  clocksource: new RISC-V SBI timer driver
  RISC-V: implement low-level interrupt handling
  RISC-V: add a definition for the SIE SEIE bit
  RISC-V: remove INTERRUPT_CAUSE_* defines from asm/irq.h
  RISC-V: simplify software interrupt / IPI code
  RISC-V: remove timer leftovers
  RISC-V: Add early printk support via the SBI console
  RISC-V: Don't increment sepc after breakpoint.
  RISC-V: implement __lshrti3.
  RISC-V: Use KBUILD_CFLAGS instead of KCFLAGS when building the vDSO
Linus Torvalds [Sun, 19 Aug 2018 16:30:44 +0000 (09:30 -0700)]
 
Merge tag 'char-misc-4.19-rc1' of git://git./linux/kernel/git/gregkh/char-misc
Pull UIO fix from Greg KH:
 "Here is a single UIO fix that I forgot to send before 4.18-final came
  out. It reverts a UIO patch that went in the 4.18 development window
  that was causing problems.
  This patch has been in linux-next for a while with no problems, I just
  forgot to send it earlier, or as part of the larger char/misc patch
  series from yesterday, my fault"
* tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  Revert "uio: use request_threaded_irq instead"
Linus Torvalds [Sat, 18 Aug 2018 23:48:07 +0000 (16:48 -0700)]
 
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
 - a new driver for Rohm BU21029 touch controller
 - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free
 - updates to Atmel, eeti. pxrc and iforce drivers
 - assorted driver cleanups and fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits)
  MAINTAINERS: Add PhoenixRC Flight Controller Adapter
  Input: do not use WARN() in input_alloc_absinfo()
  Input: mark expected switch fall-throughs
  Input: raydium_i2c_ts - use true and false for boolean values
  Input: evdev - switch to bitmap API
  Input: gpio-keys - switch to bitmap_zalloc()
  Input: elan_i2c_smbus - cast sizeof to int for comparison
  bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
  md: Avoid namespace collision with bitmap API
  dm: Avoid namespace collision with bitmap API
  Input: pm8941-pwrkey - add resin entry
  Input: pm8941-pwrkey - abstract register offsets and event code
  Input: iforce - reorganize joystick configuration lists
  Input: atmel_mxt_ts - move completion to after config crc is updated
  Input: atmel_mxt_ts - don't report zero pressure from T9
  Input: atmel_mxt_ts - zero terminate config firmware file
  Input: atmel_mxt_ts - refactor config update code to add context struct
  Input: atmel_mxt_ts - config CRC may start at T71
  Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM
  Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE
  ...
Linus Torvalds [Sat, 18 Aug 2018 23:45:27 +0000 (16:45 -0700)]
 
Merge tag 'hwlock-v4.19' of git://github.com/andersson/remoteproc
Pull hwspinlock updates from Bjorn Andersson:
 "This introduces devres helpers and an API to request a lock by name,
  then migrates the sprd SPI driver to use these"
* tag 'hwlock-v4.19' of git://github.com/andersson/remoteproc:
  hwspinlock: Fix incorrect return pointers
  spi: sprd: Change to use devm_hwspin_lock_request_specific()
  spi: sprd: Replace of_hwspin_lock_get_id() with of_hwspin_lock_get_id_byname()
  hwspinlock: Fix one comment mistake
  hwspinlock: Remove redundant config
  hwspinlock: Add devm_xxx() APIs to register/unregister one hwlock controller
  hwspinlock: Add devm_xxx() APIs to request/free hwlock
  hwspinlock: Add one new API to support getting a specific hwlock by the name
Linus Torvalds [Sat, 18 Aug 2018 23:43:57 +0000 (16:43 -0700)]
 
Merge tag 'rpmsg-v4.19' of git://github.com/andersson/remoteproc
Pull rpmsg updates from Bjorn Andersson:
 "This fixes a few compile and kerneldoc warnings, allows rpmsg devices
  to handle power domains, allow for labeling GLINK edges and supports
  compat for rpmsg_char"
* tag 'rpmsg-v4.19' of git://github.com/andersson/remoteproc:
  rpmsg: Add compat ioctl for rpmsg char driver
  rpmsg: glink: Store edge name for glink device
  dt-bindings: soc: qcom: Add label for GLINK bindings
  rpmsg: core: add support to power domains for devices
  rpmsg: smd: fix kerneldoc warnings
  rpmsg: glink: Fix various kerneldoc warnings.
  rpmsg: glink: correctly annotate intent members
  rpmsg: smd: Add missing include of sizes.h
Linus Torvalds [Sat, 18 Aug 2018 23:42:04 +0000 (16:42 -0700)]
 
Merge tag 'rproc-v4.19' of git://github.com/andersson/remoteproc
Pull remoteproc updates from Bjorn Andersson:
 "This adds support for pre-start and post-shutdown hooks for remoteproc
  subdevices, refactors the Qualcomm Hexagon support to allow reuse
  between several drivers, makes authentication in the MDT file loader
  optional, migrates a few format strings to use %pK and migrates the
  Davinci driver to use the reset framework"
* tag 'rproc-v4.19' of git://github.com/andersson/remoteproc:
  remoteproc/davinci: use the reset framework
  remoteproc/davinci: Mark error recovery as disabled
  remoteproc: st_slim: replace "%p" with "%pK"
  remoteproc: replace "%p" with "%pK"
  remoteproc: qcom: fix Q6V5_WCSS dependencies
  remoteproc: Reset table_ptr in rproc_start() failure paths
  remoteproc: qcom: q6v5-pil: fix modem hang on SDM845 after axis2 clk unvote
  remoteproc: qcom q6v5: fix modular build
  remoteproc: Introduce prepare and unprepare for subdevices
  remoteproc: rename subdev probe and remove functions
  remoteproc: Make client initialize ops in rproc_subdev
  remoteproc: Make start and stop in subdev optional
  remoteproc: Rename subdev functions to start/stop
  remoteproc: qcom: Introduce Hexagon V5 based WCSS driver
  remoteproc: qcom: q6v5-pil: Use common q6v5 helpers
  remoteproc: qcom: adsp: Use common q6v5 helpers
  remoteproc: q6v5: Extract common resource handling
  remoteproc: qcom: mdt_loader: Make the firmware authentication optional
Linus Torvalds [Sat, 18 Aug 2018 23:16:57 +0000 (16:16 -0700)]
 
Merge tag 'linux-watchdog-4.19-rc1' of git://linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
 - add MEN 16z069 IP-Core driver
 - renesas-wdt: add support for the R8A77990 wdt
 - stm32_iwdg: Add stm32mp1 support and pclk feature
 - sp805_wdt, orion_wdt, sprd_wdt: several improvements
 - imx2_wdt, stmp3xxx: switch to SPDX identifier
* tag 'linux-watchdog-4.19-rc1' of git://www.linux-watchdog.org/linux-watchdog:
  watchdog: fix dependencies of menz69_wdt.o
  watchdog: sp805: Add clock-frequency property
  watchdog: add driver for the MEN 16z069 IP-Core
  watchdog: sprd_wdt: Remove redundant dev_err call in sprd_wdt_probe()
  watchdog: stmp3xxx: Switch to SPDX identifier
  watchdog: imx2_wdt: Switch to SPDX identifier
  watchdog: sp805: set WDOG_HW_RUNNING when appropriate
  watchdog: sp805: add 'timeout-sec' DT property support
  dt-bindings: watchdog: Add optional 'timeout-sec' property for sp805
  dt-bindings: watchdog: Consolidate SP805 binding docs
  watchdog: orion_wdt: Mark watchdog as active when running at probe
  watchdog: stm32: add pclk feature for stm32mp1
  dt-bindings: watchdog: add stm32mp1 support
  dt-bindings: watchdog: renesas-wdt: Add support for the R8A77990 wdt
Linus Torvalds [Sat, 18 Aug 2018 22:55:59 +0000 (15:55 -0700)]
 
Merge tag 'dmaengine-4.19-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull DMAengine updates from Vinod Koul:
 "This round brings couple of framework changes, a new driver and usual
  driver updates:
   - new managed helper for dmaengine framework registration
   - split dmaengine pause capability to pause and resume and allow
     drivers to report that individually
   - update dma_request_chan_by_mask() to handle deferred probing
   - move imx-sdma to use virt-dma
   - new driver for Actions Semi Owl family S900 controller
   - minor updates to intel, renesas, mv_xor, pl330 etc"
* tag 'dmaengine-4.19-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (46 commits)
  dmaengine: Add Actions Semi Owl family S900 DMA driver
  dt-bindings: dmaengine: Add binding for Actions Semi Owl SoCs
  dmaengine: sh: rcar-dmac: Should not stop the DMAC by rcar_dmac_sync_tcr()
  dmaengine: mic_x100_dma: use the new helper to simplify the code
  dmaengine: add a new helper dmaenginem_async_device_register
  dmaengine: imx-sdma: add memcpy interface
  dmaengine: imx-sdma: add SDMA_BD_MAX_CNT to replace '0xffff'
  dmaengine: dma_request_chan_by_mask() to handle deferred probing
  dmaengine: pl330: fix irq race with terminate_all
  dmaengine: Revert "dmaengine: mv_xor_v2: enable COMPILE_TEST"
  dmaengine: mv_xor_v2: use {lower,upper}_32_bits to configure HW descriptor address
  dmaengine: mv_xor_v2: enable COMPILE_TEST
  dmaengine: mv_xor_v2: move unmap to before callback
  dmaengine: mv_xor_v2: convert callback to helper function
  dmaengine: mv_xor_v2: kill the tasklets upon exit
  dmaengine: mv_xor_v2: explicitly freeup irq
  dmaengine: sh: rcar-dmac: Add dma_pause operation
  dmaengine: sh: rcar-dmac: add a new function to clear CHCR.DE with barrier
  dmaengine: idma64: Support dmaengine_terminate_sync()
  dmaengine: hsu: Support dmaengine_terminate_sync()
  ...
Linus Torvalds [Sat, 18 Aug 2018 22:54:05 +0000 (15:54 -0700)]
 
Merge tag 'mmc-v4.19' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC updates from Ulf Hansson:
 "Updates for MMC for v4.19.
  MMC core:
   - Add some fine-grained hooks to further support HS400 tuning
   - Improve error path for bus width setting for HS400es
   - Use a common method when checking R1 status
  MMC host:
   - renesas_sdhi: Add r8a77990 support
   - renesas_sdhi: Add eMMC HS400 mode support
   - tmio/renesas_sdhi: Improve tuning/clock management
   - tmio: Add eMMC HS400 mode support
   - sunxi: Add support for 3.3V eMMC DDR mode
   - mmci: Initial support to manage variant specific callbacks
   - sdhci: Don't try 3.3V I/O voltage if not supported
   - sdhci-pci-dwc-mshc: Add driver to support Synopsys dwc mshc SDHCI PCI
   - sdhci-of-dwcmshc: Add driver to support Synopsys DWC MSHC SDHCI
   - sdhci-msm: Add support for new version sdcc V5
   - sdhci-pci-o2micro: Add support for O2 eMMC HS200 mode
   - sdhci-pci-o2micro: Add support for O2 hardware tuning
   - sdhci-pci-o2micro: Add MSI interrupt support for O2 SD host
   - sdhci-pci: Add support for Intel ICP
   - sdhci-tegra: Prevent ACMD23 and HS200 mode on Tegra 3
   - sdhci-tegra: Fix eMMC DDR52 mode
   - sdhci-tegra: Improve clock management
   - dw_mmc-rockchip: Document compatible string for px30
   - sdhci-esdhc-imx: Add support for 3.3V eMMC DDR mode
   - sdhci-of-esdhc: Set proper DMA mask for ls104x chips
   - sdhci-of-esdhc: Improve clock management
   - sdhci-of-arasan: Add a quirk to manage unstable clocks
   - dw_mmc-exynos: Address potential external abort during system resume
   - pxamci: Add support for common MMC DT bindings
   - pxamci: Several cleanups and improvements
   - pxamci: Merge immutable branch for pxa to switch to DMA slave maps"
* tag 'mmc-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (56 commits)
  mmc: core: improve reasonableness of bus width setting for HS400es
  mmc: tmio: remove unneeded variable in tmio_mmc_start_command()
  mmc: renesas_sdhi: Fix sampling clock position selecting
  mmc: tmio: Fix tuning flow
  mmc: sunxi: remove output of virtual base address
  dt-bindings: mmc: rockchip-dw-mshc: add description for px30
  mmc: renesas_sdhi: Add r8a77990 support
  mmc: sunxi: allow 3.3V DDR when DDR is available
  mmc: mmci: Add and implement a ->dma_setup() callback for qcom dml
  mmc: mmci: Initial support to manage variant specific callbacks
  mmc: tegra: Force correct divider calculation on DDR50/52
  mmc: sdhci: Add MSI interrupt support for O2 SD host
  mmc: sdhci: Add support for O2 hardware tuning
  mmc: sdhci: Export sdhci tuning function symbol
  mmc: sdhci: Change O2 Host HS200 mode clock frequency to 200MHz
  mmc: sdhci: Add support for O2 eMMC HS200 mode
  mmc: tegra: Add and use tegra_sdhci_get_max_clock()
  mmc: sdhci-esdhc-imx: fix indent
  mmc: sdhci-esdhc-imx: disable clocks before changing frequency
  mmc: tegra: prevent ACMD23 on Tegra 3
  ...
Haishuang Yan [Sat, 18 Aug 2018 14:43:48 +0000 (22:43 +0800)]
 
ip6_vti: simplify stats handling in vti6_xmit
Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit
code path.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 18 Aug 2018 19:30:42 +0000 (12:30 -0700)]
 
pcmcia: remove long deprecated pcmcia_request_exclusive_irq() function
This function was created as a deprecated fallback case back in 2010 by
commit 
eb14120f743d ("pcmcia: re-work pcmcia_request_irq()") for legacy
cases.
Actual in-kernel users haven't been around for a long while.  The last
in-kernel user was apparently removed four years ago by commit
5f5316fcd08e ("am2150: Update nmclan_cs.c to use update PCMCIA API").
Just remove it entirely.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Linus Torvalds [Sat, 18 Aug 2018 19:19:56 +0000 (12:19 -0700)]
 
deprecate the '__deprecated' attribute warnings entirely and for good
We haven't had lots of deprecation warnings lately, but the rdma use of
it made them flare up again.
They are not useful.  They annoy everybody, and nobody ever does
anything about them, because it's always "somebody elses problem".  And
when people start thinking that warnings are normal, they stop looking
at them, and the real warnings that mean something go unnoticed.
If you want to get rid of a function, just get rid of it.  Convert every
user to the new world order.
And if you can't do that, then don't annoy everybody else with your
marking that says "I couldn't be bothered to fix this, so I'll just spam
everybody elses build logs with warnings about my laziness".
Make a kernelnewbies wiki page about things that could be cleaned up,
write a blog post about it, or talk to people on the mailing lists.  But
don't add warnings to the kernel build about cleanup that you think
should happen but you aren't doing yourself.
Don't.  Just don't.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 18 Aug 2018 18:44:53 +0000 (11:44 -0700)]
 
Merge tag 'driver-core-4.19-rc1' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
 "Here are all of the driver core and related patches for 4.19-rc1.
  Nothing huge here, just a number of small cleanups and the ability to
  now stop the deferred probing after init happens.
  All of these have been in linux-next for a while with only a merge
  issue reported"
* tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
  base: core: Remove WARN_ON from link dependencies check
  drivers/base: stop new probing during shutdown
  drivers: core: Remove glue dirs from sysfs earlier
  driver core: remove unnecessary function extern declare
  sysfs.h: fix non-kernel-doc comment
  PM / Domains: Stop deferring probe at the end of initcall
  iommu: Remove IOMMU_OF_DECLARE
  iommu: Stop deferring probe at end of initcalls
  pinctrl: Support stopping deferred probe after initcalls
  dt-bindings: pinctrl: add a 'pinctrl-use-default' property
  driver core: allow stopping deferred probe after init
  driver core: add a debugfs entry to show deferred devices
  sysfs: Fix internal_create_group() for named group updates
  base: fix order of OF initialization
  linux/device.h: fix kernel-doc notation warning
  Documentation: update firmware loader fallback reference
  kobject: Replace strncpy with memcpy
  drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
  kernfs: Replace strncpy with memcpy
  device: Add #define dev_fmt similar to #define pr_fmt
  ...
Linus Torvalds [Sat, 18 Aug 2018 18:04:51 +0000 (11:04 -0700)]
 
Merge tag 'char-misc-4.19-rc1' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
 "Here is the bit set of char/misc drivers for 4.19-rc1
  There is a lot here, much more than normal, seems like everyone is
  writing new driver subsystems these days... Anyway, major things here
  are:
   - new FSI driver subsystem, yet-another-powerpc low-level hardware
     bus
   - gnss, finally an in-kernel GPS subsystem to try to tame all of the
     crazy out-of-tree drivers that have been floating around for years,
     combined with some really hacky userspace implementations. This is
     only for GNSS receivers, but you have to start somewhere, and this
     is great to see.
  Other than that, there are new slimbus drivers, new coresight drivers,
  new fpga drivers, and loads of DT bindings for all of these and
  existing drivers.
  All of these have been in linux-next for a while with no reported
  issues"
* tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits)
  android: binder: Rate-limit debug and userspace triggered err msgs
  fsi: sbefifo: Bump max command length
  fsi: scom: Fix NULL dereference
  misc: mic: SCIF Fix scif_get_new_port() error handling
  misc: cxl: changed asterisk position
  genwqe: card_base: Use true and false for boolean values
  misc: eeprom: assignment outside the if statement
  uio: potential double frees if __uio_register_device() fails
  eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency
  misc: ti-st: Fix memory leak in the error path of probe()
  android: binder: Show extra_buffers_size in trace
  firmware: vpd: Fix section enabled flag on vpd_section_destroy
  platform: goldfish: Retire pdev_bus
  goldfish: Use dedicated macros instead of manual bit shifting
  goldfish: Add missing includes to goldfish.h
  mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux
  dt-bindings: mux: add adi,adgs1408
  Drivers: hv: vmbus: Cleanup synic memory free path
  Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
  Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
  ...
Linus Torvalds [Sat, 18 Aug 2018 18:00:00 +0000 (11:00 -0700)]
 
Merge tag 'staging-4.19-rc1' of git://git./linux/kernel/git/gregkh/staging
Pull staging and IIO updates from Greg KH:
 "Here are the big staging/iio patches for 4.19-rc1.
  Lots of churn here, with tons of cleanups happening in staging
  drivers, a removal of an old crypto driver that no one was using
  (skein), and the addition of some new IIO drivers. Also added was a
  "gasket" driver from Google that needs loads of work and the erofs
  filesystem.
  Even with adding all of the new drivers and a new filesystem, we are
  only adding about 1000 lines overall to the kernel linecount, which
  shows just how much cleanup happened, and how big the unused crypto
  driver was.
  All of these have been in the linux-next tree for a while now with no
  reported issues"
* tag 'staging-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (903 commits)
  staging:rtl8192u: Remove unused macro definitions - Style
  staging:rtl8192u: Add spaces around '+' operator - Style
  staging:rtl8192u: Remove stale comment - Style
  staging: rtl8188eu: remove unused mp_custom_oid.h
  staging: fbtft: Add spaces around / - Style
  staging: fbtft: Erases some repetitive usage of function name - Style
  staging: fbtft: Adjust some empty-line problems - Style
  staging: fbtft: Removes one nesting level to help readability - Style
  staging: fbtft: Changes gamma table to define.
  staging: fbtft: A bit more information on dev_err.
  staging: fbtft: Fixes some alignment issues - Style
  staging: fbtft: Puts macro arguments in parenthesis to avoid precedence issues - Style
  staging: rtl8188eu: remove unused array dB_Invert_Table
  staging: rtl8188eu: remove whitespace, add missing blank line
  staging: rtl8188eu: use is_multicast_ether_addr in rtw_sta_mgt.c
  staging: rtl8188eu: remove whitespace - style
  staging: rtl8188eu: cleanup block comment - style
  staging: rtl8188eu: use is_multicast_ether_addr in rtl8188eu_xmit.c
  staging: rtl8188eu: use is_multicast_ether_addr in recv_linux.c
  staging: rtlwifi: refactor rtl_get_tcb_desc
  ...
Linus Torvalds [Sat, 18 Aug 2018 17:50:41 +0000 (10:50 -0700)]
 
Merge tag 'tty-4.19-rc1' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
 "Here is the big tty and serial driver pull request for 4.19-rc1.
  It's not all that big, just a number of small serial driver updates
  and fixes, along with some better vt handling for unicode characters
  for those using braille terminals.
  All of these patches have been in linux-next for a long time with no
  reported issues"
* tag 'tty-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (73 commits)
  tty: serial: 8250: Revert NXP SC16C2552 workaround
  serial: 8250_exar: Read INT0 from slave device, too
  tty: rocket: Fix possible buffer overwrite on register_PCI
  serial: 8250_dw: Add ACPI support for uart on Broadcom SoC
  serial: 8250_dw: always set baud rate in dw8250_set_termios
  dt-bindings: serial: Add binding for uartlite
  tty: serial: uartlite: Add support for suspend and resume
  tty: serial: uartlite: Add clock adaptation
  tty: serial: uartlite: Add structure for private data
  serial: sh-sci: Improve support for separate TEI and DRI interrupts
  serial: sh-sci: Remove SCIx_RZ_SCIFA_REGTYPE
  serial: sh-sci: Allow for compressed SCIF address
  serial: sh-sci: Improve interrupts description
  serial: 8250: Use cached port name directly in messages
  serial: 8250_exar: Drop unused variable in pci_xr17v35x_setup()
  vt: drop unused struct vt_struct
  vt: avoid a VLA in the unicode screen scroll function
  vt: add /dev/vcsu* to devices.txt
  vt: coherence validation code for the unicode screen buffer
  vt: selection: take screen contents from uniscr if available
  ...
Linus Torvalds [Sat, 18 Aug 2018 17:21:49 +0000 (10:21 -0700)]
 
Merge tag 'usb-4.19-rc1' of git://git./linux/kernel/git/gregkh/usb
Pull USB/PHY updates from Greg KH:
 "Here is the big USB and phy driver patch set for 4.19-rc1.
  Nothing huge but there was a lot of work that happened this
  development cycle:
   - lots of type-c work, with drivers graduating out of staging, and
     displayport support being added.
   - new PHY drivers
   - the normal collection of gadget driver updates and fixes
   - code churn to work on the urb handling path, using irqsave()
     everywhere in anticipation of making this codepath a lot simpler in
     the future.
   - usbserial driver fixes and reworks
   - other misc changes
  All of these have been in linux-next with no reported issues for a
  while"
* tag 'usb-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (159 commits)
  USB: serial: pl2303: add a new device id for ATEN
  usb: renesas_usbhs: Kconfig: convert to SPDX identifiers
  usb: dwc3: gadget: Check MaxPacketSize from descriptor
  usb: dwc2: Turn on uframe_sched on "stm32f4x9_fsotg" platforms
  usb: dwc2: Turn on uframe_sched on "amlogic" platforms
  usb: dwc2: Turn on uframe_sched on "his" platforms
  usb: dwc2: Turn on uframe_sched on "bcm" platforms
  usb: dwc2: gadget: ISOC's starting flow improvement
  usb: dwc2: Make dwc2_readl/writel functions endianness-agnostic.
  usb: dwc3: core: Enable AutoRetry feature in the controller
  usb: dwc3: Set default mode for dwc_usb31
  usb: gadget: udc: renesas_usb3: Add register of usb role switch
  usb: dwc2: replace ioread32/iowrite32_rep with dwc2_readl/writel_rep
  usb: dwc2: Modify dwc2_readl/writel functions prototype
  usb: dwc3: pci: Intel Merrifield can be host
  usb: dwc3: pci: Supply device properties via driver data
  arm64: dts: dwc3: description of incr burst type
  usb: dwc3: Enable undefined length INCR burst type
  usb: dwc3: add global soc bus configuration reg0
  usb: dwc3: Describe 'wakeup_work' field of struct dwc3_pci
  ...
David S. Miller [Sat, 18 Aug 2018 17:02:49 +0000 (10:02 -0700)]
 
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2018-08-18
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit
   restrictions, from Yonghong.
2) Fix a suspicious RCU rcu_dereference_check() warning triggered
   from removing a device's XDP memory allocator by using the correct
   rhashtable lookup function, from Tariq.
3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races
   as well as enforcing module aliases for ULPs. Another fix for BPF
   map redirect to make them work again with tail calls, from Daniel.
4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 18 Aug 2018 16:59:19 +0000 (09:59 -0700)]
 
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree:
1) Infinite loop in IPVS when net namespace is released, from
   Tan Hu.
2) Do not show negative timeouts in ip_vs_conn by using the new
   jiffies_delta_to_msecs(), patches from Matteo Croce.
3) Set F_IFACE flag for linklocal addresses in ip6t_rpfilter,
   from Florian Westphal.
4) Fix overflow in set size allocation, from Taehee Yoo.
5) Use netlink_dump_start() from ctnetlink to fix memleak from
   the error path, again from Florian.
6) Register nfnetlink_subsys in last place, otherwise netns
   init path may lose race and see net->nft uninitialized data.
   This also reverts previous attempt to fix this by increase
   netns refcount, patches from Florian.
7) Remove conntrack entries on layer 4 protocol tracker module
   removal, from Florian.
8) Use GFP_KERNEL_ACCOUNT for xtables blob allocation, from
   Michal Hocko.
9) Get tproxy documentation in sync with existing codebase,
   from Mate Eckl.
10) Honor preset layer 3 protocol via ctx->family in the new nft_ct
    timeout infrastructure, from Harsha Sharma.
11) Let uapi nfnetlink_osf.h compile standalone with no errors,
    from Dmitry V. Levin.
12) Missing braces compilation warning in nft_tproxy, patch from
    Mate Eclk.
13) Disregard bogus check to bail out on non-anonymous sets from
    the dynamic set update extension.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 18 Aug 2018 00:27:58 +0000 (17:27 -0700)]
 
Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
 "This contains mostly fixes (6 to be backported to stable) and a few
  changes, here is the breakdown:
   - rework how fids are attributed by replacing some custom tracking in
     a list by an idr
   - for packet-based transports (virtio/rdma) validate that the packet
     length matches what the header says
   - a few race condition fixes found by syzkaller
   - missing argument check when NULL device is passed in sys_mount
   - a few virtio fixes
   - some spelling and style fixes"
* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
  net/9p/trans_virtio.c: add null terminal for mount tag
  9p/virtio: fix off-by-one error in sg list bounds check
  9p: fix whitespace issues
  9p: fix multiple NULL-pointer-dereferences
  fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
  9p: validate PDU length
  net/9p/trans_fd.c: fix race by holding the lock
  net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
  net/9p/virtio: Fix hard lockup in req_done
  net/9p/trans_virtio.c: fix some spell mistakes in comments
  9p/net: Fix zero-copy path in the 9p virtio transport
  9p: Embed wait_queue_head into p9_req_t
  9p: Replace the fidlist with an IDR
  9p: Change p9_fid_create calling convention
  9p: Fix comment on smp_wmb
  net/9p/client.c: version pointer uninitialized
  fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
  net/9p: fix error path of p9_virtio_probe
  9p/net/protocol.c: return -ENOMEM when kmalloc() failed
  net/9p/client.c: add missing '\n' at the end of p9_debug()
  ...
Linus Torvalds [Fri, 17 Aug 2018 23:49:31 +0000 (16:49 -0700)]
 
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 - a few misc things
 - a few Y2038 fixes
 - ntfs fixes
 - arch/sh tweaks
 - ocfs2 updates
 - most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  mm/hmm.c: remove unused variables align_start and align_end
  fs/userfaultfd.c: remove redundant pointer uwq
  mm, vmacache: hash addresses based on pmd
  mm/list_lru: introduce list_lru_shrink_walk_irq()
  mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
  mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
  mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
  mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
  mm/sparse: delete old sparse_init and enable new one
  mm/sparse: add new sparse_init_nid() and sparse_init()
  mm/sparse: move buffer init/fini to the common place
  mm/sparse: use the new sparse buffer functions in non-vmemmap
  mm/sparse: abstract sparse buffer allocations
  mm/hugetlb.c: don't zero 1GiB bootmem pages
  mm, page_alloc: double zone's batchsize
  mm/oom_kill.c: document oom_lock
  mm/hugetlb: remove gigantic page support for HIGHMEM
  mm, oom: remove sleep from under oom_lock
  kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
  mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
  ...
Colin Ian King [Fri, 17 Aug 2018 22:50:07 +0000 (15:50 -0700)]
 
mm/hmm.c: remove unused variables align_start and align_end
Variables align_start and align_end are being assigned but are never
used hence they are redundant and can be removed.
Cleans up clang warnings:
  warning: variable 'align_start' set but not used [-Wunused-but-set-variable]
  warning: variable 'align_size' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Colin Ian King [Fri, 17 Aug 2018 22:50:01 +0000 (15:50 -0700)]
 
fs/userfaultfd.c: remove redundant pointer uwq
Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
  warning: variable 'uwq' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
David Rientjes [Fri, 17 Aug 2018 22:49:58 +0000 (15:49 -0700)]
 
mm, vmacache: hash addresses based on pmd
When perf profiling a wide variety of different workloads, it was found
that vmacache_find() had higher than expected cost: up to 0.08% of cpu
utilization in some cases.  This was found to rival other core VM
functions such as alloc_pages_vma() with thp enabled and default
mempolicy, and the conditionals in __get_vma_policy().
VMACACHE_HASH() determines which of the four per-task_struct slots a vma
is cached for a particular address.  This currently depends on the pfn,
so pfn 5212 occupies a different vmacache slot than its neighboring pfn
5213.
vmacache_find() iterates through all four of current's vmacache slots
when looking up an address.  Hashing based on pfn, an address has
~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
about 25%, *if* the vma is cached.
This patch hashes an address by its pmd instead of pte to optimize for
workloads with good spatial locality.  This results in a higher
probability of vmas being cached in the first slot that is checked:
normally ~70% on the same workloads instead of 25%.
[rientjes@google.com: various updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Sebastian Andrzej Siewior [Fri, 17 Aug 2018 22:49:55 +0000 (15:49 -0700)]
 
mm/list_lru: introduce list_lru_shrink_walk_irq()
Provide list_lru_shrink_walk_irq() and let it behave like
list_lru_walk_one() except that it locks the spinlock with
spin_lock_irq().  This is used by scan_shadow_nodes() because its lock
nests within the i_pages lock which is acquired with IRQ.  This change
allows to use proper locking promitives instead hand crafted
lock_irq_disable() plus spin_lock().
There is no EXPORT_SYMBOL provided because the current user is in-kernel
only.
Add list_lru_shrink_walk_irq() which acquires the spinlock with the
proper locking primitives.
Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Sebastian Andrzej Siewior [Fri, 17 Aug 2018 22:49:51 +0000 (15:49 -0700)]
 
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
the first two argument.  Those two are only used to retrieve struct
list_lru_node.  Since this is already done by the caller of the function
for the locking, we can pass struct list_lru_node* directly and avoid
the dance around it.
Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Sebastian Andrzej Siewior [Fri, 17 Aug 2018 22:49:48 +0000 (15:49 -0700)]
 
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
Move the locking inside __list_lru_walk_one() to its caller.  This is a
preparation step in order to introduce list_lru_walk_one_irq() which
does spin_lock_irq() instead of spin_lock() for the locking.
Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Sebastian Andrzej Siewior [Fri, 17 Aug 2018 22:49:45 +0000 (15:49 -0700)]
 
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".
This series removes the local_irq_disable() around
list_lru_shrink_walk() (as used by mm/workingset) by adding
list_lru_shrink_walk_irq().
Vladimir Davydov preferred this over `irq' argument which I added to
struct list_lru.
The initial post (of this series) received a Reviewed-by tag by Vladimir
Davydov which I added to each patch of the series.  The series applies
on top of akpm's tree which has Kirill's shrink_slab series and does not
clash with it (akpm asked me to wait a week or so and repost it then).
I tested the code paths by triggering the OOM-killer via memory over
commit and lockdep did not complain (nor did I see any warnings).
This patch (of 4):
list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
memcg_idx parameter.  The same can be achieved by list_lru_walk_one() and
passing NULL as memcg argument which then gets converted into -1.  This is
a preparation step when the spin_lock() function is lifted to the caller
of __list_lru_walk_one().  Invoke list_lru_walk_one() instead
__list_lru_walk_one() when possible.
Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Huang Ying [Fri, 17 Aug 2018 22:49:41 +0000 (15:49 -0700)]
 
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
to optimize swapping for THP (Transparent Huge Page) without basic
swapping support.
In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
split_swap_cluster() will not be built because it is in swapfile.c, but
it will be called in huge_memory.c.  This doesn't trigger a build error
in practice because the call site is enclosed by PageSwapCache(), which
is defined to be constant 0 when CONFIG_SWAP=n.  But this is fragile and
should be fixed.
The comments are fixed too to reflect the latest progress.
Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Pavel Tatashin [Fri, 17 Aug 2018 22:49:37 +0000 (15:49 -0700)]
 
mm/sparse: delete old sparse_init and enable new one
Rename new_sparse_init() to sparse_init() which enables it.  Delete old
sparse_init() and all the code that became obsolete with.
[pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Pavel Tatashin [Fri, 17 Aug 2018 22:49:33 +0000 (15:49 -0700)]
 
mm/sparse: add new sparse_init_nid() and sparse_init()
sparse_init() requires to temporary allocate two large buffers: usemap_map
and map_map.  Baoquan He has identified that these buffers are so large
that Linux is not bootable on small memory machines, such as a kdump boot.
The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
are scaled to the maximum physical memory size.
Baoquan provided a fix, which reduces these sizes of these buffers, but it
is much better to get rid of them entirely.
Add a new way to initialize sparse memory: sparse_init_nid(), which only
operates within one memory node, and thus allocates memory either in large
contiguous block or allocates section by section.  This eliminates the
need for use of temporary buffers.
For simplified bisecting and review temporarly call sparse_init()
new_sparse_init(), the new interface is going to be enabled as well as old
code removed in the next patch.
Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Pavel Tatashin [Fri, 17 Aug 2018 22:49:30 +0000 (15:49 -0700)]
 
mm/sparse: move buffer init/fini to the common place
Now that both variants of sparse memory use the same buffers to populate
memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
common place.
Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Pavel Tatashin [Fri, 17 Aug 2018 22:49:26 +0000 (15:49 -0700)]
 
mm/sparse: use the new sparse buffer functions in non-vmemmap
non-vmemmap sparse also allocated large contiguous chunk of memory, and if
fails falls back to smaller allocations.  Use the same functions to
allocate buffer as the vmemmap-sparse
Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Pavel Tatashin [Fri, 17 Aug 2018 22:49:21 +0000 (15:49 -0700)]
 
mm/sparse: abstract sparse buffer allocations
Patch series "sparse_init rewrite", v6.
In sparse_init() we allocate two large buffers to temporary hold usemap
and memmap for the whole machine.  However, we can avoid doing that if
we changed sparse_init() to operated on per-node bases instead of doing
it on the whole machine beforehand.
As shown by Baoquan
  http://lkml.kernel.org/r/
20180628062857.29658-1-bhe@redhat.com
The buffers are large enough to cause machine stop to boot on small
memory systems.
Another benefit of these changes is that they also obsolete
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.
This patch (of 5):
When struct pages are allocated for sparse-vmemmap VA layout, we first try
to allocate one large buffer, and than if that fails allocate struct pages
for each section as we go.
The code that allocates buffer is uses global variables and is spread
across several call sites.
Cleanup the code by introducing three functions to handle the global
buffer:
sparse_buffer_init()	initialize the buffer
sparse_buffer_fini()	free the remaining part of the buffer
sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
return NULL
Define these functions in sparse.c instead of sparse-vmemmap.c because
later we will use them for non-vmemmap sparse allocations as well.
[akpm@linux-foundation.org: use PTR_ALIGN()]
[akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Cannon Matthews [Fri, 17 Aug 2018 22:49:17 +0000 (15:49 -0700)]
 
mm/hugetlb.c: don't zero 1GiB bootmem pages
When using 1GiB pages during early boot, use the new
memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it.
Zeroing out hundreds or thousands of GiB in a single core memset() call
is very slow, and can make early boot last upwards of 20-30 minutes on
multi TiB machines.
The memory does not need to be zero'd as the hugetlb pages are always
zero'd on page fault.
Tested: Booted with ~3800 1G pages, and it booted successfully in
roughly the same amount of time as with 0, as opposed to the 25+ minutes
it would take before.
Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com
Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Aaron Lu [Fri, 17 Aug 2018 22:49:14 +0000 (15:49 -0700)]
 
mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy.  When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.
zone's batch size gets doubled last time by commit 
ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
then, CPU has envolved a lot and CPU's cache sizes also increased.
Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.
In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.
In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.
Most test results are flat.  will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop.  Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop.  This performance drop
might be mitigated by others' work on optimizing LRU lock.
Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before.  My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates.  For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.
Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.
The two phase's test results are listed below(all tests are done with THP
disabled).
Phase one(will-it-scale/page_fault1) test results:
Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.
batch   score     change   zone_contention   lru_contention   total_contention
 31   
15345900    +0.00%       64%                 8%           72%
 53   
17903847   +16.67%       32%                38%           70%
 63   
17992886   +17.25%       24%                45%           69%
 73   
18022825   +17.44%       10%                61%           71%
119   
18023401   +17.45%        4%                66%           70%
127   
18029012   +17.48%        3%                66%           69%
137   
18036075   +17.53%        4%                66%           70%
165   
18035964   +17.53%        2%                67%           69%
188   
18101105   +17.95%        2%                67%           69%
223   
18130951   +18.15%        2%                67%           69%
255   
18118898   +18.07%        2%                67%           69%
267   
18101559   +17.96%        2%                67%           69%
299   
18160468   +18.34%        2%                68%           70%
320   
18139845   +18.21%        2%                67%           69%
393   
18160869   +18.34%        2%                68%           70%
424   
18170999   +18.41%        2%                68%           70%
458   
18144868   +18.24%        2%                68%           70%
467   
18142366   +18.22%        2%                68%           70%
498   
18154549   +18.30%        1%                68%           69%
511   
18134525   +18.17%        1%                69%           70%
Broadwell-EX: similar pattern as Skylake-EX.
batch   score     change   zone_contention   lru_contention   total_contention
 31   
16703983    +0.00%       67%                 7%           74%
 53   
18195393    +8.93%       43%                28%           71%
 63   
18288885    +9.49%       38%                33%           71%
 73   
18344329    +9.82%       35%                37%           72%
119   
18535529   +10.96%       24%                46%           70%
127   
18513596   +10.83%       23%                48%           71%
137   
18514327   +10.84%       23%                48%           71%
165   
18511840   +10.82%       22%                49%           71%
188   
18593478   +11.31%       17%                53%           70%
223   
18601667   +11.36%       17%                52%           69%
255   
18774825   +12.40%       12%                58%           70%
267   
18754781   +12.28%        9%                60%           69%
299   
18892265   +13.10%        7%                63%           70%
320   
18873812   +12.99%        8%                62%           70%
393   
18891174   +13.09%        6%                64%           70%
424   
18975108   +13.60%        6%                64%           70%
458   
18932364   +13.34%        8%                62%           70%
467   
18960891   +13.51%        5%                65%           70%
498   
18944526   +13.41%        5%                64%           69%
511   
18960839   +13.51%        5%                64%           69%
Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX.  Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.
batch   score    change   zone_contention   lru_contention   total_contention
 31   
9554867    +0.00%       66%                 3%           69%
 53   
9855486    +3.15%       63%                 3%           66%
 63   
9980145    +4.45%       62%                 4%           66%
 73   
10092774   +5.63%       62%                 5%           67%
119   
10310061   +7.90%       45%                19%           64%
127   
10342019   +8.24%       42%                19%           61%
137   
10358182   +8.41%       42%                21%           63%
165   
10397060   +8.81%       37%                24%           61%
188   
10341808   +8.24%       34%                26%           60%
223   
10349135   +8.31%       31%                27%           58%
255   
10327189   +8.08%       28%                29%           57%
267   
10344204   +8.26%       27%                29%           56%
299   
10325043   +8.06%       25%                30%           55%
320   
10310325   +7.91%       25%                31%           56%
393   
10293274   +7.73%       21%                31%           52%
424   
10311099   +7.91%       21%                32%           53%
458   
10321375   +8.02%       21%                32%           53%
467   
10303881   +7.84%       21%                32%           53%
498   
10332462   +8.14%       20%                33%           53%
511   
10325016   +8.06%       20%                32%           52%
Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.
batch   score    change   zone_contention   lru_contention   total_contention
 31   
10121178   +0.00%       19%                50%           69%
 53   
10142366   +0.21%        6%                63%           69%
 63   
10117984   -0.03%       11%                58%           69%
 73   
10123330   +0.02%        7%                63%           70%
119   
10108791   -0.12%        2%                67%           69%
127   
10166074   +0.44%        3%                66%           69%
137   
10141574   +0.20%        3%                66%           69%
165   
10154499   +0.33%        2%                68%           70%
188   
10124921   +0.04%        2%                67%           69%
223   
10137399   +0.16%        2%                67%           69%
255   
10143289   +0.22%        0%                68%           68%
267   
10123535   +0.02%        1%                68%           69%
299   
10140952   +0.20%        0%                68%           68%
320   
10163170   +0.41%        0%                68%           68%
393   
10000633   -1.19%        0%                69%           69%
424   
10087998   -0.33%        0%                69%           69%
458   
10187116   +0.65%        0%                69%           69%
467   
10146790   +0.25%        0%                69%           69%
498   
10197958   +0.76%        0%                69%           69%
511   
10152326   +0.31%        0%                69%           69%
Haswell-EP: similar to Broadwell-EP.
batch   score   change   zone_contention   lru_contention   total_contention
 31   
10442205   +0.00%       14%                48%           62%
 53   
10442255   +0.00%        5%                57%           62%
 63   
10452059   +0.09%        6%                57%           63%
 73   
10482349   +0.38%        5%                59%           64%
119   
10454644   +0.12%        3%                60%           63%
127   
10431514   -0.10%        3%                59%           62%
137   
10423785   -0.18%        3%                60%           63%
165   
10481216   +0.37%        2%                61%           63%
188   
10448755   +0.06%        2%                61%           63%
223   
10467144   +0.24%        2%                61%           63%
255   
10480215   +0.36%        2%                61%           63%
267   
10484279   +0.40%        2%                61%           63%
299   
10466450   +0.23%        2%                61%           63%
320   
10452578   +0.10%        2%                61%           63%
393   
10499678   +0.55%        1%                62%           63%
424   
10481454   +0.38%        1%                62%           63%
458   
10473562   +0.30%        1%                62%           63%
467   
10484269   +0.40%        0%                62%           62%
498   
10505599   +0.61%        0%                62%           62%
511   
10483395   +0.39%        0%                62%           62%
Westmere-EP: contention is pretty small so not interesting.  Note too high
a batch value could hurt performance.
batch   score   change   zone_contention   lru_contention   total_contention
 31   
4831523   +0.00%        2%                 3%            5%
 53   
4834086   +0.05%        2%                 4%            6%
 63   
4834262   +0.06%        2%                 3%            5%
 73   
4832851   +0.03%        2%                 4%            6%
119   
4830534   -0.02%        1%                 3%            4%
127   
4827461   -0.08%        1%                 4%            5%
137   
4827459   -0.08%        1%                 3%            4%
165   
4820534   -0.23%        0%                 4%            4%
188   
4817947   -0.28%        0%                 3%            3%
223   
4809671   -0.45%        0%                 3%            3%
255   
4802463   -0.60%        0%                 4%            4%
267   
4801634   -0.62%        0%                 3%            3%
299   
4798047   -0.69%        0%                 3%            3%
320   
4793084   -0.80%        0%                 3%            3%
393   
4785877   -0.94%        0%                 3%            3%
424   
4782911   -1.01%        0%                 3%            3%
458   
4779346   -1.08%        0%                 3%            3%
467   
4780306   -1.06%        0%                 3%            3%
498   
4780589   -1.05%        0%                 3%            3%
511   
4773724   -1.20%        0%                 3%            3%
Skylake-Desktop: similar to Westmere-EP, nothing interesting.
batch   score   change   zone_contention   lru_contention   total_contention
 31   
3906608   +0.00%        2%                 3%            5%
 53   
3940164   +0.86%        2%                 3%            5%
 63   
3937289   +0.79%        2%                 3%            5%
 73   
3940201   +0.86%        2%                 3%            5%
119   
3933240   +0.68%        2%                 3%            5%
127   
3930514   +0.61%        2%                 4%            6%
137   
3938639   +0.82%        0%                 3%            3%
165   
3908755   +0.05%        0%                 3%            3%
188   
3905621   -0.03%        0%                 3%            3%
223   
3903015   -0.09%        0%                 4%            4%
255   
3889480   -0.44%        0%                 3%            3%
267   
3891669   -0.38%        0%                 4%            4%
299   
3898728   -0.20%        0%                 4%            4%
320   
3894547   -0.31%        0%                 4%            4%
393   
3875137   -0.81%        0%                 4%            4%
424   
3874521   -0.82%        0%                 3%            3%
458   
3880432   -0.67%        0%                 4%            4%
467   
3888715   -0.46%        0%                 3%            3%
498   
3888633   -0.46%        0%                 4%            4%
511   
3875305   -0.80%        0%                 5%            5%
Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.
batch   score   change   zone_contention   lru_contention   total_contention
 31   
3511158   +0.00%        2%                 5%            7%
 53   
3555445   +1.26%        2%                 6%            8%
 63   
3561082   +1.42%        2%                 6%            8%
 73   
3547218   +1.03%        2%                 6%            8%
119   
3571319   +1.71%        1%                 7%            8%
127   
3549375   +1.09%        0%                 6%            6%
137   
3560233   +1.40%        0%                 6%            6%
165   
3555176   +1.25%        2%                 6%            8%
188   
3551501   +1.15%        0%                 8%            8%
223   
3531462   +0.58%        0%                 7%            7%
255   
3570400   +1.69%        0%                 7%            7%
267   
3532235   +0.60%        1%                 8%            9%
299   
3562326   +1.46%        0%                 6%            6%
320   
3553569   +1.21%        0%                 8%            8%
393   
3539519   +0.81%        0%                 7%            7%
424   
3549271   +1.09%        0%                 8%            8%
458   
3528885   +0.50%        0%                 8%            8%
467   
3526554   +0.44%        0%                 7%            7%
498   
3525302   +0.40%        0%                 9%            9%
511   
3527556   +0.47%        0%                 8%            8%
Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.
batch   score   change   zone_contention   lru_contention   total_contention
 31   
1744495   +0.00%        0%                 0%            0%
 53   
1755341   +0.62%        0%                 0%            0%
 63   
1758469   +0.80%        0%                 0%            0%
 73   
1759626   +0.87%        0%                 0%            0%
119   
1770417   +1.49%        0%                 0%            0%
127   
1768252   +1.36%        0%                 0%            0%
137   
1767848   +1.34%        0%                 0%            0%
165   
1765088   +1.18%        0%                 0%            0%
188   
1766918   +1.29%        0%                 0%            0%
223   
1767866   +1.34%        0%                 0%            0%
255   
1768074   +1.35%        0%                 0%            0%
267   
1763187   +1.07%        0%                 0%            0%
299   
1765620   +1.21%        0%                 0%            0%
320   
1767603   +1.32%        0%                 0%            0%
393   
1764612   +1.15%        0%                 0%            0%
424   
1758476   +0.80%        0%                 0%            0%
458   
1758593   +0.81%        0%                 0%            0%
467   
1757915   +0.77%        0%                 0%            0%
498   
1753363   +0.51%        0%                 0%            0%
511   
1755548   +0.63%        0%                 0%            0%
Phase two test results:
Note: all percent change is against base(batch=31).
ebizzy.throughput (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
2410037±7%     
2600451±2% +7.9%     
2602878 +8.0%
lkp-bdw-ex1     
1493328        1489243    -0.3%     
1492145 -0.1%
lkp-skl-2sp2    
1329674        1345891    +1.2%     
1351056 +1.6%
lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
lkp-sb02          98937          98937    +0.0%       99030 +0.1%
kbuild.buildtime (less is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
lkp-sb02          29990         30137  +0.5%         30103 +0.4%
oltp.transactions (higer is better)
machine         batch=31      batch=63             batch=127
lkp-bdw-ex1      
9073276       9100377     +0.3%    
9036344     -0.4%
lkp-skl-2sp2     
8898717       8852054     -0.5%    
8894459     -0.0%
lkp-bdw-ep2     
13426155      13384654     -0.3%   
13333637     -0.7%
lkp-hsw-ep2     
13146314      13232784     +0.7%   
13193163     +0.4%
lkp-wsm-ep2      
5035355       5019348     -0.3%    
5033418     -0.0%
lkp-skl-d01       418485       
4413339     -0.1%    
4419039     +0.0%
lkp-hsw-d01      
3517817±5%    
3396120±3%  -3.5%    
3455138±3%  -1.8%
pigz.throughput (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
will-it-scale.malloc1.processes (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
lkp-bdw-ex1      
1403610      1401201 -0.2%        
1417900 +1.0%
lkp-skl-2sp2     
1288097      1284145 -0.3%        
1283907 -0.3%
lkp-bdw-ep2      
1427879      1427675 -0.0%        
1428266 +0.0%
lkp-hsw-ep2      
1362546      1353965 -0.6%        
1354759 -0.6%
lkp-wsm-ep2      
2099657      2107576 +0.4%        
2100226 +0.0%
lkp-skl-d01      
1476835      1476358 -0.0%        
1474487 -0.2%
lkp-hsw-d01      
1308810      1303429 -0.4%        
1301299 -0.6%
lkp-sb02          589286       589284 -0.0%         588101 -0.2%
will-it-scale.malloc1.threads (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
will-it-scale.malloc2.processes (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
lkp-sb02         
82485104      82478206 -0.0%       
82473546 -0.0%
will-it-scale.malloc2.threads (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
lkp-sb02         
68081213      68061423 -0.0%       
68076037 -0.0%
will-it-scale.page_fault2.processes (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
14509125±4%   
16472364 +13.5%       
17123117 +18.0%
lkp-bdw-ex1     
14736381      16196588  +9.9%       
16364011 +11.0%
lkp-skl-2sp2     
6354925       6435444  +1.3%        
6436644  +1.3%
lkp-bdw-ep2      
8749584       8834422  +1.0%        
8827179  +0.9%
lkp-hsw-ep2      
8762591       8845920  +1.0%        
8825697  +0.7%
lkp-wsm-ep2      
3036083       3030428  -0.2%        
3021741  -0.5%
lkp-skl-d01      
2307834       2304731  -0.1%        
2286142  -0.9%
lkp-hsw-d01      
1806237       1800786  -0.3%        
1795943  -0.6%
lkp-sb02          842616        837844  -0.6%         833921  -1.0%
will-it-scale.page_fault2.threads
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     
1623294       1615132±2% -0.5%     
1656777    +2.1%
lkp-bdw-ex1      
1995714       2025948    +1.5%     
2113753±3% +5.9%
lkp-skl-2sp2     
2346708       2415591    +2.9%     
2416919    +3.0%
lkp-bdw-ep2      
2342564       2344882    +0.1%     
2300206    -1.8%
lkp-hsw-ep2      
1820658       1831681    +0.6%     
1844057    +1.3%
lkp-wsm-ep2      
1725482       1733774    +0.5%     
1740517    +0.9%
lkp-skl-d01      
1832833       1823628    -0.5%     
1806489    -1.4%
lkp-hsw-d01      
1427913       1427287    -0.0%     
1420226    -0.5%
lkp-sb02          750626        748615    -0.3%      746621    -0.5%
will-it-scale.page_fault3.processes (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
24382726      24400317 +0.1%       
24668774 +1.2%
lkp-bdw-ex1     
35399750      35683124 +0.8%       
35829492 +1.2%
lkp-skl-2sp2    
28136820      28068248 -0.2%       
28147989 +0.0%
lkp-bdw-ep2     
37269077      37459490 +0.5%       
37373073 +0.3%
lkp-hsw-ep2     
36224967      36114085 -0.3%       
36104908 -0.3%
lkp-wsm-ep2     
16820457      16911005 +0.5%       
16968596 +0.9%
lkp-skl-d01      
7721138       7725904 +0.1%        
7756740 +0.5%
lkp-hsw-d01      
7611979       7650928 +0.5%        
7651323 +0.5%
lkp-sb02         
3781546       3796502 +0.4%        
3796827 +0.4%
will-it-scale.page_fault3.threads (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     
1865820±3%   
1900917±2%  +1.9%     
1826245±4%  -2.1%
lkp-bdw-ex1      
3094060      3148326     +1.8%     
3150036     +1.8%
lkp-skl-2sp2     
3952940      3953898     +0.0%     
3989360     +0.9%
lkp-bdw-ep2      
3420373±3%   
3643964     +6.5%     
3644910±5%  +6.6%
lkp-hsw-ep2      
2609635±2%   
2582310±3%  -1.0%     
2780459     +6.5%
lkp-wsm-ep2      
4395001      4417196     +0.5%     
4432499     +0.9%
lkp-skl-d01      
5363977      5400003     +0.7%     
5411370     +0.9%
lkp-hsw-d01      
5274131      5311294     +0.7%     
5319359     +0.9%
lkp-sb02         
2917314      2913004     -0.1%     
2935286     +0.6%
will-it-scale.read1.processes (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
73762279±14%  
69322519±10% -6.0%    
69349855±13%  -6.0% (result unstable)
lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
lkp-skl-2sp2    
63111570      63113953     +0.0%    
63836573      +1.1%
lkp-bdw-ep2     
79247409      79424610     +0.2%    
78012656      -1.6%
lkp-hsw-ep2     
67677026      68308800     +0.9%    
67539106      -0.2%
lkp-wsm-ep2     
13339630      13939817     +4.5%    
13766865      +3.2%
lkp-skl-d01     
10969487      10972650     +0.0%    no data
lkp-hsw-d01     
9857342±2%    
10080592±2%  +2.3%    
10131560      +2.8%
lkp-sb02        
5189076        5197473     +0.2%    
5163253       -0.5%
will-it-scale.read1.threads (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
62468045±12%  
73666726±7% +17.9%    
79553123±12% +27.4% (result unstable)
lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
lkp-skl-2sp2    
58319780      59181032     +1.5%    
59821353      +2.6%
lkp-bdw-ep2     
74057992      75698171     +2.2%    
74990869      +1.3%
lkp-hsw-ep2     
63672959      63639652     -0.1%    
64387051      +1.1%
lkp-wsm-ep2     
13489943      13526058     +0.3%    
13259032      -1.7%
lkp-skl-d01     
10297906      10338796     +0.4%    
10407328      +1.1%
lkp-hsw-d01      
9636721       9667376     +0.3%     
9341147      -3.1%
lkp-sb02         
4801938       4804496     +0.1%     
4802290      +0.0%
will-it-scale.write1.processes (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
lkp-skl-2sp2     
59369233      58994841     -0.6%    
58715168     -1.1%
lkp-bdw-ep2      
61820979      CPU throttle          
63593123     +2.9%
lkp-hsw-ep2      
57897587      57435605     -0.8%    
56347450     -2.7%
lkp-wsm-ep2       
7814203       7918017±2%  +1.3%     
7669068     -1.9%
lkp-skl-d01       
8886557       8971422     +1.0%     
8818366     -0.8%
lkp-hsw-d01       
9171001±5%    
9189915     +0.2%     
9483909     +3.4%
lkp-sb02          
4475406       4475294     -0.0%     
4501756     +0.6%
will-it-scale.write1.threads (higer is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
lkp-skl-2sp2     
54492421      56086678     +2.9%    
55975657  +2.7%
lkp-bdw-ep2      
59360449      59003957     -0.6%    
58101262  -2.1%
lkp-hsw-ep2      
53346346±2%   
52530876     -1.5%    
52902487  -0.8%
lkp-wsm-ep2       
7774006       7800092±2%  +0.3%     
7558833  -2.8%
lkp-skl-d01       
8346174       8235695     -1.3%     no data
lkp-hsw-d01       
8636244       8655731     +0.2%     
8658868  +0.3%
lkp-sb02          
4181820       4204107     +0.5%     
4182992  +0.0%
vm-scalability.anon-r-rand.throughput (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
11933873±3%   
12356544±2%  +3.5%   
12188624     +2.1%
lkp-bdw-ex1      
7114424±2%    
7330949±2%  +3.0%    
7392419     +3.9%
lkp-skl-2sp2     
6773277±5%    
6492332±8%  -4.1%    
6543962     -3.4%
lkp-bdw-ep2      
7133846±4%    
7233508     +1.4%    
7013518±3%  -1.7%
lkp-hsw-ep2      
4576626       4527098     -1.1%    
4551679     -0.5%
lkp-wsm-ep2      
2583599       2592492     +0.3%    
2588039     +0.2%
lkp-hsw-d01       998199±2%    
1028311     +3.0%    
1006460±2%  +0.8%
lkp-sb02          570572        567854     -0.5%     568449     -0.4%
vm-scalability.anon-r-rand-mt.throughput (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     
1789419       1787830     -0.1%    
1788208     -0.1%
lkp-bdw-ex1      
3492595±2%    
3554966±2%  +1.8%    
3558835±3%  +1.9%
lkp-skl-2sp2     
3856238±2%    
3975403±4%  +3.1%    
3994600     +3.6%
lkp-bdw-ep2      
3726963±11%   
3809292±6%  +2.2%    
3871924±4%  +3.9%
lkp-hsw-ep2      
2131760±3%    
2033578±4%  -4.6%    
2130727±6%  -0.0%
lkp-wsm-ep2      
2369731       2368384     -0.1%    
2370252     +0.0%
lkp-skl-d01      
1207128       1206220     -0.1%    
1205801     -0.1%
lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
lkp-sb02          567137        567346     +0.0%     566144     -0.2%
vm-scalability.lru-file-mmap-read.throughput (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    
19560469±6%   
23018999     +17.7%   
23418800     +19.7%
lkp-bdw-ex1     
17769135±14%  
26141676±3%  +47.1%   
26284723±5%  +47.9%
lkp-skl-2sp2    
14056512      13578884      -3.4%   
13146214      -6.5%
lkp-bdw-ep2     
15336542      14737654      -3.9%   
14088159      -8.1%
lkp-hsw-ep2     
16275498      15756296      -3.2%   
15018090      -7.7%
lkp-wsm-ep2     
11272160      11237231      -0.3%   
11310047      +0.3%
lkp-skl-d01      
7322119       7324569      +0.0%    
7184148      -1.9%
lkp-hsw-d01      
6449234       6404542      -0.7%    
6356141      -1.4%
lkp-sb02         
3517943       3520668      +0.1%    
3527309      +0.3%
vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     
1689052       1697553  +0.5%       
1698726  +0.6%
lkp-bdw-ex1      
1675246       1699764  +1.5%       
1712226  +2.2%
lkp-skl-2sp2     
1800533       1799749  -0.0%       
1800581  +0.0%
lkp-bdw-ep2      
1807422       1807758  +0.0%       
1804932  -0.1%
lkp-hsw-ep2      
1809807       1808781  -0.1%       
1807811  -0.1%
lkp-wsm-ep2      
1800198       1802434  +0.1%       
1801236  +0.1%
lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
lkp-sb02          258939        258787  -0.1%        258199  -0.3%
Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Michal Hocko [Fri, 17 Aug 2018 22:49:10 +0000 (15:49 -0700)]
 
mm/oom_kill.c: document oom_lock
Add comments describing oom_lock's scope.
Requested-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Mike Kravetz [Fri, 17 Aug 2018 22:49:07 +0000 (15:49 -0700)]
 
mm/hugetlb: remove gigantic page support for HIGHMEM
This reverts 
ee8f248d266e ("hugetlb: add phys addr to struct
huge_bootmem_page").
At one time powerpc used this field and supporting code.  However that
was removed with commit 
79cc38ded1e1 ("powerpc/mm/hugetlb: Add support
for reserving gigantic huge pages via kernel command line").
There are no users of this field and supporting code, so remove it.
Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Michal Hocko [Fri, 17 Aug 2018 22:49:04 +0000 (15:49 -0700)]
 
mm, oom: remove sleep from under oom_lock
Tetsuo has pointed out that since 
27ae357fa82b ("mm, oom: fix concurrent
munlock and oom reaper unmap, v3") we have a strong synchronization
between the oom_killer and victim's exiting because both have to take
the oom_lock.  Therefore the original heuristic to sleep for a short
time in out_of_memory doesn't serve the original purpose.
Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful.  Hammering the system with many processes can lead
to a starvation when the task holding the oom_lock can block for a long
time (minutes) and block any further progress because the oom_reaper
depends on the oom_lock as well.
Drop the short sleep from out_of_memory when we hold the lock.  Keep the
sleep when the trylock fails to throttle the concurrent OOM paths a bit.
This should be solved in a more reasonable way (e.g.  sleep proportional
to the time spent in the active reclaiming etc.) but this is much more
complex thing to achieve.  This is a quick fixup to remove a stale code.
Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Marek Szyprowski [Fri, 17 Aug 2018 22:49:00 +0000 (15:49 -0700)]
 
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
The CMA memory allocator doesn't support standard gfp flags for memory
allocation, so there is no point having it as a parameter for
dma_alloc_from_contiguous() function.  Replace it by a boolean no_warn
argument, which covers all the underlaying cma_alloc() function
supports.
This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit 
dd65a941f6ba ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Marek Szyprowski [Fri, 17 Aug 2018 22:48:57 +0000 (15:48 -0700)]
 
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
convert gfp_mask parameter to boolean no_warn parameter.
This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit 
dd65a941f6ba ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Rik van Riel [Fri, 17 Aug 2018 22:48:53 +0000 (15:48 -0700)]
 
Revert "mm: always flush VMA ranges affected by zap_page_range"
There was a bug in Linux that could cause madvise (and mprotect?) system
calls to return to userspace without the TLB having been flushed for all
the pages involved.
This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.
This was noticed in the summer of 2017, at which time two solutions
were created:
  
56236a59556c ("mm: refactor TLB gathering API")
  
99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
  
4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
We need only one of these solutions, and the former appears to be a
little more efficient than the latter, so revert that one.
This reverts 
4647706ebeee6e50 ("mm: always flush VMA ranges affected by
zap_page_range")
Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Baoquan He [Fri, 17 Aug 2018 22:48:49 +0000 (15:48 -0700)]
 
mm/sparse: optimize memmap allocation during sparse_init()
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  With the
help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node.  This avoids too many
memory fragmentations.  Like below diagram, '1' indicates the present
memory section, '0' means absent one.  The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.
  |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
  -------------------------------------------------------
   0 1 2 3         4 5         i   i+1     n-1   n
If we fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not
present.  After use, these two arrays will be released at the end of
sparse_init().
In 4-level paging mode, each array costs 4M which can be ignorable.
While in 5-level paging, they costs 256M each, 512M altogether.  Kdump
kernel Usually only reserves very few memory, e.g 256M.  So, even thouth
they are temporarily allocated, still not acceptable.
In fact, there's no need to allocate them with the size of
NR_MEM_SECTIONS.  Since the ->section_mem_map clearing has been deferred
to the last, the number of present memory sections are kept the same
during sparse_init() until we finally clear out the memory section's
->section_mem_map if its usemap or memmap is not correctly handled.
Thus in the middle whenever for_each_present_section_nr() loop is taken,
the i-th present memory section is always the same one.
Here only allocate usemap_map and map_map with the size of
'nr_present_sections'.  For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
Then in the last for_each_present_section_nr() loop which clears the
failed memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Baoquan He [Fri, 17 Aug 2018 22:48:45 +0000 (15:48 -0700)]
 
mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmap
It's used to pass the size of map data unit into
alloc_usemap_and_memmap, and is preparation for next patch.
Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Baoquan He [Fri, 17 Aug 2018 22:48:42 +0000 (15:48 -0700)]
 
mm/sparsemem.c: defer the ms->section_mem_map clearing
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one.  If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being
present.  Like this, the number of mem sections marked as present could
become less during sparse_init() execution.
Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop.  This
is in preparation for later optimizing the mem map allocation.
[akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Baoquan He [Fri, 17 Aug 2018 22:48:38 +0000 (15:48 -0700)]
 
mm/sparse.c: add a static variable nr_present_sections
Patch series "mm/sparse: Optimize memmap allocation during
sparse_init()", v6.
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  In
5-level paging mode, this will cost 512M memory though they will be
released at the end of sparse_init().  System with few memory, like
kdump kernel which usually only has about 256M, will fail to boot
because of allocation failure if CONFIG_X86_5LEVEL=y.
In this patchset, optimize the memmap allocation code to only use
usemap_map and map_map with the size of nr_present_sections.  This makes
kdump kernel boot up with normal crashkernel='' setting when
CONFIG_X86_5LEVEL=y.
This patch (of 5):
nr_present_sections is used to record how many memory sections are
marked as present during system boot up, and will be used in the later
patch.
Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:34 +0000 (15:48 -0700)]
 
mm: use special value SHRINKER_REGISTERING instead of list_empty() check
The patch introduces a special value SHRINKER_REGISTERING to use instead
of list_empty() to differ a registering shrinker from unregistered
shrinker.  Why we need that at all?
Shrinker registration is split in two parts.  The first one is
prealloc_shrinker(), which allocates shrinker memory and reserves ID in
shrinker_idr.  This function can fail.  The second is
register_shrinker_prepared(), and it finalizes the registration.  This
function actually makes shrinker available to be used from
shrink_slab(), and it can't fail.
One shrinker may be based on more then one LRU lists.  So, we never
clear the bit in memcg shrinker maps, when (one of) corresponding LRU
list becomes empty, since other LRU lists may be not empty.  See
superblock shrinker for example: it is based on two LRU lists:
s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
when there are no inodes in s_inode_lru, as s_dentry_lru may contain
dentries.
Instead of that, we use special algorithm to detect shrinkers having no
elements at all its LRU lists, and this is made in shrink_slab_memcg().
See the comment in this function for the details.
Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
meet unregistered shrinker (bit is set, while there is no a shrinker in
IDR).  Otherwise, we would have done that at the moment of shrinker
unregistration for all memcgs (and this looks worse, since iteration
over all memcg may take much time).  Also this would have imposed
restrictions on shrinker unregistration order for its users: they would
have had to guarantee, there are no new elements after
unregister_shrinker() (otherwise, a new added element would have set a
bit).
So, if we meet a set bit in map and no shrinker in IDR when we're
iterating over the map in shrink_slab_memcg(), this means the
corresponding shrinker is unregistered, and we must clear the bit.
Another case is shrinker registration.  We want two things there:
1) do_shrink_slab() can be called only for completely registered
   shrinkers;
2) shrinker internal lists may be populated in any order with
   register_shrinker_prepared() (let's talk on the example with sb).  Both
   of:
  a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
    memcg_set_shrinker_bit();                               [cpu0]
    ...
    register_shrinker_prepared();                           [cpu1]
  and
  b)register_shrinker_prepared();                           [cpu0]
    ...
    list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
    memcg_set_shrinker_bit();                               [cpu1]
   are legitimate.  We don't want to impose restriction here and to
   force people to use only (b) variant.  We don't want to force people to
   care, there is no elements in LRU lists before the shrinker is
   completely registered.  Internal users of LRU lists and shrinker code
   are two different subsystems, and they have to be closed in themselves
   each other.
In (a) case we have the bit set before shrinker is completely
registered.  We don't want do_shrink_slab() is called at this moment, so
we have to detect such the registering shrinkers.
Before this patch list_empty() (shrinker is not linked to the list)
check was used for that.  So, in (a) there could be a bit set, but we
don't call do_shrink_slab() unless shrinker is linked to the list.  It's
just an indicator, I just overloaded linking to the list.
This was not the best solution, since it's better not to touch the
shrinker memory from shrink_slab_memcg() before it's completely
registered (this also will be useful in the future to make shrink_slab()
completely lockless).
So, this patch introduces better way to detect registering shrinker,
which allows not to dereference shrinker memory.  It's just a ~0UL
value, which we insert into the IDR during ID allocation.  After
shrinker is ready to be used, we insert actual shrinker pointer in the
IDR, and it becomes available to shrink_slab_memcg().
We can't use NULL instead of this new value for this purpose as:
shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
and we don't want the function sees NULL and clears the bit, otherwise
(a) won't work.
This is the only thing the patch makes: the better way to detect
registering shrinker.  Nothing else this patch makes.
Also this gives a better assembler, but it's minor side of the patch:
Before:
  callq  <idr_find>
  mov    %rax,%r15
  test   %rax,%rax
  je     <shrink_slab_memcg+0x1d5>
  mov    0x20(%rax),%rax
  lea    0x20(%r15),%rdx
  cmp    %rax,%rdx
  je     <shrink_slab_memcg+0xbd>
  mov    0x8(%rsp),%edx
  mov    %r15,%rsi
  lea    0x10(%rsp),%rdi
  callq  <do_shrink_slab>
After:
  callq  <idr_find>
  mov    %rax,%r15
  lea    -0x1(%rax),%rax
  cmp    $0xfffffffffffffffd,%rax
  ja     <shrink_slab_memcg+0x1cd>
  mov    0x8(%rsp),%edx
  mov    %r15,%rsi
  lea    0x10(%rsp),%rdi
  callq  
ffffffff810cefd0 <do_shrink_slab>
[ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:30 +0000 (15:48 -0700)]
 
mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab()
In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
numa-aware.  This is not a real problem, since currently all memcg-aware
shrinkers are numa-aware too (we have two: super_block shrinker and
workingset shrinker), but something may change in the future.
Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:25 +0000 (15:48 -0700)]
 
mm/vmscan.c: clear shrinker bit if there are no objects related to memcg
To avoid further unneed calls of do_shrink_slab() for shrinkers, which
already do not have any charged objects in a memcg, their bits have to
be cleared.
This patch introduces a lockless mechanism to do that without races
without parallel list lru add.  After do_shrink_slab() returns
SHRINK_EMPTY the first time, we clear the bit and call it once again.
Then we restore the bit, if the new return value is different.
Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
two situations:
1)list_lru_add()     shrink_slab_memcg
    list_add_tail()    for_each_set_bit() <--- read bit
                         do_shrink_slab() <--- missed list update (no barrier)
    <MB>                 <MB>
    set_bit()            do_shrink_slab() <--- seen list update
This situation, when the first do_shrink_slab() sees set bit, but it
doesn't see list update (i.e., race with the first element queueing), is
rare.  So we don't add <MB> before the first call of do_shrink_slab()
instead of this to do not slow down generic case.  Also, it's need the
second call as seen in below in (2).
2)list_lru_add()      shrink_slab_memcg()
    list_add_tail()     ...
    set_bit()           ...
  ...                   for_each_set_bit()
  do_shrink_slab()        do_shrink_slab()
    clear_bit()           ...
  ...                     ...
  list_lru_add()          ...
    list_add_tail()       clear_bit()
    <MB>                  <MB>
    set_bit()             do_shrink_slab()
The barriers guarantee that the second do_shrink_slab() in the right
side task sees list update if really cleared the bit.  This case is
drawn in the code comment.
[Results/performance of the patchset]
After the whole patchset applied the below test shows signify increase
of performance:
  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
      $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
			    touch s/$i/file; done
Then, 5 sequential calls of drop caches:
  $time echo 3 > /proc/sys/vm/drop_caches
1)Before:
  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU
2)After
  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
The results show the performance increases at least in 548 times.
Shakeel Butt tested this patchset with fork-bomb on his configuration:
 > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
 > file containing few KiBs on corresponding mount. Then in a separate
 > memcg of 200 MiB limit ran a fork-bomb.
 >
 > I ran the "perf record -ag -- sleep 60" and below are the results:
 >
 > Without the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 
3279403076005
 > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
 > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
 > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
 > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
 > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 >
 > With the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 
2756866824946
 > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
 > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
 > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
 > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
 > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
 > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg
[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:21 +0000 (15:48 -0700)]
 
mm: add SHRINK_EMPTY shrinker methods return value
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all.  Currently, in
the both of these cases, shrinker::count_objects() returns 0.
The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case.  It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.
Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Vladimir Davydov [Fri, 17 Aug 2018 22:48:17 +0000 (15:48 -0700)]
 
mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
The patch makes shrink_slab() be called for root_mem_cgroup in the same
way as it's called for the rest of cgroups.  This simplifies the logic
and improves the readability.
[ktkhai@virtuozzo.com: wrote changelog]
Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:14 +0000 (15:48 -0700)]
 
mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()
Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap.  To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
for_each_set_bit() from the bitmap.  In case of big nodes, having many
isolated environments, this gives significant performance growth.  See
next patches for the details.
Note that the patch does not respect to empty memcg shrinkers, since we
never clear the bitmap bits after we set it once.  Their shrinkers will
be called again, with no shrinked objects as result.  This functionality
is provided by next patches.
[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:10 +0000 (15:48 -0700)]
 
mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance
Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.
This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.
[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:06 +0000 (15:48 -0700)]
 
mm/memcontrol.c: export mem_cgroup_is_root()
This will be used in next patch.
Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:48:01 +0000 (15:48 -0700)]
 
mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node()
This is just refactoring to allow next patches to have lru pointer in
memcg_drain_list_lru_node().
Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:58 +0000 (15:47 -0700)]
 
mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node()
This is just refactoring to allow the next patches to have dst_memcg
pointer in memcg_drain_list_lru_node().
Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:54 +0000 (15:47 -0700)]
 
mm/list_lru.c: add memcg argument to list_lru_from_kmem()
This is just refactoring to allow the next patches to have memcg pointer
in list_lru_from_kmem().
Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:50 +0000 (15:47 -0700)]
 
fs: propagate shrinker::id to list_lru
Add list_lru::shrinker_id field and populate it by registered shrinker
id.
This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.
Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:45 +0000 (15:47 -0700)]
 
fs/super.c: refactor alloc_super()
Do two list_lru_init_memcg() calls after prealloc_super().
destroy_unused_super() in fail path is OK with this.  Next patch needs
such the order.
Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:41 +0000 (15:47 -0700)]
 
mm/workingset.c: refactor workingset_init()
Use prealloc_shrinker()/register_shrinker_prepared() instead of
register_shrinker().  This will be used in next patch.
[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:37 +0000 (15:47 -0700)]
 
mm, memcg: assign memcg-aware shrinkers bitmap to memcg
Imagine a big node with many cpus, memory cgroups and containers.  Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts.  If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().
Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 
4000000.
So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list().  I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.
This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
the map is related to corresponding shrinker id.
Next patches will maintain set bit only for really charged memcg.  This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.
[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:33 +0000 (15:47 -0700)]
 
mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines
Next patch requires these defines are above their current position, so
here they are moved to declarations.
Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:29 +0000 (15:47 -0700)]
 
mm: assign id to every memcg-aware shrinker
Introduce shrinker::id number, which is used to enumerate memcg-aware
shrinkers.  The number start from 0, and the code tries to maintain it
as small as possible.
This will be used to represent a memcg-aware shrinkers in memcg
shrinkers map.
Since all memcg-aware shrinkers are based on list_lru, which is
per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
be under this config option.
[ktkhai@virtuozzo.com: v9]
Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:25 +0000 (15:47 -0700)]
 
mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB
Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.
Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 
Kirill Tkhai [Fri, 17 Aug 2018 22:47:21 +0000 (15:47 -0700)]
 
mm/list_lru.c: combine code under the same define
Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.
This patcheset solves the problem with slow shrink_slab() occuring on
the machines having many shrinkers and memory cgroups (i.e., with many
containers).  The problem is complexity of shrink_slab() is O(n^2) and
it grows too fast with the growth of containers numbers.
Let us have 200 containers, and every container has 10 mounts and 10
cgroups.  All container tasks are isolated, and they don't touch foreign
containers mounts.
In case of global reclaim, a task has to iterate all over the memcgs and
to call all the memcg-aware shrinkers for all of them.  This means, the
task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
2000 = 
4000000.
4 million calls are not a number operations, which can takes 1 cpu
cycle.  E.g., super_cache_count() accesses at least two lists, and makes
arifmetical calculations.  Even, if there are no charged objects, we do
these calculations, and replaces cpu caches by read memory.  I observed
nodes spending almost 100% time in kernel, in case of intensive writing
and global reclaim.  The writer consumes pages fast, but it's need to
shrink_slab() before the reclaimer reached shrink pages function (and
frees SWAP_CLUSTER_MAX pages).  Even if there is no writing, the
iterations just waste the time, and slows reclaim down.
Let's see the small test below:
  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
  $for i in `seq 0 4000`;
          do mkdir /sys/fs/cgroup/memory/ct/$i;
          echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
          mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
  done
Then, let's see drop caches time (5 sequential calls):
  $time echo 3 > /proc/sys/vm/drop_caches
  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU
The last four calls don't actually shrink anything.  So, the iterations
over slab shrinkers take 5.48 seconds.  Not so good for scalability.
The patchset solves the problem by making shrink_slab() of O(n)
complexity.  There are following functional actions:
1) Assign id to every registered memcg-aware shrinker.
2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
   shrinker-related bit after the first element is added to lru list
   (also, when removed child memcg elements are reparanted).
3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
   shrinker if its bit is set in memcg's shrinker bitmap.  (Also, there is
   a functionality to clear the bit, after last element is shrinked).
This gives significant performance increase.  The result after patchset
is applied:
  $time echo 3 > /proc/sys/vm/drop_caches
  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
The results show the performance increases at least in 548 times.
So, the patchset makes shrink_slab() of less complexity and improves the
performance in such types of load I pointed.  This will give a profit in
case of !global reclaim case, since there also will be less
do_shrink_slab() calls.
This patch (of 17):
These two pairs of blocks of code are under the same #ifdef #else
#endif.
Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Roman Gushchin <guro@fb.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Waiman Long <longman@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>