Chao Yu [Mon, 30 Oct 2017 09:49:54 +0000 (17:49 +0800)]
f2fs: check curseg space before foreground GC
When we are closing to trigger foreground GC, if there are only a few
of dirty metas, we can log these dirty metas in left space of opened
segments instead of triggering foreground GC.
With this patch, total count of foreground GC triggered by
test/generic/* of fstest suit reduce from 254 to 184.
So let's do the check before foreground GC anyway.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 30 Oct 2017 09:49:53 +0000 (17:49 +0800)]
f2fs: use rw_semaphore to protect SIT cache
There are some cases user didn't update SIT cache under this lock,
so let's use rw_semaphore instead of mutex to enhance concurrently
accessing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 6 Oct 2017 16:14:28 +0000 (09:14 -0700)]
f2fs: support quota sys files
This patch supports hidden quota files in the system, which will be used for
Android. It requires up-to-date f2fs-tools later than v1.9.0.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 6 Oct 2017 04:03:06 +0000 (21:03 -0700)]
f2fs: add quota_ino feature infra
This patch adds quota_ino feature infra to be used for quota files.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Mon, 30 Oct 2017 07:19:48 +0000 (15:19 +0800)]
f2fs: optimize __update_nat_bits
Make three modification for __update_nat_bits:
1. Take the codes of dealing the nat with nid 0 out of the loop
Such nat only needs to be dealt with once at beginning.
2. Use " nat_index == 0" instead of " start_nid == 0" to decide if it's the first nat block
It's better that we don't assume @start_nid is the first nid of the nat block it's in.
3. Use " if (nat_blk->entries[i].block_addr != NULL_ADDR)" to explicitly comfirm the value of block_addr
use constant to make sure the codes is right, even if the value of NULL_ADDR changes.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlei He [Mon, 30 Oct 2017 06:18:55 +0000 (14:18 +0800)]
f2fs: modify for accurate fggc node io stat
modify for accurate fggc node io stat
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlong Song [Mon, 30 Oct 2017 01:33:41 +0000 (09:33 +0800)]
Revert "f2fs: handle dirty segments inside refresh_sit_entry"
This reverts commit
5e443818fa0b2a2845561ee25bec181424fb2889
The commit should be reverted because call sequence of below two parts
of code must be kept:
a. update sit information, it needs to be updated before segment
allocation since latter allocation may trigger SSR, and SSR allocation
needs latest valid block information of all segments.
b. update segment status, it needs to be updated after segment allocation
since we can skip updating current opened segment status.
Fixes: 5e443818fa0b ("f2fs: handle dirty segments inside refresh_sit_entry")
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: remove refresh_sit_entry function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fan Li [Sat, 28 Oct 2017 11:03:37 +0000 (19:03 +0800)]
f2fs: add a function to move nid
This patch add a new function to move nid from one state to another.
Move operation is heavily used, by adding a new function for it
we can cut down some branches from several flow.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:33 +0000 (16:52 +0800)]
f2fs: export SSR allocation threshold
This patch exports min_ssr_segments threshold in sysfs to let user
control triggering SSR allocation flexibly.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:32 +0000 (16:52 +0800)]
f2fs: give correct trimmed blocks in fstrim
We have supported to issue discard in specified range during fstrim,
it needs to return caller with successfully trimmed bytes in that
range instead of bytes of invalid blocks which are scanned in
checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:31 +0000 (16:52 +0800)]
f2fs: support bio allocation error injection
This patch adds to support bio allocation error injection to simulate
out-of-memory test scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:30 +0000 (16:52 +0800)]
f2fs: support get_page error injection
This patch adds to support get_page error injection to simulate
out-of-memory test scenario.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Sat, 28 Oct 2017 08:52:29 +0000 (16:52 +0800)]
f2fs: add missing sysfs description
There are some missing sysfs entries' description in document, add them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlong Song [Fri, 27 Oct 2017 12:45:05 +0000 (20:45 +0800)]
f2fs: support soft block reservation
It supports to extend reserved_blocks sysfs interface to be soft
threshold, which allows user configure it exceeding current available
user space. This patch also introduces a new sysfs interface called
current_reserved_blocks, which shows the current blocks that have
already been reserved.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Mon, 16 Oct 2017 22:05:16 +0000 (15:05 -0700)]
f2fs: handle error case when adding xattr entry
This patch fixes recovering incomplete xattr entries remaining in inline xattr
and xattr block, caused by any kind of errors.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 6 Sep 2017 13:59:50 +0000 (21:59 +0800)]
f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.
In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.
So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.
To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.
Inode disk layout:
+----------------------+
| .i_mode |
| ... |
| .i_ext |
+----------------------+
| .i_extra_isize |
| .i_inline_xattr_size |-----------+
| ... | |
+----------------------+ |
| .i_addr | |
| - block address or | |
| - inline data | |
+----------------------+<---+ v
| inline xattr | +---inline xattr range
+----------------------+<---+
| .i_nid |
+----------------------+
| node_footer |
| (nid, ino, offset) |
+----------------------+
Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.
Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 26 Oct 2017 08:31:22 +0000 (10:31 +0200)]
f2fs: show current cp state
This patch shows whether checkpoint met any error case.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Mon, 23 Oct 2017 21:50:15 +0000 (23:50 +0200)]
f2fs: add missing quota_initialize
This patch adds to call quota_intialize in f2fs_set_acl, f2fs_unlink,
and f2fs_rename.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Tue, 24 Oct 2017 07:46:54 +0000 (09:46 +0200)]
f2fs: show # of dirty segments via sysfs
This patch adds one sysfs entry to show # of dirty segments which can be
used for gc timing by user.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Mon, 23 Oct 2017 21:48:49 +0000 (23:48 +0200)]
f2fs: stop all the operations by cp_error flag
This patch replaces to use cp_error flag instead of RDONLY for quota off.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Colin Ian King [Thu, 19 Oct 2017 10:58:21 +0000 (12:58 +0200)]
f2fs: remove several redundant assignments
There are several assignments to variables that are redundant
as the values are never read when the variables are updated later
and so the redundant statements can be safely removed.
Cleans up clang warnings:
fs/f2fs/segment.c:923:19: warning: Value stored to 'p' during its initialization is never read
fs/f2fs/segment.c:2060:2: warning: Value stored to 'hint' is never read
fs/f2fs/segment.c:2353:2: warning: Value stored to 'start_block' is never read
fs/f2fs/segment.c:2354:2: warning: Value stored to 'end_block' is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Arnd Bergmann [Thu, 19 Oct 2017 09:52:47 +0000 (11:52 +0200)]
f2fs: avoid using timespec
All uses of timespec are deprecated, and this one is not particularly
useful, as the documented method for converting seconds to jiffies
is to multiply by 'HZ'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 18 Oct 2017 02:34:14 +0000 (10:34 +0800)]
f2fs: fix to correct no_fggc_candidate
There may be extreme case as below:
For one section contains one segment, and there are total 100 segments
with 10% over-privision ratio in f2fs partition, fggc_threshold will
be rounded down to 460 instead of 460.8 as below caclulation:
sbi->fggc_threshold = div_u64((u64)(main_count - ovp_count) *
BLKS_PER_SEC(sbi), (main_count - resv_count));
If section usage is as:
60 segments which contain 460 valid blocks
40 segments which contain 462 valid blocks
As valid block number in all sections is large than fggc_threshold, so
none of them will be chosen as candidate due to incorrect fggc_threshold.
Let's just soften the term of choosing foreground GC candidates.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 19:07:11 +0000 (12:07 -0700)]
Revert "f2fs: return wrong error number on f2fs_quota_write"
This reverts commit
4f31d26b0c17f2aae6a6afeb823a87e20671ab4b.
It turns out that we need to report error number if nothing was written.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 18:48:57 +0000 (11:48 -0700)]
f2fs: remove obsolete pointer for truncate_xattr_node
This patch removes obosolete parameter for truncate_xattr_node.
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 16:43:56 +0000 (09:43 -0700)]
f2fs: retry ENOMEM for quota_read|write
This gives another chance to read or write quota data.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Thu, 19 Oct 2017 02:05:57 +0000 (19:05 -0700)]
f2fs: limit # of inmemory pages
If some abnormal users try lots of atomic write operations, f2fs is able to
produce pinned pages in the main memory which affects system performance.
This patch limits that as 20% over total memory size, and if f2fs reaches
to the limit, it will drop all the inmemory pages.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:36 +0000 (18:01 +0800)]
f2fs: update ctx->pos correctly when hitting hole in directory
This patch fixes to update ctx->pos correctly when hitting hole in
directory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:35 +0000 (18:01 +0800)]
f2fs: relocate readahead codes in readdir()
Previously, for large directory, we just do readahead only once in
readdir(), readdir()'s performance may drop when traversing latter
blocks. In order to avoid this, relocate readahead codes to covering
all traverse flow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:34 +0000 (18:01 +0800)]
f2fs: allow readdir() to be interrupted
This patch follows ext4 to allow readdir() in large empty directory to
be interrupted. Referenced commit of ext4:
1f60fbe72749 ("ext4: allow
readdir()'s of large empty directories to be interrupted").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 13 Oct 2017 10:01:33 +0000 (18:01 +0800)]
f2fs: trace f2fs_readdir
This patch adds trace for f2fs_readdir.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Tue, 17 Oct 2017 09:33:41 +0000 (17:33 +0800)]
f2fs: trace f2fs_lookup
This patch adds trace for f2fs_lookup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Weichao Guo [Sat, 14 Oct 2017 00:13:32 +0000 (08:13 +0800)]
f2fs: skip searching non-exist range in truncate_hole
Let's skip entire non-exist area to speed up truncate_hole by
using get_next_page_offset.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 13 Oct 2017 17:27:45 +0000 (10:27 -0700)]
f2fs: expose some sectors to user in inline data or dentry case
If there's some data written through inline data or dentry, we need to shouw
st_blocks. This fixes reporting zero blocks even though there is small written
data.
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid link file for quotacheck]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Fri, 13 Oct 2017 02:12:53 +0000 (19:12 -0700)]
f2fs: avoid stale fi->gdirty_list pointer
When doing fault injection test, f2fs_evict_inode() didn't remove gdirty_list
which incurs a kernel panic due to wrong pointer access.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jaegeuk Kim [Sat, 7 Oct 2017 07:08:05 +0000 (00:08 -0700)]
f2fs/crypto: drop crypto key at evict_inode only
This patch avoids dropping crypto key in f2fs_drop_inode, so we can guarantee
it happens only at evict_inode.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 9 Oct 2017 09:55:19 +0000 (17:55 +0800)]
f2fs: fix to avoid race when accessing last_disk_size
last_disk_size could be wrong due to concurrently updating, so using
i_sem semaphore to make last_disk_size updating exclusive to fix this
issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Thomas Meyer [Sat, 7 Oct 2017 14:02:21 +0000 (16:02 +0200)]
f2fs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:37 +0000 (09:08 +0800)]
f2fs: give up CP_TRIMMED_FLAG if it drops discards
In ->umount, once we drop remained discard entries, we should not
set CP_TRIMMED_FLAG with another checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:36 +0000 (09:08 +0800)]
f2fs: trace f2fs_remove_discard
This patch adds tracepoint to trace f2fs_remove_discard.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:35 +0000 (09:08 +0800)]
f2fs: reduce cmd_lock coverage in __issue_discard_cmd
__submit_discard_cmd may lead long latency due to exhaustion of I/O
request resource in block layer, so issuing all discard under cmd_lock
may lead to hangtask, in order to avoid that, let's reduce it's coverage.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:34 +0000 (09:08 +0800)]
f2fs: split discard policy
There are many different scenarios such as fstrim, umount, urgent or
background where we will issue discards, actually, they need use
different policy in aspect of io aware, discard granularity, delay
interval and so on. But now they just share one common discard policy,
so there will be race when changing policy in between these scenarios,
the interference of changing discard policy will be very serious.
This patch changes to split discard policy for different scenarios.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:33 +0000 (09:08 +0800)]
f2fs: wrap discard policy
This patch wraps scattered optional parameters into discard policy as
below, later, with it we expect that we can adjust these parameters with
proper strategy in different scenario.
struct discard_policy {
unsigned int min_interval; /* used for candidates exist */
unsigned int max_interval; /* used for candidates not exist */
unsigned int max_requests; /* # of discards issued per round */
unsigned int io_aware_gran; /* minimum granularity discard not be aware of I/O */
bool io_aware; /* issue discard in idle time */
bool sync; /* submit discard with REQ_SYNC flag */
};
This patch doesn't change any logic of codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Wed, 4 Oct 2017 01:08:32 +0000 (09:08 +0800)]
f2fs: support issuing/waiting discard in range
Fstrim intends to trim invalid blocks of filesystem only with specified
range and granularity, but actually, it will issue all previous cached
discard commands which may be out-of-range and be with unmatched
granularity, it's unneeded.
In order to fix above issues, this patch introduces new helps to support
to issue and wait discard in range and adds a new fstrim_list for tracking
in-flight discard from ->fstrim.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:39 +0000 (13:59 +0800)]
f2fs: fix to flush multiple device in checkpoint
If f2fs manages multiple devices, in checkpoint, we need to issue flush
in those devices which contain dirty data/node in their cache before
we write checkpoint region, otherwise, filesystem metadata could be
corrupted if hitting SPO after checkpoint.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:38 +0000 (13:59 +0800)]
f2fs: enhance multiple device flush
When multiple device feature is enabled, during ->fsync we will issue
flush in all devices to make sure node/data of the file being persisted
into storage. But some flushes of device could be unneeded as file's
data may be not writebacked into those devices. So this patch adds and
manage bitmap per inode in global cache to indicate which device is
dirty and it needs to issue flush during ->fsync, hence, we could improve
performance of fsync in scenario of multiple device.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:37 +0000 (13:59 +0800)]
f2fs: fix to show ino management cache size correctly
It needs to stat size of ino management cache with all type instead of
orphan ino type.
Fixes: 652be55162dc ("f2fs: show # of orphan inodes")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:36 +0000 (13:59 +0800)]
f2fs: drop FI_UPDATE_WRITE tag after f2fs_issue_flush
If we failed to issue flush in ->fsync, we need to keep FI_UPDATE_WRITE
flag to make sure triggering flush in next ->fsync.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Fri, 29 Sep 2017 05:59:35 +0000 (13:59 +0800)]
f2fs: obsolete ALLOC_NID_LIST list
As Fan Li reported, there is no user traversing nid_list[ALLOC_NID_LIST]
which is used for tracking preallocated nids. Let's drop it, and only
track preallocated nids in free_nid_root radix-tree.
Reported-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Weichao Guo [Fri, 29 Sep 2017 14:43:23 +0000 (22:43 +0800)]
f2fs: convert inline data for direct I/O & FI_NO_PREALLOC
In FI_NO_PREALLOC cases, direct I/O path may allocate blocks for an
inode but keep its inline data flag. This inconsistency may trigger
vfs clear_inode nrpages bug_on when evicting the inode. We should
convert inline data first in this case.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hsiang Kao [Sat, 23 Sep 2017 18:45:42 +0000 (02:45 +0800)]
f2fs: allow readpages with NULL file pointer
Keep in line with the other Linux file system implementations
since page_cache_sync_readahead supports NULL file pointer,
and thus we can readahead data by f2fs itself without file opening
(something like the btrfs behavior).
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Thu, 14 Sep 2017 02:18:01 +0000 (10:18 +0800)]
f2fs: show flush list status in sysfs
This patch adds to show flush list status in sysfs.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 4 Sep 2017 10:58:03 +0000 (18:58 +0800)]
f2fs: introduce read_xattr_block
Commit
ba38c27eb93e ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_xattr_block to clean up redundant codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 4 Sep 2017 10:58:02 +0000 (18:58 +0800)]
f2fs: introduce read_inline_xattr
Commit
ba38c27eb93e ("f2fs: enhance lookup xattr") introduces
lookup_all_xattrs duplicating from read_all_xattrs, which leaves
lots of similar codes in between them, so introduce new help
read_inline_xattr to clean up redundant codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Chao Yu [Mon, 25 Sep 2017 06:17:51 +0000 (14:17 +0800)]
Revert "f2fs: reuse nids more aggressively"
Commit
268344664603 ("f2fs: reuse nids more aggressively") tries to
reuse nids as many as possilbe, in order to mitigate producing obsolete
node pages in page cache.
But acutally, before we reuse the nids and related node page cache,
we will always invalidate that node page, so there will be not any
obsolete node pages in cache.
Let's just revert previous commit, so that nm_i::next_scan_nid can be
increased ascendingly, making __build_free_nids traverses all NAT pages
more easily, finally, free nid bitmap cache can be enabled as soon as
possible.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Yunlong Song [Sat, 23 Sep 2017 09:02:18 +0000 (17:02 +0800)]
Revert "f2fs: node segment is prior to data segment selected victim"
This reverts commit
b9cd20619e359d199b755543474c3d853c8e3415.
That patch causes much fewer node segments (which can be used for SSR)
than before, and in the corner case (e.g. create and delete *.txt files in
one same directory, there will be very few node segments but many data
segments), if the reserved free segments are all used up during gc, then
the write_checkpoint can still flush dentry pages to data ssr segments,
but will probably fail to flush node pages to node ssr segments, since
there are not enough node ssr segments left (the left ones are all
full).
So revert this patch to give a fair chance to let node segments remain
for SSR, which provides more robustness for corner cases.
Conflicts:
fs/f2fs/gc.c
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Linus Torvalds [Tue, 10 Oct 2017 18:04:00 +0000 (11:04 -0700)]
Merge tag 'f2fs-for-4.14-rc5' of git://git./linux/kernel/git/jaegeuk/f2fs
Pull f2fs fix from Jaegeuk Kim:
"This contains one bug fix which causes a kernel panic during fstrim
introduced in 4.14-rc1"
* tag 'f2fs-for-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: fix potential panic during fstrim
Linus Torvalds [Tue, 10 Oct 2017 17:57:46 +0000 (10:57 -0700)]
Merge tag 'linux-kselftest-4.14-rc5-fixes' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- fix for x86: sysret_ss_attrs test build failure preventing the x86
tests from running
- fix mqueue: fix regression in silencing test run output
* tag 'linux-kselftest-4.14-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: mqueue: fix regression in silencing output from RUN_TESTS
selftests: x86: sysret_ss_attrs doesn't build on a PIE build
Linus Torvalds [Tue, 10 Oct 2017 02:08:32 +0000 (19:08 -0700)]
Merge branch 'ppc-bundle' (bundle from Michael Ellerman)
Merge powerpc transactional memory fixes from Michael Ellerman:
"I figured I'd still send you the commits using a bundle to make sure
it works in case I need to do it again in future"
This fixes transactional memory state restore for powerpc.
* bundle'd patches from Michael Ellerman:
powerpc/tm: Fix illegal TM state in signal handler
powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
Linus Torvalds [Mon, 9 Oct 2017 23:25:00 +0000 (16:25 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix object leak on IPSEC offload failure, from Steffen Klassert.
2) Fix range checks in ipset address range addition operations, from
Jozsef Kadlecsik.
3) Fix pernet ops unregistration order in ipset, from Florian Westphal.
4) Add missing netlink attribute policy for nl80211 packet pattern
attrs, from Peng Xu.
5) Fix PPP device destruction race, from Guillaume Nault.
6) Write marks get lost when BPF verifier processes R1=R2 register
assignments, causing incorrect liveness information and less state
pruning. Fix from Alexei Starovoitov.
7) Fix blockhole routes so that they are marked dead and therefore not
cached in sockets, otherwise IPSEC stops working. From Steffen
Klassert.
8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
net: thunderx: mark expected switch fall-throughs in nicvf_main()
udp: fix bcast packet reception
netlink: do not set cb_running if dump's start() errs
ipv4: Fix traffic triggered IPsec connections.
ipv6: Fix traffic triggered IPsec connections.
ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Revert commit
1a8b6d76dc5b ("net:add one common config...")
ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
ixgbe: Return error when getting PHY address if PHY access is not supported
netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
tipc: Unclone message at secondary destination lookup
tipc: correct initialization of skb list
gso: fix payload length when gso_size is zero
mlxsw: spectrum_router: Avoid expensive lookup during route removal
bpf: fix liveness marking
doc: Fix typo "8023.ad" in bonding documentation
ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
...
Aleksander Morgado [Mon, 9 Oct 2017 12:05:12 +0000 (14:05 +0200)]
cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
The u-blox TOBY-L2 is a LTE Cat 4 module with HSPA+ and 2G fallback.
This module allows switching to different USB profiles with the
'AT+UUSBCONF' command, and provides a ECM network interface when the
'AT+UUSBCONF=2' profile is selected.
The u-blox SARA-U2 is a HSPA module with 2G fallback. The default USB
configuration includes a ECM network interface.
Both these modules are controlled via AT commands through one of the
TTYs exposed. Connecting these modules may be done just by activating
the desired PDP context with 'AT+CGACT=1,<cid>' and then running DHCP
on the ECM interface.
Signed-off-by: Aleksander Morgado <aleksander@aleksander.es>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 9 Oct 2017 17:55:37 +0000 (10:55 -0700)]
Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Hightlights include:
stable fixes:
- nfs/filelayout: fix oops when freeing filelayout segment
- NFS: Fix uninitialized rpc_wait_queue
bugfixes:
- NFSv4/pnfs: Fix an infinite layoutget loop
- nfs: RPC_MAX_AUTH_SIZE is in bytes"
* tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4/pnfs: Fix an infinite layoutget loop
nfs/filelayout: fix oops when freeing filelayout segment
sunrpc: remove redundant initialization of sock
NFS: Fix uninitialized rpc_wait_queue
NFS: Cleanup error handling in nfs_idmap_request_key()
nfs: RPC_MAX_AUTH_SIZE is in bytes
Gustavo A. R. Silva [Mon, 9 Oct 2017 16:44:53 +0000 (11:44 -0500)]
net: thunderx: mark expected switch fall-throughs in nicvf_main()
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Robert Richter <rric@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: netdev@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Oct 2017 17:39:52 +0000 (10:39 -0700)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
Fedorenko.
2) Fix splat with mark restoration in xt_socket with non-full-sock,
patch from Subash Abhinov Kasiviswanathan.
3) ipset bogusly bails out when adding IPv4 range containing more than
2^31 addresses, from Jozsef Kadlecsik.
4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
5) Races between dump and swap in ipset results in BUG_ON splats, from
Ross Lagerwall.
6) Fix chain renames in nf_tables, from JingPiao Chen.
7) Fix race in pernet codepath with ebtables table registration, from
Artem Savkov.
8) Memory leak in error path in set name allocation in nf_tables, patch
from Arvind Yadav.
9) Don't dump chain counters if they are not available, this fixes a
crash when listing the ruleset.
10) Fix out of bound memory read in strlcpy() in x_tables compat code,
from Eric Dumazet.
11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
Lin Zhang.
12) Cannot load rules incrementally anymore after xt_bpf with pinned
objects, added in revision 1. From Shmulik Ladkani.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Oct 2017 17:36:25 +0000 (10:36 -0700)]
Merge branch '10GbE' of git://git./linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2017-10-09
This series contains updates to ixgbe and arch/Kconfig.
Mark fixes a case where PHY register access is not supported and we were
returning a PHY address, when we should have been returning -EOPNOTSUPP.
Sabrina Dubroca fixes the use of a logical "and" when it should have been
the bitwise "and" operator.
Ding Tianhong reverts the commit that added the Kconfig bool option
ARCH_WANT_RELAX_ORDER, since there is now a new flag
PCI_DEV_FLAGS_NO_RELAXED_ORDERING that has been added to indicate that
Relaxed Ordering Attributes should not be used for Transaction Layer
Packets. Then follows up with making the needed changes to ixgbe to
use the new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag.
John Fastabend fixes an issue in the ring accounting when the transmit
ring parameters are changed via ethtool when an XDP program is attached.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Mon, 9 Oct 2017 12:52:10 +0000 (14:52 +0200)]
udp: fix bcast packet reception
The commit
bc044e8db796 ("udp: perform source validation for
mcast early demux") does not take into account that broadcast packets
lands in the same code path and they need different checks for the
source address - notably, zero source address are valid for bcast
and invalid for mcast.
As a result, 2nd and later broadcast packets with 0 source address
landing to the same socket are dropped. This breaks dhcp servers.
Since we don't have stringent performance requirements for ingress
broadcast traffic, fix it by disabling UDP early demux such traffic.
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Fixes: bc044e8db796 ("udp: perform source validation for mcast early demux")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason A. Donenfeld [Mon, 9 Oct 2017 12:14:51 +0000 (14:14 +0200)]
netlink: do not set cb_running if dump's start() errs
It turns out that multiple places can call netlink_dump(), which means
it's still possible to dereference partially initialized values in
dump() that were the result of a faulty returned start().
This fixes the issue by calling start() _before_ setting cb_running to
true, so that there's no chance at all of hitting the dump() function
through any indirect paths.
It also moves the call to start() to be when the mutex is held. This has
the nice side effect of serializing invocations to start(), which is
likely desirable anyway. It also prevents any possible other races that
might come out of this logic.
In testing this with several different pieces of tricky code to trigger
these issues, this commit fixes all avenues that I'm aware of.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Oct 2017 16:52:55 +0000 (09:52 -0700)]
Merge tag 'mac80211-for-davem-2017-10-09' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
pull-request: mac80211 2017-10-09
The QCA folks found another netlink problem - we were missing validation
of some attributes. It's not super problematic since one can only read a
few bytes beyond the message (and that memory must exist), but here's the
fix for it.
I thought perhaps we can make nla_parse_nested() require a policy, but
given the two-stage validation/parsing in regular netlink that won't work.
Please pull and let me know if there's any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Oct 2017 16:43:34 +0000 (09:43 -0700)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2017-10-09
1) Fix some error paths of the IPsec offloading API.
2) Fix a NULL pointer dereference when IPsec is used
with vti. From Alexey Kodanev.
3) Don't call xfrm_policy_cache_flush under xfrm_state_lock,
it triggers several locking warnings. From Artem Savkov.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Mon, 9 Oct 2017 06:43:55 +0000 (08:43 +0200)]
ipv4: Fix traffic triggered IPsec connections.
A recent patch removed the dst_free() on the allocated
dst_entry in ipv4_blackhole_route(). The dst_free() marked the
dst_entry as dead and added it to the gc list. I.e. it was setup
for a one time usage. As a result we may now have a blackhole
route cached at a socket on some IPsec scenarios. This makes the
connection unusable.
Fix this by marking the dst_entry directly at allocation time
as 'dead', so it is used only once.
Fixes: b838d5e1c5b6 ("ipv4: mark DST_NOGC and remove the operation of dst_free()")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Mon, 9 Oct 2017 06:39:43 +0000 (08:39 +0200)]
ipv6: Fix traffic triggered IPsec connections.
A recent patch removed the dst_free() on the allocated
dst_entry in ipv6_blackhole_route(). The dst_free() marked
the dst_entry as dead and added it to the gc list. I.e. it
was setup for a one time usage. As a result we may now have
a blackhole route cached at a socket on some IPsec scenarios.
This makes the connection unusable.
Fix this by marking the dst_entry directly at allocation time
as 'dead', so it is used only once.
Fixes: 587fea741134 ("ipv6: mark DST_NOGC and remove the operation of dst_free()")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 7 Sep 2017 17:32:48 +0000 (10:32 -0700)]
ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
Changing the TX ring parameters with an XDP program attached may
cause the XDP queues to be cleared and the TX rings to be incorrectly
configured.
Fix by doing correct ring accounting in setup call.
Fixes: 33fdc82f0883 ("ixgbe: add support for XDP_TX action")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Ding Tianhong [Fri, 18 Aug 2017 06:21:05 +0000 (14:21 +0800)]
net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
The ixgbe driver use the compile check to determine if it can
send TLPs to Root Port with the Relaxed Ordering Attribute set,
this is too inconvenient, now the new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING
has been added to the kernel and we could check the bit4 in the PCIe
Device Control register to determine whether we should use the Relaxed
Ordering Attributes or not, so use this new way in the ixgbe driver.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Ding Tianhong [Fri, 18 Aug 2017 06:21:04 +0000 (14:21 +0800)]
Revert commit
1a8b6d76dc5b ("net:add one common config...")
The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING has been added
to indicate that Relaxed Ordering Attributes (RO) should not
be used for Transaction Layer Packets (TLP) targeted toward
these affected Root Port, it will clear the bit4 in the PCIe
Device Control register, so the PCIe device drivers could
query PCIe configuration space to determine if it can send
TLPs to Root Port with the Relaxed Ordering Attributes set.
With this new flag we don't need the config ARCH_WANT_RELAX_ORDER
to control the Relaxed Ordering Attributes for the ixgbe drivers
just like the commit
1a8b6d76dc5b ("net:add one common config...") did,
so revert this commit.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sabrina Dubroca [Mon, 3 Jul 2017 11:02:55 +0000 (13:02 +0200)]
ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
In ixgbe_clear_udp_tunnel_port(), we read the IXGBE_VXLANCTRL register
and then try to mask some bits out of the value, using the logical
instead of bitwise and operator.
Fixes: a21d0822ff69 ("ixgbe: add support for geneve Rx offload")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark D Rustad [Wed, 31 Aug 2016 17:34:28 +0000 (10:34 -0700)]
ixgbe: Return error when getting PHY address if PHY access is not supported
In cases where PHY register access is not supported, don't mislead
a caller into thinking that it is supported by returning a PHY
address. Instead, return -EOPNOTSUPP when PHY access is not
supported.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shmulik Ladkani [Mon, 9 Oct 2017 12:27:15 +0000 (15:27 +0300)]
netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
Commit
2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.
However this breaks subsequent iptables calls:
# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.
That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.
However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.
One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.
However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.
It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.
Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
References: [1] https://marc.info/?l=netfilter-devel&m=
150564724607440&w=2
[2] https://marc.info/?l=netfilter-devel&m=
150575727129880&w=2
Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Lin Zhang [Thu, 5 Oct 2017 16:44:03 +0000 (00:44 +0800)]
netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
In function {ipv4,ipv6}_synproxy_hook we expect a normal tcp packet, but
the real server maybe reply an icmp error packet related to the exist
tcp conntrack, so we will access wrong tcp data.
Fix it by checking for the protocol field and only process tcp traffic.
Signed-off-by: Lin Zhang <xiaolou4617@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jon Maloy [Sat, 7 Oct 2017 13:07:20 +0000 (15:07 +0200)]
tipc: Unclone message at secondary destination lookup
When a bundling message is received, the function tipc_link_input()
calls function tipc_msg_extract() to unbundle all inner messages of
the bundling message before adding them to input queue.
The function tipc_msg_extract() just clones all inner skb for all
inner messagges from the bundling skb. This means that the skb
headroom of an inner message overlaps with the data part of the
preceding message in the bundle.
If the message in question is a name addressed message, it may be
subject to a secondary destination lookup, and eventually be sent out
on one of the interfaces again. But, since what is perceived as headroom
by the device driver in reality is the last bytes of the preceding
message in the bundle, the latter will be overwritten by the MAC
addresses of the L2 header. If the preceding message has not yet been
consumed by the user, it will evenually be delivered with corrupted
contents.
This commit fixes this by uncloning all messages passing through the
function tipc_msg_lookup_dest(), hence ensuring that the headroom
is always valid when the message is passed on.
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Sat, 7 Oct 2017 12:32:49 +0000 (14:32 +0200)]
tipc: correct initialization of skb list
We change the initialization of the skb transmit buffer queues
in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
initialize their spinlocks. This is needed because we may, during
error conditions, need to call skb_queue_purge() on those queues
further down the stack.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 9 Oct 2017 03:53:29 +0000 (20:53 -0700)]
Linux 4.14-rc4
Alexey Kodanev [Fri, 6 Oct 2017 16:02:35 +0000 (19:02 +0300)]
gso: fix payload length when gso_size is zero
When gso_size reset to zero for the tail segment in skb_segment(), later
in ipv6_gso_segment(), __skb_udp_tunnel_segment() and gre_gso_segment()
we will get incorrect results (payload length, pcsum) for that segment.
inet_gso_segment() already has a check for gso_size before calculating
payload.
The issue was found with LTP vxlan & gre tests over ixgbe NIC.
Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 8 Oct 2017 09:53:26 +0000 (11:53 +0200)]
mlxsw: spectrum_router: Avoid expensive lookup during route removal
In commit
fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for
all virtual routers") I increased the scale of supported VRFs by having
all of them share the same LPM tree.
In order to avoid look-ups for prefix lengths that don't exist, each
route removal would trigger an aggregation across all the active virtual
routers to see which prefix lengths are in use and which aren't and
structure the tree accordingly.
With the way the data structures are currently laid out, this is a very
expensive operation. When preformed repeatedly - due to the invocation
of the abort mechanism - and with enough VRFs, this can result in a hung
task.
For now, avoid this optimization until it can be properly re-added in
net-next.
Fixes: fc922bb0dd94 ("mlxsw: spectrum_router: Use one LPM tree for all virtual routers")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 5 Oct 2017 23:20:56 +0000 (16:20 -0700)]
bpf: fix liveness marking
while processing Rx = Ry instruction the verifier does
regs[insn->dst_reg] = regs[insn->src_reg]
which often clears write mark (when Ry doesn't have it)
that was just set by check_reg_arg(Rx) prior to the assignment.
That causes mark_reg_read() to keep marking Rx in this block as
REG_LIVE_READ (since the logic incorrectly misses that it's
screened by the write) and in many of its parents (until lucky
write into the same Rx or beginning of the program).
That causes is_state_visited() logic to miss many pruning opportunities.
Furthermore mark_reg_read() logic propagates the read mark
for BPF_REG_FP as well (though it's readonly) which causes
harmless but unnecssary work during is_state_visited().
Note that do_propagate_liveness() skips FP correctly,
so do the same in mark_reg_read() as well.
It saves 0.2 seconds for the test below
program before after
bpf_lb-DLB_L3.o 2604 2304
bpf_lb-DLB_L4.o 11159 3723
bpf_lb-DUNKNOWN.o 1116 1110
bpf_lxc-DDROP_ALL.o 34566 28004
bpf_lxc-DUNKNOWN.o 53267 39026
bpf_netdev.o 17843 16943
bpf_overlay.o 8672 7929
time ~11 sec ~4 sec
Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Axel Beckert [Thu, 5 Oct 2017 20:00:33 +0000 (22:00 +0200)]
doc: Fix typo "8023.ad" in bonding documentation
Should be "802.3ad" like everywhere else in the document.
Signed-off-by: Axel Beckert <abe@deuxchevaux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Matteo Croce [Thu, 5 Oct 2017 17:03:05 +0000 (19:03 +0200)]
ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
Commit
35e015e1f577 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
was intended to affect accept_dad flag handling in such a way that
DAD operation and mode on a given interface would be selected
according to the maximum value of conf/{all,interface}/accept_dad.
However, addrconf_dad_begin() checks for particular cases in which we
need to skip DAD, and this check was modified in the wrong way.
Namely, it was modified so that, if the accept_dad flag is 0 for the
given interface *or* for all interfaces, DAD would be skipped.
We have instead to skip DAD if accept_dad is 0 for the given interface
*and* for all interfaces.
Fixes: 35e015e1f577 ("ipv6: fix net.ipv6.conf.all interface DAD handlers")
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reported-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 7 Oct 2017 19:34:16 +0000 (12:34 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
- a couple of serious fixes: use after free and blacklist for WRITE
SAME
- one error leg fix: write_pending failure
- one user experience problem: do not override max_sectors_kb
- one minor unused function removal
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ibmvscsis: Fix write_pending failure path
scsi: libiscsi: Remove iscsi_destroy_session
scsi: libiscsi: Fix use-after-free race during iscsi_session_teardown
scsi: sd: Do not override max_sectors_kb sysfs setting
scsi: sd: Implement blacklist option for WRITE SAME w/ UNMAP
Linus Torvalds [Sat, 7 Oct 2017 17:07:51 +0000 (10:07 -0700)]
Merge branch 'i2c/for-current-4.14' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C has three driver fixes for the newly introduced drivers and one ID
addition for the i801 driver"
* 'i2c/for-current-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: i2c-stm32f7: make structure stm32f7_setup static const
i2c: ensure termination of *_device_id tables
i2c: i801: Add support for Intel Cedar Fork
i2c: stm32f7: fix setup structure
Linus Torvalds [Sat, 7 Oct 2017 17:03:03 +0000 (10:03 -0700)]
Merge tag 'mmc-v4.14-rc3' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Fix driver strength selection when selecting hs400es
- Delete bounce buffer handling:
This change fixes a problem related to how bounce buffers are being
allocated. However, instead of trying to fix that, let's just
remove the mmc bounce buffer code altogether, as it has practically
no use.
MMC host:
- meson-gx: A couple of fixes related to clock/phase/tuning
- sdhci-xenon: Fix clock resource by adding an optional bus clock"
* tag 'mmc-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-xenon: Fix clock resource by adding an optional bus clock
mmc: meson-gx: include tx phase in the tuning process
mmc: meson-gx: fix rx phase reset
mmc: meson-gx: make sure the clock is rounded down
mmc: Delete bounce buffer handling
mmc: core: add driver strength selection when selecting hs400es
Linus Torvalds [Sat, 7 Oct 2017 00:59:32 +0000 (17:59 -0700)]
Merge tag 'hwmon-for-linus-v4.14-rc4' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
"Fix up error path in xgene driver"
* tag 'hwmon-for-linus-v4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (xgene) Fix up error handling path mixup in 'xgene_hwmon_probe()'
Linus Torvalds [Fri, 6 Oct 2017 23:25:08 +0000 (16:25 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
- build fix to export the clk_bulk_prepare() symbol
- suspend fix for Samsung Exynos SoCs where we need to keep clks on
across suspend
- two critical clk markings for clks that shouldn't ever turn off on
Rockchip SoCs
- a fix for a copy-paste mistake on Rockchip rk3128 causing some clks
to touch the same bit and trample over one another
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: samsung: exynos4: Enable VPLL and EPLL clocks for suspend/resume cycle
clk: Export clk_bulk_prepare()
clk: rockchip: add sclk_timer5 as critical clock on rk3128
clk: rockchip: fix up rk3128 pvtm and mipi_24m gate regs error
clk: rockchip: add pclk_pmu as critical clock on rk3128
Linus Torvalds [Fri, 6 Oct 2017 22:57:08 +0000 (15:57 -0700)]
Merge tag 'arc-4.14-rc4' of git://git./linux/kernel/git/vgupta/arc
Pull ARC udpates from Vineet Gupta:
- updates for various platforms
- boot log updates for upcoming HS48 family of cores (dual issue)
* tag 'arc-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [plat-hsdk]: Add reset controller node to manage ethernet reset
ARC: [plat-hsdk]: Temporary fix to set CPU frequency to 1GHz
ARC: fix allnoconfig build warning
ARCv2: boot log: identify HS48 cores (dual issue)
ARC: boot log: decontaminate ARCv2 ISA_CONFIG register
arc: remove redundant UTS_MACHINE define in arch/arc/Makefile
ARC: [plat-eznps] Update platform maintainer as Noam left
ARC: [plat-hsdk] use actual clk driver to manage cpu clk
ARC: [*defconfig] Reenable soft lock-up detector
ARC: [plat-axs10x] sdio: Temporary fix of sdio ciu frequency
ARC: [plat-hsdk] sdio: Temporary fix of sdio ciu frequency
ARC: [plat-axs103] Add temporary quirk to reset ethernet IP
Linus Torvalds [Fri, 6 Oct 2017 22:53:36 +0000 (15:53 -0700)]
Merge tag 'xfs-4.14-fixes-4' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
- fix a race between overlapping copy on write aio
- fix cow fork swapping when we defragment reflinked files
* tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: handle racy AIO in xfs_reflink_end_cow
xfs: always swap the cow forks when swapping extents
Linus Torvalds [Fri, 6 Oct 2017 19:13:50 +0000 (12:13 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes for this series. This contains:
- NVMe pull request from Christoph, one uuid attribute fix, and one
fix for the controller memory buffer address for remapped BARs.
- use-after-free fix for bsg, from Benjamin Block.
- bcache race/use-after-free fix for a list traversal, fixing a
regression in this merge window. From Coly Li.
- null_blk change configfs dependency change from a 'depends' to a
'select'. This is a change from this merge window as well. From me.
- nbd signal fix from Josef, fixing a regression introduced with the
status code changes.
- nbd MAINTAINERS mailing list entry update.
- blk-throttle stall fix from Joseph Qi.
- blk-mq-debugfs fix from Omar, fixing an issue where we don't
register the IO scheduler debugfs directory, if the driver is
loaded with it. Only shows up if you switch through the sysfs
interface"
* 'for-linus' of git://git.kernel.dk/linux-block:
bsg-lib: fix use-after-free under memory-pressure
nvme-pci: Use PCI bus address for data/queues in CMB
blk-mq-debugfs: fix device sched directory for default scheduler
null_blk: change configfs dependency to select
blk-throttle: fix possible io stall when upgrade to max
MAINTAINERS: update list for NBD
nbd: fix -ERESTARTSYS handling
nvme: fix visibility of "uuid" ns attribute
bcache: use llist_for_each_entry_safe() in __closure_wake_up()
Linus Torvalds [Fri, 6 Oct 2017 19:07:09 +0000 (12:07 -0700)]
Merge tag 'pci-v4.14-fixes-4' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
"Fix legacy IDE probe issues exposed by recent PCI core IRQ mapping
changes (Bartlomiej Zolnierkiewicz, Lorenzo Pieralisi)"
* tag 'pci-v4.14-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
ide: fix IRQ assignment for PCI bus order probing
ide: pci: free PCI BARs on initialization failure
ide: free hwif->portdev on hwif_init() failure
Linus Torvalds [Fri, 6 Oct 2017 18:31:46 +0000 (11:31 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Bring initialisation of user space undefined instruction handling
early (core_initcall) since late_initcall() happens after modprobe in
initramfs is invoked. Similar fix for fpsimd initialisation
- Increase the kernel stack when KASAN is enabled
- Bring the PCI ACS enabling earlier via the
iort_init_platform_devices()
- Fix misleading data abort address printing (decimal vs hex)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Ensure fpsimd support is ready before userspace is active
arm64: Ensure the instruction emulation is ready for userspace
arm64: Use larger stacks when KASAN is selected
ACPI/IORT: Fix PCI ACS enablement
arm64: fix misleading data abort decoding
Linus Torvalds [Fri, 6 Oct 2017 18:28:34 +0000 (11:28 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
- fix PPC XIVE interrupt delivery
- fix x86 RCU breakage from asynchronous page faults when built without
PREEMPT_COUNT
- fix x86 build with -frecord-gcc-switches
- fix x86 build without X86_LOCAL_APIC
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: add X86_LOCAL_APIC dependency
x86/kvm: Move kvm_fastop_exception to .fixup section
kvm/x86: Avoid async PF preempting the kernel incorrectly
KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive()
Linus Torvalds [Fri, 6 Oct 2017 18:25:55 +0000 (11:25 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"This is a pretty small pull request. Only 6 patches in total. There
are no outstanding -rc patches on the mailing list after this pull
request, so only if some new issues are discovered in the remainder of
the rc cycles will you hear from me again.
Summary:
- a fix for iwpm netlink usage
- a fix for error unwinding in mlx5
- two fixes to vlan handling in qedr
- a couple small i40iw fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
i40iw: Fix port number for query QP
i40iw: Add missing memory barriers
RDMA/qedr: Parse vlan priority as sl
RDMA/qedr: Parse VLAN ID correctly and ignore the value of zero
IB/mlx5: Fix label order in error path handling
RDMA/iwpm: Properly mark end of NL messages
Guillaume Nault [Fri, 6 Oct 2017 15:05:49 +0000 (17:05 +0200)]
ppp: fix race in ppp device destruction
ppp_release() tries to ensure that netdevices are unregistered before
decrementing the unit refcount and running ppp_destroy_interface().
This is all fine as long as the the device is unregistered by
ppp_release(): the unregister_netdevice() call, followed by
rtnl_unlock(), guarantee that the unregistration process completes
before rtnl_unlock() returns.
However, the device may be unregistered by other means (like
ppp_nl_dellink()). If this happens right before ppp_release() calling
rtnl_lock(), then ppp_release() has to wait for the concurrent
unregistration code to release the lock.
But rtnl_unlock() releases the lock before completing the device
unregistration process. This allows ppp_release() to proceed and
eventually call ppp_destroy_interface() before the unregistration
process completes. Calling free_netdev() on this partially unregistered
device will BUG():
------------[ cut here ]------------
kernel BUG at net/core/dev.c:8141!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1557 Comm: pppd Not tainted 4.14.0-rc2+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
Call Trace:
ppp_destroy_interface+0xd8/0xe0 [ppp_generic]
ppp_disconnect_channel+0xda/0x110 [ppp_generic]
ppp_unregister_channel+0x5e/0x110 [ppp_generic]
pppox_unbind_sock+0x23/0x30 [pppox]
pppoe_connect+0x130/0x440 [pppoe]
SYSC_connect+0x98/0x110
? do_fcntl+0x2c0/0x5d0
SyS_connect+0xe/0x10
entry_SYSCALL_64_fastpath+0x1a/0xa5
RIP: free_netdev+0x107/0x110 RSP:
ffffc28a40573d88
---[ end trace
ed294ff0cc40eeff ]---
We could set the ->needs_free_netdev flag on PPP devices and move the
ppp_destroy_interface() logic in the ->priv_destructor() callback. But
that'd be quite intrusive as we'd first need to unlink from the other
channels and units that depend on the device (the ones that used the
PPPIOCCONNECT and PPPIOCATTACH ioctls).
Instead, we can just let the netdevice hold a reference on its
ppp_file. This reference is dropped in ->priv_destructor(), at the very
end of the unregistration process, so that neither ppp_release() nor
ppp_disconnect_channel() can call ppp_destroy_interface() in the interim.
Reported-by: Beniamino Galvani <bgalvani@redhat.com>
Fixes: 8cb775bc0a34 ("ppp: fix device unregistration upon netns deletion")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 6 Oct 2017 16:03:08 +0000 (09:03 -0700)]
Merge branch 'for-4.14-rc4' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.
The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.
The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"
* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit