This patch cleans up parameters on IO paths.
The key idea is to use f2fs_io_info adding a parameter, block address, and then
use this structure as parameters.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use more common function ra_meta_pages() with META_POR to readahead node blocks
in restore_node_summary() instead of ra_sum_pages(), hence we can simplify the
readahead code there, and also we can remove unused function ra_sum_pages().
changes from v2:
o use invalidate_mapping_pages as before suggested by Changman Lee.
changes from v1:
o fix one bug when using truncate_inode_pages_range which is pointed out by
Jaegeuk Kim.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.
changes from v2:
o update description of memory usage gain for 64-bit machine suggested by
Changman Lee.
changes from v1:
o introduce inline copy_node_info() to copy valid data from node info suggested
by Jaegeuk Kim, it can avoid bug.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's add readahead code for reading contiguous compact/normal summary blocks
in checkpoint, then we will gain better performance in mount procedure.
Changes from v1
o remove inappropriate 'unlikely' in npages_for_summary_flush.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we use inmemory pages for atomic write only and provide abort procedure,
we don't need to truncate them explicitly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The ckpt_valid_map and cur_valid_map are synced by seg_info_to_raw_sit.
In the case of small discards, the candidates are selected before sync,
while fitrim selects candidates after sync.
So, for small discards, we need to add candidates only just being obsoleted.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
o f2fs_ioc_abort_volatile_write
- If transaction was failed, all the grabbed pages and data should be written.
o f2fs_ioc_release_volatile_write
- This is to enhance the performance of PERSIST mode in sqlite.
In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to force to write dirty_exceeded for f2fs_balance_fs_bg.
This flag was only meaningful to write bypassing conditions.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The
clashing O_PATH value was added in commit 5229645bdc ("vfs: add
nonconflicting values for O_PATH") but this can't be changed as it is
user-visible.
FMODE_NONOTIFY is only used internally in the kernel, but it is in the
same numbering space as the other O_* flags, as indicated by the comment
at the top of include/uapi/asm-generic/fcntl.h (and its use in
fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash.
All of this has happened before (commit 12ed2e36c9: "fanotify:
FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will
happen again -- so update the uniqueness check in fcntl_init() to
include __FMODE_NONOTIFY.
Signed-off-by: David Drysdale <drysdale@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_link(), the parent directory inode passed to function
ocfs2_lookup_ino_from_name() is wrong. Parameter dir is the parent of
new_dentry not old_dentry. We should get old_dir from old_dentry and
lookup old_dentry in old_dir in case another node remove the old dentry.
With this change, hard linking works again, when paths are relative with
at least one subdirectory. This is how the problem was reproducable:
# mkdir a
# mkdir b
# touch a/test
# ln a/test b/test
ln: failed to create hard link `b/test' => `a/test': No such file or directory
However when creating links in the same dir, it worked well.
Now the link gets created.
Fixes: 0e048316ff ("ocfs2: check existence of old dentry in ocfs2_link()")
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Szabo Aron - UBIT <aron@ubit.hu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Tested-by: Aron Szabo <aron@ubit.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_process_recovery_data, only when dlm_new_lock failed the ret will
be set to -ENOMEM. And in this case, newlock is definitely NULL. So
test newlock is meaningless, remove it.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4 to handle ext3 file systems, plus two minor bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUqxTQAAoJENNvdpvBGATwUC8QAIYfP02XYyGrBQIoAoCiRJji
TmhAe2Amy9xHgBq2I8Yz3bi8j1wGuiqc57fYgwwVScf10tzmO/HuwnAZZDAmg9hK
ZWp9WyPoyf+U/nbkIfC5mRh3Qz0dt1pt6R3uQDUlcUuAamdMBrdJnhkQC6WMbpU2
fAqsJT3/yGrLnMF29eVqJzcxb5KORJ8hEcD7kwkvJwe4sGm3C7iDsjS0i63YWDz4
QNclW6zF4THhmuVNxwRupOgMQNSq8sHg8U23nP4DZLvLE7GlgtwfDvehU7uBfw5n
WO5UfsEYLoeODNmujUJCtjXNLpzDXmrtByyWbbTK7EX3MmV94ym4uu5lHLfyMiTc
o2ppxcsKBVcOsPWnFwuhJ5p/Wyy0Uld9Q3P6b5ymhyzDhkuwcTURpeRxBRXHEgcm
nY5GE1bBdO7OigDz/+DFL/Zgr8EO7hW72hrBaLDWMEbrrl0asZw/ReC/bnreMmm4
sP87DB+MqRXzRs8aOPWmCofJwGSgCYmOq2nqNCAaxgk/ofvrDURrnZYfLrbzspGa
hqE1W0X5hvQydcifi4qq2Na76+Js3atSY38EOH/HNknSqlQjysnkW4ajTDWk/GFy
M/fKUCfIl1tmCMN2myZzl89E7uMSyod75ycd0BQy36iHPE14JVvk/u7GfcKHLs53
1rAPpW90a72GX2Z9+xxA
=uPYz
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Revert a potential seek_data/hole regression which shows up when using
ext4 to handle ext3 file systems, plus two minor bug fixes"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: remove spurious KERN_INFO from ext4_warning call
Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
ext4: prevent online resize with backup superblock
This reverts commit 14516bb7bb.
This was causing regression test failures with generic/285 with an ext3
filesystem using CONFIG_EXT4_USE_FOR_EXT23.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull CIFS fixes from Steve French:
"A set of three minor cifs fixes"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: make new inode cache when file type is different
Fix signed/unsigned pointer warning
Convert MessageID in smb2_hdr to LE
Pull UDF & isofs fixes from Jan Kara:
"A couple of UDF fixes of handling of corrupted media and one iso9660
fix of the same"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Reduce repeated dereferences
udf: Check component length before reading it
udf: Check path length when reading symlink
udf: Verify symlink size before loading it
udf: Verify i_size when loading inode
isofs: Fix unchecked printing of ER records
Prevent BUG or corrupted file systems after the following:
mkfs.ext4 /dev/vdc 100M
mount -t ext4 -o sb=40961 /dev/vdc /vdc
resize2fs /dev/vdc
We previously prevented online resizing using the old resize ioctl.
Move the code to ext4_resize_begin(), so the check applies for all of
the resize ioctl's.
Reported-by: Maxim Malkov <malkov@ispras.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In spite of different file type,
if file is same name and same inode number, old inode cache is used.
This causes that you can not cd directory, can not cat SymbolicLink.
So this patch is that if file type is different, return error.
Reproducible sample :
1. create file 'a' at cifs client.
2. repeat rm and mkdir 'a' 4 times at server, then direcotry 'a' having same inode number is created.
(Repeat 4 times, then same inode number is recycled.)
(When server is under RHEL 6.6, 1 time is O.K. Always same inode number is recycled.)
3. ls -li at client, then you can not cd directory, can not remove directory.
SymbolicLink has same problem.
Bug link:
https://bugzilla.kernel.org/show_bug.cgi?id=90011
Signed-off-by: Nakajima Akira <nakajima.akira@nttcom.co.jp>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
Check that length specified in a component of a symlink fits in the
input buffer we are reading. Also properly ignore component length for
component types that do not use it. Otherwise we read memory after end
of buffer for corrupted udf image.
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Pull vfs pile #3 from Al Viro:
"Assorted fixes and patches from the last cycle"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[regression] chunk lost from bd9b51
vfs: make mounts and mountstats honor root dir like mountinfo does
vfs: cleanup show_mountinfo
init: fix read-write root mount
unfuck binfmt_misc.c (broken by commit e6084d4)
vm_area_operations: kill ->migrate()
new helper: iter_is_iovec()
move_extent_per_page(): get rid of unused w_flags
lustre: get rid of playing with ->fs
btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
- The filename decryption routines were, at times, writing a zero byte one
character past the end of the filename buffer
- The encrypted view feature attempted, and failed, to roll its own form of
enforcing a read-only mount instead of letting the VFS enforce it
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJUlGlWAAoJENaSAD2qAscKgdwQAJLCgt5xgoij8uH65mYVy6PE
++E8KggCHchhUcFh19EuuG++ENzwZedqKqbbRhmZt+svla08uPDO0jY+sYaOZzqB
HMzZyyL4Fw/DiVsK6LUMGfN2CS7tua9ReK0dj4tUc3jwBplnQnGrQlUPbgdPRJJp
XJDKHVtHGQPQC1dJAeProCJTg23jv5wXacly2I5VxZuh5DnDWjuo2KkzqGMKfadU
9bxOf5DjVwQ/X6apgkNxG/8Q6J8a+80K7SGs/aRELIuRL0+mIj5AX9ht+p5Z0XI+
E8jU4HM2biuqj8PtxK/WbyF1bHiREVyOLdneW+n3pcqGTLMZ3TLpDURkNd0cwZcd
AGoZtUZZuQ/xvzE9IrEj+KYeINnZzPQpyJu9jXXp1CsQJ8u/wUG9e0kcJrUI1dZ/
q26zhPaVhJ9x0go/BNOxzTUpOtuMWXttAVZGICthp74I5l+DgfSZ4J4CeQmFMMRs
kbiTg5h4qsrD4jnEU/BCscpCLoz4qlF6+q/+lspwkg7OwvhHc0qSMJn3UjZ/eYzb
ncDp6gc4Iju3RhWnVEZphA4ttbZJX7ahER4y0NdIG65Fa37AUEUxy0OmGfoyFOmC
KVtlHkl2hKoCB5+ujLzL7WimH33ogtVhtOJTZlXzS0lsdSfSxUWU4/OZ22jTy0hL
Bnx6uoNIPWsr6b5OJiA0
=jTyq
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-3.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs fixes from Tyler Hicks:
"Fixes for filename decryption and encrypted view plus a cleanup
- The filename decryption routines were, at times, writing a zero
byte one character past the end of the filename buffer
- The encrypted view feature attempted, and failed, to roll its own
form of enforcing a read-only mount instead of letting the VFS
enforce it"
* tag 'ecryptfs-3.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: Remove buggy and unnecessary write in file name decode routine
eCryptfs: Remove unnecessary casts when parsing packet lengths
eCryptfs: Force RO mount when encrypted view is enabled
Pull more btrfs updates from Chris Mason:
"This is part two of our merge window patches.
These are all from Filipe, and fix some really hard to find races that
can cause corruptions. Most of them involved block group removal
(balance) or discard"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove non-sense btrfs_error_discard_extent() function
Btrfs: fix fs corruption on transaction abort if device supports discard
Btrfs: always clear a block group node when removing it from the tree
Btrfs: ensure deletion from pinned_chunks list is protected
Pull irq core fix from Thomas Gleixner:
"A single fix plugging a long standing race between proc/stat and
proc/interrupts access and freeing of interrupt descriptors"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Prevent proc race against freeing of irq descriptors
Symlink reading code does not check whether the resulting path fits into
the page provided by the generic code. This isn't as easy as just
checking the symlink size because of various encoding conversions we
perform on path. So we have to check whether there is still enough space
in the buffer on the fly.
CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jan Kara <jack@suse.cz>
UDF specification allows arbitrarily large symlinks. However we support
only symlinks at most one block large. Check the length of the symlink
so that we don't access memory beyond end of the symlink block.
CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Verify that inode size is sane when loading inode with data stored in
ICB. Otherwise we may get confused later when working with the inode and
inode size is too big.
CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jan Kara <jack@suse.cz>
We didn't check length of rock ridge ER records before printing them.
Thus corrupted isofs image can cause us to access and print some memory
behind the buffer with obvious consequences.
Reported-and-tested-by: Carl Henrik Lunde <chlunde@ping.uio.no>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Merge misc patches from Andrew Morton:
"A few stragglers"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
tools/testing/selftests/Makefile: alphasort the TARGETS list
mm/zsmalloc: adjust order of functions
ocfs2: fix journal commit deadlock
ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_worker
ocfs2: reflink: fix slow unlink for refcounted file
mm/memory.c:do_shared_fault(): add comment
.mailmap: Santosh Shilimkar has moved
.mailmap: update akpm@osdl.org
lib/show_mem.c: add cma reserved information
fs/proc/meminfo.c: include cma info in proc/meminfo
mm: cma: split cma-reserved in dmesg log
hfsplus: fix longname handling
mm/mempolicy.c: remove unnecessary is_valid_nodemask()
For buffer write, page lock will be got in write_begin and released in
write_end, in ocfs2_write_end_nolock(), before it unlock the page in
ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
for the read lock of journal->j_trans_barrier. Holding page lock and
ask for journal->j_trans_barrier breaks the locking order.
This will cause a deadlock with journal commit threads, ocfs2cmt will
get write lock of journal->j_trans_barrier first, then it wakes up
kjournald2 to do the commit work, at last it waits until done. To
commit journal, kjournald2 needs flushing data first, it needs get the
cache page lock.
Since some ocfs2 cluster locks are holding by write process, this
deadlock may hung the whole cluster.
unlock pages before ocfs2_run_deallocs() can fix the locking order, also
put unlock before ocfs2_commit_trans() to make page lock is unlocked
before j_trans_barrier to preserve unlocking order.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ac4fef4d23 ("ocfs2/dlm: do not purge lockres that is queued for
assert master") may have the following possible race case:
dlm_dispatch_assert_master dlm_wq
========================================================================
queue_work(dlm->quedlm_worker,
&dlm->dispatched_work);
dispatch work,
dlm_lockres_drop_inflight_worker
*BUG_ON(res->inflight_assert_workers == 0)*
dlm_lockres_grab_inflight_worker
inflight_assert_workers++
So ensure inflight_assert_workers to be increased first.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Xue jiufei <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running ocfs2 test suite multiple nodes reflink stress test, for a
4 nodes cluster, every unlink() for refcounted file needs about 700s.
The slow unlink is caused by the contention of refcount tree lock since
all nodes are unlink files using the same refcount tree. When the
unlinking file have many extents(over 1600 in our test), most of the
extents has refcounted flag set. In ocfs2_commit_truncate(), it will
execute the following call trace for every extents. This means it needs
get and released refcount tree lock about 1600 times. And when several
nodes are do this at the same time, the performance will be very low.
ocfs2_remove_btree_range()
-- ocfs2_lock_refcount_tree()
---- ocfs2_refcount_lock()
------ __ocfs2_cluster_lock()
ocfs2_refcount_lock() is costly, move it to ocfs2_commit_truncate() to
do lock/unlock once can improve a lot performance.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Wengang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo.
Currently, in a CMA enabled system, if somebody wants to know the total
CMA size declared, there is no way to tell, other than the dmesg or
/var/log/messages logs.
With this patch we are showing the CMA info as part of meminfo, so that it
can be determined at any point of time. This will be populated only when
CMA is enabled.
Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB.
MemTotal: 471172 kB
MemFree: 111712 kB
MemAvailable: 271172 kB
.
.
.
CmaTotal: 16384 kB
CmaFree: 6144 kB
This patch also fix below checkpatch errors that were found during these changes.
ERROR: space required after that ',' (ctx:ExV)
199: FILE: fs/proc/meminfo.c:199:
+ ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
^
ERROR: space required after that ',' (ctx:ExV)
202: FILE: fs/proc/meminfo.c:202:
+ ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
^
ERROR: space required after that ',' (ctx:ExV)
206: FILE: fs/proc/meminfo.c:206:
+ ,K(totalcma_pages)
^
total: 3 errors, 0 warnings, 2 checks, 236 lines checked
Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Longname is not correctly handled by hfsplus driver. If an attempt to
create a longname(>255) file/directory is made, it succeeds by creating a
file/directory with HFSPLUS_MAX_STRLEN and incorrect catalog key. Thus
leaving the volume in an inconsistent state. This patch fixes this issue.
Although lookup is always called first to create a negative entry, so just
doing a check in lookup would probably fix this issue. I choose to
propagate error to other iops as well.
Please NOTE: I have factored out hfsplus_cat_build_key_with_cnid from
hfsplus_cat_build_key, to avoid unncessary branching.
Thanks a lot.
TEST:
------
dir="TEST_DIR"
cdir=`pwd`
name255="_123456789_123456789_123456789_123456789_123456789_123456789\
_123456789_123456789_123456789_123456789_123456789_123456789_123456789\
_123456789_123456789_123456789_123456789_123456789_123456789_123456789\
_123456789_123456789_123456789_123456789_123456789_1234"
name256="${name255}5"
mkdir $dir
cd $dir
touch $name255
rm -f $name255
touch $name256
ls -la
cd $cdir
rm -rf $dir
RESULT:
-------
[sougata@ultrabook tmp]$ cdir=`pwd`
[sougata@ultrabook tmp]$
name255="_123456789_123456789_123456789_123456789_123456789_123456789\
> _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
> _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
> _123456789_123456789_123456789_123456789_123456789_1234"
[sougata@ultrabook tmp]$ name256="${name255}5"
[sougata@ultrabook tmp]$
[sougata@ultrabook tmp]$ mkdir $dir
[sougata@ultrabook tmp]$ cd $dir
[sougata@ultrabook TEST_DIR]$ touch $name255
[sougata@ultrabook TEST_DIR]$ rm -f $name255
[sougata@ultrabook TEST_DIR]$ touch $name256
[sougata@ultrabook TEST_DIR]$ ls -la
ls: cannot access
_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234:
No such file or directory
total 0
drwxrwxr-x 1 sougata sougata 3 Feb 20 19:56 .
drwxrwxrwx 1 root root 6 Feb 20 19:56 ..
-????????? ? ? ? ? ?
_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234
[sougata@ultrabook TEST_DIR]$ cd $cdir
[sougata@ultrabook tmp]$ rm -rf $dir
rm: cannot remove `TEST_DIR': Directory not empty
-ENAMETOOLONG returned from hfsplus_asc2uni was not propaged to iops.
This allowed hfsplus to create files/directories with HFSPLUS_MAX_STRLEN
and incorrect keys, leaving the FS in an inconsistent state. This patch
fixes this issue.
Signed-off-by: Sougata Santra <sougata@tuxera.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While reviewing the code of umount_tree I realized that when we append
to a preexisting unmounted list we do not change pprev of the former
first item in the list.
Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
the former first item of the list will stomp unmounted.first leaving
it set to some random mount point which we are likely to free soon.
This isn't likely to hit, but if it does I don't know how anyone could
track it down.
[ This happened because we don't have all the same operations for
hlist's as we do for normal doubly-linked lists. In particular,
list_splice() is easy on our standard doubly-linked lists, while
hlist_splice() doesn't exist and needs both start/end entries of the
hlist. And commit 38129a13e6 incorrectly open-coded that missing
hlist_splice().
We should think about making these kinds of "mindless" conversions
easier to get right by adding the missing hlist helpers - Linus ]
Fixes: 38129a13e6 switch mnt_hash to hlist
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Neither Sage nor I noticed that Zheng Yan had mistakenly committed
fs/ceph/super.h.rej as part of commit 31c542a199 ("ceph: add inline
data to pagecache").
Remove it.
Requested-by: Yan, Zheng <ukernel@gmail.com>
Cc: Sage Weil <sweil@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ceph updates from Sage Weil:
"The big item here is support for inline data for CephFS and for
message signatures from Zheng. There are also several bug fixes,
including interrupted flock request handling, 0-length xattrs, mksnap,
cached readdir results, and a message version compat field. Finally
there are several cleanups from Ilya, Dan, and Markus.
Note that there is another series coming soon that fixes some bugs in
the RBD 'lingering' requests, but it isn't quite ready yet"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
ceph: fix setting empty extended attribute
ceph: fix mksnap crash
ceph: do_sync is never initialized
libceph: fixup includes in pagelist.h
ceph: support inline data feature
ceph: flush inline version
ceph: convert inline data to normal data before data write
ceph: sync read inline data
ceph: fetch inline data when getting Fcr cap refs
ceph: use getattr request to fetch inline data
ceph: add inline data to pagecache
ceph: parse inline data in MClientReply and MClientCaps
libceph: specify position of extent operation
libceph: add CREATE osd operation support
libceph: add SETXATTR/CMPXATTR osd operations support
rbd: don't treat CEPH_OSD_OP_DELETE as extent op
ceph: remove unused stringification macros
libceph: require cephx message signature by default
ceph: introduce global empty snap context
ceph: message versioning fixes
...
Pull user namespace related fixes from Eric Biederman:
"As these are bug fixes almost all of thes changes are marked for
backporting to stable.
The first change (implicitly adding MNT_NODEV on remount) addresses a
regression that was created when security issues with unprivileged
remount were closed. I go on to update the remount test to make it
easy to detect if this issue reoccurs.
Then there are a handful of mount and umount related fixes.
Then half of the changes deal with the a recently discovered design
bug in the permission checks of gid_map. Unix since the beginning has
allowed setting group permissions on files to less than the user and
other permissions (aka ---rwx---rwx). As the unix permission checks
stop as soon as a group matches, and setgroups allows setting groups
that can not later be dropped, results in a situtation where it is
possible to legitimately use a group to assign fewer privileges to a
process. Which means dropping a group can increase a processes
privileges.
The fix I have adopted is that gid_map is now no longer writable
without privilege unless the new file /proc/self/setgroups has been
set to permanently disable setgroups.
The bulk of user namespace using applications even the applications
using applications using user namespaces without privilege remain
unaffected by this change. Unfortunately this ix breaks a couple user
space applications, that were relying on the problematic behavior (one
of which was tools/selftests/mount/unprivileged-remount-test.c).
To hopefully prevent needing a regression fix on top of my security
fix I rounded folks who work with the container implementations mostly
like to be affected and encouraged them to test the changes.
> So far nothing broke on my libvirt-lxc test bed. :-)
> Tested with openSUSE 13.2 and libvirt 1.2.9.
> Tested-by: Richard Weinberger <richard@nod.at>
> Tested on Fedora20 with libvirt 1.2.11, works fine.
> Tested-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
> Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
> Just to be sure I was testing the right thing I also tested using
> my unprivileged nsexec testcases, and they failed on setgroup/setgid
> as now expected, and succeeded there without your patches.
> Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>
> I tested this with Sandstorm. It breaks as is and it works if I add
> the setgroups thing.
> Tested-by: Andy Lutomirski <luto@amacapital.net> # breaks things as designed :("
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Unbreak the unprivileged remount tests
userns; Correct the comment in map_write
userns: Allow setting gid_maps without privilege when setgroups is disabled
userns: Add a knob to disable setgroups on a per user namespace basis
userns: Rename id_map_mutex to userns_state_mutex
userns: Only allow the creator of the userns unprivileged mappings
userns: Check euid no fsuid when establishing an unprivileged uid mapping
userns: Don't allow unprivileged creation of gid mappings
userns: Don't allow setgroups until a gid mapping has been setablished
userns: Document what the invariant required for safe unprivileged mappings.
groups: Consolidate the setgroups permission checks
mnt: Clear mnt_expire during pivot_root
mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
umount: Do not allow unmounting rootfs.
umount: Disallow unprivileged mount force
mnt: Update unprivileged remount test
mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
* Add device tree support for DoC3
* SPI NOR:
Refactoring, for better layering between spi-nor.c and its driver users
(e.g., m25p80.c)
New flash device support
Support 6-byte ID strings
* NAND
New NAND driver for Allwinner SoC's (sunxi)
GPMI NAND: add support for raw (no ECC) access, for testing purposes
Add ATO manufacturer ID
A few odd driver fixes
* MTD tests:
Allow testers to compensate for OOB bitflips in oobtest
Fix a torturetest regression
* nandsim: Support longer ID byte strings
And more.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUj6PbAAoJEFySrpd9RFgtahcP/RGvknk9lnitaZI7+aZPP8Zs
AopfiuisLNv3v87EEBAWGYjRYm6vuuhO1z3K55iOIlemBVzoMIP0jf68XGy9uXnL
Ox6AHqxm55wwmc+CHry5/GssaqE6GzdPm8TBP+AGGNhHrhc+raJL55R0QJaoYVwX
pUxkhWaa4lZ6CrOIMQ3n+MEnduilHZoFIcXSc1UtI0y9IXsf1m0Qs8M5jGN8BQ16
HOVNJN9wOXF89swoi7bKsyAn+QFUYgdksAisncb6b9r5Ks5KHmcOuS1LM5X9YoUr
KfeogHfDM68fcaHsSvMU1xmxjXGtE+HFJE52eYNPB1fNbT3JAC13FFj92GeSsZtE
ekpCQh4OPLazT/2wCUHTQwC7T1dCilwyW7VJB9MSl7fSBo9P7jIiUHxVUdM43kez
Di02/XWi4IULTwrgzZiTT8yplFrVdvkmKHAAFEIoaVWiF/l4DeSodLGUw7owmNYn
rJPBPQulpPHRwKZY7gThJuOUXpgbT715GSbvmPYXimHBqmViiPkrhqQ/b/v4PRRs
Nlfhwbswr7WBq6vmPkd6eOyfdFANmWcZQMp/++BCdI/7mhfaik72GWMTBSuJ7hN5
PB+z95soHaKBWlbiDGGGPvuqJmPkOVq1R8itQdIYBWEh7eNSHecwVxyUJJ+V3oPv
QkD7mEP2ZozZe3Ys2EJQ
=gDW8
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20141215' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
"Summary:
- Add device tree support for DoC3
- SPI NOR:
Refactoring, for better layering between spi-nor.c and its
driver users (e.g., m25p80.c)
New flash device support
Support 6-byte ID strings
- NAND:
New NAND driver for Allwinner SoC's (sunxi)
GPMI NAND: add support for raw (no ECC) access, for testing
purposes
Add ATO manufacturer ID
A few odd driver fixes
- MTD tests:
Allow testers to compensate for OOB bitflips in oobtest
Fix a torturetest regression
- nandsim: Support longer ID byte strings
And more"
* tag 'for-linus-20141215' of git://git.infradead.org/linux-mtd: (63 commits)
mtd: tests: abort torturetest on erase errors
mtd: physmap_of: fix potential NULL dereference
mtd: spi-nor: allow NULL as chip name and try to auto detect it
mtd: nand: gpmi: add raw oob access functions
mtd: nand: gpmi: add proper raw access support
mtd: nand: gpmi: add gpmi_copy_bits function
mtd: spi-nor: factor out write_enable() for erase commands
mtd: spi-nor: add support for s25fl128s
mtd: spi-nor: remove the jedec_id/ext_id
mtd: spi-nor: add id/id_len for flash_info{}
mtd: nand: correct the comment of function nand_block_isreserved()
jffs2: Drop bogus if in comment
mtd: atmel_nand: replace memcpy32_toio/memcpy32_fromio with memcpy
mtd: cafe_nand: drop duplicate .write_page implementation
mtd: m25p80: Add support for serial flash Spansion S25FL132K
MTD: m25p80: fix inconsistency in m25p_ids compared to spi_nor_ids
mtd: spi-nor: improve wait-till-ready timeout loop
mtd: delete unnecessary checks before two function calls
mtd: nand: omap: Fix NAND enumeration on 3430 LDP
mtd: nand: add ATO manufacturer info
...
Pull fuse update from Miklos Szeredi:
"The first part makes sure we don't hold up umount with pending async
requests. In addition to being a cleanup, this is a small behavioral
change (for the better) and unlikely to break anything.
The second part prepares for a cleanup of the fuse device I/O code by
adding a helper for simple request submission, with some savings in
line numbers already realized"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use file_inode() in fuse_file_fallocate()
fuse: introduce fuse_simple_request() helper
fuse: reduce max out args
fuse: hold inode instead of path after release
fuse: flush requests on umount
fuse: don't wake up reserved req in fuse_conn_kill()
make sure 'value' is not null. otherwise __ceph_setxattr will remove
the extended attribute.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
mksnap reply only contain 'target', does not contain 'dentry'. So
it's wrong to use req->r_reply_info.head->is_dentry to detect traceless
reply.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Probably this code was syncing a lot more often then intended because
the do_sync variable wasn't set to zero.
Cc: stable@vger.kernel.org # v3.11+
Fixes: c62988ec09 ('ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
After converting inline data to normal data, client need to flush
the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes
cap messages (sent to MDS) contain inline_version and inline_data.
Client always converts inline data to normal data before data write,
so the inline data length part is always zero.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Before any data write, convert inline data to normal data and set
i_inline_version to CEPH_INLINE_NONE. The OSD request that saves
inline data to object contains 3 operations (CMPXATTR, WRITE and
SETXATTR). It compares a xattr named 'inline_version' to prevent
old data overwrites newer data.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
we can't use getattr to fetch inline data while holding Fr cap,
because it can cause deadlock. If we need to sync read inline data,
drop cap refs first, then use getattr to fetch inline data.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
we can't use getattr to fetch inline data after getting Fcr caps,
because it can cause deadlock. The solution is try bringing inline
data to page cache when not holding any cap, and hope the inline
data page is still there after getting the Fcr caps. If the page
is still there, pin it in page cache for later IO.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Request reply and cap message can contain inline data. add inline data
to the page cache if there is Fc cap.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
allow specifying position of extent operation in multi-operations
osd request. This is required for cephfs to convert inline data to
normal data (compare xattr, then write object).
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Current snaphost code does not properly handle moving inode from one
empty snap realm to another empty snap realm. After changing inode's
snap realm, some dirty pages' snap context can be not equal to inode's
i_head_snap. This can trigger BUG() in ceph_put_wrbuffer_cap_refs()
The fix is introduce a global empty snap context for all empty snap
realm. This avoids triggering the BUG() for filesystem with no snapshot.
Fixes: http://tracker.ceph.com/issues/9928
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
There were two places we were assigning version in host byte order
instead of network byte order.
Also in MSG_CLIENT_SESSION we weren't setting compat_version in the
header to reflect continued compatability with older MDSs.
Fixes: http://tracker.ceph.com/issues/9945
Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
After creating/deleting/renaming file, offsets of sibling dentries may
change. So we can not use cached dentries to satisfy readdir. But we can
still use the cached dentries to conclude -ENOENT for lookup.
This patch introduces a new inode flag indicating if child dentries are
ordered. The flag is set at the same time marking a directory complete.
After creating/deleting/renaming file, we clear the flag on directory
inode. This prevents ceph_readdir() from using cached dentries to satisfy
readdir syscall.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
When a lock operation is interrupted, current code sends a unlock request to
MDS to undo the lock operation. This method does not work as expected because
the unlock request can drop locks that have already been acquired.
The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR
requests to interrupt blocked file lock request. These requests do not drop
locks that have alread been acquired, they only interrupt blocked file lock
request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
As we already show mountpoints relative to the root directory, thanks
to the change made back in 2000, change show_vfsmnt() and show_vfsstat()
to skip out-of-root mountpoints the same way as show_mountinfo() does.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Starting with commit v3.2-rc4-1-g02125a8, seq_path_root() no longer
changes the value of its "struct path *root" argument.
Starting with commit v3.2-rc7-104-g8c9379e, the "struct path *root"
argument of seq_path_root() is const.
As result, the temporary variable "root" in show_mountinfo() that
holds a copy of struct path root is no longer needed.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
scanarg(s, del) never returns s; the empty field results in s + 1.
Restore the correct checks, and move NUL-termination into scanarg(),
while we are at it.
Incidentally, mixing "coding style cleanups" (for small values of cleanup)
with functional changes is a Bad Idea(tm)...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
the only instance this method has ever grown was one in kernfs -
one that call ->migrate() of another vm_ops if it exists.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs pile #2 from Al Viro:
"Next pile (and there'll be one or two more).
The large piece in this one is getting rid of /proc/*/ns/* weirdness;
among other things, it allows to (finally) make nameidata completely
opaque outside of fs/namei.c, making for easier further cleanups in
there"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_venus_readdir(): use file_inode()
fs/namei.c: fold link_path_walk() call into path_init()
path_init(): don't bother with LOOKUP_PARENT in argument
fs/namei.c: new helper (path_cleanup())
path_init(): store the "base" pointer to file in nameidata itself
make default ->i_fop have ->open() fail with ENXIO
make nameidata completely opaque outside of fs/namei.c
kill proc_ns completely
take the targets of /proc/*/ns/* symlinks to separate fs
bury struct proc_ns in fs/proc
copy address of proc_ns_ops into ns_common
new helpers: ns_alloc_inum/ns_free_inum
make proc_ns_operations work with struct ns_common * instead of void *
switch the rest of proc_ns_operations to working with &...->ns
netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
common object embedded into various struct ....ns
Pull isofs and reiserfs fixes from Jan Kara:
"A reiserfs and an isofs fix. They arrived after I sent you my first
pull request and I don't want to delay them unnecessarily till rc2"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
isofs: Fix infinite looping over CE entries
reiserfs: destroy allocated commit workqueue
Pull nfsd updates from Bruce Fields:
"A comparatively quieter cycle for nfsd this time, but still with two
larger changes:
- RPC server scalability improvements from Jeff Layton (using RCU
instead of a spinlock to find idle threads).
- server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
Schumaker, enabling fallocate on new clients"
* 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd4: fix xdr4 count of server in fs_location4
nfsd4: fix xdr4 inclusion of escaped char
sunrpc/cache: convert to use string_escape_str()
sunrpc: only call test_bit once in svc_xprt_received
fs: nfsd: Fix signedness bug in compare_blob
sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
sunrpc: convert to lockless lookup of queued server threads
sunrpc: fix potential races in pool_stats collection
sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
sunrpc: require svc_create callers to pass in meaningful shutdown routine
sunrpc: have svc_wake_up only deal with pool 0
sunrpc: convert sp_task_pending flag to use atomic bitops
sunrpc: move rq_cachetype field to better optimize space
sunrpc: move rq_splice_ok flag into rq_flags
sunrpc: move rq_dropme flag into rq_flags
sunrpc: move rq_usedeferral flag to rq_flags
sunrpc: move rq_local field to rq_flags
sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
nfsd: minor off by one checks in __write_versions()
sunrpc: release svc_pool_map reference when serv allocation fails
...
Rock Ridge extensions define so called Continuation Entries (CE) which
define where is further space with Rock Ridge data. Corrupted isofs
image can contain arbitrarily long chain of these, including a one
containing loop and thus causing kernel to end in an infinite loop when
traversing these entries.
Limit the traversal to 32 entries which should be more than enough space
to store all the Rock Ridge data.
Reported-by: P J P <ppandit@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Pull security layer updates from James Morris:
"In terms of changes, there's general maintenance to the Smack,
SELinux, and integrity code.
The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT,
which allows IMA appraisal to require signatures. Support for reading
keys from rootfs before init is call is also added"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits)
selinux: Remove security_ops extern
security: smack: fix out-of-bounds access in smk_parse_smack()
VFS: refactor vfs_read()
ima: require signature based appraisal
integrity: provide a hook to load keys when rootfs is ready
ima: load x509 certificate from the kernel
integrity: provide a function to load x509 certificate from the kernel
integrity: define a new function integrity_read_file()
Security: smack: replace kzalloc with kmem_cache for inode_smack
Smack: Lock mode for the floor and hat labels
ima: added support for new kernel cmdline parameter ima_template_fmt
ima: allocate field pointers array on demand in template_desc_init_fields()
ima: don't allocate a copy of template_fmt in template_desc_init_fields()
ima: display template format in meas. list if template name length is zero
ima: added error messages to template-related functions
ima: use atomic bit operations to protect policy update interface
ima: ignore empty and with whitespaces policy lines
ima: no need to allocate entry for comment
ima: report policy load status
ima: use path names cache
...
Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes, just
removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There are
some ath9k patches coming in through this tree that have been acked by
the wireless maintainers as they relied on the debugfs changes.
Everything has been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM
53kAoLeteByQ3iVwWurwwseRPiWa8+MI
=OVRS
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core update from Greg KH:
"Here's the set of driver core patches for 3.19-rc1.
They are dominated by the removal of the .owner field in platform
drivers. They touch a lot of files, but they are "simple" changes,
just removing a line in a structure.
Other than that, a few minor driver core and debugfs changes. There
are some ath9k patches coming in through this tree that have been
acked by the wireless maintainers as they relied on the debugfs
changes.
Everything has been in linux-next for a while"
* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
fs: debugfs: add forward declaration for struct device type
firmware class: Deletion of an unnecessary check before the function call "vunmap"
firmware loader: fix hung task warning dump
devcoredump: provide a one-way disable function
device: Add dev_<level>_once variants
ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
ath: use seq_file api for ath9k debugfs files
debugfs: add helper function to create device related seq_file
drivers/base: cacheinfo: remove noisy error boot message
Revert "core: platform: add warning if driver has no owner"
drivers: base: support cpu cache information interface to userspace via sysfs
drivers: base: add cpu_device_create to support per-cpu devices
topology: replace custom attribute macros with standard DEVICE_ATTR*
cpumask: factor out show_cpumap into separate helper function
driver core: Fix unbalanced device reference in drivers_probe
driver core: fix race with userland in device_add()
sysfs/kernfs: make read requests on pre-alloc files use the buffer.
sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
fs: sysfs: return EGBIG on write if offset is larger than file size
...
LZ4 is a lightweight compression algorithm which can be used
on embedded systems to reduce CPU and memory overhead (in comparison
to the standard zlib compression).
These patches add the wrapper code to allow Squashfs to use
the existing LZ4 decompression code, and the necessary configuration
option.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUjgIJAAoJEJAch/D1fbHUArwP/iDiDSpqxdfTQHwUKxB57skO
0iBzg6bXPlqwmsnllegg0SV0vOvFjpWiXWpOOtAVCJlfbov8DgUndsigpvG3UhD/
qJpNXAJ+xSdOzBqj3bS7SOu2DwPY8Gz4rxGQcNN3PsuOVR/EUgAnNlv22ZHY10A5
XQVyPbkwZ73TrZ2uKA8leWArFtCbM4oYGpxP+ramEox8nVFEOtixn5IcX5WkbGEL
Yt0NRw8K8vDIIETWVariugUFE4C1olFk+YmqqAw7cmDGJ70cEg5jh9ocNkwDIZPj
I9BNtkggBRMaCPwGsH6IvahMFUyWLQUgGayfY/fgbRiB9ZuYIQ1lyPDhzbgWczoE
o34eXAIDdmfPrmYlDEBDkYnXXtwuYqdOVYOtcEnyFEYqpHfaeS2h2s9nTiM+rz21
v0UEaDRmPtlkK/ZdLKUsrOf+8y9ejkT0R67swFaguHshL6EHey7X5ghmOuwCoL9x
fzGWtPFR+Nbqga5T3dwf+apvyUVrPaw6gZu36NNim2779ZgpnIPzW6MEYUMhtXCn
ef2+NvS9AeGyo7kiqlNQrihQWZSN0W/AiVsEeulzk5h+adzSNQ5eipzAO9DAAp16
8muY4nq51bOGVaWzqJz/KacCmt7i0qUdmS1p4l2uqPp9gH/s/S91yrYn/iszf3AV
CpwU2i9g3nQu9ecDc1Os
=mr7J
-----END PGP SIGNATURE-----
Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
Pull squashfs update from Phillip Lougher:
"These patches optionally add LZ4 compression support to Squashfs.
LZ4 is a lightweight compression algorithm which can be used on
embedded systems to reduce CPU and memory overhead (in comparison to
the standard zlib compression).
These patches add the wrapper code to allow Squashfs to use the
existing LZ4 decompression code, and the necessary configuration
option"
* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: Add LZ4 compression configuration option
Squashfs: add LZ4 compression support
Pull aio updates from Benjamin LaHaise.
* git://git.kvack.org/~bcrl/aio-next:
aio: Skip timer for io_getevents if timeout=0
aio: Make it possible to remap aio ring
Commit 2ae83bf938 ("[CIFS] Fix setting time before epoch (negative
time values)") changed "u64 t" to "s64 t", which makes do_div() complain
about a pointer signedness mismatch:
CC fs/cifs/netmisc.o
In file included from ./arch/mips/include/asm/div64.h:12:0,
from include/linux/kernel.h:124,
from include/linux/list.h:8,
from include/linux/wait.h:6,
from include/linux/net.h:23,
from fs/cifs/netmisc.c:25:
fs/cifs/netmisc.c: In function ‘cifs_NTtimeToUnix’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
^
fs/cifs/netmisc.c:941:22: note: in expansion of macro ‘do_div’
ts.tv_nsec = (long)do_div(t, 10000000) * 100;
Introduce a temporary "u64 abs_t" variable to fix this.
Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
We have encountered failures when When testing smb2 mounts on ppc64
machines when using both Samba as well as Windows 2012.
On poking around, the problem was determined to be caused by the
high endian MessageID passed in the header for smb2. On checking the
corresponding MID for smb1 is converted to LE before being sent on the
wire.
We have tested this patch successfully on a ppc64 machine.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.
In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.
So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).
To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.
So, to make restore work we need to make sure that
a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value
Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.
Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.
What do you think?
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
destroy_list is used to track marks which still need waiting for srcu
period end before they can be freed. However by the time mark is added to
destroy_list it isn't in group's list of marks anymore and thus we can
reuse fsnotify_mark->g_list for queueing into destroy_list. This saves
two pointers for each fsnotify_mark.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a lot of common code in inode and mount marks handling. Factor it
out to a common helper function.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fanotify and the inotify API can be used to monitor changes of the
file system. System call fallocate() modifies files. Hence it should
trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
events. The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
this value allows to create arbitrary file content from random data.
This patch adds the missing call to fsnotify_modify().
The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
succeeds. It will even be created if the file length remains unchanged,
e.g. when calling fanotify with flag FALLOC_FL_KEEP_SIZE.
This logic was primarily chosen to keep the coding simple.
It resembles the logic of the write() system call.
When we call write() we always create a FAN_MODIFY event, even in the case
of overwriting with identical data.
Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
actually changed.
Furthermore even if if the filesize remains unchanged, fallocate() may
influence whether a subsequent write() will succeed and hence the
fallocate() call may be considered a modification.
The fallocate(2) man page teaches: After a successful call, subsequent
writes into the range specified by offset and len are guaranteed not to
fail because of lack of disk space.
So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
different outcomes of a subsequent write depending on the values of offset
and len.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on ext2_direct_IO
Tested with O_DIRECT file open and sysbench/mariadb with 1% written
queries improvement (update_non_index test) on a volume created with
mkaffs.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-Remove ErrorBuffer and use %pV
-Add __printf to enable argument mistmatch warnings
Original patch by Joe Perches.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts). The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.
Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.
Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns. The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch (of 4):
Add a new execveat(2) system call. execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.
In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).
The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).
Based on patches by Meredydd Luff.
Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues.
fsx ./file2 -Z -r 4096 -w 4096
...
..
truncating to largest ever: 0x907c
fallocating to largest ever: 0x11137
truncating to largest ever: 0x2c6fe
truncating to largest ever: 0x2cfdf
fallocating to largest ever: 0x40000
Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e
...
..
The reason being, it is doing a truncate down, but the zeroing does not
happen on the last block boundary when offset is not aligned. Even though
it calls truncate_setsize()->truncate_inode_pages()->
truncate_inode_pages_range() and considers the partial zeroout but it
retrieves the page using find_lock_page() - which only looks the page in
the cache. So, zeroing out does not happen in case of direct IO.
Make a truncate page based around block_truncate_page for FAT filesystem
and invoke that helper to zerout in case the offset is not aligned with
the blocksize.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coverity id: 1042674
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 058504edd0 ("fs/seq_file: fallback to vmalloc allocation"),
seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
memory fails. This was done to address order-4 slab allocations for
reading /proc/stat on large machines and noticed because
PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
allocator when allocating new slab for such high-order allocations.
Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
Other GFP_KERNEL high-order allocations that are <=
PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and
oom kill processes as a result.
We don't want to kill processes so that we can allocate contiguous memory
in situations when contiguous memory isn't necessary.
This patch does the kmalloc() allocation with __GFP_NORETRY for high-order
allocations. This still utilizes memory compaction and direct reclaim in
the allocation path, the only difference is that it will fail immediately
instead of oom kill processes when out of memory.
[akpm@linux-foundation.org: add comment]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask. This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages. The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.
Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.
Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does. Accumulate the number over all
visited lruvecs to get the per-zone value.
Some nodes have multiple zones due to memory addressing restrictions. To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.
For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim. It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.
This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.
Zone reclaim behavior also changes. It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs. Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.
[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory. To
this end, this lock can also be a rwsem. In addition, there are some
important opportunities to share the lock when there are no tree
modifications.
This conversion is straightforward. For now, all users take the write
lock.
[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
When resirefs is trying to mount a partition, it creates a commit
workqueue (sbi->commit_wq). But when mount fails later, the workqueue
is not freed.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reported-by: auxsvr@gmail.com
Reported-by: Benoît Monin <benoit.monin@gmx.fr>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # >= 3.16
Cc: reiserfs-devel@vger.kernel.org
Fixes: 797d9016ce
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUi0B6AAoJEKurIx+X31iByH8P/jfMgzyUO+KpJMA1DbgCAG7x
WPJgbMUyPwB63DH09RyMEmiwf61Rl1klXTPVNY0Dnj7qRJOmpB9U3vGIfO4HpD84
5IZMBlc+Jl+kJCxSAJYbTJTZLsIMjFGOfuVTvlY+HnMBitQVBumKptmC0DoBBqgz
yYy5MHRMaVoHcogyMyBiknmxdxu6/ruUKY+6yyvdUESt0SCcJG8V6Qik7TMmnx47
NvIIPzfibvvLLnd8IOEj2fwh8XMtJdfcCxPpAEvEaNq0jZEDF9K22jttTQvl9r92
NQf7JKQQrNfzloRZ3flKax5ZMGi9RkcirTLLdJ4I2xMGVHOA4XUAjsSCYR6INuuJ
Ox00FnuiIrADNw37m52Y+ujPTF1C2PQUNK69gwsLd84MSjy+95F2dlC5cC3Yt4N5
rpstXxWELZTqjMGD8GTPOpv6zlg799IbFexr4H6KTc+47EX0MNayJiI6L597gYnq
gIiPmDnnz6WlWp4HHgBIwjNAH3Tbf/uU3MlgzqS3Ftd7YkYmLnxvClhrwgErviFn
Nfnz2LtGuMxMHSt0uSWxODVEaR4reKRVJBvhRSGWL1PufylEyt0YWayiqpohuKD9
6X/RufWK5qdCBHytoGyMUZ57oqxth9QSVG4RBkGPmaZgMq/5DdyOhBfW0yInjMuo
AuDMmqrU5yFTitLMGcsG
=kcmD
-----END PGP SIGNATURE-----
Merge tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore update #2 from Tony Luck:
"Couple of pstore-ram enhancements to allow use of different memory
attributes"
* tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore-ram: Allow optional mapping with pgprot_noncached
pstore-ram: Fix hangs by using write-combine mappings
Pull btrfs update from Chris Mason:
"From a feature point of view, most of the code here comes from Miao
Xie and others at Fujitsu to implement scrubbing and replacing devices
on raid56. This has been in development for a while, and it's a big
improvement.
Filipe and Josef have a great assortment of fixes, many of which solve
problems corruptions either after a crash or in error conditions. I
still have a round two from Filipe for next week that solves
corruptions with discard and block group removal"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
Btrfs: make get_caching_control unconditionally return the ctl
Btrfs: fix unprotected deletion from pending_chunks list
Btrfs: fix fs mapping extent map leak
Btrfs: fix memory leak after block remove + trimming
Btrfs: make btrfs_abort_transaction consider existence of new block groups
Btrfs: fix race between writing free space cache and trimming
Btrfs: fix race between fs trimming and block group remove/allocation
Btrfs, replace: enable dev-replace for raid56
Btrfs: fix freeing used extents after removing empty block group
Btrfs: fix crash caused by block group removal
Btrfs: fix invalid block group rbtree access after bg is removed
Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
Btrfs, replace: write raid56 parity into the replace target device
Btrfs, replace: write dirty pages into the replace target device
Btrfs, raid56: support parity scrub on raid56
Btrfs, raid56: use a variant to record the operation type
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
Btrfs, raid56: don't change bbio and raid_map
Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
...