Welcome! Log In Create A New Profile Recent Messages

Advanced

[PATCH 00/17] writeback fixes and cleanups for 2.6.40 (v2)

Posted by Wu Fengguang 
Andrew,

This is the combination of all the recent writeback patches that get
reasonably reviewed and tested.

The first 10 patches are already in -mm tree, with updates:

- remove "writeback: pass writeback_control down to move_expired_inodes()", and
resolve the resulting merge conflicts in other patches.
- move ahead the sync livelock prevention patches (01, 02) so that (04) won't livelock sync
- merge the three -mm fixes to (08)
- fixed changelog of (01)

[PATCH 01/17] writeback: introduce .tagged_sync for the WB_SYNC_NONE sync stage
[PATCH 02/17] writeback: update dirtied_when for synced inode to prevent livelock
[PATCH 03/17] writeback: introduce writeback_control.inodes_cleaned
[PATCH 04/17] writeback: try more writeback as long as something was written
[PATCH 05/17] writeback: the kupdate expire timestamp should be a moving target
[PATCH 06/17] writeback: sync expired inodes first in background writeback
[PATCH 07/17] writeback: refill b_io iff empty
[PATCH 08/17] writeback: split inode_wb_list_lock into bdi_writeback.list_lock
[PATCH 09/17] writeback: elevate queue_io() into wb_writeback()
[PATCH 10/17] writeback: avoid extra sync work at enqueue time

The following 7 patches were posted and reviewed these days:

[PATCH 11/17] writeback: add bdi_dirty_limit() kernel-doc
[PATCH 12/17] writeback: skip balance_dirty_pages() for in-memory fs
[PATCH 13/17] writeback: remove writeback_control.more_io
[PATCH 14/17] writeback: make writeback_control.nr_to_write straight
[PATCH 15/17] writeback: remove .nonblocking and .encountered_congestion
[PATCH 16/17] writeback: trace event writeback_single_inode
[PATCH 17/17] writeback: trace event writeback_queue_io

Thanks to Jan and Dave for the careful reviews!

The patches are git pullable from

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
Clarify the bdi_dirty_limit() comment.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c 2011-05-05 23:30:22.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-05-05 23:30:31.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
*pdirty = dirty;
}

-/*
+/**
* bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
*
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
* - starving fast devices
* - piling up dirty pages (that will take long time to sync) on slow devices
*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
before calling __writeback_inodes_wb()

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)

--- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:41:36.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 23:51:59.000000000 +0800
@@ -580,17 +580,13 @@ static int writeback_sb_inodes(struct su
return 1;
}

-void writeback_inodes_wb(struct bdi_writeback *wb,
- struct writeback_control *wbc)
+static void __writeback_inodes_wb(struct bdi_writeback *wb,
+ struct writeback_control *wbc)
{
int ret = 0;

if (!wbc->wb_start)
wbc->wb_start = jiffies; /* livelock avoidance */
- spin_lock(&wb->list_lock);
-
- if (list_empty(&wb->b_io))
- queue_io(wb, wbc->older_than_this);

while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -606,19 +602,16 @@ void writeback_inodes_wb(struct bdi_writ
if (ret)
break;
}
- spin_unlock(&wb->list_lock);
/* Leave any unwritten inodes on b_io */
}

-static void __writeback_inodes_sb(struct super_block *sb,
- struct bdi_writeback *wb, struct writeback_control *wbc)
+void writeback_inodes_wb(struct bdi_writeback *wb,
+ struct writeback_control *wbc)
{
- WARN_ON(!rwsem_is_locked(&sb->s_umount));
-
spin_lock(&wb->list_lock);
if (list_empty(&wb->b_io))
queue_io(wb, wbc->older_than_this);
- writeback_sb_inodes(sb, wb, wbc, true);
+ __writeback_inodes_wb(wb, wbc);
spin_unlock(&wb->list_lock);
}

@@ -685,7 +678,7 @@ static long wb_writeback(struct bdi_writ
* The intended call sequence for WB_SYNC_ALL writeback is:
*
* wb_writeback()
- * __writeback_inodes_sb() <== called only once
+ * writeback_sb_inodes() <== called only once
* write_cache_pages() <== called once for each inode
* (quickly) tag currently dirty pages
* (maybe slowly) sync all tagged pages
@@ -731,10 +724,14 @@ static long wb_writeback(struct bdi_writ

retry:
trace_wbc_writeback_start(&wbc, wb->bdi);
+ spin_lock(&wb->list_lock);
+ if (list_empty(&wb->b_io))
+ queue_io(wb, wbc.older_than_this);
if (work->sb)
- __writeback_inodes_sb(work->sb, wb, &wbc);
+ writeback_sb_inodes(work->sb, wb, &wbc, true);
else
- writeback_inodes_wb(wb, &wbc);
+ __writeback_inodes_wb(wb, &wbc);
+ spin_unlock(&wb->list_lock);
trace_wbc_writeback_written(&wbc, wb->bdi);

work->nr_pages -= write_chunk - wbc.nr_to_write;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
Wu Fengguang
[PATCH 07/17] writeback: refill b_io iff empty
May 13, 2011 06:09AM
There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

2.6.39-rc3 2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed 256.043 252.367
stddev 24.381 12.530

tar elapsed 30.097 28.808
dd elapsed 13.214 11.782

#!/bin/zsh

cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

umount /dev/sda7
mkfs.xfs -f /dev/sda7
mount /dev/sda7 /fs

echo 3 > /proc/sys/vm/drop_caches

tic=$(cat /proc/uptime|cut -d' ' -f2)

cd /fs
time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

wait
sync
tac=$(cat /proc/uptime|cut -d' ' -f2)
echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)

writeback large inode 1024 pages -> b_more_io

b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)

writeback 8 small inodes 800 pages
1 large inode 224 pages -> b_more_io

b_more_io (large inode) -> b_io (8s, 1l)
8 newly expired inodes -> b_io (8s, 1l, 8s)
.....

Your new code:

b_more_io (large inode) -> b_io (1l)
8 newly expired inodes -> b_io (1l, 8s)

writeback large inode 1024 pages -> b_more_io
(b_io == 8s)
writeback 8 small inodes 800 pages

b_io empty: (1800 pages written)
b_more_io (large inode) -> b_io (1l)
14 newly expired inodes -> b_io (1l, 14s)

writeback large inode 1024 pages -> b_more_io
(b_io == 14s)
writeback 10 small inodes 1000 pages
1 small inode 24 pages -> b_more_io (1l, 1s(24))
writeback 5 small inodes 500 pages
b_io empty: (2548 pages written)
b_more_io (large inode) -> b_io (1l, 1s(24))
20 newly expired inodes -> b_io (1l, 1s(24), 20s)
......

Rough progression of pages written at b_io refill:

Old code:

total large file % of writeback
1024 224 21.9% (fixed)

New code:
total large file % of writeback
1800 1024 ~55%
2550 1024 ~40%
3050 1024 ~33%
3500 1024 ~29%
3950 1024 ~26%
4250 1024 ~24%
4500 1024 ~22.7%
4700 1024 ~21.7%
4800 1024 ~21.3%
4800 1024 ~21.3%
(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message winking smiley

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

after + dyn-expire + ioless:
&(&wb->list_lock)->rlock: 2291 2304 0.15 204.09 3125.12 35315 970712 0.10 223.84 1113437.05
------------------------
&(&wb->list_lock)->rlock 9 [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
&(&wb->list_lock)->rlock 1614 [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 459 [<ffffffff8115d351>] writeback_sb_inodes+0x108/0x154
&(&wb->list_lock)->rlock 137 [<ffffffff8115cdcf>] writeback_single_inode+0x1b4/0x296
------------------------
&(&wb->list_lock)->rlock 3 [<ffffffff8110c367>] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [<ffffffff8115dc5d>] inode_wb_list_del+0x5f/0x85
&(&wb->list_lock)->rlock 1160 [<ffffffff8115da6a>] __mark_inode_dirty+0x173/0x1cf
&(&wb->list_lock)->rlock 435 [<ffffffff8115dcb6>] writeback_inodes_wb+0x33/0x12b

after + dyn-expire:
&(&wb->list_lock)->rlock: 226820 229719 0.10 194.28 809275.91 327372 1033513685 0.08 476.96 3590929811.61
------------------------
&(&wb->list_lock)->rlock 11 [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
&(&wb->list_lock)->rlock 30559 [<ffffffff8115bb1f>] wb_writeback+0x2fb/0x3c3
&(&wb->list_lock)->rlock 37339 [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
&(&wb->list_lock)->rlock 54880 [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227
------------------------
&(&wb->list_lock)->rlock 3 [<ffffffff8110b606>] bdi_lock_two+0x46/0x4b
&(&wb->list_lock)->rlock 6 [<ffffffff8115b6d3>] inode_wb_list_del+0x5f/0x85
&(&wb->list_lock)->rlock 55347 [<ffffffff8115b72c>] writeback_inodes_wb+0x33/0x12b
&(&wb->list_lock)->rlock 55338 [<ffffffff8115a87f>] writeback_single_inode+0x17f/0x227

--- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:30:26.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 23:30:27.000000000 +0800
@@ -589,7 +589,8 @@ void writeback_inodes_wb(struct bdi_writ
if (!wbc->wb_start)
wbc->wb_start = jiffies; /* livelock avoidance */
spin_lock(&inode_wb_list_lock);
- if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+ if (list_empty(&wb->b_io))
queue_io(wb, wbc->older_than_this);

while (!list_empty(&wb->b_io)) {
@@ -616,7 +617,7 @@ static void __writeback_inodes_sb(struct
WARN_ON(!rwsem_is_locked(&sb->s_umount));

spin_lock(&inode_wb_list_lock);
- if (!wbc->for_kupdate || list_empty(&wb->b_io))
+ if (list_empty(&wb->b_io))
queue_io(wb, wbc->older_than_this);
writeback_sb_inodes(sb, wb, wbc, true);
spin_unlock(&inode_wb_list_lock);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

At each queue_io() time, first try enqueuing only newly expired inodes.
If there are zero expired inodes to work with, then relax the rule and
enqueue all dirty inodes.

It at least makes sense from the data integrity point of view.

This may also reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by the
page reclaim code.

More background: as Mel put it, "it makes sense to write old pages first
to reduce the chances page reclaim is initiating IO."

Rik also presented the situation with a graph:

LRU head [*] dirty page
[ * * * * * * * * * * *]

Ideally, most dirty pages should lie close to the LRU tail instead of
LRU head. That requires the flusher thread to sync old/expired inodes
first (as there are obvious correlations between inode age and page
age), and to give fair opportunities to newly expired inodes rather
than sticking with some large eldest inodes (as larger inodes have
weaker correlations in the inode<=>page ages).

This patch helps the flusher to meet both the above requirements.

Side effects: it might reduce the batch size and hence reduce
inode_wb_list_lock hold time, but in turn make the cluster-by-partition
logic in the same function less effective on reducing disk seeks.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel<riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:30:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 23:30:26.000000000 +0800
@@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
if (work->for_background && !over_bground_thresh())
break;

- if (work->for_kupdate) {
+ if (work->for_kupdate || work->for_background) {
oldest_jif = jiffies -
msecs_to_jiffies(dirty_expire_interval * 10);
wbc.older_than_this = &oldest_jif;
@@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
wbc.pages_skipped = 0;
wbc.inodes_cleaned = 0;

+retry:
trace_wbc_writeback_start(&wbc, wb->bdi);
if (work->sb)
__writeback_inodes_sb(work->sb, wb, &wbc);
@@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
if (wbc.inodes_cleaned)
continue;
/*
+ * background writeback will start with expired inodes, and
+ * if none is found, fallback to all inodes. This order helps
+ * reduce the number of dirty pages reaching the end of LRU
+ * lists and cause trouble to the page reclaim.
+ */
+ if (work->for_background &&
+ wbc.older_than_this &&
+ list_empty(&wb->b_io) &&
+ list_empty(&wb->b_more_io)) {
+ wbc.older_than_this = NULL;
+ goto retry;
+ }
+ /*
* No more inodes for IO, bail
*/
if (!wbc.more_io)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
livelock prevention for it, too.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact: It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.

Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/ext4/inode.c | 4 ++--
fs/fs-writeback.c | 9 +++++----
include/linux/writeback.h | 1 +
mm/page-writeback.c | 4 ++--
4 files changed, 10 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:29:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-05 23:30:22.000000000 +0800
@@ -36,6 +36,7 @@ struct wb_writeback_work {
long nr_pages;
struct super_block *sb;
enum writeback_sync_modes sync_mode;
+ unsigned int tagged_sync:1;
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
unsigned int for_background:1;
@@ -650,6 +651,7 @@ static long wb_writeback(struct bdi_writ
{
struct writeback_control wbc = {
.sync_mode = work->sync_mode,
+ .tagged_sync = work->tagged_sync,
.older_than_this = NULL,
.for_kupdate = work->for_kupdate,
.for_background = work->for_background,
@@ -657,7 +659,7 @@ static long wb_writeback(struct bdi_writ
};
unsigned long oldest_jif;
long wrote = 0;
- long write_chunk;
+ long write_chunk = MAX_WRITEBACK_PAGES;
struct inode *inode;

if (wbc.for_kupdate) {
@@ -683,9 +685,7 @@ static long wb_writeback(struct bdi_writ
* (quickly) tag currently dirty pages
* (maybe slowly) sync all tagged pages
*/
- if (wbc.sync_mode == WB_SYNC_NONE)
- write_chunk = MAX_WRITEBACK_PAGES;
- else
+ if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_sync)
write_chunk = LONG_MAX;

wbc.wb_start = jiffies; /* livelock avoidance */
@@ -1193,6 +1193,7 @@ void writeback_inodes_sb_nr(struct super
struct wb_writeback_work work = {
.sb = sb,
.sync_mode = WB_SYNC_NONE,
+ .tagged_sync = 1,
.done = &done,
.nr_pages = nr,
};
--- linux-next.orig/include/linux/writeback.h 2011-05-05 23:29:51.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-05-05 23:30:22.000000000 +0800
@@ -47,6 +47,7 @@ struct writeback_control {
unsigned encountered_congestion:1; /* An output: a queue is full */
unsigned for_kupdate:1; /* A kupdate writeback */
unsigned for_background:1; /* A background writeback */
+ unsigned tagged_sync:1; /* do livelock prevention for sync */
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
--- linux-next.orig/mm/page-writeback.c 2011-05-05 23:29:51.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-05-05 23:30:22.000000000 +0800
@@ -892,12 +892,12 @@ int write_cache_pages(struct address_spa
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
}
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
retry:
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && (index <= end)) {
--- linux-next.orig/fs/ext4/inode.c 2011-05-05 23:29:51.000000000 +0800
+++ linux-next/fs/ext4/inode.c 2011-05-05 23:30:22.000000000 +0800
@@ -2741,7 +2741,7 @@ static int write_cache_pages_da(struct a
index = wbc->range_start >> PAGE_CACHE_SHIFT;
end = wbc->range_end >> PAGE_CACHE_SHIFT;

- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
@@ -2975,7 +2975,7 @@ static int ext4_da_writepages(struct add
}

retry:
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_sync)
tag_pages_for_writeback(mapping, index, end);

while (!ret && wbc->nr_to_write > 0) {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Thu, May 12, 2011 at 09:57:07PM +0800, Wu Fengguang wrote:
> sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> livelock prevention for it, too.
>
> Note that writeback_inodes_sb() is called by not only sync(), they are
> treated the same because the other callers also need livelock prevention.
>
> Impact: It changes the order in which pages/inodes are synced to disk.
> Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> until finished with the current inode.

What about all the filesystems that implement their own
.writepages()/write_cache_pages() functions or have
have special code that checks WB_SYNC_ALL in .writepages (e.g. gfs2,
ext4, btrfs and perhaps others). Don't they all need to be aware of
this tagged_sync field?

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Thu, May 12, 2011 at 09:57:12PM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
>
> At each queue_io() time, first try enqueuing only newly expired inodes.
> If there are zero expired inodes to work with, then relax the rule and
> enqueue all dirty inodes.
>
> It at least makes sense from the data integrity point of view.
>
> This may also reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by the
> page reclaim code.
>
> More background: as Mel put it, "it makes sense to write old pages first
> to reduce the chances page reclaim is initiating IO."
>
> Rik also presented the situation with a graph:
>
> LRU head [*] dirty page
> [ * * * * * * * * * * *]
>
> Ideally, most dirty pages should lie close to the LRU tail instead of
> LRU head. That requires the flusher thread to sync old/expired inodes
> first (as there are obvious correlations between inode age and page
> age), and to give fair opportunities to newly expired inodes rather
> than sticking with some large eldest inodes (as larger inodes have
> weaker correlations in the inode<=>page ages).
>
> This patch helps the flusher to meet both the above requirements.
>
> Side effects: it might reduce the batch size and hence reduce
> inode_wb_list_lock hold time, but in turn make the cluster-by-partition
> logic in the same function less effective on reducing disk seeks.
>
> v2: keep policy changes inside wb_writeback() and keep the
> wbc.older_than_this visibility as suggested by Dave.
>
> CC: Dave Chinner <david@fromorbit.com>
> Acked-by: Jan Kara <jack@suse.cz>
> Acked-by: Rik van Riel<riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> fs/fs-writeback.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:30:25.000000000 +0800
> +++ linux-next/fs/fs-writeback.c 2011-05-05 23:30:26.000000000 +0800
> @@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
> if (work->for_background && !over_bground_thresh())
> break;
>
> - if (work->for_kupdate) {
> + if (work->for_kupdate || work->for_background) {
> oldest_jif = jiffies -
> msecs_to_jiffies(dirty_expire_interval * 10);
> wbc.older_than_this = &oldest_jif;
> @@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
> wbc.pages_skipped = 0;
> wbc.inodes_cleaned = 0;
>
> +retry:
> trace_wbc_writeback_start(&wbc, wb->bdi);
> if (work->sb)
> __writeback_inodes_sb(work->sb, wb, &wbc);
> @@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
> if (wbc.inodes_cleaned)
> continue;
> /*
> + * background writeback will start with expired inodes, and
> + * if none is found, fallback to all inodes. This order helps
> + * reduce the number of dirty pages reaching the end of LRU
> + * lists and cause trouble to the page reclaim.
> + */
> + if (work->for_background &&
> + wbc.older_than_this &&
> + list_empty(&wb->b_io) &&
> + list_empty(&wb->b_more_io)) {
> + wbc.older_than_this = NULL;
> + goto retry;
> + }
> + /*
> * No more inodes for IO, bail
> */
> if (!wbc.more_io)

I have to say that I dislike this implicit nested looping structure
using a goto. It would seem better to me to make it explicit that we
can do multiple writeback calls by using a do/while loop here and
moving the logic of setting/resetting wbc.older_than_this to one
place inside the nested loop...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Fri, May 13, 2011 at 06:40:13AM +0800, Dave Chinner wrote:
> On Thu, May 12, 2011 at 09:57:07PM +0800, Wu Fengguang wrote:
> > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> > livelock prevention for it, too.
> >
> > Note that writeback_inodes_sb() is called by not only sync(), they are
> > treated the same because the other callers also need livelock prevention.
> >
> > Impact: It changes the order in which pages/inodes are synced to disk.
> > Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> > until finished with the current inode.
>
> What about all the filesystems that implement their own
> .writepages()/write_cache_pages() functions or have
> have special code that checks WB_SYNC_ALL in .writepages (e.g. gfs2,
> ext4, btrfs and perhaps others). Don't they all need to be aware of
> this tagged_sync field?

Right, good point. Currently only ext4 is updated. The other
filesystems --- afs, btrfs, cifs, gfs2 --- do not even use
PAGECACHE_TAG_TOWRITE for livelock prevention. My plan was to add
PAGECACHE_TAG_TOWRITE and tagged_sync code to them as the next step,
when tagged_sync is accepted and proved to work fine.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Fri, May 13, 2011 at 10:56:08AM +0800, Wu Fengguang wrote:
> > What about all the filesystems that implement their own
> > .writepages()/write_cache_pages() functions or have
> > have special code that checks WB_SYNC_ALL in .writepages (e.g. gfs2,
> > ext4, btrfs and perhaps others). Don't they all need to be aware of
> > this tagged_sync field?
>
> Right, good point. Currently only ext4 is updated. The other
> filesystems --- afs, btrfs, cifs, gfs2 --- do not even use
> PAGECACHE_TAG_TOWRITE for livelock prevention. My plan was to add
> PAGECACHE_TAG_TOWRITE and tagged_sync code to them as the next step,
> when tagged_sync is accepted and proved to work fine.

I think it would be better to try to figure out why these filesystems
need to duplicate that functionality and figure out if there's any
way to make them use the generic code. But given that we need to get
some writeback updates ready for .40 it might be worth to postponed
that and go down the copy & paste route for now.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Fri, May 13, 2011 at 10:56:08AM +0800, Wu Fengguang wrote:
> On Fri, May 13, 2011 at 06:40:13AM +0800, Dave Chinner wrote:
> > On Thu, May 12, 2011 at 09:57:07PM +0800, Wu Fengguang wrote:
> > > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > > WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> > > livelock prevention for it, too.
> > >
> > > Note that writeback_inodes_sb() is called by not only sync(), they are
> > > treated the same because the other callers also need livelock prevention.
> > >
> > > Impact: It changes the order in which pages/inodes are synced to disk.
> > > Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> > > until finished with the current inode.
> >
> > What about all the filesystems that implement their own
> > .writepages()/write_cache_pages() functions or have
> > have special code that checks WB_SYNC_ALL in .writepages (e.g. gfs2,
> > ext4, btrfs and perhaps others). Don't they all need to be aware of
> > this tagged_sync field?
>
> Right, good point. Currently only ext4 is updated. The other
> filesystems --- afs, btrfs, cifs, gfs2 --- do not even use
> PAGECACHE_TAG_TOWRITE for livelock prevention. My plan was to add
> PAGECACHE_TAG_TOWRITE and tagged_sync code to them as the next step,
> when tagged_sync is accepted and proved to work fine.

Where "proved to work fine" can mean "caused regressions for certain
filesystems"? I mean, for btrfs it means that the bio is submitted
with WRITE rather than WRITE_SYNC, which causes subtle changes of
behaviour in the elevator. that could cause strange regressions that
are very hard to isolate.

Hence regardless of whether filesystems use PAGECACHE_TAG_TOWRITE
or not, filesystems are checking for synchronous writeback for
a reason. If we now have two different ways of signalling sync
writeback they need to know about them.

Which just raised the question in my mind - why did you add a new
field rather than a new sync_mode definition? After all, this is a
new sync control, and it seems clumsy to me to have two separate
control fields for defining sync behaviour...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Mon, May 16, 2011 at 07:43:06AM +0800, Dave Chinner wrote:
> On Fri, May 13, 2011 at 10:56:08AM +0800, Wu Fengguang wrote:
> > On Fri, May 13, 2011 at 06:40:13AM +0800, Dave Chinner wrote:
> > > On Thu, May 12, 2011 at 09:57:07PM +0800, Wu Fengguang wrote:
> > > > sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
> > > > WB_SYNC_ALL sync. Tag the first stage with wbc.tagged_sync and do
> > > > livelock prevention for it, too.
> > > >
> > > > Note that writeback_inodes_sb() is called by not only sync(), they are
> > > > treated the same because the other callers also need livelock prevention.
> > > >
> > > > Impact: It changes the order in which pages/inodes are synced to disk.
> > > > Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
> > > > until finished with the current inode.
> > >
> > > What about all the filesystems that implement their own
> > > .writepages()/write_cache_pages() functions or have
> > > have special code that checks WB_SYNC_ALL in .writepages (e.g. gfs2,
> > > ext4, btrfs and perhaps others). Don't they all need to be aware of
> > > this tagged_sync field?
> >
> > Right, good point. Currently only ext4 is updated. The other
> > filesystems --- afs, btrfs, cifs, gfs2 --- do not even use
> > PAGECACHE_TAG_TOWRITE for livelock prevention. My plan was to add
> > PAGECACHE_TAG_TOWRITE and tagged_sync code to them as the next step,
> > when tagged_sync is accepted and proved to work fine.
>
> Where "proved to work fine" can mean "caused regressions for certain
> filesystems"? I mean, for btrfs it means that the bio is submitted
> with WRITE rather than WRITE_SYNC, which causes subtle changes of
> behaviour in the elevator. that could cause strange regressions that
> are very hard to isolate.

Hmm, where is the relevant btrfs code? It seems that you assumed
WB_SYNC_ALL semantics in .tagged_sync, however the latter merely means
"tag all dirty pages with PAGECACHE_TAG_TOWRITE and write them out".

> Hence regardless of whether filesystems use PAGECACHE_TAG_TOWRITE
> or not, filesystems are checking for synchronous writeback for
> a reason. If we now have two different ways of signalling sync
> writeback they need to know about them.

See above, shall we rename .tagged_sync to .tagged_write?

> Which just raised the question in my mind - why did you add a new
> field rather than a new sync_mode definition? After all, this is a
> new sync control, and it seems clumsy to me to have two separate
> control fields for defining sync behaviour...

Yeah I considered that too. The main problem is, it somehow overloads
the sync mode enum and some filesystems already assumed two modes only
by using (sync_mode == WB_SYNC_ALL) and (sync_mode != WB_SYNC_NONE)
interchangeably.

For example, if adding another mode

WB_SYNC_NONE // WRITE, don't wait
+ WB_SYNC_NONE_TAGGED // WRITE, don't wait, use PAGECACHE_TAG_TOWRITE
WB_SYNC_ALL // WRITE_SYNC, wait, use PAGECACHE_TAG_TOWRITE

The btrfs code will unnecessarily wait on WB_SYNC_NONE_TAGGED:

if (wbc->sync_mode != WB_SYNC_NONE) {
if (PageWriteback(page))
flush_fn(data);
wait_on_page_writeback(page);
}

We can fix btrfs trivially, however if there are out of tree
filesystems, they'll break silently..

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
On Fri, May 13, 2011 at 06:55:25AM +0800, Dave Chinner wrote:
> On Thu, May 12, 2011 at 09:57:12PM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> >
> > At each queue_io() time, first try enqueuing only newly expired inodes.
> > If there are zero expired inodes to work with, then relax the rule and
> > enqueue all dirty inodes.
> >
> > It at least makes sense from the data integrity point of view.
> >
> > This may also reduce the number of dirty pages encountered by page
> > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > dirty pages, which are more close to the end of the LRU lists. So
> > syncing older inodes first helps reducing the dirty pages reached by the
> > page reclaim code.
> >
> > More background: as Mel put it, "it makes sense to write old pages first
> > to reduce the chances page reclaim is initiating IO."
> >
> > Rik also presented the situation with a graph:
> >
> > LRU head [*] dirty page
> > [ * * * * * * * * * * *]
> >
> > Ideally, most dirty pages should lie close to the LRU tail instead of
> > LRU head. That requires the flusher thread to sync old/expired inodes
> > first (as there are obvious correlations between inode age and page
> > age), and to give fair opportunities to newly expired inodes rather
> > than sticking with some large eldest inodes (as larger inodes have
> > weaker correlations in the inode<=>page ages).
> >
> > This patch helps the flusher to meet both the above requirements.
> >
> > Side effects: it might reduce the batch size and hence reduce
> > inode_wb_list_lock hold time, but in turn make the cluster-by-partition
> > logic in the same function less effective on reducing disk seeks.
> >
> > v2: keep policy changes inside wb_writeback() and keep the
> > wbc.older_than_this visibility as suggested by Dave.
> >
> > CC: Dave Chinner <david@fromorbit.com>
> > Acked-by: Jan Kara <jack@suse.cz>
> > Acked-by: Rik van Riel<riel@redhat.com>
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > fs/fs-writeback.c | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > --- linux-next.orig/fs/fs-writeback.c 2011-05-05 23:30:25.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c 2011-05-05 23:30:26.000000000 +0800
> > @@ -718,7 +718,7 @@ static long wb_writeback(struct bdi_writ
> > if (work->for_background && !over_bground_thresh())
> > break;
> >
> > - if (work->for_kupdate) {
> > + if (work->for_kupdate || work->for_background) {
> > oldest_jif = jiffies -
> > msecs_to_jiffies(dirty_expire_interval * 10);
> > wbc.older_than_this = &oldest_jif;
> > @@ -729,6 +729,7 @@ static long wb_writeback(struct bdi_writ
> > wbc.pages_skipped = 0;
> > wbc.inodes_cleaned = 0;
> >
> > +retry:
> > trace_wbc_writeback_start(&wbc, wb->bdi);
> > if (work->sb)
> > __writeback_inodes_sb(work->sb, wb, &wbc);
> > @@ -752,6 +753,19 @@ static long wb_writeback(struct bdi_writ
> > if (wbc.inodes_cleaned)
> > continue;
> > /*
> > + * background writeback will start with expired inodes, and
> > + * if none is found, fallback to all inodes. This order helps
> > + * reduce the number of dirty pages reaching the end of LRU
> > + * lists and cause trouble to the page reclaim.
> > + */
> > + if (work->for_background &&
> > + wbc.older_than_this &&
> > + list_empty(&wb->b_io) &&
> > + list_empty(&wb->b_more_io)) {
> > + wbc.older_than_this = NULL;
> > + goto retry;
> > + }
> > + /*
> > * No more inodes for IO, bail
> > */
> > if (!wbc.more_io)
>
> I have to say that I dislike this implicit nested looping structure
> using a goto. It would seem better to me to make it explicit that we
> can do multiple writeback calls by using a do/while loop here and
> moving the logic of setting/resetting wbc.older_than_this to one
> place inside the nested loop...

This is the best I can do. Does it look better? Now the
.older_than_this modifying lines are close to each others smiling smiley

Thanks,
Fengguang
---

fs/fs-writeback.c | 58 +++++++++++++++++++++++---------------------
1 file changed, 31 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c 2011-05-16 20:29:53.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-05-16 20:53:31.000000000 +0800
@@ -713,42 +713,22 @@ static long wb_writeback(struct bdi_writ
unsigned long oldest_jif;
struct inode *inode;
long progress;
+ bool bg_retry_all = false;

oldest_jif = jiffies;
work->older_than_this = &oldest_jif;

spin_lock(&wb->list_lock);
for (;winking smiley {
- /*
- * Stop writeback when nr_pages has been consumed
- */
- if (work->nr_pages <= 0)
- break;
-
- /*
- * Background writeout and kupdate-style writeback may
- * run forever. Stop them if there is other work to do
- * so that e.g. sync can proceed. They'll be restarted
- * after the other works are all done.
- */
- if ((work->for_background || work->for_kupdate) &&
- !list_empty(&wb->bdi->work_list))
- break;
-
- /*
- * For background writeout, stop when we are below the
- * background dirty threshold
- */
- if (work->for_background && !over_bground_thresh())
- break;
-
- if (work->for_kupdate || work->for_background) {
+ if (bg_retry_all) {
+ bg_retry_all = false;
+ work->older_than_this = NULL;
+ } else if (work->for_kupdate || work->for_background) {
oldest_jif = jiffies -
msecs_to_jiffies(dirty_expire_interval * 10);
work->older_than_this = &oldest_jif;
}

-retry:
trace_writeback_start(wb->bdi, work);
if (list_empty(&wb->b_io))
queue_io(wb, work->older_than_this);
@@ -778,14 +758,38 @@ retry:
work->older_than_this &&
list_empty(&wb->b_io) &&
list_empty(&wb->b_more_io)) {
- work->older_than_this = NULL;
- goto retry;
+ bg_retry_all = true;
+ continue;
}
/*
* No more inodes for IO, bail
*/
if (list_empty(&wb->b_more_io))
break;
+
+ /*
+ * Stop writeback when nr_pages has been consumed
+ */
+ if (work->nr_pages <= 0)
+ break;
+
+ /*
+ * Background writeout and kupdate-style writeback may
+ * run forever. Stop them if there is other work to do
+ * so that e.g. sync can proceed. They'll be restarted
+ * after the other works are all done.
+ */
+ if ((work->for_background || work->for_kupdate) &&
+ !list_empty(&wb->bdi->work_list))
+ break;
+
+ /*
+ * For background writeout, stop when we are below the
+ * background dirty threshold
+ */
+ if (work->for_background && !over_bground_thresh())
+ break;
+
/*
* Nothing written. Wait for some inode to
* become available for writeback. Otherwise
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
> shall we rename .tagged_sync to .tagged_write?

FYI, I finally renamed the flag to .tagged_writepages which most
precisely describes what it's doing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at [vger.kernel.org]
Please read the FAQ at [www.tux.org]
Sorry, you do not have permission to post/reply in this forum.