Linux内核 页面回收 -- ARM64 v5.0

文中引文均来自文末参考文章.

Question

  1. PG_ACTIVE,PG_REFERENCE的作用和区别?.
  2. LRU+二次机会的实现?

LRU链表

LRU链表是内存回收的核心结构.

lru链表组织的页包括:可以存放到swap分区中的页,映射了文件的页,以及被锁在内存中禁止换出的进程页。所有属于这些情况的页都必须加入到lru链表中,无一例外,而剩下那些没有加入到lru链表中的页,基本也就剩内核使用的页框了。

 为了方便对于LRU_INACTIVE_ANON和LRU_ACTIVE_ANON这两个链表,统称为匿名页lru链表,而LRU_INACTIVE_FILE和LRU_ACTIVE_FILE统称为文件页lru链表。当进程运行过程中,通过调用mlock()将一些内存页锁在内存中时,这些内存页就会被加入到它们锁在的zone的LRU_UNEVICTABLE链表中,在LRU_UNEVICTABLE链表中的页可能是文件页也可能是匿名页。

当需要修改lru链表时,一定要占有zone中的lru_lock这个锁,在多核的硬件环境中,在同时需要对lru链表进行修改时,锁的竞争会非常的频繁,所以内核提供了一个lru缓存的机制,这种机制能够减少锁的竞争频率。其实这种机制非常简单,lru缓存相当于将一些需要相同处理的页集合起来,当达到一定数量时再对它们进行一批次的处理,这样做可以让对锁的需求集中在这个处理的时间点

1
2
3
4
5
6
7
8
9
10
/* LRU缓存 
* PAGEVEC_SIZE默认为14
*/
struct pagevec {
/* 当前数量 */
unsigned long nr;
unsigned long cold;
/* 指针数组,每一项都可以指向一个页描述符,默认大小是14 */
struct page *pages[PAGEVEC_SIZE];
};

注意,内核是为每个CPU提供四种lru缓存,而不是每个zone,并且也不是为每种lru链表提供四种lru缓存,也就是说,只要是新页,所有应该放入lru链表的新页都会加入到当前CPU的lru_add_pvec这个lru缓存中,比如同时有两个新页,一个将加入到zone0的活动匿名页lru链表,另一个将加入到zone1的非活动文件页lru链表,这两个新页都会先加入到此CPU的lru_add_pvec这个lru缓存中

1
2
3
4
5
6
7
8
9
10
11
12
/* 这部分的lru缓存是用于那些原来不属于lru链表的,新加入进来的页 */
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
/* 在这个lru_rotate_pvecs中的页都是非活动页并且在非活动lru链表中,将这些页移动到非活动lru链表的末尾 */
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
/* 在这个lru缓存的页原本应属于活动lru链表中的页,会强制清除PG_activate和PG_referenced,并加入到非活动lru链表的链表表头中
* 这些页一般从活动lru链表中的尾部拿出来的
*/
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
#ifdef CONFIG_SMP
/* 将此lru缓存中的页放到活动页lru链表头中,这些页原本属于非活动lru链表的页 */
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
#endif

对于CPU的lru_add_pvec缓存的处理,如上,而其他类型的lru缓存处理也是相同。只需要记住,要对页实现什么操作,就放到CPU对应的lru缓存中,而CPU的lru缓存满或者需要立即将lru缓存中的页放入lru链表时,就会将lru缓存中的页放到它们需要放入的lru链表中。同时,对于lru缓存来说,它们只负责将页放到页应该放到的lru链表中,所以,在一个页加入lru缓存前,就必须设置好此页的一些属性,这样才能配合lru缓存进行工作。

 如注释所说,判断页需要加入到哪个lru链表中,主要通过三个标志位:
PG_active::此标志置位,表示此页需要加入或者处于页所在zone的活动lru链表中,当此页已经在lru链表中时,此标志可以让系统判断此页是在活动lru链表还是非活动lru链表中。
PG_swapbacked:此标志置位,表示此页可以回写到swap分区,那么此页需要加入或者处于页所在zone的匿名页lru链表中。
PG_unevictable:置位表示此页被锁在内存中禁止换出,表示此页需要加入或者处于页所在zone的LRU_UNEVICTABLE链表中。
  而对于文件页lru链表来说,实际上还有一个PG_referenced标志

内存回收流程

因为在不同的内存分配路径中,会触发不同的内存回收方式,内存回收针对的目标有两种,一种是针对zone的,另一种是针对一个memcg的,而这里我们只讨论针对zone的内存回收,个人把针对zone的内存回收方式分为三种,分别是快速内存回收、直接内存回收、kswapd内存回收。

快速内存回收:处于get_page_from_freelist()函数中,在遍历zonelist过程中,对每个zone都在分配前进行判断,如果分配后zone的空闲内存数量 < 阀值 + 保留页框数量,那么此zone就会进行快速内存回收,即使分配前此zone空闲页框数量都没有达到阀值,都会进行此zone的快速内存回收。注意阀值可能是min/low/high的任何一种,因为在快速内存分配,慢速内存分配和oom分配过程中如果回收的页框足够,都会调用到get_page_from_freelist()函数,所以快速内存回收不仅仅发生在快速内存分配中,在慢速内存分配过程中也会发生。回收对象:zone的干净文件页、slab、可能会回写匿名页

直接内存回收:处于慢速分配过程中,直接内存回收只有一种情况下会使用,在慢速分配中无法从zonelist的所有zone中以min阀值分配页框,并且进行异步内存压缩后,还是无法分配到页框的时候,就对zonelist中的所有zone进行一次直接内存回收。注意,直接内存回收是针对zonelist中的所有zone的,它并不像快速内存回收和kswapd内存回收,只会对zonelist中空闲页框不达标的zone进行内存回收。并且在直接内存回收中,有可能唤醒flush内核线程。回收对象:超过所在memcg的soft_limit_in_bytes的进程的内存、zone的干净文件页、slab、匿名页swap

kswapd内存回收:发生在kswapd内核线程中,每个node有一个swapd内核线程,也就是kswapd内核线程中的内存回收,是只针对所在node的,并且只会对 分配了order页框数量后空闲页框数量 < 此zone的high阀值 + 保留页框数量 的zone进行内存回收,并不会对此node的所有zone进行内存回收。

这三种内存回收虽然是在不同状态下会被触发,但是如果当内存不足时,kswapd内存回收和直接内存回收很大可能是在并发的进行内存回收的。而实际上,这三种回收再怎么不同,进行内存回收的执行代码是一样的,只是在内存回收前做的一些处理和判断不同。回收对象:超过所在memcg的soft_limit_in_bytes的进程的内存、zone的干净的文件页、zone的脏的文件页、slab、匿名页swap

shrink_node

shrink_node是三种内存回收路径合并的地方, 进一步调用shrink_node_memcg.

1
static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)

shrink_node_memcg先get_scan_count获取在各条LRU链表上需要扫描的页面数量, 然后对各条LRU链表调用shrink_list.

shrink_list进一步根据lru链表类型进行dispatch.在处理active_list时, 仅当inactive_list_is_low时调用shrink_active_list

1
2
3
4
5
6
7
8
9
10
11
12
13
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct mem_cgroup *memcg,
struct scan_control *sc)
{
if (is_active_lru(lru)) {
if (inactive_list_is_low(lruvec, is_file_lru(lru),
memcg, sc, true))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}

return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
}

将页面在ACTIVE链表和INACTIVE链表之间移动前, 需要先将要处理的页面隔离出来, 这使用isolate_lru_pages函数
isolate_lru_pages实际上隔离出来的页面数量小于等于扫描的页面数量, 这与isolate_mode等有关.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

/*
* zone_lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
* For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions).
*
* Appropriate locks must be held before calling this function.
*
* @nr_to_scan: The number of eligible pages to look through on the list.
* @lruvec: The LRU vector to pull pages from.
* @dst: The temp list to put pages on to.
* @nr_scanned: The number of pages that were scanned.
* @sc: The scan_control struct for this reclaim session
* @mode: One of the LRU isolation modes
* @lru: LRU list id for isolating
*
* returns how many pages were moved onto *@dst.
*/
static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
struct lruvec *lruvec, struct list_head *dst,
unsigned long *nr_scanned, struct scan_control *sc,
isolate_mode_t mode, enum lru_list lru)
{
struct list_head *src = &lruvec->lists[lru];
unsigned long nr_taken = 0;
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
unsigned long skipped = 0;
unsigned long scan, total_scan, nr_pages;
LIST_HEAD(pages_skipped);

scan = 0;
for (total_scan = 0;
scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src);
total_scan++) {
struct page *page;

page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);

VM_BUG_ON_PAGE(!PageLRU(page), page);

if (page_zonenum(page) > sc->reclaim_idx) {
list_move(&page->lru, &pages_skipped);
nr_skipped[page_zonenum(page)]++;
continue;
}

/*
* Do not count skipped pages because that makes the function
* return with no isolated pages if the LRU mostly contains
* ineligible pages. This causes the VM to not reclaim any
* pages, triggering a premature OOM.
*/
scan++;
switch (__isolate_lru_page(page, mode)) {
case 0:
nr_pages = hpage_nr_pages(page);
nr_taken += nr_pages;
nr_zone_taken[page_zonenum(page)] += nr_pages;
list_move(&page->lru, dst);
break;

case -EBUSY:
/* else it is being freed elsewhere */
list_move(&page->lru, src);
continue;

default:
BUG();
}
}

/*
* Splice any skipped pages to the start of the LRU list. Note that
* this disrupts the LRU order when reclaiming for lower zones but
* we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
* scanning would soon rescan the same pages to skip and put the
* system at risk of premature OOM.
*/
if (!list_empty(&pages_skipped)) {
int zid;

list_splice(&pages_skipped, src);
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
if (!nr_skipped[zid])
continue;

__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
skipped += nr_skipped[zid];
}
}
*nr_scanned = total_scan;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
total_scan, skipped, nr_taken, mode, lru);
update_lru_sizes(lruvec, lru, nr_zone_taken);
return nr_taken;
}

先放一张概览图



shrink_active_list先将要进行处理的页面从lruvec中隔离出来, 放到l_hold中,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
static void shrink_active_list(unsigned long nr_to_scan,
struct lruvec *lruvec,
struct scan_control *sc,
enum lru_list lru)
{
unsigned long nr_taken;
unsigned long nr_scanned;
unsigned long vm_flags;
LIST_HEAD(l_hold); /* The pages which were snipped off */
LIST_HEAD(l_active);
LIST_HEAD(l_inactive);
struct page *page;
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
unsigned nr_deactivate, nr_activate;
unsigned nr_rotated = 0;
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);

lru_add_drain();

if (!sc->may_unmap)
isolate_mode |= ISOLATE_UNMAPPED;

spin_lock_irq(&pgdat->lru_lock);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;

__count_vm_events(PGREFILL, nr_scanned);
count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);

spin_unlock_irq(&pgdat->lru_lock);

然后再处理隔离出来的页面. 有三条路径:

  1. 如果页面是不可驱逐的, 调用putback_lru_page放回正确的lru中.
  2. 如果页面近期没被访问过, 加入l_inactive头部
  3. 如果页面近期被访问过, 且是文件页, 且标记为VM_EXEC(对可执行页面的特殊处理), 则再次加入l_active头部, 否则加入l_inactive头部.

2,3可以合成一句话, 除了近期被访问过的可执行文件页, 其他隔离页面均加入l_active头部.

这里提一下”访问”的概念, 检测页面是否访问过是通过page_referenced_one函数, 该函数会检测并清除该页面对应页表项中的PTE_YOUNG(在ARM64中实际是PTE_AF), 该位在页面被访问时由硬件设置. 与下文会提到的PG_referenced是两个不同的标志.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
while (!list_empty(&l_hold)) {
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);

if (unlikely(!page_evictable(page))) {
putback_lru_page(page);
continue;
}

if (unlikely(buffer_heads_over_limit)) {
if (page_has_private(page) && trylock_page(page)) {
if (page_has_private(page))
try_to_release_page(page, 0);
unlock_page(page);
}
}

if (page_referenced(page, 0, sc->target_mem_cgroup,
&vm_flags)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
* give them one more trip around the active list. So
* that executable code get better chances to stay in
* memory under moderate memory pressure. Anon pages
* are not likely to be evicted by use-once streaming
* IO, plus JVM can create lots of anon VM_EXEC pages,
* so we ignore them here.
*/
if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
list_add(&page->lru, &l_active);
continue;
}
}

ClearPageActive(page); /* we are de-activating */
SetPageWorkingset(page);
list_add(&page->lru, &l_inactive);
}

最后再正式加入到对应的lru中, 并直接释放不再被引用的页面.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
* Move pages back to the lru list.
*/
spin_lock_irq(&pgdat->lru_lock);
/*
* Count referenced pages from currently used mappings as rotated,
* even though only some of them are actually re-activated. This
* helps balance scan pressure between file and anonymous pages in
* get_scan_count.
*/
reclaim_stat->recent_rotated[file] += nr_rotated;

nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
free_unref_page_list(&l_hold);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);

shrink_inactive_list的实现比较类似. 先扫描并隔离出要处理的页面放入page_list, 然后调用shrink_page_list处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/*
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
*/
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
LIST_HEAD(page_list);
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
struct reclaim_stat stat = {};
isolate_mode_t isolate_mode = 0;
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
bool stalled = false;

while (unlikely(too_many_isolated(pgdat, file, sc))) {
if (stalled)
return 0;

/* wait a bit for the reclaimer. */
msleep(100);
stalled = true;

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}

lru_add_drain();

if (!sc->may_unmap)
isolate_mode |= ISOLATE_UNMAPPED;

spin_lock_irq(&pgdat->lru_lock);

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;

if (current_is_kswapd()) {
if (global_reclaim(sc))
__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
count_memcg_events(lruvec_memcg(lruvec), PGSCAN_KSWAPD,
nr_scanned);
} else {
if (global_reclaim(sc))
__count_vm_events(PGSCAN_DIRECT, nr_scanned);
count_memcg_events(lruvec_memcg(lruvec), PGSCAN_DIRECT,
nr_scanned);
}
spin_unlock_irq(&pgdat->lru_lock);

if (nr_taken == 0)
return 0;

nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);

spin_lock_irq(&pgdat->lru_lock);

if (current_is_kswapd()) {
if (global_reclaim(sc))
__count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_KSWAPD,
nr_reclaimed);
} else {
if (global_reclaim(sc))
__count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_DIRECT,
nr_reclaimed);
}

putback_inactive_pages(lruvec, &page_list);

__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);

spin_unlock_irq(&pgdat->lru_lock);

mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);

/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not doing their job. This can
* happen when memory pressure pushes dirty pages to the end of
* the LRU before the dirty limits are breached and the dirty
* data has expired. It can also happen when the proportion of
* dirty pages grows not through writes but through memory
* pressure reclaiming all the clean cache. And in some cases,
* the flushers simply cannot keep up with the allocation
* rate. Nudge the flusher threads in case they are asleep.
*/
if (stat.nr_unqueued_dirty == nr_taken)
wakeup_flusher_threads(WB_REASON_VMSCAN);

sc->nr.dirty += stat.nr_dirty;
sc->nr.congested += stat.nr_congested;
sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
sc->nr.writeback += stat.nr_writeback;
sc->nr.immediate += stat.nr_immediate;
sc->nr.taken += nr_taken;
if (file)
sc->nr.file_taken += nr_taken;

trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed, &stat, sc->priority, file);
return nr_reclaimed;
}

shrink_page_list处理的情况比较复杂, 这里直接摘抄参考文章中的注释.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

/* 1.此页是否在进行回写(两种情况会导致回写,之前进行内存回收时导致此页进行了回写;此页为脏页,系统自动将其回写),这种情况同步回收和异步回收有不同的处理
* 2.此次回收时非强制进行回收,那要先判断此页能不能进行回收
* 如果是匿名页,只要最近此页被进程访问过,则将此页移动到活动lru链表头部,否则回收
* 如果是映射可执行文件的文件页,只要最近被进程访问过,就放到活动lru链表,否则回收
* 如果是其他的文件页,如果最近被多个进程访问过,移动到活动lru链表,如果只被1个进程访问过,但是PG_referenced置位了,也放入活动lru链表,其他情况回收
* 3.如果遍历到的page为匿名页,但是又不处于swapcache中,这里会尝试将其加入到swapcache中并把页标记为脏页,这个swapcache作为swap缓冲区,是一个address_space
* 4.对所有映射了此页的进程的页表进行此页的unmap操作
* 5.如果页为脏页,则进行回写,分同步和异步,同步情况是回写完成才返回,异步情况是加入块层的写入队列,标记页的PG_writeback表示正在回写就返回,此页将会被放到非活动lru链表头部
* 6.检查页的PG_writeback标志,如果此标志位0,则说明此页的回写完成(两种情况: 1.同步回收 2.之前异步回收对此页进行的回写已完成),则从此页对应的address_space中的基树移除此页的结点,加入到free_pages链表
* 对于PG_writeback标志位1的,将其重新加入到page_list链表,这个链表之后会将里面的页放回到非活动lru链表末尾,下次进行回收时,如果页回写完成了就会被释放
* 7.对free_pages链表的页释放
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct pglist_data *pgdat,
struct scan_control *sc,
enum ttu_flags ttu_flags,
struct reclaim_stat *stat,
bool force_reclaim)

其中涉及到一个二次机会的概念, 对于文件页来说, 由于初始化时会被引用一次, 为了防止大量文件页留在ACTIVE中, 使用PG_referenced标志位来计数, 让该页面在INACTIVE链表中再循环一轮.

该标志同样用于ACTIVE的文件页, 刚从ACTIVE链表中移除但仍被使用的文件页可以快速恢复.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
static enum page_references page_check_references(struct page *page,
struct scan_control *sc)
{
int referenced_ptes, referenced_page;
unsigned long vm_flags;

referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
&vm_flags);
referenced_page = TestClearPageReferenced(page);

/*
* Mlock lost the isolation race with us. Let try_to_unmap()
* move the page to the unevictable list.
*/
if (vm_flags & VM_LOCKED)
return PAGEREF_RECLAIM;

if (referenced_ptes) {
if (PageSwapBacked(page))
return PAGEREF_ACTIVATE;
/*
* All mapped pages start out with page table
* references from the instantiating fault, so we need
* to look twice if a mapped file page is used more
* than once.
*
* Mark it and spare it for another trip around the
* inactive list. Another page table reference will
* lead to its activation.
*
* Note: the mark is set for activated pages as well
* so that recently deactivated but used pages are
* quickly recovered.
*/
SetPageReferenced(page);

if (referenced_page || referenced_ptes > 1)
return PAGEREF_ACTIVATE;

/*
* Activate file-backed executable pages after first usage.
*/
if (vm_flags & VM_EXEC)
return PAGEREF_ACTIVATE;

return PAGEREF_KEEP;
}

/* Reclaim if clean, defer dirty pages to writeback */
if (referenced_page && !PageSwapBacked(page))
return PAGEREF_RECLAIM_CLEAN;

return PAGEREF_RECLAIM;
}

快速内存回收

快速内存回收的入口点在node_reclaim. 进行一些检查后转入__node_reclaim.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
{
int ret;

/*
* Node reclaim reclaims unmapped file backed pages and
* slab pages if we are over the defined limits.
*
* A small portion of unmapped file backed pages is needed for
* file I/O otherwise pages read by file I/O will be immediately
* thrown out if the node is overallocated. So we do not reclaim
* if less than a specified percentage of the node is used by
* unmapped file backed pages.
*/
if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
return NODE_RECLAIM_FULL;

/*
* Do not scan if the allocation should not be delayed.
*/
if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
return NODE_RECLAIM_NOSCAN;

/*
* Only run node reclaim on the local node or on nodes that do not
* have associated processors. This will favor the local processor
* over remote processors and spread off node memory allocations
* as wide as possible.
*/
if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
return NODE_RECLAIM_NOSCAN;

if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
return NODE_RECLAIM_NOSCAN;

ret = __node_reclaim(pgdat, gfp_mask, order);
clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);

if (!ret)
count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

return ret;
}

node_pagecache_reclaimable计算当前node_reclaim_mode下节点中能回收的文件页面缓存数量.

  • 如果设置了RECLAIM_UNMAP
    • 可回收页面数 += 文件页总数
  • 否则
    • 可回收页面数 += 未映射的文件页数量
  • 如果未设置RECLAIM_WRITE(将脏页写回存储设备)
    • 可映射页面数 -= 脏页数量
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29

      /* Work out how many page cache pages we can reclaim in this reclaim_mode */
      static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
      {
      unsigned long nr_pagecache_reclaimable;
      unsigned long delta = 0;

      /*
      * If RECLAIM_UNMAP is set, then all file pages are considered
      * potentially reclaimable. Otherwise, we have to worry about
      * pages like swapcache and node_unmapped_file_pages() provides
      * a better estimate
      */
      if (node_reclaim_mode & RECLAIM_UNMAP)
      nr_pagecache_reclaimable = node_page_state(pgdat, NR_FILE_PAGES);
      else
      nr_pagecache_reclaimable = node_unmapped_file_pages(pgdat);

      /* If we can't clean pages, remove dirty pages from consideration */
      if (!(node_reclaim_mode & RECLAIM_WRITE))
      delta += node_page_state(pgdat, NR_FILE_DIRTY);

      /* Watch for any possible underflows due to delta */
      if (unlikely(delta > nr_pagecache_reclaimable))
      delta = nr_pagecache_reclaimable;

      return nr_pagecache_reclaimable - delta;
      }

__node_reclaim设置scan_control, 然后循环调用shrink_node尝试回收至少nr_pages张页面.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/*
* Try to free up some pages from this node through reclaim.
*/
static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
{
/* Minimum pages needed in order to stay on node */
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
unsigned int noreclaim_flag;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = current_gfp_context(gfp_mask),
.order = order,
.priority = NODE_RECLAIM_PRIORITY,
.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
.may_swap = 1,
.reclaim_idx = gfp_zone(gfp_mask),
};

cond_resched();
fs_reclaim_acquire(sc.gfp_mask);
/*
* We need to be able to allocate from the reserves for RECLAIM_UNMAP
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_UNMAP.
*/
noreclaim_flag = memalloc_noreclaim_save();
p->flags |= PF_SWAPWRITE;
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
/*
* Free memory by calling shrink node with increasing
* priorities until we have enough memory freed.
*/
do {
shrink_node(pgdat, &sc);
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}

p->reclaim_state = NULL;
current->flags &= ~PF_SWAPWRITE;
memalloc_noreclaim_restore(noreclaim_flag);
fs_reclaim_release(sc.gfp_mask);
return sc.nr_reclaimed >= nr_pages;
}

直接内存回收

入口点在do_try_to_free_pages, 对zonlist中的各zone->pgdat调用shrink_node.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/*
* This is the main entry point to direct page reclaim.
*
* If a full scan of the inactive list fails to free enough memory then we
* are "out of memory" and something needs to be killed.
*
* If the caller is !__GFP_FS then the probability of a failure is reasonably
* high - the zone may be full of dirty or under-writeback pages, which this
* caller can't do much about. We kick the writeback threads and take explicit
* naps in the hope that some of these pages can be written. But if the
* allocating task holds filesystem locks which prevent writeout this might not
* work, and the allocation attempt will fail.
*
* returns: 0, if no pages reclaimed
* else, the number of pages reclaimed
*/
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct scan_control *sc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
gfp_t orig_mask;
pg_data_t *last_pgdat = NULL;

/*
* If the number of buffer_heads in the machine exceeds the maximum
* allowed level, force direct reclaim to scan the highmem zone as
* highmem pages could be pinning lowmem pages storing buffer_heads
*/
orig_mask = sc->gfp_mask;
if (buffer_heads_over_limit) {
sc->gfp_mask |= __GFP_HIGHMEM;
sc->reclaim_idx = gfp_zone(sc->gfp_mask);
}

for_each_zone_zonelist_nodemask(zone, z, zonelist,
sc->reclaim_idx, sc->nodemask) {
/*
* Take care memory controller reclaiming has small influence
* to global LRU.
*/
if (global_reclaim(sc)) {
if (!cpuset_zone_allowed(zone,
GFP_KERNEL | __GFP_HARDWALL))
continue;

/*
* If we already have plenty of memory free for
* compaction in this zone, don't free any more.
* Even though compaction is invoked for any
* non-zero order, only frequent costly order
* reclamation is disruptive enough to become a
* noticeable problem, like transparent huge
* page allocations.
*/
if (IS_ENABLED(CONFIG_COMPACTION) &&
sc->order > PAGE_ALLOC_COSTLY_ORDER &&
compaction_ready(zone, sc)) {
sc->compaction_ready = true;
continue;
}

/*
* Shrink each node in the zonelist once. If the
* zonelist is ordered by zone (not the default) then a
* node may be shrunk multiple times but in that case
* the user prefers lower zones being preserved.
*/
if (zone->zone_pgdat == last_pgdat)
continue;

/*
* This steals pages from memory cgroups over softlimit
* and returns the number of reclaimed pages and
* scanned pages. This works for global memory pressure
* and balancing, not for a memcg's limit.
*/
nr_soft_scanned = 0;
nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
sc->order, sc->gfp_mask,
&nr_soft_scanned);
sc->nr_reclaimed += nr_soft_reclaimed;
sc->nr_scanned += nr_soft_scanned;
/* need some check for avoid more shrink_zone() */
}

/* See comment about same check for global reclaim above */
if (zone->zone_pgdat == last_pgdat)
continue;
last_pgdat = zone->zone_pgdat;
shrink_node(zone->zone_pgdat, sc);
}

异步内存回收

每个节点有一个由init进程启动的内存回收进程, kswapd.
当满足条件时被唤醒, 进行异步的内存回收工作.
经由以下路径转入shrink_node: balance_pgdat->kswapd_shrink_node->shrink_node.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/*
* The background pageout daemon, started as a kernel thread
* from the init process.
*
* This basically trickles out pages so that we have _some_
* free memory available even if there is no other activity
* that frees anything up. This is needed for things like routing
* etc, where we otherwise might have all activity going on in
* asynchronous contexts that cannot page things out.
*
* If there are applications that are active memory-allocators
* (most normal use), this basically shouldn't matter.
*/
static int kswapd(void *p)

参考文章

https://www.cnblogs.com/tolimit/p/5447448.html
https://www.cnblogs.com/tolimit/p/5435068.html

  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2022-2024 翰青HanQi

请我喝杯咖啡吧~

支付宝
微信