Linux内核 KSM机制 -- ARM64 v5.0

(看源码看得最难受的一集)

Question

  • KSM机制的流程?
  • 如何理解KSM机制中的数据结构?

KSM概述

KSM指Kernel SamePage Merging,即内核同页合并,用于合
并内容相同的页面。KSM的出现是为了优化虚拟化中产生的冗余页面,
因为虚拟化的实际应用中在同一台主机上会有许多相同的操作系统和
应用程序,许多内存页面的内容可能是相同的,所以它们可以合并,
从而释放内存供其他应用程序使用。

KSM允许合并同一个进程或不同进程之间内容相同的匿名页面,这
对应用程序来说是不可见的。把这些相同的页面合并成一个只读的页
面,从而释放出多余的物理页面,当应用程序需要改变页面内容时,
会发生写时复制。

KSM的流程其实并不复杂, 但编码中大量复用的数据结构, 没有注释的术语, 多种功能糅杂的函数, 使得源码分析比较困难. 源码分析部分推荐先看整体流程, 然后每调用一个子函数, 先去后文看看子函数做了什么.

数据结构

KSM系统维护一个稳定节点树和一个不稳定节点树.

不稳定节点树节点元素的数据结构是struct rmap_item. 它将一个普通节点对应的page反向映射到一个虚拟地址.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/**
* struct rmap_item - reverse mapping item for virtual addresses
* @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list
* @anon_vma: pointer to anon_vma for this mm,address, when in stable tree
* @nid: NUMA node id of unstable tree in which linked (may not match page)
* @mm: the memory structure this rmap_item is pointing into
* @address: the virtual address this rmap_item tracks (+ flags in low bits)
* @oldchecksum: previous checksum of the page at that virtual address
* @node: rb node of this rmap_item in the unstable tree
* @head: pointer to stable_node heading this list in the stable tree
* @hlist: link into hlist of rmap_items hanging off that stable_node
*/
struct rmap_item {
struct rmap_item *rmap_list;
union {
struct anon_vma *anon_vma; /* when stable */
#ifdef CONFIG_NUMA
int nid; /* when node of unstable tree */
#endif
};
struct mm_struct *mm;
unsigned long address; /* + low bits used for flags below */
unsigned int oldchecksum; /* when unstable */
union {
struct rb_node node; /* when node of unstable tree */
struct { /* when listed from stable tree */
struct stable_node *head;
struct hlist_node hlist;
};
};
};

稳定节点树的节点元素是struct stable_node, 但它有三种形态. 一种是普通节点, 一种是链式节点的头节点, 一种是链式节点的链表元素.
虽然有三种形态, 但只有前两种会真正挂载到稳定节点树上. 第三种实质是挂载位置不同的普通节点.
变量名stable_node指挂载到稳定节点树上的节点, 可能是一,二形态.
变量名stable_node_dup指存放rmap_item的节点, 可能是一, 三形态.

作为普通节点和链式节点的头节点时结构均如下图, 但字段意义不同.

  • 普通节点: hlist为使用该节点对应的ksm页面的rmap_item的链表, rmap_hlist_len为对应的rmap_item链表长度, 该长度上限为ksm_max_page_sharing(默认256,但似乎可以超出这个上限).
  • 链式节点头节点: hlist为该链式节点管理的节点链表的链表头. rmap_hlist_len标识该节点为链式节点头结点, 值为STABLE_NODE_CHAIN
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    struct stable_node {
    struct rb_node node;

    struct hlist_head hlist;
    union {
    unsigned long kpfn;
    unsigned long chain_prune_time;
    };

    int rmap_hlist_len;
    #ifdef CONFIG_NUMA
    int nid;
    #endif
    };

作为链式节点链表元素时, hlist_dup维护这一节点链表, 为使用该节点对应的ksm页面的rmap_item的链表, rmap_hlist_len为rmap_item链表长度.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct stable_node {

struct {
struct list_head *head;
struct {
struct hlist_node hlist_dup;
struct list_head list;
};
};

struct hlist_head hlist;
union {
unsigned long kpfn;
unsigned long chain_prune_time;
};

#define STABLE_NODE_CHAIN -1024
int rmap_hlist_len;
#ifdef CONFIG_NUMA
int nid;
#endif
};

由于稳定节点树上的一个普通节点(包括之前提到的一,三形态)对应一个kpage, 在下方笔者会有类似”将page合并到某节点”的表述.
以及表述”组成链式节点”有两种情况, 形成一个新的链式节点或将其中一个节点插入到另一个原有链式节点的链表中.

流程概述

ksmd内核线程扫描加入到KSM系统的页面page(加入到KSM系统是一个比较抽象的说法), 调用cmp_and_merge_page进行相同页面合并.

  • 先调用stable_tree_search, 查找稳定节点树中是否有能与page合并的节点, 以及对应的kpage.
    • 如果有, 则调用 try_to_merge_with_ksm_page进行合并. 合并完成后, 调用stable_tree_append将page对应的rmap_item加入到kpage对应节点中.
    • 如果没有, 计算该page的checksum, 如果和上次保存在rmap_item->oldchecksum中的值不同, 那么说明该page的内容可能是频繁变化的, 没有必要进行合并.直接返回.
    • 如果该page的checksum == zero_checksum, 则调用try_to_merge_one_page将其与系统零页合并后返回.
    • 稳定节点树中没有, 那就调用unstable_tree_search_insert在不稳定节点树中搜索能合并到的tree_page.
      • 如果找到了, 调用try_to_merge_two_pages进行合并, 合并后的kpage再调用stable_tree_insert插入到稳定节点树中. 然后调用stable_tree_append把合并前的tree_rmap_item, rmap_item加入到稳定节点树上新产生的对应节点中.
      • 如果没找到, 把rmap_item插入不稳定节点树中.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
/*
* cmp_and_merge_page - first see if page can be merged into the stable tree;
* if not, compare checksum to previous and if it's the same, see if page can
* be inserted into the unstable tree, or merged with a page already there and
* both transferred to the stable tree.
*
* @page: the page that we are searching identical page to.
* @rmap_item: the reverse mapping into the virtual address of this page
*/
static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
{
struct mm_struct *mm = rmap_item->mm;
struct rmap_item *tree_rmap_item;
struct page *tree_page = NULL;
struct stable_node *stable_node;
struct page *kpage;
unsigned int checksum;
int err;
bool max_page_sharing_bypass = false;

stable_node = page_stable_node(page);
if (stable_node) {
if (stable_node->head != &migrate_nodes &&
get_kpfn_nid(READ_ONCE(stable_node->kpfn)) !=
NUMA(stable_node->nid)) {
stable_node_dup_del(stable_node);
stable_node->head = &migrate_nodes;
list_add(&stable_node->list, stable_node->head);
}
if (stable_node->head != &migrate_nodes &&
rmap_item->head == stable_node)
return;
/*
* If it's a KSM fork, allow it to go over the sharing limit
* without warnings.
*/
if (!is_page_sharing_candidate(stable_node))
max_page_sharing_bypass = true;
}

/* We first start with searching the page inside the stable tree */
kpage = stable_tree_search(page);
if (kpage == page && rmap_item->head == stable_node) {
put_page(kpage);
return;
}

remove_rmap_item_from_tree(rmap_item);

if (kpage) {
err = try_to_merge_with_ksm_page(rmap_item, page, kpage);
if (!err) {
/*
* The page was successfully merged:
* add its rmap_item to the stable tree.
*/
lock_page(kpage);
stable_tree_append(rmap_item, page_stable_node(kpage),
max_page_sharing_bypass);
unlock_page(kpage);
}
put_page(kpage);
return;
}

/*
* If the hash value of the page has changed from the last time
* we calculated it, this page is changing frequently: therefore we
* don't want to insert it in the unstable tree, and we don't want
* to waste our time searching for something identical to it there.
*/
checksum = calc_checksum(page);
if (rmap_item->oldchecksum != checksum) {
rmap_item->oldchecksum = checksum;
return;
}

/*
* Same checksum as an empty page. We attempt to merge it with the
* appropriate zero page if the user enabled this via sysfs.
*/
if (ksm_use_zero_pages && (checksum == zero_checksum)) {
struct vm_area_struct *vma;

down_read(&mm->mmap_sem);
vma = find_mergeable_vma(mm, rmap_item->address);
err = try_to_merge_one_page(vma, page,
ZERO_PAGE(rmap_item->address));
up_read(&mm->mmap_sem);
/*
* In case of failure, the page was not really empty, so we
* need to continue. Otherwise we're done.
*/
if (!err)
return;
}
tree_rmap_item =
unstable_tree_search_insert(rmap_item, page, &tree_page);
if (tree_rmap_item) {
bool split;

kpage = try_to_merge_two_pages(rmap_item, page,
tree_rmap_item, tree_page);
/*
* If both pages we tried to merge belong to the same compound
* page, then we actually ended up increasing the reference
* count of the same compound page twice, and split_huge_page
* failed.
* Here we set a flag if that happened, and we use it later to
* try split_huge_page again. Since we call put_page right
* afterwards, the reference count will be correct and
* split_huge_page should succeed.
*/
split = PageTransCompound(page)
&& compound_head(page) == compound_head(tree_page);
put_page(tree_page);
if (kpage) {
/*
* The pages were successfully merged: insert new
* node in the stable tree and add both rmap_items.
*/
lock_page(kpage);
stable_node = stable_tree_insert(kpage);
if (stable_node) {
stable_tree_append(tree_rmap_item, stable_node,
false);
stable_tree_append(rmap_item, stable_node,
false);
}
unlock_page(kpage);

/*
* If we fail to insert the page into the stable tree,
* we will have 2 virtual addresses that are pointing
* to a ksm page left outside the stable tree,
* in which case we need to break_cow on both.
*/
if (!stable_node) {
break_cow(tree_rmap_item);
break_cow(rmap_item);
}
} else if (split) {
/*
* We are here if we tried to merge two pages and
* failed because they both belonged to the same
* compound page. We will split the page now, but no
* merging will take place.
* We do not want to add the cost of a full lock; if
* the page is locked, it is better to skip it and
* perhaps try again later.
*/
if (!trylock_page(page))
return;
split_huge_page(page);
unlock_page(page);
}
}
}

源码分析

KSM系统初始化

ksm_init完成初始化.

  • 计算系统零页的checksum.
  • 创建rmap_item, stable_node, mmslot对应的kmem_cache.
  • 创建内核线程ksmd.
  • 创建sysfs的属性组.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    static int __init ksm_init(void)
    {
    struct task_struct *ksm_thread;
    int err;

    /* The correct value depends on page size and endianness */
    zero_checksum = calc_checksum(ZERO_PAGE(0));
    /* Default to false for backwards compatibility */
    ksm_use_zero_pages = false;

    err = ksm_slab_init();
    if (err)
    goto out;

    ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
    if (IS_ERR(ksm_thread)) {
    pr_err("ksm: creating kthread failed\n");
    err = PTR_ERR(ksm_thread);
    goto out_free;
    }

    #ifdef CONFIG_SYSFS
    err = sysfs_create_group(mm_kobj, &ksm_attr_group);
    if (err) {
    pr_err("ksm: register sysfs failed\n");
    kthread_stop(ksm_thread);
    goto out_free;
    }
    #else
    ksm_run = KSM_RUN_MERGE; /* no way for user to start it */

    #endif /* CONFIG_SYSFS */

    #ifdef CONFIG_MEMORY_HOTREMOVE
    /* There is no significance to this priority 100 */
    hotplug_memory_notifier(ksm_memory_callback, 100);
    #endif
    return 0;

    out_free:
    ksm_slab_free();
    out:
    return err;
    }
    subsys_initcall(ksm_init);

madvise 添加mm, vma到KSM系统

通过在madvise系统调用中指定MADV_MERGEABLE/MADV_UNMERGEABLE在KSM系统中添加/移除内存区域.

  • 如果是MADV_MERGEABLE

    • 如果vma是DAX(direct access), VM_SAO(Strong Access Ordering (powerpc)), VM_SPARC_ADI(Uses ADI tag for access control), 直接返回.
    • 如果该mm_struct没被添加到KSM系统中, 调用__ksm_enter.
    • 将该vma标记为VM_MERGEABLE.
  • 如果是MADV_UNMERGEABLE

    • 如果该vma不带VM_MERGEABLE标志, 直接返回.
    • 如果vma->anon_vma, 调用unmerge_ksm_pages取消合并.
    • 移除vma的VM_MERGEABLE标记.
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
      unsigned long end, int advice, unsigned long *vm_flags)
      {
      struct mm_struct *mm = vma->vm_mm;
      int err;

      switch (advice) {
      case MADV_MERGEABLE:
      /*
      * Be somewhat over-protective for now!
      */
      if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE |
      VM_PFNMAP | VM_IO | VM_DONTEXPAND |
      VM_HUGETLB | VM_MIXEDMAP))
      return 0; /* just ignore the advice */

      if (vma_is_dax(vma))
      return 0;

      #ifdef VM_SAO
      if (*vm_flags & VM_SAO)
      return 0;
      #endif
      #ifdef VM_SPARC_ADI
      if (*vm_flags & VM_SPARC_ADI)
      return 0;
      #endif

      if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
      err = __ksm_enter(mm);
      if (err)
      return err;
      }

      *vm_flags |= VM_MERGEABLE;
      break;

      case MADV_UNMERGEABLE:
      if (!(*vm_flags & VM_MERGEABLE))
      return 0; /* just ignore the advice */

      if (vma->anon_vma) {
      err = unmerge_ksm_pages(vma, start, end);
      if (err)
      return err;
      }

      *vm_flags &= ~VM_MERGEABLE;
      break;
      }

      return 0;
      }
  • 分配一个mm_slot, 设置mm_slot->mm后插入mm_slot_hash的哈希表.

  • 根据KSM线程的状态, 将该mm_slots插入链表中的不同位置.

    • 如果当前状态是KSM_RUN_MERGE或者KSM_RUN_STOP, 插入到cursor的前面, 这样可以避免ksm在一个fork后立刻执行exec的进程里浪费时间.
    • 如果当前状态是KSM_RUN_UNMERGE, 那么必须插到链表的尾部, 以确保KSM能遍历到这个新插入的mm_slot.
  • 给mm标记MMF_VM_MERGEABLE, 表示已经加入到KSM系统中了.

  • 如果在本次添加之前, mm_slots链表为空, 那么需要唤醒ksmd线程.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    int __ksm_enter(struct mm_struct *mm)
    {
    struct mm_slot *mm_slot;
    int needs_wakeup;

    mm_slot = alloc_mm_slot();
    if (!mm_slot)
    return -ENOMEM;

    /* Check ksm_run too? Would need tighter locking */
    needs_wakeup = list_empty(&ksm_mm_head.mm_list);

    spin_lock(&ksm_mmlist_lock);
    insert_to_mm_slots_hash(mm, mm_slot);
    /*
    * When KSM_RUN_MERGE (or KSM_RUN_STOP),
    * insert just behind the scanning cursor, to let the area settle
    * down a little; when fork is followed by immediate exec, we don't
    * want ksmd to waste time setting up and tearing down an rmap_list.
    *
    * But when KSM_RUN_UNMERGE, it's important to insert ahead of its
    * scanning cursor, otherwise KSM pages in newly forked mms will be
    * missed: then we might as well insert at the end of the list.
    */
    if (ksm_run & KSM_RUN_UNMERGE)
    list_add_tail(&mm_slot->mm_list, &ksm_mm_head.mm_list);
    else
    list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
    spin_unlock(&ksm_mmlist_lock);

    set_bit(MMF_VM_MERGEABLE, &mm->flags);
    mmgrab(mm);

    if (needs_wakeup)
    wake_up_interruptible(&ksm_thread_wait);

    return 0;
    }

ksm辅助函数分析

try_to_merge_xxxxxx

try_to_merge_two_pages和try_to_merge_with_ksm_page比较像, 只是try_to_merge_with_ksm_page要求后一个页面必须是KSM页面.

try_to_merge_two_pages传入空的kpage调用try_to_merge_with_ksm_page尝试将page提升为ksm_page, 然后以其为kpage, 再次调用try_to_merge_with_ksm_page与tree_page进行合并.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/*
* try_to_merge_two_pages - take two identical pages and prepare them
* to be merged into one page.
*
* This function returns the kpage if we successfully merged two identical
* pages into one ksm page, NULL otherwise.
*
* Note that this function upgrades page to ksm page: if one of the pages
* is already a ksm page, try_to_merge_with_ksm_page should be used.
*/
static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
struct page *page,
struct rmap_item *tree_rmap_item,
struct page *tree_page)
{
int err;

err = try_to_merge_with_ksm_page(rmap_item, page, NULL);
if (!err) {
err = try_to_merge_with_ksm_page(tree_rmap_item,
tree_page, page);
/*
* If that fails, we have a ksm page with only one pte
* pointing to it: so break it.
*/
if (err)
break_cow(rmap_item);
}
return err ? NULL : page;
}

try_to_merge_with_ksm_page函数调用try_to_merge_one_page完成合并,
再调用remove_rmap_item_from_tree从节点树上移除rmap_item(前提是rmap_item已经在树上了).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/*
* try_to_merge_with_ksm_page - like try_to_merge_two_pages,
* but no new kernel page is allocated: kpage must already be a ksm page.
*
* This function returns 0 if the pages were merged, -EFAULT otherwise.
*/
static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
struct page *page, struct page *kpage)
{
struct mm_struct *mm = rmap_item->mm;
struct vm_area_struct *vma;
int err = -EFAULT;

down_read(&mm->mmap_sem);
vma = find_mergeable_vma(mm, rmap_item->address);
if (!vma)
goto out;

err = try_to_merge_one_page(vma, page, kpage);
if (err)
goto out;

/* Unstable nid is in union with stable anon_vma: remove first */
remove_rmap_item_from_tree(rmap_item);

/* Must get reference to anon_vma while still holding mmap_sem */
rmap_item->anon_vma = vma->anon_vma;
get_anon_vma(vma->anon_vma);
out:
up_read(&mm->mmap_sem);
return err;
}

try_to_merge_one_page尝试将page合并到kpage或将page变成KSM页面.

  • 如果page == kpage, 这说明本次扫描到的page就是KSM中的kpage, 这是fork导致的. 直接返回.
  • 如果page不是匿名页, 直接返回.
  • 判断kpage是否为空
    • 如果为空, 意味着本次调用是为了将page变成KSM页面.
      • 把当前page设置为KSM页面(page->mapping |= PAGE_MAPPING_KSM), 但此时不指定对应的stable_node而是置空.
      • 设置脏位, 防止页面回收释放该页面, 而是将它交换出去.
    • 如果不为空
      • 为page和kpage建立临时映射, 并memcmp比较内容是否相同
        • 如果相同, 调用replace_page进行合并(替换).
  • 如果原vma有VM_LOCKED标志, 将其解锁并将kpage对应的vma内存锁定.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/*
* try_to_merge_one_page - take two pages and merge them into one
* @vma: the vma that holds the pte pointing to page
* @page: the PageAnon page that we want to replace with kpage
* @kpage: the PageKsm page that we want to map instead of page,
* or NULL the first time when we want to use page as kpage.
*
* This function returns 0 if the pages were merged, -EFAULT otherwise.
*/
static int try_to_merge_one_page(struct vm_area_struct *vma,
struct page *page, struct page *kpage)
{
pte_t orig_pte = __pte(0);
int err = -EFAULT;

if (page == kpage) /* ksm page forked */
return 0;

if (!PageAnon(page))
goto out;

/*
* We need the page lock to read a stable PageSwapCache in
* write_protect_page(). We use trylock_page() instead of
* lock_page() because we don't want to wait here - we
* prefer to continue scanning and merging different pages,
* then come back to this page when it is unlocked.
*/
if (!trylock_page(page))
goto out;

if (PageTransCompound(page)) {
if (split_huge_page(page))
goto out_unlock;
}

/*
* If this anonymous page is mapped only here, its pte may need
* to be write-protected. If it's mapped elsewhere, all of its
* ptes are necessarily already write-protected. But in either
* case, we need to lock and check page_count is not raised.
*/
if (write_protect_page(vma, page, &orig_pte) == 0) {
if (!kpage) {
/*
* While we hold page lock, upgrade page from
* PageAnon+anon_vma to PageKsm+NULL stable_node:
* stable_tree_insert() will update stable_node.
*/
set_page_stable_node(page, NULL);
mark_page_accessed(page);
/*
* Page reclaim just frees a clean page with no dirty
* ptes: make sure that the ksm page would be swapped.
*/
if (!PageDirty(page))
SetPageDirty(page);
err = 0;
} else if (pages_identical(page, kpage))
err = replace_page(vma, page, kpage, orig_pte);
}

if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
munlock_vma_page(page);
if (!PageMlocked(kpage)) {
unlock_page(page);
lock_page(kpage);
mlock_vma_page(kpage);
page = kpage; /* for final unlock */
}
}

out_unlock:
unlock_page(page);
out:
return err;
}

replace_page函数

  • 创建PTE条目, 使得原来指向page的PTE指向新的kpage.
  • 对kpage调用page_add_anon_rmap.(感觉在这里只有增加_mapcount和NR_ANON_MAPPED的作用)
  • 对page调用page_remove_rmap.(感觉在这里只有减少_mapcount和NR_ANON_MAPPED的作用)
  • 如果该page不再被映射了, 尝试释放对应的swap空间.
  • 释放对该page的一次引用.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    /**
    * replace_page - replace page in vma by new ksm page
    * @vma: vma that holds the pte pointing to page
    * @page: the page we are replacing by kpage
    * @kpage: the ksm page we replace page by
    * @orig_pte: the original value of the pte
    *
    * Returns 0 on success, -EFAULT on failure.
    */
    static int replace_page(struct vm_area_struct *vma, struct page *page,
    struct page *kpage, pte_t orig_pte)
    {
    struct mm_struct *mm = vma->vm_mm;
    pmd_t *pmd;
    pte_t *ptep;
    pte_t newpte;
    spinlock_t *ptl;
    unsigned long addr;
    int err = -EFAULT;
    struct mmu_notifier_range range;

    addr = page_address_in_vma(page, vma);
    if (addr == -EFAULT)
    goto out;

    pmd = mm_find_pmd(mm, addr);
    if (!pmd)
    goto out;

    mmu_notifier_range_init(&range, mm, addr, addr + PAGE_SIZE);
    mmu_notifier_invalidate_range_start(&range);

    ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
    if (!pte_same(*ptep, orig_pte)) {
    pte_unmap_unlock(ptep, ptl);
    goto out_mn;
    }

    /*
    * No need to check ksm_use_zero_pages here: we can only have a
    * zero_page here if ksm_use_zero_pages was enabled alreaady.
    */
    if (!is_zero_pfn(page_to_pfn(kpage))) {
    get_page(kpage);
    page_add_anon_rmap(kpage, vma, addr, false);
    newpte = mk_pte(kpage, vma->vm_page_prot);
    } else {
    newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
    vma->vm_page_prot));
    /*
    * We're replacing an anonymous page with a zero page, which is
    * not anonymous. We need to do proper accounting otherwise we
    * will get wrong values in /proc, and a BUG message in dmesg
    * when tearing down the mm.
    */
    dec_mm_counter(mm, MM_ANONPAGES);
    }

    flush_cache_page(vma, addr, pte_pfn(*ptep));
    /*
    * No need to notify as we are replacing a read only page with another
    * read only page with the same content.
    *
    * See Documentation/vm/mmu_notifier.rst
    */
    ptep_clear_flush(vma, addr, ptep);
    set_pte_at_notify(mm, addr, ptep, newpte);

    page_remove_rmap(page, false);
    if (!page_mapped(page))
    try_to_free_swap(page);
    put_page(page);

    pte_unmap_unlock(ptep, ptl);
    err = 0;
    out_mn:
    mmu_notifier_invalidate_range_end(&range);
    out:
    return err;
    }

get_ksm_page

get_ksm_page尝试获取节点对应的KSM页面(所谓KSM页面就是page->mapping带有PAGE_MAPPING_KSM的页面).
这是因为stable_node并没有持有它对应的KSM页面的引用, 它可能被释放或变更.我们通过get_ksm_page来保证对该页面的访问仍然安全且正确.该函数同时会移除已经过时的节点, 这会导致节点树的rebalance, 这个过程就是修剪(prune).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/*
* get_ksm_page: checks if the page indicated by the stable node
* is still its ksm page, despite having held no reference to it.
* In which case we can trust the content of the page, and it
* returns the gotten page; but if the page has now been zapped,
* remove the stale node from the stable tree and return NULL.
* But beware, the stable node's page might be being migrated.
*
* You would expect the stable_node to hold a reference to the ksm page.
* But if it increments the page's count, swapping out has to wait for
* ksmd to come around again before it can free the page, which may take
* seconds or even minutes: much too unresponsive. So instead we use a
* "keyhole reference": access to the ksm page from the stable node peeps
* out through its keyhole to see if that page still holds the right key,
* pointing back to this stable node. This relies on freeing a PageAnon
* page to reset its page->mapping to NULL, and relies on no other use of
* a page to put something that might look like our key in page->mapping.
* is on its way to being freed; but it is an anomaly to bear in mind.
*/
static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
{
struct page *page;
void *expected_mapping;
unsigned long kpfn;

expected_mapping = (void *)((unsigned long)stable_node |
PAGE_MAPPING_KSM);
again:
kpfn = READ_ONCE(stable_node->kpfn); /* Address dependency. */
page = pfn_to_page(kpfn);
if (READ_ONCE(page->mapping) != expected_mapping)
goto stale;

/*
* We cannot do anything with the page while its refcount is 0.
* Usually 0 means free, or tail of a higher-order page: in which
* case this node is no longer referenced, and should be freed;
* however, it might mean that the page is under page_ref_freeze().
* The __remove_mapping() case is easy, again the node is now stale;
* but if page is swapcache in migrate_page_move_mapping(), it might
* still be our page, in which case it's essential to keep the node.
*/
while (!get_page_unless_zero(page)) {
/*
* Another check for page->mapping != expected_mapping would
* work here too. We have chosen the !PageSwapCache test to
* optimize the common case, when the page is or is about to
* be freed: PageSwapCache is cleared (under spin_lock_irq)
* in the ref_freeze section of __remove_mapping(); but Anon
* page->mapping reset to NULL later, in free_pages_prepare().
*/
if (!PageSwapCache(page))
goto stale;
cpu_relax();
}

if (READ_ONCE(page->mapping) != expected_mapping) {
put_page(page);
goto stale;
}

if (lock_it) {
lock_page(page);
if (READ_ONCE(page->mapping) != expected_mapping) {
unlock_page(page);
put_page(page);
goto stale;
}
}
return page;

stale:
/*
* We come here from above when page->mapping or !PageSwapCache
* suggests that the node is stale; but it might be under migration.
* We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
* before checking whether node->kpfn has been changed.
*/
smp_rmb();
if (READ_ONCE(stable_node->kpfn) != kpfn)
goto again;
remove_node_from_stable_tree(stable_node);
return NULL;
}

ksmd流程

ksmd和其他内核线程一样循环调用对应的主处理函数 – ksm_do_scan.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static int ksm_scan_thread(void *nothing)
{
unsigned int sleep_ms;

set_freezable();
set_user_nice(current, 5);

while (!kthread_should_stop()) {
mutex_lock(&ksm_thread_mutex);
wait_while_offlining();
if (ksmd_should_run())
ksm_do_scan(ksm_thread_pages_to_scan);
mutex_unlock(&ksm_thread_mutex);

try_to_freeze();

if (ksmd_should_run()) {
sleep_ms = READ_ONCE(ksm_thread_sleep_millisecs);
wait_event_interruptible_timeout(ksm_iter_wait,
sleep_ms != READ_ONCE(ksm_thread_sleep_millisecs),
msecs_to_jiffies(sleep_ms));
} else {
wait_event_freezable(ksm_thread_wait,
ksmd_should_run() || kthread_should_stop());
}
}
return 0;
}

每次被唤醒, ksm扫描ksm_thread_pages_to_scan张页面.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/**
* ksm_do_scan - the ksm scanner main worker function.
* @scan_npages: number of pages we want to scan before we return.
*/
static void ksm_do_scan(unsigned int scan_npages)
{
struct rmap_item *rmap_item;
struct page *uninitialized_var(page);

while (scan_npages-- && likely(!freezing(current))) {
cond_resched();
rmap_item = scan_get_next_rmap_item(&page);
if (!rmap_item)
return;
cmp_and_merge_page(page, rmap_item);
put_page(page);
}
}

unstable_tree_search_insert函数在不稳定节点树上搜索与指定page内容相同的tree_page以及对应的tree_rmap_item. 如果不存在, 则将该page的rmap_item插入到不稳定节点树中.

  • 首先根据page对应的nid选中对应的不稳定节点树的根节点.
  • 从根节点开始遍历(tree_rmap_item作遍历时的cursor)
    • 调用get_mergeable_page获取cursor对应的可合并匿名页tree_page, 这是将rmap_item->address传给follow_page爬页表来获取的.
    • 如果没获取到, 直接返回.(这里笔者有点没理解为啥直接返回而不是继续遍历下一个节点)
    • 如果page==tree_page(fork导致), 直接返回.
    • 使用memcmp比较page和tree_page的内容, 并根据memcmp的返回值来决定是否需要继续遍历, 以及继续遍历时左右节点的选择.
      • 如果内容相同, 则完成本次搜索.
      • 否则继续遍历.
  • 若遍历完成还是没找到相同的tree_page, 那么将该rmap_item插入到不稳定节点树中.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
/*
* unstable_tree_search_insert - search for identical page,
* else insert rmap_item into the unstable tree.
*
* This function searches for a page in the unstable tree identical to the
* page currently being scanned; and if no identical page is found in the
* tree, we insert rmap_item as a new object into the unstable tree.
*
* This function returns pointer to rmap_item found to be identical
* to the currently scanned page, NULL otherwise.
*
* This function does both searching and inserting, because they share
* the same walking algorithm in an rbtree.
*/
static
struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
struct page *page,
struct page **tree_pagep)
{
struct rb_node **new;
struct rb_root *root;
struct rb_node *parent = NULL;
int nid;

nid = get_kpfn_nid(page_to_pfn(page));
root = root_unstable_tree + nid;
new = &root->rb_node;

while (*new) {
struct rmap_item *tree_rmap_item;
struct page *tree_page;
int ret;

cond_resched();
tree_rmap_item = rb_entry(*new, struct rmap_item, node);
tree_page = get_mergeable_page(tree_rmap_item);
if (!tree_page)
return NULL;

/*
* Don't substitute a ksm page for a forked page.
*/
if (page == tree_page) {
put_page(tree_page);
return NULL;
}

ret = memcmp_pages(page, tree_page);

parent = *new;
if (ret < 0) {
put_page(tree_page);
new = &parent->rb_left;
} else if (ret > 0) {
put_page(tree_page);
new = &parent->rb_right;
} else if (!ksm_merge_across_nodes &&
page_to_nid(tree_page) != nid) {
/*
* If tree_page has been migrated to another NUMA node,
* it will be flushed out and put in the right unstable
* tree next time: only merge with it when across_nodes.
*/
put_page(tree_page);
return NULL;
} else {
*tree_pagep = tree_page;
return tree_rmap_item;
}
}

rmap_item->address |= UNSTABLE_FLAG;
rmap_item->address |= (ksm_scan.seqnr & SEQNR_MASK);
DO_NUMA(rmap_item->nid = nid);
rb_link_node(&rmap_item->node, parent, new);
rb_insert_color(&rmap_item->node, root);

ksm_pages_unshared++;
return NULL;
}

在这里插入一个页面共享候选者(candidate)的概念, 后面会大量提及.
依据是 0 < stable_node->rmap_hlist_len < ksm_max_page_sharing , 说人话就是该节点的rmap_hlist还没到上限.

1
2
3
4
5
6
7
8
9
10
11
12
13
static __always_inline
bool __is_page_sharing_candidate(struct stable_node *stable_node, int offset)
{
VM_BUG_ON(stable_node->rmap_hlist_len < 0);
/*
* Check that at least one mapping still exists, otherwise
* there's no much point to merge and share with this
* stable_node, as the underlying tree_page of the other
* sharer is going to be freed soon.
*/
return stable_node->rmap_hlist_len &&
stable_node->rmap_hlist_len + offset < ksm_max_page_sharing;
}

stable_tree_search比较类似.

  • 尝试获取page对应的稳定节点树的节点page_node
    • 如果获取到了, 说明该page已经在稳定节点树中. 如果page_node->head != migrate_nodes, 则说明本次扫描到该page是fork导致的, 直接返回.
  • 从稳定节点树的根节点开始遍历
    • 调用chain_prune函数获取本次能够合并到的节点(stable_node_dup), 以及对应的tree_page, 并对稳定节点树进行修剪(见后文).
      • 如果没获取到, 证明本节点不能成为本次合并的候选者(普通) 或者 其中没有能成为本次合并的候选者(candidate, 见后文解释)的节点了(链式). 调用stable_node_dup_any无限制的获取一个节点, 可能是该节点本身(普通), 也可能是该节点中的第一个节点(链式).
        • 如果这次获取到了, 将tree_page设置成新获取到的节点的ksm_page.继续执行
        • 还是没获取到,证明这个节点是一个空的链式节点, 而stable_node_dup_any会对空的链式节点进行rb_erase导致树的rebalance, 所以重新从根节点开始遍历.
    • 如果tree_page为空, 意味着刚刚的过程中在一个过时的节点上调用了get_ksm_page, 这会导致从稳定节点树中移除掉这个节点, 所以稳定节点树可能重新进行了平衡操作, 所以重新从根节点开始遍历.
    • 到这里说明我们已经找到了可能可以用来合并的tree_page, memcmp比较内容并根据结果来决定继续遍历的方向.
      • 如果内容相同
        • 如果之前获取到了page_node(即该page本来就在稳定节点树中, 但正在迁移), 且page->_mapcount > 1, 则跳转到chain_append, 将该page_node与找到的stable_node组成链式节点. 如果page_node是候选者, 返回page, 否则返回空.
        • 如果没有可以合并的节点, 返回空.
        • 尝试获取stable_node_dup的ksm页面, 如果没获取到, 从头开始遍历, 否则返回tree_page.
  • 遍历结束, 如果page_node是候选者, 返回page, 否则返回空.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
/*
* stable_tree_search - search for page inside the stable tree
*
* This function checks if there is a page inside the stable tree
* with identical content to the page that we are scanning right now.
*
* This function returns the stable tree node of identical content if found,
* NULL otherwise.
*/
static struct page *stable_tree_search(struct page *page)
{
int nid;
struct rb_root *root;
struct rb_node **new;
struct rb_node *parent;
struct stable_node *stable_node, *stable_node_dup, *stable_node_any;
struct stable_node *page_node;

page_node = page_stable_node(page);
if (page_node && page_node->head != &migrate_nodes) {
/* ksm page forked */
get_page(page);
return page;
}

nid = get_kpfn_nid(page_to_pfn(page));
root = root_stable_tree + nid;
again:
new = &root->rb_node;
parent = NULL;

while (*new) {
struct page *tree_page;
int ret;

cond_resched();
stable_node = rb_entry(*new, struct stable_node, node);
stable_node_any = NULL;
tree_page = chain_prune(&stable_node_dup, &stable_node, root);
/*
* NOTE: stable_node may have been freed by
* chain_prune() if the returned stable_node_dup is
* not NULL. stable_node_dup may have been inserted in
* the rbtree instead as a regular stable_node (in
* order to collapse the stable_node chain if a single
* stable_node dup was found in it). In such case the
* stable_node is overwritten by the calleee to point
* to the stable_node_dup that was collapsed in the
* stable rbtree and stable_node will be equal to
* stable_node_dup like if the chain never existed.
*/
if (!stable_node_dup) {
/*
* Either all stable_node dups were full in
* this stable_node chain, or this chain was
* empty and should be rb_erased.
*/
stable_node_any = stable_node_dup_any(stable_node,
root);
if (!stable_node_any) {
/* rb_erase just run */
goto again;
}
/*
* Take any of the stable_node dups page of
* this stable_node chain to let the tree walk
* continue. All KSM pages belonging to the
* stable_node dups in a stable_node chain
* have the same content and they're
* wrprotected at all times. Any will work
* fine to continue the walk.
*/
tree_page = get_ksm_page(stable_node_any, false);
}
VM_BUG_ON(!stable_node_dup ^ !!stable_node_any);
if (!tree_page) {
/*
* If we walked over a stale stable_node,
* get_ksm_page() will call rb_erase() and it
* may rebalance the tree from under us. So
* restart the search from scratch. Returning
* NULL would be safe too, but we'd generate
* false negative insertions just because some
* stable_node was stale.
*/
goto again;
}

ret = memcmp_pages(page, tree_page);
put_page(tree_page);

parent = *new;
if (ret < 0)
new = &parent->rb_left;
else if (ret > 0)
new = &parent->rb_right;
else {
if (page_node) {
VM_BUG_ON(page_node->head != &migrate_nodes);
/*
* Test if the migrated page should be merged
* into a stable node dup. If the mapcount is
* 1 we can migrate it with another KSM page
* without adding it to the chain.
*/
if (page_mapcount(page) > 1)
goto chain_append;
}

if (!stable_node_dup) {
/*
* If the stable_node is a chain and
* we got a payload match in memcmp
* but we cannot merge the scanned
* page in any of the existing
* stable_node dups because they're
* all full, we need to wait the
* scanned page to find itself a match
* in the unstable tree to create a
* brand new KSM page to add later to
* the dups of this stable_node.
*/
return NULL;
}

/*
* Lock and unlock the stable_node's page (which
* might already have been migrated) so that page
* migration is sure to notice its raised count.
* It would be more elegant to return stable_node
* than kpage, but that involves more changes.
*/
tree_page = get_ksm_page(stable_node_dup, true);
if (unlikely(!tree_page))
/*
* The tree may have been rebalanced,
* so re-evaluate parent and new.
*/
goto again;
unlock_page(tree_page);

if (get_kpfn_nid(stable_node_dup->kpfn) !=
NUMA(stable_node_dup->nid)) {
put_page(tree_page);
goto replace;
}
return tree_page;
}
}

if (!page_node)
return NULL;

list_del(&page_node->list);
DO_NUMA(page_node->nid = nid);
rb_link_node(&page_node->node, parent, new);
rb_insert_color(&page_node->node, root);
out:
if (is_page_sharing_candidate(page_node)) {
get_page(page);
return page;
} else
return NULL;

replace:
/*
* If stable_node was a chain and chain_prune collapsed it,
* stable_node has been updated to be the new regular
* stable_node. A collapse of the chain is indistinguishable
* from the case there was no chain in the stable
* rbtree. Otherwise stable_node is the chain and
* stable_node_dup is the dup to replace.
*/
if (stable_node_dup == stable_node) {
VM_BUG_ON(is_stable_node_chain(stable_node_dup));
VM_BUG_ON(is_stable_node_dup(stable_node_dup));
/* there is no chain */
if (page_node) {
VM_BUG_ON(page_node->head != &migrate_nodes);
list_del(&page_node->list);
DO_NUMA(page_node->nid = nid);
rb_replace_node(&stable_node_dup->node,
&page_node->node,
root);
if (is_page_sharing_candidate(page_node))
get_page(page);
else
page = NULL;
} else {
rb_erase(&stable_node_dup->node, root);
page = NULL;
}
} else {
VM_BUG_ON(!is_stable_node_chain(stable_node));
__stable_node_dup_del(stable_node_dup);
if (page_node) {
VM_BUG_ON(page_node->head != &migrate_nodes);
list_del(&page_node->list);
DO_NUMA(page_node->nid = nid);
stable_node_chain_add_dup(page_node, stable_node);
if (is_page_sharing_candidate(page_node))
get_page(page);
else
page = NULL;
} else {
page = NULL;
}
}
stable_node_dup->head = &migrate_nodes;
list_add(&stable_node_dup->list, stable_node_dup->head);
return page;

chain_append:
/* stable_node_dup could be null if it reached the limit */
if (!stable_node_dup)
stable_node_dup = stable_node_any;
/*
* If stable_node was a chain and chain_prune collapsed it,
* stable_node has been updated to be the new regular
* stable_node. A collapse of the chain is indistinguishable
* from the case there was no chain in the stable
* rbtree. Otherwise stable_node is the chain and
* stable_node_dup is the dup to replace.
*/
if (stable_node_dup == stable_node) {
VM_BUG_ON(is_stable_node_chain(stable_node_dup));
VM_BUG_ON(is_stable_node_dup(stable_node_dup));
/* chain is missing so create it */
stable_node = alloc_stable_node_chain(stable_node_dup,
root);
if (!stable_node)
return NULL;
}
/*
* Add this stable_node dup that was
* migrated to the stable_node chain
* of the current nid for this page
* content.
*/
VM_BUG_ON(!is_stable_node_chain(stable_node));
VM_BUG_ON(!is_stable_node_dup(stable_node_dup));
VM_BUG_ON(page_node->head != &migrate_nodes);
list_del(&page_node->list);
DO_NUMA(page_node->nid = nid);
stable_node_chain_add_dup(page_node, stable_node);
goto out;
}

chain_prune和chain作用相同, 获取s_n节点中的候选者(可能是该普通节点本身, 也可能是该链式节点链表中的一个节点)存入s_n_d. 同时, 该函数可能导致一个长度为1的链式节点坍缩到一个普通节点, 所以s_n的值可能被更新.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static __always_inline struct page *chain(struct stable_node **s_n_d,
struct stable_node *s_n,
struct rb_root *root)
{
struct stable_node *old_stable_node = s_n;
struct page *tree_page;

tree_page = __stable_node_chain(s_n_d, &s_n, root, false);
/* not pruning dups so s_n cannot have changed */
VM_BUG_ON(s_n != old_stable_node);
return tree_page;
}

static __always_inline struct page *chain_prune(struct stable_node **s_n_d,
struct stable_node **s_n,
struct rb_root *root)
{
return __stable_node_chain(s_n_d, s_n, root, true);
}

__stable_node_chain函数

  • 判断当前节点是链式节点还是普通节点(链式节点的rmap_hlist_len==STABLE_NODE_CHAIN)
    • 如果是普通节点, 判断是否是页面共享的候选者(candidate).
      • 如果是候选者, 返回对应的ksm_page并将参数中的_stable_node设置为当前stable_node.
      • 否则将参数中的_stable_node设置为空, 并返回空.
    • 如果是链式节点, 转入stable_node_dup.
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      /*
      * Like for get_ksm_page, this function can free the *_stable_node and
      * *_stable_node_dup if the returned tree_page is NULL.
      *
      * It can also free and overwrite *_stable_node with the found
      * stable_node_dup if the chain is collapsed (in which case
      * *_stable_node will be equal to *_stable_node_dup like if the chain
      * never existed). It's up to the caller to verify tree_page is not
      * NULL before dereferencing *_stable_node or *_stable_node_dup.
      *
      * *_stable_node_dup is really a second output parameter of this
      * function and will be overwritten in all cases, the caller doesn't
      * need to initialize it.
      */
      static struct page *__stable_node_chain(struct stable_node **_stable_node_dup,
      struct stable_node **_stable_node,
      struct rb_root *root,
      bool prune_stale_stable_nodes)
      {
      struct stable_node *stable_node = *_stable_node;
      if (!is_stable_node_chain(stable_node)) {
      if (is_page_sharing_candidate(stable_node)) {
      *_stable_node_dup = stable_node;
      return get_ksm_page(stable_node, false);
      }
      /*
      * _stable_node_dup set to NULL means the stable_node
      * reached the ksm_max_page_sharing limit.
      */
      *_stable_node_dup = NULL;
      return NULL;
      }
      return stable_node_dup(_stable_node_dup, _stable_node, root,
      prune_stale_stable_nodes);
      }

stable_node_dup函数

  • 遍历这一链式节点的链表
    • 调用get_ksm_page尝试获取对应的KSM页面
    • 若没获取到则跳过本节点(因为我们不是对真正挂载在树上的节点调用get_ksm_page, 不用从头开始).
    • 如果是页面共享的候选者
      • 如果没有在之前的遍历中找到过符合的节点 或者 本节点的rmap_hlist_len比上次找到的长
        • 将本节点设为found, 对应的page设置为tree_page,更新found_rmap_hlist_len.
        • 释放上一次找到的page的一次引用
        • 如果不需要修建过时的节点, 结束遍历.
    • 释放本次遍历节点的一次引用.
  • 接下来对找到的节点进行处理
    • 如果有修剪的要求, 且刚才的遍历只有一个节点获取到了对应的ksm页面, 那么这个链式节点就应该回退到普通节点, 进行替换并释放原链式节点, 更新统计信息.将_stable_node设为该节点.
    • 否则
      • 如果找到的节点不是该链式节点的第一个节点, 并且至少还能继续合并一次, 将其作为链式节点的第一个节点, 加速下一次不需要修剪的情况.
  • 将_stable_node_dup设置为found, 这是rmap_hlist_len最长的候选者.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    static struct page *stable_node_dup(struct stable_node **_stable_node_dup,
    struct stable_node **_stable_node,
    struct rb_root *root,
    bool prune_stale_stable_nodes)
    {
    struct stable_node *dup, *found = NULL, *stable_node = *_stable_node;
    struct hlist_node *hlist_safe;
    struct page *_tree_page, *tree_page = NULL;
    int nr = 0;
    int found_rmap_hlist_len;

    if (!prune_stale_stable_nodes ||
    time_before(jiffies, stable_node->chain_prune_time +
    msecs_to_jiffies(
    ksm_stable_node_chains_prune_millisecs)))
    prune_stale_stable_nodes = false;
    else
    stable_node->chain_prune_time = jiffies;

    hlist_for_each_entry_safe(dup, hlist_safe,
    &stable_node->hlist, hlist_dup) {
    cond_resched();
    /*
    * We must walk all stable_node_dup to prune the stale
    * stable nodes during lookup.
    *
    * get_ksm_page can drop the nodes from the
    * stable_node->hlist if they point to freed pages
    * (that's why we do a _safe walk). The "dup"
    * stable_node parameter itself will be freed from
    * under us if it returns NULL.
    */
    _tree_page = get_ksm_page(dup, false);
    if (!_tree_page)
    continue;
    nr += 1;
    if (is_page_sharing_candidate(dup)) {
    if (!found ||
    dup->rmap_hlist_len > found_rmap_hlist_len) {
    if (found)
    put_page(tree_page);
    found = dup;
    found_rmap_hlist_len = found->rmap_hlist_len;
    tree_page = _tree_page;

    /* skip put_page for found dup */
    if (!prune_stale_stable_nodes)
    break;
    continue;
    }
    }
    put_page(_tree_page);
    }

    if (found) {
    /*
    * nr is counting all dups in the chain only if
    * prune_stale_stable_nodes is true, otherwise we may
    * break the loop at nr == 1 even if there are
    * multiple entries.
    */
    if (prune_stale_stable_nodes && nr == 1) {
    /*
    * If there's not just one entry it would
    * corrupt memory, better BUG_ON. In KSM
    * context with no lock held it's not even
    * fatal.
    */
    BUG_ON(stable_node->hlist.first->next);

    /*
    * There's just one entry and it is below the
    * deduplication limit so drop the chain.
    */
    rb_replace_node(&stable_node->node, &found->node,
    root);
    free_stable_node(stable_node);
    ksm_stable_node_chains--;
    ksm_stable_node_dups--;
    /*
    * NOTE: the caller depends on the stable_node
    * to be equal to stable_node_dup if the chain
    * was collapsed.
    */
    *_stable_node = found;
    /*
    * Just for robustneess as stable_node is
    * otherwise left as a stable pointer, the
    * compiler shall optimize it away at build
    * time.
    */
    stable_node = NULL;
    } else if (stable_node->hlist.first != &found->hlist_dup &&
    __is_page_sharing_candidate(found, 1)) {
    /*
    * If the found stable_node dup can accept one
    * more future merge (in addition to the one
    * that is underway) and is not at the head of
    * the chain, put it there so next search will
    * be quicker in the !prune_stale_stable_nodes
    * case.
    *
    * NOTE: it would be inaccurate to use nr > 1
    * instead of checking the hlist.first pointer
    * directly, because in the
    * prune_stale_stable_nodes case "nr" isn't
    * the position of the found dup in the chain,
    * but the total number of dups in the chain.
    */
    hlist_del(&found->hlist_dup);
    hlist_add_head(&found->hlist_dup,
    &stable_node->hlist);
    }
    }

    *_stable_node_dup = found;
    return tree_page;
    }

stable_tree_insert比较类似.

  • 获取kpage对应的稳定节点树的根节点
  • 从根节点开始遍历
    • 调用chain函数获取本次能够合并到的节点(stable_node_dup), 以及对应的tree_page
      • 如果没获取到, 证明本节点不能成为本次合并的候选(普通) 或者 其中没有能成为本次合并的候选者(candidate)的节点了(链式). 调用stable_node_dup_any无限制的获取一个节点, 可能是该节点本身(普通), 也可能是该节点中的第一个节点(链式).
        • 如果这次获取到了, 将tree_page设置成新获取到的节点的ksm_page.继续执行
        • 还是没获取到, 重新从根节点开始遍历.
    • 如果tree_page为空, 意味着刚刚的过程中在一个过时的节点上调用了get_ksm_page, 这会导致从稳定节点树中移除掉这个节点, 所以稳定节点树可能重新进行了平衡操作, 所以重新从根节点开始遍历.
    • 到这里说明我们已经找到了可能可以用来合并的tree_page, 和之前类似的操作, memcmp比较内容并根据结果来决定继续遍历的方向.如果内容相同, 跳出遍历循环.
  • 分配一个新的节点, 与kpage互相关联. 遍历结束有两种情况, 如果是遍历到了终点导致结束, 证明我们没有在稳定节点树上找到可以与kpage合并的节点, 那么我们将这个节点作为一个普通节点, 挂载到稳定节点树上. 如果是找到了用来合并的节点而结束遍历的, 那么将新节点与找到的节点组成链式节点, 或者将新节点直接加入找到的链式节点的链表中.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/*
* stable_tree_insert - insert stable tree node pointing to new ksm page
* into the stable tree.
*
* This function returns the stable tree node just allocated on success,
* NULL otherwise.
*/
static struct stable_node *stable_tree_insert(struct page *kpage)
{
int nid;
unsigned long kpfn;
struct rb_root *root;
struct rb_node **new;
struct rb_node *parent;
struct stable_node *stable_node, *stable_node_dup, *stable_node_any;
bool need_chain = false;

kpfn = page_to_pfn(kpage);
nid = get_kpfn_nid(kpfn);
root = root_stable_tree + nid;
again:
parent = NULL;
new = &root->rb_node;

while (*new) {
struct page *tree_page;
int ret;

cond_resched();
stable_node = rb_entry(*new, struct stable_node, node);
stable_node_any = NULL;
tree_page = chain(&stable_node_dup, stable_node, root);
if (!stable_node_dup) {
/*
* Either all stable_node dups were full in
* this stable_node chain, or this chain was
* empty and should be rb_erased.
*/
stable_node_any = stable_node_dup_any(stable_node,
root);
if (!stable_node_any) {
/* rb_erase just run */
goto again;
}
/*
* Take any of the stable_node dups page of
* this stable_node chain to let the tree walk
* continue. All KSM pages belonging to the
* stable_node dups in a stable_node chain
* have the same content and they're
* wrprotected at all times. Any will work
* fine to continue the walk.
*/
tree_page = get_ksm_page(stable_node_any, false);
}
VM_BUG_ON(!stable_node_dup ^ !!stable_node_any);
if (!tree_page) {
/*
* If we walked over a stale stable_node,
* get_ksm_page() will call rb_erase() and it
* may rebalance the tree from under us. So
* restart the search from scratch. Returning
* NULL would be safe too, but we'd generate
* false negative insertions just because some
* stable_node was stale.
*/
goto again;
}

ret = memcmp_pages(kpage, tree_page);
put_page(tree_page);

parent = *new;
if (ret < 0)
new = &parent->rb_left;
else if (ret > 0)
new = &parent->rb_right;
else {
need_chain = true;
break;
}
}

stable_node_dup = alloc_stable_node();
if (!stable_node_dup)
return NULL;

INIT_HLIST_HEAD(&stable_node_dup->hlist);
stable_node_dup->kpfn = kpfn;
set_page_stable_node(kpage, stable_node_dup);
stable_node_dup->rmap_hlist_len = 0;
DO_NUMA(stable_node_dup->nid = nid);
if (!need_chain) {
rb_link_node(&stable_node_dup->node, parent, new);
rb_insert_color(&stable_node_dup->node, root);
} else {
if (!is_stable_node_chain(stable_node)) {
struct stable_node *orig = stable_node;
/* chain is missing so create it */
stable_node = alloc_stable_node_chain(orig, root);
if (!stable_node) {
free_stable_node(stable_node_dup);
return NULL;
}
}
stable_node_chain_add_dup(stable_node_dup, stable_node);
}

return stable_node_dup;
}
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2022-2024 翰青HanQi

请我喝杯咖啡吧~

支付宝
微信