 05158082f6
			
		
	
	05158082f6
	
	
	
		
			
			Backport a preliminary version of Yu Zhao's multi-generational LRU, for improved memory management. Refresh the patches while at it. Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com>
		
			
				
	
	
		
			162 lines
		
	
	
		
			6.3 KiB
		
	
	
	
		
			Diff
		
	
	
	
	
	
			
		
		
	
	
			162 lines
		
	
	
		
			6.3 KiB
		
	
	
	
		
			Diff
		
	
	
	
	
	
| From f59c618ed70a1e48accc4cad91a200966f2569c9 Mon Sep 17 00:00:00 2001
 | |
| From: Yu Zhao <yuzhao@google.com>
 | |
| Date: Tue, 2 Feb 2021 01:27:45 -0700
 | |
| Subject: [PATCH 10/10] mm: multigenerational lru: documentation
 | |
| 
 | |
| Add Documentation/vm/multigen_lru.rst.
 | |
| 
 | |
| Signed-off-by: Yu Zhao <yuzhao@google.com>
 | |
| Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
 | |
| Change-Id: I1902178bcbb5adfa0a748c4d284a6456059bdd7e
 | |
| ---
 | |
|  Documentation/vm/index.rst        |   1 +
 | |
|  Documentation/vm/multigen_lru.rst | 132 ++++++++++++++++++++++++++++++
 | |
|  2 files changed, 133 insertions(+)
 | |
|  create mode 100644 Documentation/vm/multigen_lru.rst
 | |
| 
 | |
| --- a/Documentation/vm/index.rst
 | |
| +++ b/Documentation/vm/index.rst
 | |
| @@ -17,6 +17,7 @@ various features of the Linux memory man
 | |
|  
 | |
|     swap_numa
 | |
|     zswap
 | |
| +   multigen_lru
 | |
|  
 | |
|  Kernel developers MM documentation
 | |
|  ==================================
 | |
| --- /dev/null
 | |
| +++ b/Documentation/vm/multigen_lru.rst
 | |
| @@ -0,0 +1,132 @@
 | |
| +.. SPDX-License-Identifier: GPL-2.0
 | |
| +
 | |
| +=====================
 | |
| +Multigenerational LRU
 | |
| +=====================
 | |
| +
 | |
| +Quick Start
 | |
| +===========
 | |
| +Build Configurations
 | |
| +--------------------
 | |
| +:Required: Set ``CONFIG_LRU_GEN=y``.
 | |
| +
 | |
| +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by
 | |
| + default.
 | |
| +
 | |
| +Runtime Configurations
 | |
| +----------------------
 | |
| +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the
 | |
| + feature was not turned on by default.
 | |
| +
 | |
| +:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to
 | |
| + protect the working set of ``N`` milliseconds. The OOM killer is
 | |
| + invoked if this working set cannot be kept in memory.
 | |
| +
 | |
| +:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature
 | |
| + is turned on. This file has the following output:
 | |
| +
 | |
| +::
 | |
| +
 | |
| +  memcg  memcg_id  memcg_path
 | |
| +    node  node_id
 | |
| +      min_gen  birth_time  anon_size  file_size
 | |
| +      ...
 | |
| +      max_gen  birth_time  anon_size  file_size
 | |
| +
 | |
| +``min_gen`` is the oldest generation number and ``max_gen`` is the
 | |
| +youngest generation number. ``birth_time`` is in milliseconds.
 | |
| +``anon_size`` and ``file_size`` are in pages.
 | |
| +
 | |
| +Phones/Laptops/Workstations
 | |
| +---------------------------
 | |
| +No additional configurations required.
 | |
| +
 | |
| +Servers/Data Centers
 | |
| +--------------------
 | |
| +:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a
 | |
| + larger number.
 | |
| +
 | |
| +:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger
 | |
| + number.
 | |
| +
 | |
| +:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``.
 | |
| +
 | |
| +:Working set estimation: Write ``+ memcg_id node_id max_gen
 | |
| + [swappiness] [use_bloom_filter]`` to ``/sys/kernel/debug/lru_gen`` to
 | |
| + invoke the aging, which scans PTEs for accessed pages and then
 | |
| + creates the next generation ``max_gen+1``. A swap file and a non-zero
 | |
| + ``swappiness``, which overrides ``vm.swappiness``, are required to
 | |
| + scan PTEs mapping anon pages. Set ``use_bloom_filter`` to 0 to
 | |
| + override the default behavior which only scans PTE tables found
 | |
| + populated.
 | |
| +
 | |
| +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness]
 | |
| + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the
 | |
| + eviction, which evicts generations less than or equal to ``min_gen``.
 | |
| + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and
 | |
| + ``max_gen-1`` are not fully aged and therefore cannot be evicted.
 | |
| + Use ``nr_to_reclaim`` to limit the number of pages to evict. Multiple
 | |
| + command lines are supported, so does concatenation with delimiters
 | |
| + ``,`` and ``;``.
 | |
| +
 | |
| +Framework
 | |
| +=========
 | |
| +For each ``lruvec``, evictable pages are divided into multiple
 | |
| +generations. The youngest generation number is stored in
 | |
| +``lrugen->max_seq`` for both anon and file types as they are aged on
 | |
| +an equal footing. The oldest generation numbers are stored in
 | |
| +``lrugen->min_seq[]`` separately for anon and file types as clean
 | |
| +file pages can be evicted regardless of swap and writeback
 | |
| +constraints. These three variables are monotonically increasing.
 | |
| +Generation numbers are truncated into
 | |
| +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into
 | |
| +``page->flags``. The sliding window technique is used to prevent
 | |
| +truncated generation numbers from overlapping. Each truncated
 | |
| +generation number is an index to an array of per-type and per-zone
 | |
| +lists ``lrugen->lists``.
 | |
| +
 | |
| +Each generation is divided into multiple tiers. Tiers represent
 | |
| +different ranges of numbers of accesses from file descriptors only.
 | |
| +Pages accessed ``N`` times via file descriptors belong to tier
 | |
| +``order_base_2(N)``. Each generation contains at most
 | |
| +``CONFIG_TIERS_PER_GEN`` tiers, and they require additional
 | |
| +``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``. In contrast to
 | |
| +moving between generations which requires list operations, moving
 | |
| +between tiers only involves operations on ``page->flags`` and
 | |
| +therefore has a negligible cost. A feedback loop modeled after the PID
 | |
| +controller monitors refaulted % across all tiers and decides when to
 | |
| +protect pages from which tiers.
 | |
| +
 | |
| +The framework comprises two conceptually independent components: the
 | |
| +aging and the eviction, which can be invoked separately from user
 | |
| +space for the purpose of working set estimation and proactive reclaim.
 | |
| +
 | |
| +Aging
 | |
| +-----
 | |
| +The aging produces young generations. Given an ``lruvec``, the aging
 | |
| +traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()``
 | |
| +to scan PTEs for accessed pages (a ``mm_struct`` list is maintained
 | |
| +for each ``memcg``). Upon finding one, the aging updates its
 | |
| +generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``).
 | |
| +After each round of traversal, the aging increments ``max_seq``. The
 | |
| +aging is due when ``min_seq[]`` reaches ``max_seq-1``.
 | |
| +
 | |
| +Eviction
 | |
| +--------
 | |
| +The eviction consumes old generations. Given an ``lruvec``, the
 | |
| +eviction scans pages on the per-zone lists indexed by anon and file
 | |
| +``min_seq[]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to
 | |
| +select a type based on the values of ``min_seq[]``. If they are
 | |
| +equal, it selects the type that has a lower refaulted %. The eviction
 | |
| +sorts a page according to its updated generation number if the aging
 | |
| +has found this page accessed. It also moves a page to the next
 | |
| +generation if this page is from an upper tier that has a higher
 | |
| +refaulted % than the base tier. The eviction increments ``min_seq[]``
 | |
| +of a selected type when it finds all the per-zone lists indexed by
 | |
| +``min_seq[]`` of this selected type are empty.
 | |
| +
 | |
| +To-do List
 | |
| +==========
 | |
| +KVM Optimization
 | |
| +----------------
 | |
| +Support shadow page table walk.
 |