Notes---Operating Systems: Three Easy Piece---Virtualization（UW-Madison CS 537）

UW-Madison 的 system 其实非常强，CS 537 是讲 operation system 的一节课，“Operating Systems: Three Easy Piece” 是这门课的教材，这篇 Blog 就主要记载我阅读这本 textbook 的 reading notes。

Chapter 2: Introduction to Operating Systems

The main goal of an operating system is just to make the system easy to use.

Abstraction 在计算机领域一直是一个非常重要的概念，正是一层一层的 abstraction，我们才能制造出当今如此强大的计算机。

在 OS 里，我们当然也用了这样的一个强大的概念，virtualization 就是一个非常重要的具体体现。我们会把物理层面上的各种 resources 进行 virtualization，来制造一个更通用的 interface 来给用户去使用，同时也方便 OS 在用户看不见的地方进行各种 “wild optimization”。上面提到的，OS virtualizes 出来的 interface，其实就是我们常说的 standard library，里面的 function 就是 system calls。

The model of physical memory presented by modern machines is very simple. Memory is just an array of bytes; to read memory, one must specify an address to be able to access the data stored there; to write (or update) memory, one must also specify the data to be written to the given address.

PID: process identifier

OS will virtualize the memory for each running process.

Each process accesses its own private virtual address space (sometimes just called its address space), which the OS somehow maps onto the physical memory of the machine.

在关于 memory 的 virtualization 的部分，你会看到两个不同的 process 在同时 allocate 了同一块 memory，非常的神奇。那么为什么我们能看到这样的现象呢？

Virtual Address Space: 每个进程（Process）都运行在自己独立的、私有的虚拟地址空间 (Virtual Address Space) 中。进程看到的地址（如 C 语言中的指针值）是虚拟地址，不是真实的物理内存地址。
Process Isolation: 进程 A 的虚拟地址 0x200000 和进程 B 的虚拟地址 0x200000 是完全隔离的。它们只是各自“地图”上的同一个“门牌号”。
Address Translation: 操作系统通过页表 (Page Tables) 将这些相同的虚拟地址映射 (map) 到不同的物理内存页帧 (Physical Page Frames) 上。
- Process A [Virtual 0x200000] -> Physical RAM [Address 0x1A000]
- Process B [Virtual 0x200000] -> Physical RAM [Address 0x2B000]
OSTEP’s Goal: OSTEP 书中的例子在一个确定性 (deterministic) 的环境中运行，目的是为了清晰地展示虚拟内存和进程隔离这一核心概念。

但是如果你尝试运行 mem.c 这个程序，你大概率得不到和书中一样的答案，这是因为我们现在的电脑都使用了地址空间布局随机化 (Address Space Layout Randomization, ASLR) 这项技术。

ASLR (Address Space Layout Randomization): 现代操作系统（Linux, macOS, Windows）默认开启的一项安全功能。
Purpose: 为了防止内存攻击（如缓冲区溢出）。如果攻击者不知道内存的确切地址，就很难成功实施攻击。
How it Works: ASLR 在每次程序运行时，都会随机化其关键内存区域的基地址，主要包括：
- 堆 (Heap): malloc 分配内存的地方。
- 栈 (Stack): 局部变量存储的地方。
- 动态库 (Libraries): 如 libc。
Result: 因为 ASLR 的存在，你每次运行 ./mem，它的堆都会被放置在一个新的、随机的虚拟地址上。因此，malloc 会从这个随机的基地址开始分配，导致你每次看到的指针地址都不同。

还有一个比较重要的东西是 concurrency。书里用了一个非常神奇的 example 来说明 concurrency 有多么 tricky。

(base) (532) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 1
Initial value : 0
Final value   : 2
(base) (533) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 10
Initial value : 0
Final value   : 20
(base) (534) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 100
Initial value : 0
Final value   : 200
(base) (535) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 1000
Initial value : 0
Final value   : 2000
(base) (536) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 10000
Initial value : 0
Final value   : 20000
(base) (537) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 100000
Initial value : 0
Final value   : 176306
(base) (538) zhengs@vm-instunix-11:~/cs537/ostep-code/intro$ ./threads 1000000
Initial value : 0
Final value   : 1445732

上面是我跑的结果。

除了 virtualization 和 concurrency，最后一个主要的主题就是 persistence。按我的理解，这个 topic 就是关于系统的容错性或者稳定性吧。涉及数据是怎么存储，我们怎么做 IO，file system 之类的问题。

SSDs: solid-state drives

The software in the operating system that usually manages the disk is called the file system

As anyone who has written a device driver knows, getting a device to do something on your behalf is an intricate and detailed process.

这句话实在是太好玩了，让我想起来了“经常干 xxx 的朋友知道 xxx” 这个句式。

So now you have some idea of what an OS actually does: it takes physical resources, such as a CPU, memory, or disk, and virtualizes them. It handles tough and tricky issues related to concurrency. And it stores files persistently, thus making them safe over the long-term.

One goal in designing and implementing an operating system is to provide high performance; another way to say this is our goal is to minimize the overheads of the OS.

Chapter 4: The Abstraction: The Process

In this chapter, we discuss one of the most fundamental abstractions that the OS provides to users: the process.

Process: it is a running program.

Time sharing of the CPU: Running one process, then stopping it and running another, and so forth.

To implement virtualization of the CPU, and to implement it well, the OS will need both some low-level machinery (mechanisms) and some high-level intelligence (policies).

The states of a process.

Being moved from ready to running means the process has been scheduled; being moved from running to ready means the process has been descheduled.

Chapter 5: Interlude: Process API

这个 chapter 主要就是讲了几个在 OS 里比较重要的函数：fork()，wait() and exec()。

我在 go through p2.c 这个 code 的时候产生了一点小疑问，这个 wait() function 究竟是怎么起效果的，下面就是对这个疑问的解答。

1. Core Question: How does `wait()` know when to stop waiting?

A common misconception is that wait() knows the total number of child processes. This is incorrect.

The wait() system call’s behavior is simple: It waits for *any single* child process to terminate.

The operating system kernel manages this entire process using its internal data structures.

2. Kernel’s Role: The Process Table

The kernel maintains a Process Table where every process has an entry.
Each entry contains crucial information, including:
- PID (Process ID)
- PPID (Parent Process ID)
- A list of its children’s PIDs. This “family tree” information is the key.

3. The `fork()` and `wait()` Workflow

The process unfolds in a sequence of system calls and state changes managed by the kernel:

fork() is Called:
- A new process (child) is created.
- The kernel updates the process table:
  - Sets the child’s PPID to the parent’s PID.
  - Adds the new child’s PID to the parent’s list of children.
Parent Calls wait():
- The parent process enters the kernel.
- The kernel checks the parent’s list of children:
  - Case A: A child has already terminated (is a “zombie”). wait() immediately cleans up the zombie process and returns its PID. The parent continues without blocking.
  - Case B: No children have terminated. The kernel changes the parent’s state to BLOCKED (or WAITING). The parent process is now suspended and consumes no CPU.
Child Process Terminates:
- When the child process finishes its execution, it calls exit().
- The kernel releases most of the child’s resources but keeps its process table entry, changing its state to ZOMBIE.
- Crucial Step: The kernel sends a signal (specifically SIGCHLD) to the parent process.
Parent is Awakened:
- The signal from the kernel wakes up the BLOCKED parent process, changing its state back to READY.
- When the scheduler runs the parent again, its wait() call can now complete.
- wait() “reaps” the zombie child (cleans up its final process table entry) and returns the child’s PID. The parent’s code execution continues from the line after wait().

4. The Purpose of `sleep()` in Example Code

The sleep(seconds) function in child process code is a teaching tool.
Without sleep(): The child might finish so quickly that the parent’s wait() call finds a zombie immediately and returns without ever visibly blocking. This makes it hard to observe the waiting mechanism.
With sleep(): It artificially extends the child’s lifetime, forcing the parent’s wait() call to enter a BLOCKED state for a noticeable duration. This makes the concept of “waiting for a child to complete” tangible and easy to observe.

我们之所以需要一个 fork() 去创建一个 child process，有需要一个 exec() 去执行一个不同的 command，是因为我们想要在这两个事情中间做一些事情。比如重定向 input 和 output。

不过我们可能还是对 exec() 函数的细节又一些疑惑，比如它跟 fork() 之间的关系之类的问题，以下 notes 做了比较详细的解释。

Operating Systems Notes: The `fork()` and `exec()` Mechanism

1. 核心区别：`fork()` vs. `exec()`

这是 Unix/Linux 系统中创建和运行新程序的基础。

fork() - 克隆进程 (影分身)
- 作用: 创建一个与当前进程几乎完全相同的副本（子进程）。
- 结果:
  - 产生一个新的进程，拥有唯一的 PID。
  - 父子进程拥有各自独立的内存空间（通过写时复制技术优化）。
  - fork() 之后，两个进程都从 fork() 调用之后的地方继续执行相同的代码。
  - 通过 fork() 的返回值来区分父子进程（父进程中返回子进程PID，子进程中返回0）。
exec() - 替换程序 (灵魂附体/换房客)
- 作用: 用一个全新的程序替换当前进程的内存映像。
- 结果:
  - 不创建新进程，进程的 PID 保持不变。
  - 当前程序的代码、数据、栈和堆被完全丢弃，加载新程序的相应部分。
  - 如果调用成功，exec() 永远不会返回，因为调用它的代码已经不复存在了。

2. 黄金组合：`fork() + exec()` 模式

这是 Shell 等程序用来执行外部命令的标准方法。

执行 ls > output.txt 的流程:

fork(): Shell 调用 fork() 创建一个子进程。
wait(): 父进程（Shell 本身）调用 wait() 等待子进程结束。
重定向 (Redirection): 在子进程中，在调用 exec 之前，通过 close() 和 open() / dup2() 等系统调用修改自己的文件描述符表，将标准输出（文件描述符 1）指向 output.txt 文件。
exec(): 子进程调用 exec() 来执行 ls 程序。ls 的代码替换了子进程的代码，但继承了那个已经被修改过的文件描述符表。
结果: ls 程序正常向标准输出打印，但数据被内核根据文件描述符表重定向到了 output.txt 文件。

3. `exec()` 的关键特性

A. exec() 替换什么？(换房客)

exec 替换的是程序层面的东西，即进程内存中的内容：

代码段 (Code Segment)
数据段 (Data Segment)
堆 (Heap)
栈 (Stack)

B. exec() 保留什么？(公寓设施)

exec 保留的是进程层面的状态，即在内核中的“身份”和“配置”：

进程ID (PID) 和父进程ID (PPID)
文件描述符表 (因此 I/O 重定向有效)
环境变量 (默认情况下完全继承)
用户和组ID (UID, GID)
当前工作目录

C. 为什么 exec() 成功后不返回？

根本原因: 调用 exec() 的那段代码以及记录该调用的栈帧 (Stack Frame)，在 exec 成功时被彻底抹除了。
类比: 就像你脚下的地板突然消失了，你无法再“走回去”。
与进程退出码的区别: exec() 函数本身不返回，但被它加载的新程序在执行结束后，其 main 函数的返回值会成为整个进程的最终退出状态码 (exit status)。从这个角度看，新程序的返回值“顶替”了旧程序的。

4. 总结速查表

特性	`fork()`	`exec()`
核心作用	创建当前进程的副本	用新程序替换当前进程
是否创建新进程？	是	否
进程ID (PID)	子进程获得一个新的PID	PID 保持不变
执行的代码	父子进程继续执行同样的代码	切换到执行一个全新的程序
返回行为	成功时返回两次	成功时永不返回

Chapter 6: Mechanism: Limited Direct Execution

我们只有一个 CPU，但是我们可能想要运行很多个程序，这就需要我们有一些技巧让我们能 virtualize 出来很多个 CPU 来用。

第一个想法就是，啥都不想，就直接跑这个 process，也就是 Direct Execution。

这个当然是有问题的，比如说怎么样 switch，我们怎么预防这个程序做一些我们不想让他做的事情之类的。

解决办法就是通过 user mode 和 kernel mode。正常程序是在 user mode，而当程序想要执行一些 privileged operations 的时候，这个程序就会用 system call，切换成 kernel mode 来执行。这样用户只能通过 kernel 开放出来的这些接口（API）来实现自己想要的功能，而不是随便使用。

To execute a system call, a program must execute a special trap instruction

这样的 user mode 与 kernel mode 的逻辑其实就是 Limited Direct Execution。

To specify the exact system call, a system-call number is usually as signed to each system call.

trap table 是一个非常重要的东西，它记载了当系统接收到一个 system call 的时候，对应的函数在哪里。

如果用户可以把 trap table（中断陷阱表）换成自己的，那么整个系统的安全防线将彻底崩溃，攻击者可以完全、彻底地接管这台机器。

这绝对是操作系统能想象到的最可怕的安全漏洞之一。下面我们来具体分析会发生什么“可怕的事情”。

为什么 Trap Table 如此关键？

首先要理解 Trap Table 的角色。它就像是用户世界（User Mode）通往操作系统内核世界（Kernel Mode）的唯一、受控的桥梁。无论是系统调用、硬件中断（如键盘输入、磁盘读写完成），还是程序错误（如除以零、缺页中断），所有需要内核介入的事件都必须通过查询这个表，来找到并跳转到正确的、受信任的内核代码去处理。

这个表本质上是一个地址-指针数组，指向了所有特权操作的函数入口。控制了这张表，就等于控制了通往内核世界的所有道路。

具体的“可怕”攻击手段

如果一个攻击者可以安装自己的 Trap Table，他就能成为这个“桥梁”的收费员，可以为所欲为。

1. 劫持任意系统调用 (Hijacking System Calls)

这是最直接的攻击。攻击者可以把 Trap Table 中任何一个系统调用的地址换成他自己的恶意代码地址。

窃取信息: 比如，把 read() 或 write() 系统调用的地址换掉。那么当系统上任何一个程序（比如你的浏览器或密码管理器）调用 read() 读取文件或网络数据时，都会先跳转到攻击者的代码。攻击者的代码可以**复制一份数据（比如你的密码）**到别处，然后再去调用真正的内核 read() 函数来完成正常操作。整个过程对原始程序来说是透明的，但所有信息都已被窃取。
篡改行为: 比如，把 open() 系统调用的地址换掉。当一个程序尝试打开 “my_document.txt” 时，攻击者的代码可以偷偷将其改成打开并修改系统文件，如 /etc/passwd，然后再返回一个看起来正常的文件描述符。
创建后门: 比如，把 exec() 系统调用的地址换掉。当管理员执行 ls 时，攻击者的代码可以先执行一个恶意程序来开启网络后门，然后再执行真正的 ls。

2. 绕过所有权限检查，直接提权为 Root

这是最致命的攻击。在正常的系统中，一个普通用户想变成超级用户（root），需要通过 sudo 等程序，最终调用 setuid(0) 这样的系统调用。内核中的 setuid() 函数会进行严格的权限检查。

但如果攻击者控制了 Trap Table：

他可以把 setuid() 系统调用的地址指向他自己的一段代码。
这段恶意代码完全跳过任何权限检查。
它直接修改当前进程的用户 ID 为 0 (root)，然后返回成功。

结果就是，系统上任何一个普通程序，只要调用一下 setuid(0)，就能立刻、无条件地变成 root 权限，从而完全控制整个系统。系统的整个权限模型瞬间失效。

3. 制造硬件层面的木马

Trap Table 不仅处理系统调用，还处理硬件中断。

键盘记录器: 攻击者可以替换掉键盘中断的处理程序。他自己的代码会记录下每一次按键（包括你输入的所有密码），然后再调用原始的内核处理程序，让你感觉不到任何异常。
网络嗅探器: 攻击者可以替换掉网卡中断的处理程序，从而在数据包被操作系统处理之前就捕获、分析或篡改它们。

一个比喻：皇宫的卫兵

可以把操作系统内核想象成一座戒备森严的皇宫，里面住着皇帝（CPU最核心的功能）。

Trap Table 就是皇宫大门口的一队皇家卫兵。
平民（用户程序）要向皇帝请求办事（系统调用），都必须先通过门口的卫兵。请求“食物”，卫兵会带你去御膳房；请求“武器”，卫兵会带你去兵器库。每个卫兵都绝对忠诚，并严格遵守流程。

如果攻击者可以用自己的冒牌货换掉所有的皇家卫兵，那么：

当平民请求“食物”时，冒牌卫兵可以在食物里下毒，再交给御膳房。
当平民请求见“兵器库主管”时，冒牌卫兵可以直接把主管的钥匙和官服给你，让你当场“提权”成主管。

皇宫本身的设计再安全也没用了，因为入口完全失守了。

总而言之，允许用户修改 Trap Table，相当于把整个操作系统的“地基”和“大门钥匙”全部交给了用户，系统安全将不复存在。这就是为什么设置 Trap Table 的指令必须是最高级别的特权指令，只能由操作系统内核自己来执行。

现在我们要讨论的问题是如何在不同的 process 之间 switch。

Scheduler will make the decision that whether to continue running the currently-running process, or switch to a different one.

不管我用什么 policy 去决定当前 run 哪个 process，我都有一个非常重要的问题要解决，那就是怎么样 return control to the OS。因为我们现在假设的是我们只有一个 CPU，所以当一个 user program 正在运行的时候，OS 是不在运行的，那我们怎么样才能让 OS 运行呢。我们一般来讲有两种方法：1. Cooperative Approach；2. Non-Cooperative Approach。

所谓的 cooperative approach 就是说我们假设每一个 process 都很 friendly。他们会自己定期通过 system call 把 control 交还给 OS。

但是这确实是非常不保险，而且有的时候不是这个 programmer 故意的，而是出现了一些 bugs，比如 infinite loop，这个时候 OS 就永远没办法运行了。

所以我们有了 non-cooperative approach，在这个方法里我们用一个 timer interrupt，固定时间让 OS 运行。下图就是一个例子。

Chapter 7: Scheduling: Introduction

Scheduling metric

turnaround time：

$T_{turnaround} = T_{completion} - T_{arrival}$

response time:

We define response time as the time from when the job arrives in a system to the first time it is scheduled. 其实也有其他的定义，我们这里其实简化了，假设了 job 一被 schedule 就会产生 response。

$T_{response} = T_{firstrun} − T_{arrival}$

Example:

If we had the schedule from Figure 7.5 (with A arriving at time 0, and B and C at time 10), the response time of each job is as follows: 0 for job A, 0 for B, and 10 for C (average: 3.33).

Policy

First In, First Out (FIFO)

如果每一个 job 用的时间差不多，这个 policy 是挺好的，我们的 average turnaround time 基本上就是 optimal，但是如果有的 job 用的时间非常久，有的 job 用的时间很短，那用时短的 job 就可能要等，我们的 average turnaround time 就可能会很久。这个问题其实就是 convoy effect , where a number of relatively-short potential consumers of a resource get queued behind a heavyweight resource consumer.

Shortest Job First

既然 FIFO 可能让短的等太久，我们很自然就会想到让短的先运行，就有了这个 SJF policy。

Given the assumptions about jobs all arriving at the same time, we could prove that SJF is indeed an optimal scheduling algorithm.

但是 SJF 还是有很大的问题，问题就出在我们的 assumption 上，毕竟所有的 job 同时来这种事情在现实中不太可能发生，那么如果一个 heavy job 来的早了，那么也没有其他更短的 job 可以选，这个运行时间很长 job 就会占据 CPU，就会使得我们的 average turnaround time 变得很大。

Shortest Time-to-Completion First (STCF)

SJF 是一个 non-preemptive scheduler，也就是说一旦开始一个任务，这个任务就会运行完，所以也受制于这个特性。不过我们完全可以做到不完成一个 job，在某个时刻暂停这个 job，然后去做另外的 job。这也就是说我们可以加一个 feature，使得他可以 preempt 当前正在运行的 job，然后在决定之后运行什么 job。

We can add preemption to SJF, known as the Shortest Time-to-Completion First (STCF) or

Preemptive Shortest Job First (PSJF) scheduler.

在我们的新 assumption 下（job 不同时来），以 average turnaround time 作为 metric，STCF 被证明是一个 optimal schedule。Thus, if we knew job lengths, and that jobs only used the CPU, and our only metric was turnaround time, STCF would be a great policy.

Round Robin (RR)

我们现在考虑一个新的 metric：response time。我们就需要考虑一个新的问题：how can we build a scheduler that is sensitive to response time？

我们引入 RR 来解决这个问题。

The basic idea is simple: instead of running jobs to completion, RR runs a job for a time slice (sometimes called a scheduling quantum) and then switches to the next job in the run queue.

RR is sometimes called time-slicing

Note that the length of a time slice must be a multiple of the timer-interrupt period; thus if the timer interrupts every 10 milliseconds, the time slice could be 10, 20, or any other multiple of 10 ms.

Assume three jobs A, B, and C arrive at the same time in the system, and that they each wish to run for 5 seconds. RR with a time-slice of 1 second will schedule these jobs as Figure 7.7.

RR 的 response time 非常低，但是 RR 的 average turnaround time 会非常高。

More generally, any policy (such as RR) that is fair, i.e., that evenly divides the CPU among active processes on a small time scale, will perform poorly on metrics such as turnaround time. Indeed, this is an inherent trade-off: if you are willing to be unfair, you can run shorter jobs to completion, but at the cost of response time; if you instead value fairness,response time is lowered, but at the cost of turnaround time.

You can’t have your cake and eat it too. 有得必有失。

RR 的核心是 FIFO，他的 running queue 是按照 FIFO 排列的。

Chapter 8: Scheduling: The Multi-Level Feedback Queue

这一章我们介绍 Multi-Level Feedback Queue（MLFQ）。

In our treatment, the MLFQ has a number of distinct queues, each assigned a different priority level. At any given time, a job that is ready to run is on a single queue. MLFQ uses priorities to decide which job should run at a given time: a job with higher priority (i.e., a job on a higher queue) is chosen to run.

We have at the first two basic rules for MLFQ:

Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t).
Rule 2: If Priority(A) = Priority(B), A & B run in RR.

如何设定每一个 job priority 就很重要了，在 MLFQ 里 priority 是根据各个 job 的历史行为动态设定的。

总的来说，如果一个任务不经常使用 CPU，比如经常进行一些 I/O 操作，那么这个任务会被 assign 一个比较高的 priority；反之，如果一个任务长时间使用 CPU，这个任务就会被 assign 一个低的 priority。

咋听起来这个 policy 不是很直观，但其实很有道理。我们想的是想要 CPU 一直跑，不要空闲，但是像这种 I/O 操作多的任务就会经常性的把 CPU 空出来，这个时候我们就需要一些任务能够把这些空给填上，而这些填空的任务最好是一直用 CPU 的那种任务。

一个 MLFQ 的 snapshot 可能会像 FIgure 8.1 一样。

allotment: The allotment is the amount of time a job can spend at a given priority level before the scheduler reduces its priority.

Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).
Rule 4a: If a job uses up its allotment while running, its priority is reduced (i.e., it moves down one queue).
Rule 4b: If a job gives up the CPU (for example, by performing an I/O operation) before the allotment is up, it stays at the same priority level (i.e., its allotment is reset).

Figure 8.2 就是一个 long-running job 在 MLFQ 里的表现。

Figure 8.3 给了一个更复杂一点的例子，两个 job 同时在 queue 里。右边的图告诉我们 rule 4b 是怎么体现的。

关于把 I/O intensive 的 job 放在 high priority 的位置，除了我刚才的解释，还有另外一种解释，就是说 I/O intensive 的 job 可能是 interactive job，我们希望 interactive job 的 response time 越低越好，毕竟不想用户去等。

当前的 scheduler 还是有些问题，比如如果 interactive job 太多的话，long-running job 就永远不会被运行了，这种情况叫 starvation；或者如果有一些比较离谱的程序员，想要 game the scheduler，想要利用我们的 rules 去尽可能多的占用更多的 CPU 时间。

所以我们需要一个新的 rule，主要做的事情就是 boost。

Rule 5: After some time period S, move all the jobs in the system to the topmost queue.

当然也有一些别的方法，比如我限制一个 job 保持不掉 priority 的机会只有 5 次，在它用过 5 次 time slice 之后不管怎么样都降他的 priority。

当然我们也可以计算时间，而不是每次只要没用完，下次就重新算，根据这个，我们可以把我们的 rule 4 重写一下。

Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).

最后，我们的 MLFQ 会有一下的 rule：

• Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t).

• Rule 2: If Priority(A) = Priority(B), A & B run in round-robin fashion using the time slice (quantum length) of the given queue.

• Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).

• Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).

• Rule 5: After some time period S, move all the jobs in the system to the topmost queue.

这章节主要讲的就是 proportional-share scheduler 或者说是 fair-share scheduler。他的目的不是为了最小化 turnaround time 或者 response time，而是为了让每一个 job 都有一定的运行 cpu 的时间。

一个比较核心的概念是 tickets，tickets 代表了你占据多少的 share。比如说如果总共有 1000 个 ticket，你这个 process 有 200 个，那么你就应该占据 20% 的 cpu 时间。

关于这个 ticket，也有很多围绕它的 mechanisms。比如 ticket currency，ticket transfer 和 ticket inflation。

Ticket currency 主要是为了操作简单，大概就是 OS 会给每一个用户分配 global tickets，但是每一个用户又可以给它自己的 process 分配自己的 local ticket，然后最后 OS 会把这些 local tickets 变成 global tickets，然后再做比较。

Ticket transfer 则是说一个 process 可以暂时把自己的 ticket 给另外一个 process。

Ticket inflation 则是说在一些情况下，一个 process 可以暂时提高或者降低自己的 tickets。

在本章我们还提出一个新的指标：fairness metric。大致概念就是衡量我们的 scheduler 够不够公平。它具体的计算方法就是在两个 process 同时到达且 runtime 相同的前提下，计算他们两个完成时间的比值。其实我个人感觉应该是 turnaround time 的比值，当然可能这个比较也默认了就是他们两个到达时间都是 0。在这个 setting 下，我们的 fairness metric F 就是：

$F = \frac{\min(T_A, T_B)}{\max(T_A, T_B)},$

其中 $T_A$ 是 process A 的完成时间， $T_B$ 是 process B 的完成时间。

Lottery Scheduling

这个就是说每次都跟抽奖一样，如果哪个 process 有这个抽奖号码，这个 process 就运行。比如说现在一共十个 ticket：0，1，2，3，4，5，6，7，8，9。两个 Process：A，B。假如 A 有 0，1，2，3，4，5，6；B 有 7，8，9。然后我们现在有一个抽奖机（random number generator），抽奖范围是 [0, 9]，然后抽出来的是 5，那么就 A 去运行，以此类推。

Stride Scheduling

不同于 lottery scheduling，利用了 randomness，stride scheduling 是一个 deterministic fair-share scheduler。

在 stride scheduling 里，我们有两个 state，一个是 pass value，一个是 stride。我们每次会选择 pass value，最小的那个，等运行完之后，在把 stride 加到当前的这个 pass value 上。stride 的这个值，是它 ticket 占比的反比，这么说有点 confusing，用数学公式表达就是

$stride_i = \frac{N}{ticket_i},$

其中 N 是总 ticket 数， $ticket_i$ 是 process i 的 ticket 数。

Completely Fair Scheduler (CFS)

在 CFS 里，有一个新的，很重要的概念，virtual runtime (vruntime)。CFS 每次 schedule 的时候，会把 cpu assign 给 virtual runtime 最低的 process。

在 CFS 里，time slice 也是变化的。CFS 的 time slice 跟 sched latency 和当前正在 running 的 process 数量 n 直接相关。一般而言，我们有如下公式：

$\text{time}\_\text{slice}=\frac{\text{sched}\_\text{latency}}{n},$

不过我们的 time slice 是有下限的，这个下限是 min_granularity。所以实际上的公式是

$\text{time}\_\text{slice}=\min(\frac{\text{sched}\_\text{latency}}{n},\text{min}\_\text{granularity}).$

在 CFS 里，我们能够让一些任务获得更高的 priority，能够获得更多的 cpu time，于是 CFS 引入了 nice level 这个概念。本质的意思就是，每次我们 update virtual runtime 的时候，不是拿真正的 runtime 往上加，而是把 runtime 乘上一个系数，然后再往上加。我们更新 virtual runtime 的公式是：

$\text{vruntime}_i = \text{vruntime}_i + \frac{\text{weight}_0}{\text{weight}_i}\cdot \text{runtime}_i$

nice level 一共有 [-20, 19]，也就是 40 个级别。越 nice，weight 越小，越不容易被分配到 cpu 时间。

我们还有另一个概念 effective time slice，这个概念的意思是我们在分 time slice 的时候不是简单的直接除以 n，而是也按照 weight 去划分：

$\text{time}\_\text{slice}_k = \frac{\text{weight}_k}{\sum_{i = 0}^{n-1}\text{weight}_i}\cdot \text{sched}\_\text{latency}$

这个概念看起来没什么问题，但是我感觉好像跟 virtual runtime 加 weighted real runtime 的作用有点重复了，所以我也不太理解为什么。