Notes---Operating Systems: Three Easy Piece---Concurrency（UW-Madison CS 537）

这个 notes 主要用来记录有关 “Operating Systems: Three Easy Piece” 这本书第二大块的内容。就像书名写的，作者把 Operating system 的内容分为三大块，现在我们已经进行到了第二块：Concurrency。这一块的思想其实某种意义上是最重要的，因为 concurrency 的思想在非常多的地方都有用到，但是这一部分却是内容最少的，我还记得老师当时说的时候还挺惋惜的。

Chapter 26: Concurrency: An Introduction

关于 concurrency，第一个要介绍的概念就是 thread。之前我们说到的 program 其实都是 single-threaded program，但是现在我们就要转向 multi-threaded program。如果用一个非常浅显但是不准确的语言去描述 thread，其实就是同一个程序“同时”干好几个事情。从这个实现的功能上来讲，似乎跟多进程差不多，但有一些很底层上的区别，同时带来了很多根本上的不一样的 feature。

我们在切换 thread 的时候，也有 context switch，但是这个是 context1switch between threads。我们之前管理很多的 process 的时候，我们有 process control block (PCB)，现在到了 multi-treaded 这里，我们有 thread control blocks (TCBs)。

关于 thread，最根本的不同我觉得已经在上面这个图里体现了，thread 是一个 process 的一部分，我们之前讲过，我们的 OS 会给每一个 process 虚拟化出来一个这个 process 自己的 VM，那么既然 threads 是在一个 process 里面的，那么他也就会用这个 process 的 VM。

那么我们是如何实现我们好像可以同时干不同的事情的呢，实现方法就是每个 thread 会有自己的 stack，但是他们会共享除了 stack 之外的其他部分，比如说这个 code section，heap section。

Thus, any stack-allocated variables, parameters, return values, and other things that we put on the stack will be placed in what is sometimes called thread-local storage, i.e., the stack of the relevant thread.

但是我们为什么需要 threads 呢？

第一点书中说的是 parallelization。

The task of transforming your standard single-threaded program into a program that does this sort of work on multiple CPUs is called parallelization, and using a thread per CPU to do this work is a natural and typical way to make programs run faster on modern hardware.

第二点是防止一直在等 I/O。

Threading enables overlap of I/O with other activities within a single program, much like multiprogramming did for processes across programs

我看到这里的时候，有一个疑问，就是这些 multiprocess 不都能干，为什么我们还需要 threads，书里也有解释：

Of course, in either of the cases mentioned above, you could use multiple processes instead of threads. However, threads share an address space and thus make it easy to share data, and hence are a natural choice when constructing these types of programs. Processes are a more sound choice for logically separate tasks where little sharing of data structures in memory is needed.

在我们实际写 multi-threaded program 的时候就会发现，一个 thread 其实有点像一个 function。它和 function 的区别就在于，我们 call 一个 function 之后，我们会直接执行这个 function 里的内容，然后我们在返回到 main 函数里，但是 thread 不会，我们创建完一个 thread，这个 thread 和我们原来的 main thread 就是独立的了，谁先运行不知道。

我们前面有提到 multi-threaded 的一个优点，他们在一个 process 里，在底层设计上就有共享的 data，但是这就带来了另一个问题，当多个 thread 同时 access 一个共享的数据，会发生什么。

书里用一个 counter 的例子在说明了这个问题。概括而言就是有两个 thread，同时向一个共享的 counter 里加 10000 次，最后我们的 counter 应该得到 20000，但实际上这不一定。

prompt> ./main
main: begin (counter = 0)
A: begin
B: begin
A: done
B: done
main: done with both (counter = 19221041)

想要分析这个问题我们需要借助更底层的代码，也就是 assembly code。像这个一个非常简单的向内存某个地方加一的操作需要三行 assembly code：

mov 0x8049a1c, %eax
add $0x1, %eax
mov %eax, 0x8049a1c

上面说的那个问题之所以会发生，就是因为 thread 1 在已经加完了（第二行代码），但是还没有移回去的时候（第三行代码），我们可能会发生 context switch，我们直接就执行 thread 2 了，这样我们就有可能少加一次。

这个现象叫做 race condition，或者更确切的说是 data race：the results depend on the timing of the code’s execution.

A race condition (or data race) arises if multiple threads of execution enter the critical section at roughly the same time; both attempt to update the shared data structure, leading to a surprising (and perhaps undesirable) outcome.

A critical section is a piece of code that accesses a shared resource, usually a variable or data structure.

An indeterminate program consists of one or more race conditions; the output of the program varies from run to run, depending on which threads ran when. The outcome is thus not deterministic, something we usually expect from computer systems.

To avoid these problems, threads should use some kind of mutual exclusion primitives; doing so guarantees that only a single thread ever enters a critical section, thus avoiding races, and resulting in deterministic program outputs.

解决上面那个问题的一个方法就是把这些在 critical section 的代码变成 atomic。

The idea behind making a series of actions atomic is simply expressed with the phrase “all or nothing”; it should either appear as if all of the actions you wish to group together occurred, or that none of them occurred, with no in-between state visible. Sometimes, the grouping of many actions into a single atomic action is called a transaction, an idea developed in great detail in the world of databases and transaction processing.

但是想要实现这个 atomic，我们需要 synchronization primitives from the hardware。

在这个 concurrency 这个 piece 里，我们就会学 how to build support for synchronization primitives to support atomicity。除此之外，我们还会学 mechanisms to support sleeping/waking interaction that is common in multi-threaded programs.

Chapter 27: Interlude: Thread API

这一章节主要讲的就是具体 C 怎么用 Thread 了，比如说传什么参数进去这种。

首先就是怎么创建一个 thread：

#include <pthread.h>
int
pthread_create(pthread_t *thread,
							 const pthread_attr_t *attr,
							 void *(*start_routine)(void*),
							 void *arg);

然后讲了 thread completion，用 pthread_join() 可以等对应的 thread 结束之后，在运行后面的代码。

int pthread_join(pthread_t thread, void **value_ptr);

接下来是另一个比较重要的事情，locks，locks 可以给我们提供 mutual exclusion。

int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);

如果你有一块代码是 critical section，你就可以把这段代码放到 lock 里面保护，比如如下这个例子：

pthread_mutex_t lock;
pthread_mutex_lock(&lock);
x = x + 1; // or whatever your critical section is
pthread_mutex_unlock(&lock);

上面这段代码是有错误的，只是为了给你一些 intuition。

Conditional variable 也是一个非常重要的组成部分。如果 threads 之间也有先后次序，那我们就可以用 conditional variable 来使得某一个 thread 等待另外一个 thread 完成一些必要步骤在运行：

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex);
int pthread_cond_signal(pthread_cond_t *cond);

Chapter 28: Locks

A lock is just a variable, and thus to use one, you must declare a lock variable of some kind (such as mutex).

It is either available (or unlocked or free) and thus no thread holds the lock, or acquired (or locked or held), and thus exactly one thread holds the lock and presumably is in a critical section.

一个程序里面可能有很多的 lock variable，这是因为只用一个锁效率太低了，我们叫做 coarse-grained locking strategy，如果我们分的很细致，用很多的不同的锁，就是一个更 fine-grained 的策略。

我们实现的 lock 首先需要能提供 mutual exclusion，也就是最基本的功能要没问题；其次就是要 fair，不能让有的 thread starve；最后就是 performance 要尽可能好，加 lock 所造成的 overhead 要小。

Controlling Interrupts

One of the earliest solutions used to provide mutual exclusion was to disable interrupts for critical sections; this solution was invented for single-processor systems.

最直观的尝试就是在进入 critical section 的时候 disable interrupts，之后再重新启用。这个的好处就一个，简单，但是坏处多多的。没有 interrupt 我们就得相信这个 process 会把 cpu 自己交回到 OS，这一听就很危险；这个方法在 multiprocessor 那里还不还用；还有可能制造出严重的系统问题。总之就是弊大于利。

A Failed Attempt: Just Using Loads/Stores

这个就是把 lock variable 简单的想成是一个 flag，然后 1 的话就锁上，0 的话就没有锁。这个的问题是 lock 本身的实现也不是 atomic，所以就有可能 thread 1检查完不是 1，刚要给他 set 成 1 的时候，context switch到 thread 2，thread 2 一看，也不是 1，然后把锁给占了，这时候再回thread 1，因为对于 thread 1 来讲，他已经查过了不是 1，所以他也继续执行了，这时候我们就有两个 thread 同时进到 critical area 了。

Spin Locks

所以 lock 这个东西纯靠软件是实现不了的，因为 lock 的实现本身就需要 atomic 这个特性，所以后来硬件就提供了这样的特性，目前来讲，所有的 CPU 提供了这个特性。

test-and-set

The simplest bit of hardware support to understand is known as a test-and-set (or atomic exchange ) instruction.

用 C 语言写这个 test-and-set 的功能如下：

int TestAndSet(int *old_ptr, int new) {
  int old = *old_ptr; // fetch old value at old_ptr
  *old_ptr = new; // store ’new’ into old_ptr
  return old; // return the old value
}

其实就是把新值赋给旧变量，然后把旧变量原来的值返回。

最重要的就是这个 function will perform atomically。

只用这个就可以制造一个简单的 spin lock。

It is the simplest type of lock to build, and simply spins, using CPU cycles, until the lock becomes available. To work correctly on a single processor, it requires a preemptive scheduler (i.e., one that will interrupt a thread via a timer, in order to run a different thread, from time to time).

compare-and-exchange

除了 test-and-set ，硬件还提供了另一个 primitive，叫做 compare-and-swap 或者 compare-and-exchange。

用 C 表示它的功能就是：

int CompareAndSwap(int *ptr, int expected, int new) {
  int original = *ptr;
  if (original == expected)
  *ptr = new;
  return original;
}

大致功能就是先检查一下和 expected value 相不相等，相等赋值，不相等啥也不干，最后永远返回原来的值。

用 compare-and-swap 也可以实现我们之前实现过的 spin lock。

Compare-and-swap is a more powerful instruction than test-and-set.

load-linked and store-conditional

第三个可以实现 spin lock 的硬件 primitive 是一对 instruction：load-linked and store-conditional

关于这两个 instruction，用 C 语言表示他们的功能可以写成：

int LoadLinked(int *ptr) {
 return *ptr;
}

int StoreConditional(int *ptr, int value) {
  if (no update to *ptr since LL to this addr) {
    *ptr = value;
    return 1; // success!
  } else {
  	return 0; // failed to update
  }
}

用这个也可以实现 spin lock。

像我们之前说过的一样，关于一个 lock 最重要的就是三个 property：correctness，fairness and performance。

The most important aspect of a lock is correctness: does it provide mutual exclusion? The next axis is fairness. The final axis is performance.

针对这个 spin lock，他有 correctness，但不保证 fairness，performance 在 single CPU 的时候非常差，但是当 CPU 多的时候可能会变好。

Ticket Lock

Fetch-And-Add

最后一个这个硬件提供的 primitive 是这个 fetch-and-add。作用很简单，就是返回原来的值，然后把原来的变量加一，C 语言的 pseudocode 如下：

int FetchAndAdd(int *ptr) {
 int old = *ptr;
 *ptr = old + 1;
 return old;
}

这个 primitive 可以用来创建 ticket lock，和 spin lock 略有不同。主要的实现方法就是每个 thread 都会得到一个 ticket number，这个 number 是一个一个加上去的。然后 lock 会有一个 turn，只有当这个 turn 和 thread 的 tick number 吻合的时候才会继续进行。每结束一个 thread，这个 turn 就会加一，去执行下一个 thread。

代码如下：

这个方法好在每一个 thread 都会得到机会执行。

Chapter 29: Lock-based Concurrent Data Structures

这一章节主要介绍如何把 data structure 也变成 thread safe 的，其实就是再包装一层，higher level of abstraction。

Concurrent Counter

首先介绍的数据结构是 counter，非常简单，然后也介绍了一个非常简单的方法可以实现 thread safe，那就是再所有访问内部元素的代码块外面包上 lock。但是这种见的方法，性能非常差，没办法 scale up。如果我们能有一个可以让不同的 thread 完美并行的方案，这个方法就是 perfect scaling，也就是说不管我们有多少个 thread，这个 process 完成的时间是一样的。

我们可以用 approximate counter 来实现 scale up这个就是我们有若干个 local counter，有一个 global counter，然后我们有一个 threshold S，如果 local counter 的数到了 S，我们 update 这个到 global 里，然后把 local counter 清零。

这个有个问题就是 S 设的低一些，就更准，但是就会慢，如果 S 设的高，就没那么准，但是非常快。

Concurrent Linked Lists

最简单的方法还是我们之前提到的，就是直接在所有的地方套上 lock，但是我们稍微改一下 lock 的位置其实就可以达成更好的效果，因为很多地方其实并不需要 lock。

另外一种方法是使用的 hand-over-hand locking (a.k.a. lock coupling)，这个就是说每个 node 我都有一个 lock，但是这样的话其实非常慢，就算我们 scale up 也很慢，因为我们 acquire lock/release lock 操作的开销很大。不过虽然每一个 node 都有 lock 这样的设计不太靠谱，我们还可以使用中和一下的设计，就是固定数量的 node 有一个 lock。

Concurrent Queues

这个就是把 enqueue 和 dequeue 给分离了，用两个锁来管。