Notes---Operating Systems: Three Easy Piece---Persistence（UW-Madison CS 537）

这里就是最后一部分啦～，persistence，讲了一些像什么 I/O devices，然后什么 file system，distributed system 这种东西。

Chapter 36: I/O Devices

这一部分首先讲的是 I/O devices，which is a very important part of computers。

三种 bus：memory bus，general I/O bus 和 peripheral I/O bus。

不过上面这个并不是真的现代计算机系统的架构，一个真的现在计算机的系统架构大概会像 Figure 36.2。

现在我们讲一讲 canonical device。意思就是所有 devices 的抽象，所有 devices 都要有以下的 feature。

所有的 device 都需要提够接口，也就是 interface，这样我们的系统，或者其他组件才能够“用”这个硬件。

其次所有的 device 都有其内部结构，这个就大家各不相同了，只要你提供了好的 abstraction，好的 interface，我们系统就可以忽略你到底是怎么实现具体的功能的。

In the picture above, the (simplified) device interface is comprised of three registers: a status register, which can be read to see the current status of the device; a command register, to tell the device to perform a certain task; and a data register to pass data to the device, or get data from the device. By reading and writing these registers, the operating system can control device behavior.

轮询设备 (Polling the Device)：CPU 通过不断地、循环地读取设备的状态寄存器，来检查设备是否准备好执行操作或是否已完成操作的过程。

但是 polling 这个操作比较浪费 CPU 资源，相当于你的 CPU 一直在空转，另一个比较 efficient 的操作是 interrupt，其实我们在之前学过。

需要注意的就是不是所有时候，interrupt 都更好，因为 context switch 也有 overhead，如果这个硬件本身的操作很快，可能就是用 polling 更好。

我们也可以结合这两种方法，有 hybrid 的方法，比如先 polling，polling 一段时间之后变成 interrupt，这种 two-phased 的方法可能更现实一点。

interrupt 这个方法还有可能引发 livelock。除了 interrupt 我们也有别的优化方法，比如说 coalescing，我的理解 coalescing 更像是 batching，就是遇到一个 interrupt 的时候等一等，说不定能等到别的 interrupt，这样就能把不同的 interrupt batch 到一起，然后 CPU 只需要处理一个 interrupt。

我们有各种各样的 devices，我们时常会遇到 devices 和 memory 之间的数据传输问题，比如说我现在要复制 memory 里的一些数据到某个 device，如果我们用 PIO (Programmed I/O) 模式，CPU 干这个活，如果遇到一些任务量很大的 I/O，cpu 直接很轻松就被占满了；所以我们要把这项任务 offload 到“专人”来干，这个“专人”就叫 Direct Memory Access (DMA)。

我们到现在也没有说 CPU 和各个其他的 devices 是怎么 communicate 的，一般我们有两种方法：1. 用 explicit I/O instructions。就直接用一些专用的指令，比如就指明哪个 device 上哪个 register，我们要往里存什么数据。2. memory mapped I/O。我们没有专用的指令，而是把一些 memory 的地址设置成 mapping 地址，往这里存就相当于往某个 device 的某个 register 存。

但是毕竟每一个 devices 都不一样，那我们怎么样能非常自然的处理和这些不一样的 device 的 communication 呢？其实 OS 里，有一个小的软件，知道这些 devices 的 details，这个软件我们叫它 device driver。

Figure 36.4 是一个 simplified 关于我们究竟怎么 communicate 的 workflow。

Chapter 37: Hard Disk Drives

这一章节就专门讲 hard disk drive，hard disk 是一个非常重要的 device，所以它的 drive 也就很重要。

We can view the disk as an array of sectors; 0 to n − 1 is thus the address space of the drive.

Platter: a circular hard surface on which data is stored persistently by inducing magnetic changes to it.

A disk may have one or more platters; each platter has 2 sides, each of which is called a surface.

The platters are all bound together around the spindle, which is connected to a motor that spins the platters around

The rate of rotation is often measured in rotations per minute (RPM)

Data is encoded on each surface in concentric circles of sectors; we call one such concentric circle a track. 图 37.1 就展示了一个 single track 的 surface。

The process of reading and writing is accomplished by the disk head; there is one such head per surface of the drive. The disk head is attached to a single disk arm, which moves across the surface to position the head over the desired track.

等 sector 转过来需要时间，这个时间叫做 rotational delay。

如果我们有 multiple tracks，那我们把 head 挪到正确的 track 上也需要时间，这个时间叫做 seek time。Seek 这个操作可以分为四步： first an acceleration phase as the disk arm gets moving; then coasting as the arm is moving at full speed, then deceleration as the arm slows down; finally settling as the head is carefully positioned over the correct track. The settling time is often quite significant as the drive must be certain to find the right track.

最后我们找到了对应的 sector，我们就需要做最后一步，the actual data transfer。

We have a complete picture of I/O time: first a seek, then waiting for the rotational delay, and finally the transfer.

track skew: Make sure that sequential reads can be properly serviced even when crossing track boundaries.

这个错位是因为从一个 track 挪到另一个 track 的时候是需要时间的，如果不这样错位一下，在换 track 的时候这个盘就不能 rotated 了。

后面讲了一些具体的计算，有一个还比较有意思，就是 average seek time 等于的是 1/3 的 full seek time，就是平均计算 seek time 是相当于横跨 1/3 宽度需要的时间。

因为 I/O 的开销其实很高，所以我们还有一个 disk scheduler 专门来管当前应该做哪个 I/O。

我们 CPU 的 scheduler 其实是没办法预测一个任务需要多久完成的，但是 disk scheduler 在某种程度是可以预测的，所以我们可以直接 greedy，每次选最短的任务：principle of SJF (shortest job first)。

shortest-seek-time-first (SSTF) (also called shortest-seek-first or SSF)

SSTF 是有问题的，首先就是我们其实不知道 disk 的 geometry structure，OS 看到的是一个 block array，但是 nearest-block-first (NBF) 解决了这个问题；其次就是 starvation，如果一直进来的 I/O 都是快的，那么慢的就会一直 starve。

为了解决 starvation，我们有了 Elevator (a.k.a. SCAN or C-SCAN)。

SCAN，simply moves back and forth across the disk servicing requests in order across the tracks。

Let’s call a single pass across the disk (from outer to inner tracks, or inner to outer) a sweep.

就是按顺序的来回扫，这样在我们刚扫完的地方，如果有了新的 I/O request，我们不会立刻去处理，而是等我下一个 sweep。

SCAN has a number of variants, like F-SCAN, which freezes the queue to be serviced when it is doing a sweep

C-SCAN is another common variant, short for Circular SCAN. 只从一个 direction 扫。

不过即便我们有了 SCAN 这一系列的 scheduler，其实还是不够，因为它并没有考虑 rotation time。

SPTF: Shortest Positioning Time First (sometimes also called shortest access time first or SATF)，就可以解决上面这个 rotation time 的问题。

Chapter 39: Interlude: Files and Directories

一个新的 abstraction：persistent storage。

Two key abstractions have developed over time in the virtualization of storage. The first is the file. The second is directory.

For historical reasons, the low-level name of a file is often referred to as its inode number (i-number).

一个 directory 存的就是一个 (user-readable name, low-level name) pairs 的 list。

The directory hierarchy starts at a root directory and uses some kind of separator to name subsequent sub-directories until the desired file or directory is named.

Create files

用 open() 这个指令可以创建一个新的 file，他会给你返回一个 file descriptor。只要你有这个 file descriptor，你就可以读写这个文件，它可以起到一个钥匙的作用。

File descriptors are managed by the operating system on a per-process basis

Reading And Writing Files

正常你第一个打开的文件，给你返回的 file descriptor 基本上一定会是 3，这是因为每一个 process 已经默认打开了三个文件：一个是 standard input，一个 standard output，一个 standard error。这三个会占据 file descriptor 的 0，1，2。

把 file structures 合到一起，有一个叫做 open file table 的东西存这些 file structures。