02.4 Operating systems — the parts that matter

There is a particular flavor of confusion that hits programmers who have never opened the kernel curtain. They write a program that gets killed by the OOM killer and don’t know why. They see “load average: 12.5” and don’t know what it counts. They write multithreaded code and ship a deadlock. The fix for all of that is one focused pass through operating systems — not a semester of it, but the parts that show up in every job.

This module covers what you actually need: processes, threads, memory, files, and the syscalls that move data across the kernel boundary. You will build a tiny shell that exercises most of it. You will not implement a scheduler from scratch — you will read enough about scheduling to understand why your laptop sometimes feels slow.

The opinionated take: you don’t need to read OSTEP cover to cover. OSTEP is fantastic and it’s also 800 pages. Pick the right chapters, do the projects, move on.

Set up

You need a real Unix-like environment. On Windows, install WSL2 (Ubuntu) before you continue. On macOS, you have most of what you need; install the GNU coreutils for parity with Linux command names.

# Linux
sudo apt install build-essential strace gdb valgrind linux-tools-common

# macOS
xcode-select --install
brew install gnu-coreutils

Verify your environment:

gcc --version && strace -V 2>/dev/null || dtruss -h | head -1

Make a project:

mkdir tinyos && cd tinyos
mkdir shell traces notes

You will write the shell in C (preferred for OS work) or Rust if you went through 01.5 in Rust. Either is fine.

Read these first

In this order, and only the listed chapters:

OSTEP — Remzi & Andrea Arpaci-Dusseau. free PDF · 8–10 hours · chapters 4–6 (processes, syscalls, limited direct execution), 12–14 (memory virtualization), 26–32 (concurrency). Skip the persistence section unless you want to. This is the single best free OS textbook.
Robert Love — Linux Kernel Development. book · skim only · for a real-kernel grounding when OSTEP feels abstract.
The Linux man pages, sections 2 and 3. online · man 2 fork, man 2 execve, man 2 wait are required reading. The man pages are the actual API.
Brendan Gregg — Systems Performance. book · only after the basics. The best applied-OS book in print.

After OSTEP’s process and concurrency chapters you have the conceptual grounding. Stop reading and start writing.

Processes vs threads

A process is the kernel’s unit of isolation: its own virtual address space, its own file descriptor table, its own credentials. A thread is a unit of scheduling within a process: shared memory, shared file descriptors.

The four syscalls every Unix programmer must know:

fork() — duplicate the current process. Returns 0 in the child, the child’s PID in the parent. Surprising on first encounter; obvious by the third use.
execve() — replace the current process image with a new program. Almost always called right after fork in the child.
wait() / waitpid() — block until a child exits, reap its status. Without this, you leak zombies.
exit() — terminate the current process and return a status to the parent.

Drill: write a 30-line C program that forks five children, each prints its PID, parent waits for all five. If you understand this program — the order of the prints, the interleaving, why the parent’s wait loop matters — you understand the basic process model.

// shell/fork_demo.c
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main(void) {
    for (int i = 0; i < 5; i++) {
        pid_t p = fork();
        if (p == 0) { printf("child %d pid %d\n", i, getpid()); return 0; }
    }
    while (wait(NULL) > 0) {}
    return 0;
}

Run it ten times. Notice the prints come out in different orders. That is the scheduler at work.

Virtual memory, the diagram-level view

Every process sees a flat virtual address space. The MMU (memory management unit) translates virtual addresses to physical addresses through a per-process page table. The TLB (translation lookaside buffer) caches recent translations.

What you must internalize:

Pages are 4 KiB on most systems. Memory is allocated and tracked in page-sized units.
Page tables are big. A naive page table for a 64-bit address space is unimaginably huge, so they are multi-level (4 levels on x86-64).
A page fault is an interrupt. When you access a virtual page that’s not mapped to a physical page (or has been swapped out), the CPU traps to the kernel, which fixes it up.
The TLB matters. TLB misses are expensive — hundreds of cycles. Cache-friendly code is also TLB-friendly.

You don’t need to implement page tables. You do need to know what top’s VIRT, RES, and SHR mean (virtual size, resident-in-RAM, shared with other processes). When VIRT is huge but RES is small, that’s normal — VIRT counts everything mapped, including memory you’ve never touched.

Syscalls and the kernel boundary

A syscall is a controlled transition from user mode to kernel mode. On x86-64 Linux, the syscall instruction triggers it. Userspace cannot directly access hardware; it must ask the kernel.

The conventional wisdom: syscalls are slow because they cross the boundary, flush some CPU state, and run kernel code. Roughly 100–1000 ns per syscall, depending on what it does. This is why batched I/O (writev, io_uring) and zero-copy paths (sendfile, mmap) exist.

Tracing a real program’s syscalls is one of the most enlightening 10 minutes you can spend:

# Linux
strace -c ls /tmp

# macOS
sudo dtruss -c ls /tmp

The output is the syscall histogram: which calls, how many, time spent. You will be surprised how much of ls is getdents64 and stat.

File descriptors and the unified-IO model

In Unix, almost everything is a file. Sockets, pipes, regular files, terminals, even some device drivers — all accessed through small integers called file descriptors. fd 0 is stdin, 1 is stdout, 2 is stderr.

The four core operations:

read(fd, buf, n) — pull up to n bytes.
write(fd, buf, n) — push up to n bytes.
close(fd) — release the descriptor.
dup2(old, new) — make new refer to whatever old refers to.

The shell pipe is dup2’s killer app. To wire ls | wc, the shell creates a pipe (returns two fds), forks twice, in the ls child it dup2s the write end onto fd 1, in the wc child it dup2s the read end onto fd 0. That is the entire trick. Once you’ve written it, pipes stop being magic.

The build: a tiny shell

Build a shell that supports:

Running a single command with arguments.
Piping commands with |.
Backgrounding with &.
Built-in cd, exit.

The skeleton is roughly 200 lines of C. Build it in shell/tinysh.c over an evening or two.

// shell/tinysh.c — sketch only; you write the rest
int main(void) {
    char line[1024];
    while (printf("$ "), fflush(stdout), fgets(line, sizeof line, stdin)) {
        // 1. tokenize on whitespace
        // 2. detect & for backgrounding
        // 3. detect | and split into pipeline stages
        // 4. handle built-ins (cd, exit) without forking
        // 5. fork/exec each stage, dup2 pipe ends, wait or don't
    }
}

When yours runs cat README.md | grep todo | wc -l correctly and supports &, you have built a complete (if tiny) shell. That is roughly the same architecture as bash, dash, and zsh. Real shells add aliases, history, glob expansion, job control — all of which are layers on top of this core.

Concurrency primitives

The three you must know cold:

Primitive	What it does	Use when
Mutex	Mutual exclusion; one holder at a time	Protect a shared data structure
Condvar	Wait until a condition becomes true	Coordinate producers and consumers
Semaphore	Counted resource	Limit concurrent access to N

Anti-patterns to avoid:

Naked spinlocks in user space. Use a mutex. The kernel knows how to put a thread to sleep.
Sleeps as synchronization. sleep(0.1) is not a substitute for a condvar.
Holding a lock across I/O. The latency tail will eat you.

Drill: write a bounded-buffer producer/consumer in C with pthreads. One mutex, two condvars (not_full, not_empty). Run it with three producers and two consumers. If the buffer ever overflows or underflows, your synchronization is wrong.

Scheduling, briefly

Linux uses CFS (Completely Fair Scheduler), which approximates “give every runnable thread a fair share of CPU.” The relevant facts for application programmers:

Runnable threads compete for CPU. Threads in D state (uninterruptible sleep) often mean blocked I/O — a common cause of “my system is slow.”
nice adjusts priority weight. Most apps shouldn’t touch it.
Per-CPU run queues mean cache locality matters; the scheduler tries to keep a thread on the same CPU it last ran on.

You don’t need to implement a scheduler. You do need to read top and htop and explain the load average. (Hint: load average counts runnable + uninterruptible processes, averaged over 1/5/15 minutes. A load of 4 on a 4-core box is “fully busy”; on a 2-core box it’s “queueing up.”)

Going deeper

When you have specific questions, in this order:

OSTEP — full text. free PDF · the persistence chapters (filesystems, journaling, log-structured) are excellent when you need them.
Brendan Gregg — Systems Performance. book · the applied volume. Read it after OSTEP, not before.
The Linux Programming Interface (Kerrisk). book · 1500 pages of the actual Linux API. Reference, not cover-to-cover.
xv6. book + code · MIT’s teaching kernel. Read the source over a weekend; it’s tiny and complete.

Skip “What is an operating system” 5-minute YouTube videos. They will not help.

Checkpoints

If any one wobbles, redo the corresponding section.

Walk through what fork() returns to the parent vs the child. Why is the return value different, and how does that one quirk power the entire process model?
You see top showing one process at VIRT 8 GB, RES 200 MB. Explain what that means and whether you should be worried.
A program writes 10 MB to disk. Roughly how many syscalls, and what’s the cost difference between calling write() once with 10 MB vs ten million times with one byte each?
Sketch the pipe + dup2 dance for cmd1 | cmd2. Specifically: who creates the pipe, who closes which end, and what does the child fd table look like after dup2?
You have a producer thread and a consumer thread sharing a queue. Name the primitives you need and why a single mutex without a condvar is the wrong choice.

If you can answer all five from memory, you’ve earned 02.4. Move on to 02.5 (Computer networks, end to end) — where the file descriptor model extends across machines.