summaryrefslogtreecommitdiff
path: root/doc/klacansky.com_notes_useless-operating-system.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/klacansky.com_notes_useless-operating-system.txt')
-rw-r--r--doc/klacansky.com_notes_useless-operating-system.txt978
1 files changed, 978 insertions, 0 deletions
diff --git a/doc/klacansky.com_notes_useless-operating-system.txt b/doc/klacansky.com_notes_useless-operating-system.txt
new file mode 100644
index 0000000..cc15e24
--- /dev/null
+++ b/doc/klacansky.com_notes_useless-operating-system.txt
@@ -0,0 +1,978 @@
+ Useless Operating System
+
+ [1]Code
+
+ In my final project for the advanced operating system class taught by
+ John Regehr, I implemented parts of an operating system inspired by
+ exo/micro kernel design.
+
+ The overall goal is to expose hardware as much as possible while
+ providing protection.
+
+ I decided to stick with amd64 architecture as porting to aarch64 seems
+ unfeasible given the time frame and I primarily use amd64 CPUs so I
+ want to know more about them. UOS does not do identity mapping of
+ physical memory outside boot thus kernel will be using virtual memory
+ in almost same way as user mode programs.
+
+ After implementing small OS I recommend it to all low-level or high
+ performance oriented programmers. More importantly, OS development is
+ joy, especially the design of interfaces and low-level tinkering.
+ Unfortunately, developing a usable operating system is no small task
+ (all those drivers) and user mode applications severely constrain the
+ kernel design. I therefore moved to compilers and programming
+ languages, which are tractable for a single person (in the days of
+ LLVM), and can significantly improve ones life by making programming
+ fun againk
+
+ Also, the development made me finally appreciate debugger, where
+ simpler printf can't be used as registers need to be inspected before
+ and after instruction execution, e.g., before and after sysret.
+
+Important
+
+ * disable red zone with compiler flag -mno-red-zone (can get
+ corrupted in nested interrupts)
+ * make sure pointers are in canonical form
+
+Motivation
+
+ In the class we used xv6, a clean Unix clone with approachable code.
+ However, adding VESA BIOS Extensions framebuffer required new system
+ calls and kernel changes which was clunky. I especially disliked that
+ any experiment required tinkering with the kernel, and provided
+ abstractions were quite high level.
+
+ Moreover, the xv6 operating system runs on 32-bit architecture, but
+ most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is
+ much simpler as it removes tons of old cruft such as segmentation and
+ has simple system calls.
+
+ The primary inspiration for UOS are MIT's JOS (micro/exo kernel) and a
+ bit L4 (microkernel). For example, almost all exceptions are handled by
+ user and not kernel, including page fault. The timer interrupt is
+ forwarded to user, but in case of buggy/malicious container, kernel
+ will wait few slots and then kill the misbehaving container. Since the
+ user knows more about its use of registers, it can save only the
+ necessary ones, e.g., may not save floating point registers. On context
+ switch, kernel stores only instruction and stack pointers.
+
+ I decided to go with capability-based permission system, where each
+ resource is associated with a "lock" and the key to the lock consists
+ of hashed permissions and the lock itself. Kernel/user can easily
+ verify the key valid by hashing the lock and permissions of the key and
+ checking if the key's hash matches. This approaches makes kernel
+ simpler, as memory that can't be exposed to user (e.g., < 1 MB) is
+ simply associated with a lock that only kernel knows. Therefore, no
+ special range checks are required in the kernel when user maps memory.
+ Shared memory is also simple to implement, as a container can easily
+ create keys for the memory it owns and give it to a newly created
+ container. Also, kernel uses same interface as user, i.e., it presents
+ key if it needs to map memory. TODO: how to safely store key if
+ Meltdown is present?
+
+ UOS in constrast to Linux does not have a text mode. All assembly is in
+ separate files as I am convinced inline assembly is not good solution.
+ Moreover, kernel consists of only single .c file, which simplifies the
+ build system to a single compile command. It targets desktops and uses
+ 2 MiB pages.
+
+Amd64 Architecture
+
+ Most desktops or laptops contain amd64 processor either from Intel or
+ Amd. Since UOS targets personal computers it supports only Amd64
+ architecture. This choice simplifies the kernel as 32bit architectures
+ have much more legacy cruft. For example, Amd64 has no segmentation,
+ good syscall interface, simpler instruction set and only one necessary
+ task segment. The arguments for C ABI are passed in registers. UOS does
+ not allow kernel to read any user memory without explicit mapping.
+ Moreover, since there is more registers, all syscall arguments must use
+ them.
+
+ Any virtual address must be in canonical form, i.e., if bit 47
+ (counting from 0) is set, then all most significant bits must be also
+ set. This quirk divides the address space into 2 pieces with huge hole
+ in between them. Pretty annoying as UOS is not higher half kernel.
+
+ The System V ABI requires first 6 integer or pointer arguments to be
+ passed in registers in order rdi, rsi, rdx, rcx, r8, and r9. The
+ remaining arguments (7 and more) are passed on stack.
+
+Overview of Startup
+
+ At boot, Basic Input/Output System (BIOS) loads specific sector of 512
+ bytes into memory at fixed address and starts execution. Usually, this
+ program is a bootloader that loads the kernel, or in our case second
+ stage of the bootloader as 64 bit assembly takes more space and I could
+ not fit the ELF loader into a single sector. The second stage parses
+ ELF file (kernel) on disk and loads it into memory. The kernel
+ bootstraps itself by setting up physical and virtual memory managers
+ and core struct for the boot processor. It then switches to new page
+ table and stack and continues by starting other processors and loading
+ user init program into a container. We will now discuss individual
+ stages in more detail.
+
+Bootloader
+
+ We focus on using BIOS instead of UEFI due to simplicity of the setup
+ as there is no need for additional tools to build BIOS bootable binary.
+ The disadvantage of BIOS is all the legacy cruft, such as starting in
+ Real Mode and having only 512 bytes for bootloader.
+
+ The purpose of bootloader is to do initial setup (such as video mode,
+ virtual memory) and load kernel from a disk into a memory. Since we
+ could not squeeze bootloader with ELF loader into 512 bytes like xv6,
+ we split the bootloader into two stages, where the first stage's
+ purpose is to only load the second stage. For now, we use 4 KiB of
+ space for both stages and kernel starts at sector 8.
+
+ After the video mode and long mode setup, the bootloader reads kernel
+ from ELF binary into predefined location and jumps into the kernel
+ entry point. IDE driver is used to perform reading from a disk. UOS
+ does not support multiboot standard as xv6, so it will work only with
+ our bootloader.
+
+ The first stage must be 512 bytes and the assembly needs to be padded
+ to fill 512 bytes (and adding MBR bytes at 510 offset).
+
+ A20 Line
+
+ Used to ignore 21st bit of address to emulate wraparound from older
+ architectures. We have to enable this 21st bit so we can address all
+ memory.
+
+ xv6 bootloader uses keyboard controller to enable this bit. We opt for
+ BIOS and its functions.
+movw $0x2401, %ax
+int $0x15
+
+ http://www.win.tue.nl/~aeb/linux/kbd/A20.html
+
+ Global Descriptor Table
+
+ entry point of bootblock.o must be at the address 0x7C00 which can be
+ double checked by doing objdump -d bootblock.o
+
+ The simplest "debugging" can be done by writing to screen; segmentation
+ must be used as CGA address of 0xB8000 is outside 16-bit range. This
+ addressing can be done by setting %ds to 0xB800 (segment is multiplied
+ by 16 on access).
+
+ We directly transition from real mode to long mode without going
+ through protected mode. Since the boot sector is small (512 bytes) we
+ have to manually set up page tables instead of loading them from an
+ array.
+
+ We can't use .data segment in bootloader C code (for example to store
+ static string) as only .text is copied with objcopy (-j flag; TODO).
+
+ Bootstrap Page Table
+
+ We map same physical address [0, 1 GiB) to the virtual addresses [0x0,
+ 1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as
+ we linked kernel to reside at address 0x8000100000 (as we want it to be
+ mapped into each process space) and when we jump to it it must be
+ mapped properly. The first 1 GiB in virtual address space are mapped
+ because it is the space in which bootloader lives and from where we
+ will perform the jump.
+
+ The reason for choosing address where all 39 bits are zero is so we can
+ have only top level page directory (PML4) which shares mapping with the
+ [0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry
+ to the same entry in PDP (next level page table) as for the first
+ entry.
+
+ direct switch to long mode
+
+ Real Hardware
+
+ UOS does not boot on my desktop as it would need SATA driver. If
+ booting from USB, tt turns out if the disk lacks partition table the
+ BIOS will set DL register to 0x0, i.e., floppy, and floppy does not
+ support linear base addressing (LBA) which we use in IDE driver.
+
+ELF
+
+ Fixed-length ELF header and variable-length program header listing
+ * .text program instructions
+ * .rodata read-only data (we should make it only read-only memory)
+ * .data program's data (e.g., global variables)
+ * .bss is a reserved section for unitialized global variables (we
+ need to zero them)
+ * VMA is link address LMA is load address of .text section (as the
+ offsets in code are absolute due to not using Position Independent
+ Code)
+
+ For some stupid reason .rodata is together with .text in same segment,
+ making .rodata executable. (http://thiébaud.fr/rodata.html)
+
+ Symbol Table
+
+ Symbol table associates names with linear addresses and thus can be
+ used by the loader to find addresses of exception and interrupt
+ handlers (those must have some agreed names).
+
+Drivers
+
+ UART
+
+ Serial port for debugging. Uses port-mapped I/O which ignores virtual
+ memory (thus even if we do not map first 2 MiB it works).
+
+ IDE driver
+
+ [2]Reference
+
+ We need to read kernel from the primary ATA disk in bootloader to
+ initialize the core OS.
+
+ We wrote the function for reading sectors in assembly to avoid inline
+ assembly. Also, we could wrap individual instructions in assembly
+ functions but it was much simpler this way.
+
+Kernel Bootstrap
+
+ Mostly disabling legacy stuff and enabling modern features as the
+ processor assumes it woke up in last century.
+
+Memory Management
+
+ * frame allocator (2 MiB)
+ * avoid mapping physical memory (higher half)
+ * tried recursive paging (pain to modify other processes, TLB
+ pressure)
+ * ended up with linear/physical dual mapped page tables
+ * kernel pages marked as global
+ * kernel heap (used for most allocations)
+
+ There are roughly three levels of memory management:
+ * physical memory
+ * virtual memory
+ * heap
+
+ We use recursive page table mapping to avoid the need to remap page
+ tables each time we need to access them or to use contiguous memory
+ storage for the VA to PA and PA to VA conversions which are necessary
+ for updating the page table. The recursive mapping is the simplest as
+ we need these conversions only for page table manipulation. In other
+ cases such as memory mapped IO we would use a simple mapping.
+
+ Hmm, recursive mapping allows us to access entries of page table only
+ for currently loaded page table which is really annoying.
+
+ There are several approaches:
+ * remap physical memory to kernel and use simple +-KERNEL_BASE
+ * use recursive page table (may use it for user space)
+ * keep small space available and map the page table before modifying
+ it
+ * store virtual and physical addresses for page table (have to keep
+ them in sync, allows for custom format which gets rid of canonical
+ address issues NX bit, problems with dirty and accessed flags)
+
+ Recursive Page Table
+
+ By mapping pml4 entry to pml4 we can access all paging structures for
+ current process. The main advantage is the simplicity and constant
+ virtual address space used in kernel (since only pml4 is mapped in
+ kernel space). If we need to modify another process's page table we can
+ point it to a pml4 slot and use the recursive trick on pdp (which is
+ pml4 of the other process).
+
+ Unfortunately, there are many disadvantages:
+ * since we use 2 MiB pages we have to be careful as the recursive
+ mapping behaves as if we used 4 KiB pages
+ * large TLB footprint (as we use 2 MiB pages)
+ * clunky to modify other process page table (the pdp becomes pml4,
+ ...) - rare, but needed for container creation
+ * weird bootstrap during boot
+ * limited to x86 (not big deal for now)
+
+ Duplicated Page Table
+
+ For kernel we have a separate page table data structure that mimics
+ hardware page table.
+
+ TODO: use completely different data layout so we can easily traverse
+ the pointers from pml4 -> pdp -> pd.
+
+ For user space we may use recursive mapping to allow easy view of the
+ current virtual address space. (TODO: what about TLB pollution? We
+ can't easily map page tables as they live in kernel heap)
+
+ Backing pages cannot be moved without knowledge of the heap to avoid
+ having a separate structure for page tables.
+
+ Initialization
+
+ Since at boot we do not have valid kernel page table we have to build
+ one. This build process is quite different from modifying current page
+ table (one in CR3) as we can't use recursive mapping to access the
+ pml4, pdp, and pd tables. Furthermore, since the page table is not used
+ we do not need to invalidate TLB entries.
+
+ We assume the heap is physically contiguous at the boot time and use a
+ simple offset to setup a initial kernel page table. Alternative
+ solution would be to walk the boot page directory and find linear
+ address for a given physical address (the other way is handled by
+ recursive mapping).
+
+ Manager
+
+ Still go back and forth between the options but leaning towards thin
+ layer in kernel which will provide frame management and ability to map
+ virtual to physical addresses. The kernel would enforce permissions.
+
+ The other options are:
+ * provide only sbrk to grow heap (pain in ass to map any specific
+ frames)
+ * allocate fixed number of frames for kernel and give rest of frames
+ to user space memory manager (two context switches, painful)
+ * I can't think of anything else
+
+ Security
+
+ If I go with 2 MiB pages than we need at least 4 pages for a program (8
+ MiB):
+ * .text, read-only
+ * .rodata, read-only, no-execute
+ * .data, write, no-execute
+ * stack, write, no-execute
+
+ Heap
+
+ I started with a simple heap that allocates multiple of 4096 bytes, but
+ after watching Switch hacking talk (TODO reference) I plan to have a
+ separate allocator (slab) for each struct type, thus avoiding the
+ potential vulnerability of overwriting different struct that can happen
+ on heap (as two different types of struct cannot possibly overlap, and
+ also there are no pointer casting issues).
+
+Security
+
+ * NX, WP, SMAP, SMEP
+ * guard pages
+ * proper permissions for elf segments
+ * capability-based permission (kernel/user are "same")
+
+ The UOS is a exo/micro kernel design and thus it runs all drivers in
+ user mode (ring 3).
+
+ Kernel is prohibited to write to any user mapped memory directly and
+ must map the memory to separate address if it wants to write there (for
+ example for the info structure). Kernel never reads any user memory and
+ all syscall arguments must be passed in registers.
+
+ The recent side-channel attacks are concerning and I will probably go
+ with the separate page table for kernel and user (with the addition of
+ PCID). The Spectre attack seem much harder to mitigate. The attack
+ allows user to read all memory, thus it could read kernel key and
+ corresponding capabilities created from the key. One solution would be
+ to store the keys and capabilities in a special area not mapped when
+ user space is in flight (probably hard to not make a mistake); could we
+ create kernel where everything is non-sensitive?.
+
+ he UOS is a exo/micro kernel design and thus it runs all drivers in
+ user mode (ring 3).
+
+ Kernel is prohibited to write to any user mapped memory directly and
+ must map the memory to separate address if it wants to write there (for
+ example for the info structure). Kernel never reads any user memory and
+ all syscall arguments must be passed in registers.
+
+ The recent side-channel attacks are concerning and I will probably go
+ with the separate page table for kernel and user (with the addition of
+ PCID). The Spectre attack seem much harder to mitigate. The attack
+ allows user to read all memory, thus it could read kernel key and
+ corresponding capabilities created from the key. One solution would be
+ to store the keys and capabilities in a special area not mapped when
+ user space is in flight (probably hard to not make a mistake); could we
+ create kernel where everything is non-sensitive?.
+
+ Capability-based Mechanisms
+
+ Instead of user/group based system have fine grained badges that give
+ owner permissions. Access control lists (ACL) store list of things that
+ have access to given resource. Capabilities store list of resources a
+ thing can access.
+
+ Strive for simplicity
+
+ c-list (capability list), trapdoor functions?
+
+ random numbers (rdrand, rdseed)
+
+ How to do these?
+ * capability per frame
+ * capability per core
+
+ Anton Burtsev, http://www.ics.uci.edu/~aburtsev/
+
+ References:
+ * https://www.cs.cornell.edu/courses/cs513/2007fa/L08.html
+ * https://software.intel.com/en-us/articles/intel-digital-random-numb
+ er-generator-drng-software-implementation-guide
+
+ Design
+
+ Initial approach was to have each frame associated with capability,
+ thus when use allocated a frame he would get a key to it. This method
+ was clunky as it would be tricky to allocate multiple frames and it
+ placed large burden onto user. Furthermore, we were associating the
+ object number with the key to identify it, which caused same problems.
+
+ Instead, we use user provided key (can be random number) when a frame
+ is acquired, thus the same key can be used with any number of frames.
+ User can create capability by hashing the key and permissions. Then
+ when any access such as mapping the frame is performed, OS performs
+ same hashing and check if the hash in capability matches the computed
+ one.
+
+ TODO: test if we can use AES instead of sha256
+
+Multicore
+
+ * ACPI table to get list of cores
+ * x2APIC (MSR)
+ * GS register
+ * one kernel stack per core
+ * create container in user space (aka fork)
+
+ There are at least two ways to detect Application Processors (AP) after
+ Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor
+ Specification tables or ACPI (specifically MADT).
+
+ xv6 uses MPS tables which are obsolete, so we focus on the ACPI
+ approach. The ACPI table pointer structure lives at some memory range
+ and can be checksumed to verify if the struct is really an ACPI table.
+ Since Qemu does not support x2APIC we have to memory map the LAPIC
+ registers to query the APIC id (other alternative is to use CPUID). The
+ ACPI table tells us this address and moreover it has a flag if core is
+ working (the table is filled during boot by each core).
+
+ After parsing the tables we loop through all core entries and if the
+ LAPIC id is different from the id of BSP we attempt to start the core
+ with interprocess interrupt. Then we wait until the core updates the
+ shared core array with running flag. We could do exponential startup
+ for large core counts, but currently it is not an issue. Note, we use
+ x2APIC which uses MSR as an interface instead of memory mapping.
+
+ Each core has its own kernel stack and pml4 with shared entries for the
+ kernel data structures (including heap) and for the entry which holds
+ all core kernel stacks (TODO: should cores see other cores stacks?).
+
+User Mode
+
+ * params passed only in registers
+ * fast syscall/sysret
+ * low level (map memory, create container)
+
+Graphics
+
+ The simplest way to draw pixels is uisng VGA mode 0x13 (320x200 with
+ 256 colors with linear addressing) with BIOS interrupt 0x10 in
+ protected mode. This mode switch is done in bootloader before protected
+ mode is entered as only real mode can sanely interact with BIOS to
+ manipulate VGA graphics.
+# two moves can be combined into movw $0x0013, %ax
+movb $0x0, %ah # call function 0x0 to set video mode
+movb $0x13, %al # use video mode 0x13
+# or movw $0x13, %ax
+int $0x10
+
+ After changing video mode, the pixel can be accessed at physical
+ address 0xA0000, which in xv6 is mapped at KERNBASE + 0xA0000.
+
+ Unfortunately, the small resolution (and only 256 colors) provided by
+ VGA is restrictive and unsuitable for any graphics based system.
+
+ The simplest solution is to use VESA BIOS Extensions (VBE) which can
+ provide more than 1kx1k resolutions with up to 16M colors. Even though
+ modes defined by VESA are considered deprecated, they are likely to
+ work and allow to avoid finding suitable mode. This predefined modes
+ come handy as real mode needs to be used and xv6's boot code can't be
+ more than 512 bytes. Other solution would be to drop down to real mode
+ before switching to long mode (TODO).
+
+ For example, setting 1280x1024 with 16M colors and loading its
+ information.
+movw $0x4F02, %ax # set VBE mode function
+movw $0x411B, %bx # 1280x1024 24bit mode (with linear framebuffer)
+int $0x10
+
+movw $0x4F01, %ax # get mode information function
+movw $0x411B, %bx # mode
+movw $0x0, %di # store the mode information struct (256 bytes) at physi
+cal address 0x0
+int $0x10
+
+ Than we can qeury mode parameters by loading the ModeInfoBlock
+ structure. For simplicity, we load it at fixed physical address 0x0 as
+ kernel code starts on address 0x100000. Since we care only about the
+ PhysBasePtr field (tells us location of linear framebuffer), we could
+ pass it as function parameter to the kernel bootstrap function.
+
+ Now, the tricky bit; we need to create virtual memory mapping to the
+ physical address in PhysBasePtr. On my machine in Qemu, the address is
+ 0xFD000000 and we need to map 4 MiB (depends on resolution and bit
+ depth) up to the address 0xFE000000 (defined in memlayout.h as
+ DEVSPACE, i.e., memory mapped I/O).
+
+ I used mappages (vm.c) in function main (main.c) to map some fixed
+ virtual address in FRAMEBUFFER to physical address at vbe->physbaseptr
+ before starting other processors.
+struct vbemodeinfo const *const vbe = (void *)KERNBASE;
+int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm);
+mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr,
+PTE_W);
+
+ Now, pixels can be written at the address FRAMEBUFFER anywhere in
+ kernel. I was surprised I could not find any exported function to
+ perform the memory mapping explicitly, another reason for exokernel.
+
+Floating Points
+
+ By default disabled on amd64
+
+ We can test in bootloader first if AVX2 is supported, and if not hang
+ the OS. Some OS will trap if user used floating point unit and enable
+ it. This mechanism allows OS to avoid storing/restoring floating point
+ registers during context switch if process does not use those
+ registers. Since UOS kernel does not save context at all (except few
+ registers on timer interrupt), it enables AVX2 at the kernel startup
+ and let's user to do register store/restore if necessary. In fact, this
+ approach is simpler as user knows what registers are being used and can
+ do some tricks. Moreover, since kernel does not use floating point
+ registers, user can pass values in registers to the next scheduled
+ container.
+
+Kernel
+
+ We map kernel at high virtual address as user space programs assume
+ loading at address 0.
+
+ Global Descriptor Table
+
+ The GDT from bootloader has only kernel code segment and no user code
+ segment, nor Task State Segment (TSS). Initially, we had one GDT shared
+ by all cores, but each core needs its own TSS, thus we could either
+ create an entry for each core, or better, duplicate whole GDT to obtain
+ full flexibility at mere cost of extra 32 bytes per core. Keeping the
+ GDT separate is the right step into the direction of sharing as little
+ as possible between cores to scale well on many core architectures (>
+ 32 cores).
+
+ Memory Layout
+
+ UOS's kernel does not remap physical memory contiguously (higher half)
+ as xv6 does. Instead, the kernel uses the same interfaces as user
+ programs, allocating frame and mapping it. In linear address space, the
+ kernel lives at entry 511 (last) in PML4 page table and thus takes 512
+ GB. We chose to take whole entry for simplicity as it can be easily
+ shared by all user programs. Moreover, we mark the entry as global to
+ avoid flushing TLB on context switch.
+
+ Each core's stack lives in entry 510 and is again marked global. Stacks
+ are one page large (2 MiB), and have guard pages interspersed between
+ the stacks. We do not have kernel stack per container as there are no
+ slow system calls in the kernel (such as IO) that would need to be
+ interrupted. This push of expensive operations to user space makes
+ kernel simpler and easier to reason about.
+
+ Linking
+
+Containers
+
+ Instead of notion of a traditional UNIX process, UOS uses containers as
+ its execution unit abstraction. A container is akin to a process, but
+ it also has access to exceptions, low-level memory management, and
+ scheduling. For example, user can easily implement preemtive threads
+ inside a container.
+
+ In short, container has following entries:
+ * unique identifier
+ * PML4 pointer to a page table
+ * frame for storing rsp and rip during system call
+ * status (running, ...)
+ * penalty (increased on each tick, decreased on yield)
+
+ In contrast to xv6, we do not have any scheduler thread, so the kernel
+ thread becomes a container after bootstrap and if container is killed,
+ next container is scheduled (or in lack of container kernel panics).
+
+User Space Bootstrap
+
+ The initial user mode binary is embedded into the kernel as there is no
+ disk driver in kernel. All binaries use custom link script so there are
+ only .text, .rodata, and .data, and they have proper permissions so
+ they can be loaded into separate pages.
+
+ The ELF file is parsed for timer interrupt handler which needs to be
+ passed to function allocating quantum (so it will be scheduled to run).
+
+System Calls
+
+ xv6 uses interrupt 0x10 for system calls. We use the native
+ syscall/sysret instructions present on amd64 architectures specifically
+ designed for system calls.
+
+ Since the syscall instruction uses rcx register to store the returning
+ instruction pointer we copy the value in the register to caller saved
+ register and restore it back in the syscall to preserve C calling
+ convention.
+
+ We may need to move away from C calling convention due the need of
+ having more than 6 function arguments for the syscall (7-th is on the
+ stack and uOS is prohibited from touching any user memory).
+
+ UOS relies on a C calling convention for system calls, but it must
+ clear caller saved registers before returning to the user as it could
+ leak information from the kernel (e.g., private key).
+
+Entering User Mode
+
+ Since amd64 has no "direct" instruction to switch to user mode we have
+ to pretend that the container is returning either from interrupt
+ (iretq) or system call (sysretq). We used the system call intruction as
+ it does not touch user's stack, thus we can avoid writing to user's
+ memory (which we avoid for security reasons).
+
+Traps and Interrupts
+
+ 256 slots total.
+
+ Must be acknowledged
+
+ Priorities, higher is handled first
+
+ TODO: do software excetions need to be acknowledged?
+
+ Needs one Task State Segment (TSS)
+
+ User space exception list:
+ * #DE - divide by zero (vector 0)
+ * #PF - page fault (vector 14) - TODO: user can't read CR2; should
+ kernel push CR2?
+ * #XF - SIMD floating point (vector 19)
+
+ User space rare exceptions (which we let double fault):
+ * #SS - stack exception (vector 12) - TODO: detects canonical address
+ issue
+ * #UD - invalid opcode (vector 6) - TODO: mostly legacy instructions
+ and very rare
+ * #GP - general protection (vector 13) - TODO: mostly for unaligned
+ SSE
+ * #AC - alignment check (vector 17) - can happen only if CPL = 3
+
+ Kernel space exception list:
+ * #DF - double fault (vector 8)
+ * #MC - machine check (vector 18)
+
+ Legacy exceptions:
+ * Non-maskable interrupt (vector 2)
+ * #OF - overflow (vector 4)
+ * #BR - bound range (vector 5)
+ * #NM - device not available (vector 7)
+ * #TS - invalid TSS (vector 10)
+ * #NP - segment not present (vector 11)
+ * #MF - x87 floating point (vector 16)
+
+ Story
+
+ Thanks to #osdev for help.
+
+ After implementing divide zero exception the kernel start rebooting.
+ Changing the .text segment to writable stopped the restarting and
+ seemingly fixed the problem. However, exception should not modify the
+ .text segment so after verifying the IDT is correct I had to check what
+ changed in .text segment.
+
+ I copied all .text data and run comparison after the fault happened.
+ The mismatch was found:
+mismatch at FFFFFF80000000C0: 20990000000000 20980000000000
+
+ Which refered to GDT code segment entry and specifically the accessed
+ bit got flipped. The reason is that during fault the code segment is
+ loaded, but for syscall/sysret it is not, so it was confusing.
+
+ For now, the solution was to move non-bootstrap GDT to .data segment.
+ NOTE: bootstrap GDT is in .text segment and thus the page for starting
+ APs must be writable.
+
+ Division by zero
+
+ When asked on IRC to about advancing RIP after division by zero I got
+ this response :).
+
+ It's *possible*, but it's also likely to break dozens of known good
+ assumptions about computer science and invoke a metric shitload of
+ undefined behaviour in the process (Kazinsal #osdev)
+
+ Timer Interrupt
+
+ So far, all non-critical exceptions were handled by currently running
+ user mode container. Unfortunately, I am not sure how to make the LAPIC
+ timer in user mode as it requires writes to a MSR to reset it (TSC
+ deadline), and more importantly kernel has to relinquish control in
+ case of adversary/buggy container that is reluctant to yield on the
+ interrupt.
+
+ Initially, I thought setting up two timers, one at higher frequency
+ (user mode) and the other at lower frequency (kernel mode) would be
+ neat, but LAPIC has only one timer interrupt vector. Thus, the current
+ plan is to point the interrupt vector to kernel which will bump up
+ penalty integer (if next quantum is a different container) and returns
+ to present container. Since the registers on context switch are saved
+ by containers, kernel must (I think) use either user mode stack (can be
+ read-only) or registers to at minimum save current RIP of the user
+ container before jumping to the handler. If sysretq is used, RCX and
+ R11 must be saved on user stack in addition, thus to better mimick
+ traditional exception, a user read-only stack seems as a better
+ solution (there is the GS/FS register too, but I rather have user
+ complete freedom and not reserve any registers - probably makes it more
+ portable too). At the end, I decided to create read-only container
+ environment (basically, shared memory with kernel) that can be used to
+ share information such as, the current penalty count, last RIP (on
+ interrupt), next quantum container id, and other different kernel
+ counters for active container. This area requires one frame (2 MiB) and
+ is mapped as write into kernel, and as read into user. The policy of
+ kernel never reading any user mode memory (only registers are allowed)
+ is still enforced.
+
+ After kernel returns to the user mode interrupt handler, the container
+ is expected to save its context if it is necessary and yield. There are
+ exceptions, for example if the next quantum is the same container,
+ there is no need to save anything nor yield, the container can restore
+ the stack pointer and jump back to the saved or different RIP. This
+ mechanism can be used to implement user space threads by simply
+ allocating few consecutive quanta (such as 10, each 1 ms long) and
+ yield only on the last one, so the kernel can switch to different
+ container.
+
+ Normally, I use null segment for SS, but when returning from CPL < 3
+ into CPL = 3 via iretq, the SS will get reloaded and will cause general
+ protection fault. Thus a not null selector must be created in GDT and
+ SS must be loaded with that selector index. We use a simple trick where
+ SS gets loaded with proper selector after executing sysretq instruction
+ (starts user mode container). It is possible to use sysretq, but then
+ two extra registers needs to be stored in info (rcx and r11); with
+ iretq we need to store only rip. For user space exceptions we do not
+ follow the manual and avoid iretq as it would set the accesses bit in
+ GDT entry and page fault. (TODO: manual explicitly requires to use
+ iretq)
+
+ TODO: handle what happens if container's interrupt handler gets
+ interrupted (I am thinking about unique ID written by kernel into info
+ as RIP can't be used reliably to identify interrupts). The container
+ should be penalized (and eventually killed), but more importantly,
+ there must be a plan for handling the user space stack in sane way (and
+ not losing some space on it, due to RSP being bump in handler, but not
+ decremented).
+
+ Protocol:
+ 1. interrupt jumps into kernel handler
+ 2. if current container penalty is 3, kill it and schedule next
+ container
+ 3. increase penalty
+ 4. update RIP in the info
+ 5. set next LAPIC deadline
+ 6. jump into user handler
+ 7. user is expected to save registers and yield
+ 8. decrease penalty (and if negative set to 0)
+ 9. schedule next container
+
+ Notice, the penalty indicates if there is any nesting in timer
+ interrupt. A well behaving program should always have penalty equal 1
+ when entering timer interrupt handler.
+
+ Since the user mode bootstrap container must handle timer interrupt, we
+ parse the ELF symbol table and search for timer_interrupt_asm assembly
+ function. This function is then called by kernel timer interrupt
+ handler.
+
+ The kernel interrupt handler is fully written in assembly to avoid any
+ surprises in terms of register modifications. The current
+ implementation needs to push and pop only 3 registers, RAX, RCX, and
+ RDX.
+
+Compile Flags
+
+ -mno-red-zone (disable red zone as interrupts do not respect convention
+ of leaving extra stack space for leaf functions)
+
+Tasks
+
+ * tests
+ + [ ] frame allocator
+ + [ ] sha256
+ * style
+ + [X] header only
+ + [ ] Intel-style assembly
+ * bootloader
+ + [X] load rest of bootloader with BIOS (sort of 2 stage)
+ + [X] switch from real mode to long mode
+ + [ ] query video modes and save this information at fixed
+ address
+ + [ ] switch to most suitable video mode
+ + [X] load ELF binary containing kernel (uses IDE)
+ + [ ] switch to SATA?
+ + [ ] boot on real hardware (needs SATA)
+ * kernel
+ + [X] make work on KVM (needed to disable SMAP on Haswell)
+ + [?] add performance counters to evaluate impact of recursive
+ paging on TLB misses
+ o abandoned recursive paging due to clunky mapping of other
+ page tables
+ + [X] start application processors (AP)
+ + [X] create kernel process for each application processor (AP)
+ + [X] add spinlock (test it by cores racing on it)
+ + [X] use RDGSBASE to get pointer to core structure instead of
+ the current offset stuff
+ + [X] check required feature set with CPUID
+ + [ ] SKINIT instruction (read Security section in manual)
+ + [ ] restructure so the end is at 16 MiB boundary
+ + [ ] avoid global variables
+ + [ ] use ILP64 mode (sizeof (int) == 8) - seems hard as there
+ is no compiler support
+ + [X] timer (local APIC; TSC-Deadline mode)
+ + [X] kernel double fault handler
+ + [X] enable SCE (system call extensions) in EFER
+ + [X] add system call support (syscall)
+ + [ ] syscall user rip security (validate it is canonical?)
+ + [X] add containers (similar to process abstraction)
+ + [X] load user space bootstrap ELF embedded inside kernel
+ + [X] page permissions for ELF segments
+ + [X] parse ELF symbol table to find timer interrupt handler
+ + [X] switch to user space (via sysretq)
+ + [X] use x2APIC (through MSR) instead of old xAPIC
+ + [X] wait 10 ms on AP startup
+ + [ ] IPC (async to avoid deadlocks...)
+ + [ ] AVX2
+ + [X] one TSS
+ + [ ] sleep idle cores (hlt instructions)
+ + [ ] expose some data structures (and update during task
+ switch?)
+ + [ ] Continuation style system (TODO: read papers on it)
+ + [ ] init system (so servers can reserve resources)
+ + [ ] struct type for physical addresses (what about linear
+ addresses? always a pointer?)
+ + [X] add HPET support (to compute TSC rate)
+ + [ ] compute TSC rate for each core independently (good idea?)
+ + [ ] destroy container on double fault or other error
+ * memory management (only 2 MiB pages, thus only PML4, PDP, and PD is
+ used)
+ + [X] boot page table
+ + [X] kernel page table
+ + [X] recursive PML4 (no longer used for kernel space)
+ + [X] frame allocator
+ + [X] page allocator
+ + [X] heap allocator (only alloc, no free)
+ + [X] duplicated page table (allows easy traversal of any page
+ table - even inactive one)
+ + [X] page mapping interface (global marked pages for kernel to
+ avoid TLB invalidation; write protect bit CR0)
+ + [ ] alignment checking in user space?
+ + [X] Supervisor-Mode Execution Prevention (CR4.SMEP)
+ + [X] non-executable pages (NXE in EFER)
+ + [ ] TLB (tagged)
+ + [X] shared memory (uses capabilities)
+ + [ ] replace heap with slab allocator
+ * scheduling
+ + [ ] penalty protocol for wonky containers
+ + [ ] round-robin scheduler
+ + [ ] container migration to other core
+ * syscalls
+ + [X] log to UART
+ + [ ] frame alloc/free
+ + [ ] map/unmap
+ + [X] alloc/free container
+ + [X] quanta alloc, yield
+ + [X] userspace register store
+ * debugging
+ + [X] setup .gdbinit for automatic GDB startup
+ * security
+ + [ ] separate page table for user and kernel (KAISER)
+ + [X] use guard page for each core kernel stack
+ + [X] proper page flags for .text, .rodata, .data
+ + [X] capability-based resource management
+ + [ ] separate memory pools for each struct type (attacker can
+ only modify same type of struct)
+ + [ ] MTTR (prevent DMA to write into kernel memory)
+ + [ ] kernel ASLR (both linear and physical address; PIC needed)
+ + [ ] random location for user stack
+ + [ ] finer granularity permissions for capabilities
+ + [ ] check if AES can be used instead of sha256
+ + [ ] namespaces?
+ + [ ] string type with length?
+ + [ ] stack protection
+ + [ ] do not map kernel into user space? (recent patches in NT
+ and Linux do it)
+ * libs
+ + [X] sprintf
+ + [X] sha256 routine
+ + [X] random numbers (via rdrand instruction)
+ + [X] strcmp
+ * user space
+ + [X] simple game of life render test
+ + [X] exceptions (divide by zero, page fault, ...)
+ + [X] timer interrupt
+ + [ ] compositor
+ * ACPI
+ + [X] parse MADT for detecting CPUs
+ + [X] build parsed struct with LAPICs
+ + [ ] parse x2APIC (Qemu unfortunately has ACPI 1.0 only)
+ * drivers
+ + [ ] SATA (high priority)
+ + [X] basic serial port
+ + [X] IDE
+ + [ ] keyboard (needs USB stack)
+ + [ ] mouse (needs USB stack)
+ + [ ] Intel GPU driver (blitting support)
+ * graphics
+ + [ ] GUI library
+ + [ ] tiling manager
+ * file system
+ + [ ] sector-based addressing (similar to page frames)
+ + [ ] memory mapped filesystem?
+ + [ ] some filesystem format
+ * networking (not likely before Christmas)
+ + [ ] ethernet driver
+ + [ ] packet filter (downloadable?)
+ * JIT (use wasm or some other typesafe IR and thus run everything at
+ ring 0)
+ + [X] interpreter of a subset of WASM
+
+References
+
+ * Exokernel: An Operating System Architecture for Application-Level
+ Resource Management
+ * Singularity: Rethinking the Software Stack
+ * Amoeba: a distributed operating system for the 1990s
+ * Scheduler Activations: Effective Kernel Support for the User-Level
+ Management of Parallelism
+ * A Comparison of Scheduling Algorithms for Multiprocessors
+ * L4 Microkernels: The Lessons from 20 Years of Research and
+ Deployment
+ * xv6, JOS, seL4
+ * osdev wiki
+ * Operating Systems Design and Implementation
+ * AMD64 Volume 1, 2, 3
+ * ATA Interface Reference Manual
+ * Advanced Configuration and Power Interface (ACPI) Specification
+ * ELF-64 Object File Format
+ * VESA BIOS EXTENSION (VBE) Core Functions Standard
+ * [3]The little book about OS development
+ * [4]Reading privileged memory with a side-channel
+ * [5]Kaiser
+ * [6]Hyperkernel: Push-Button Verification of an OS Kernel
+ * [7]Dune: Safe User-level Access to Privileged CPU Features
+ * Modern Operating Systems - 3rd edition, A. Tanenbaum
+
+References
+
+ 1. https://klacansky.com/notes/data/useless-os.zip
+ 2. http://wiki.osdev.org/ATA_PIO_Mode
+ 3. https://littleosbook.github.io/
+ 4. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
+ 5. https://klacansky.com/notes/reference/uos/kaiser.pdf
+ 6. https://klacansky.com/notes/reference/uos/hyperkernel.pdf
+ 7. https://klacansky.com/notes/reference/uos/dune.pdf