1 files changed, 978 insertions, 0 deletions
diff --git a/doc/klacansky.com_notes_useless-operating-system.txt b/doc/klacansky.com_notes_useless-operating-system.txt
new file mode 100644
index 0000000..cc15e24
--- /dev/null
+++ b/doc/klacansky.com_notes_useless-operating-system.txt
@@ -0,0 +1,978 @@
+                            Useless Operating System
+
+   [1]Code
+
+   In my final project for the advanced operating system class taught by
+   John Regehr, I implemented parts of an operating system inspired by
+   exo/micro kernel design.
+
+   The overall goal is to expose hardware as much as possible while
+   providing protection.
+
+   I decided to stick with amd64 architecture as porting to aarch64 seems
+   unfeasible given the time frame and I primarily use amd64 CPUs so I
+   want to know more about them. UOS does not do identity mapping of
+   physical memory outside boot thus kernel will be using virtual memory
+   in almost same way as user mode programs.
+
+   After implementing small OS I recommend it to all low-level or high
+   performance oriented programmers. More importantly, OS development is
+   joy, especially the design of interfaces and low-level tinkering.
+   Unfortunately, developing a usable operating system is no small task
+   (all those drivers) and user mode applications severely constrain the
+   kernel design. I therefore moved to compilers and programming
+   languages, which are tractable for a single person (in the days of
+   LLVM), and can significantly improve ones life by making programming
+   fun againk
+
+   Also, the development made me finally appreciate debugger, where
+   simpler printf can't be used as registers need to be inspected before
+   and after instruction execution, e.g., before and after sysret.
+
+Important
+
+     * disable red zone with compiler flag -mno-red-zone (can get
+       corrupted in nested interrupts)
+     * make sure pointers are in canonical form
+
+Motivation
+
+   In the class we used xv6, a clean Unix clone with approachable code.
+   However, adding VESA BIOS Extensions framebuffer required new system
+   calls and kernel changes which was clunky. I especially disliked that
+   any experiment required tinkering with the kernel, and provided
+   abstractions were quite high level.
+
+   Moreover, the xv6 operating system runs on 32-bit architecture, but
+   most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is
+   much simpler as it removes tons of old cruft such as segmentation and
+   has simple system calls.
+
+   The primary inspiration for UOS are MIT's JOS (micro/exo kernel) and a
+   bit L4 (microkernel). For example, almost all exceptions are handled by
+   user and not kernel, including page fault. The timer interrupt is
+   forwarded to user, but in case of buggy/malicious container, kernel
+   will wait few slots and then kill the misbehaving container. Since the
+   user knows more about its use of registers, it can save only the
+   necessary ones, e.g., may not save floating point registers. On context
+   switch, kernel stores only instruction and stack pointers.
+
+   I decided to go with capability-based permission system, where each
+   resource is associated with a "lock" and the key to the lock consists
+   of hashed permissions and the lock itself. Kernel/user can easily
+   verify the key valid by hashing the lock and permissions of the key and
+   checking if the key's hash matches. This approaches makes kernel
+   simpler, as memory that can't be exposed to user (e.g., < 1 MB) is
+   simply associated with a lock that only kernel knows. Therefore, no
+   special range checks are required in the kernel when user maps memory.
+   Shared memory is also simple to implement, as a container can easily
+   create keys for the memory it owns and give it to a newly created
+   container. Also, kernel uses same interface as user, i.e., it presents
+   key if it needs to map memory. TODO: how to safely store key if
+   Meltdown is present?
+
+   UOS in constrast to Linux does not have a text mode. All assembly is in
+   separate files as I am convinced inline assembly is not good solution.
+   Moreover, kernel consists of only single .c file, which simplifies the
+   build system to a single compile command. It targets desktops and uses
+   2 MiB pages.
+
+Amd64 Architecture
+
+   Most desktops or laptops contain amd64 processor either from Intel or
+   Amd. Since UOS targets personal computers it supports only Amd64
+   architecture. This choice simplifies the kernel as 32bit architectures
+   have much more legacy cruft. For example, Amd64 has no segmentation,
+   good syscall interface, simpler instruction set and only one necessary
+   task segment. The arguments for C ABI are passed in registers. UOS does
+   not allow kernel to read any user memory without explicit mapping.
+   Moreover, since there is more registers, all syscall arguments must use
+   them.
+
+   Any virtual address must be in canonical form, i.e., if bit 47
+   (counting from 0) is set, then all most significant bits must be also
+   set. This quirk divides the address space into 2 pieces with huge hole
+   in between them. Pretty annoying as UOS is not higher half kernel.
+
+   The System V ABI requires first 6 integer or pointer arguments to be
+   passed in registers in order rdi, rsi, rdx, rcx, r8, and r9. The
+   remaining arguments (7 and more) are passed on stack.
+
+Overview of Startup
+
+   At boot, Basic Input/Output System (BIOS) loads specific sector of 512
+   bytes into memory at fixed address and starts execution. Usually, this
+   program is a bootloader that loads the kernel, or in our case second
+   stage of the bootloader as 64 bit assembly takes more space and I could
+   not fit the ELF loader into a single sector. The second stage parses
+   ELF file (kernel) on disk and loads it into memory. The kernel
+   bootstraps itself by setting up physical and virtual memory managers
+   and core struct for the boot processor. It then switches to new page
+   table and stack and continues by starting other processors and loading
+   user init program into a container. We will now discuss individual
+   stages in more detail.
+
+Bootloader
+
+   We focus on using BIOS instead of UEFI due to simplicity of the setup
+   as there is no need for additional tools to build BIOS bootable binary.
+   The disadvantage of BIOS is all the legacy cruft, such as starting in
+   Real Mode and having only 512 bytes for bootloader.
+
+   The purpose of bootloader is to do initial setup (such as video mode,
+   virtual memory) and load kernel from a disk into a memory. Since we
+   could not squeeze bootloader with ELF loader into 512 bytes like xv6,
+   we split the bootloader into two stages, where the first stage's
+   purpose is to only load the second stage. For now, we use 4 KiB of
+   space for both stages and kernel starts at sector 8.
+
+   After the video mode and long mode setup, the bootloader reads kernel
+   from ELF binary into predefined location and jumps into the kernel
+   entry point. IDE driver is used to perform reading from a disk. UOS
+   does not support multiboot standard as xv6, so it will work only with
+   our bootloader.
+
+   The first stage must be 512 bytes and the assembly needs to be padded
+   to fill 512 bytes (and adding MBR bytes at 510 offset).
+
+  A20 Line
+
+   Used to ignore 21st bit of address to emulate wraparound from older
+   architectures. We have to enable this 21st bit so we can address all
+   memory.
+
+   xv6 bootloader uses keyboard controller to enable this bit. We opt for
+   BIOS and its functions.
+movw $0x2401, %ax
+int $0x15
+
+   http://www.win.tue.nl/~aeb/linux/kbd/A20.html
+
+  Global Descriptor Table
+
+   entry point of bootblock.o must be at the address 0x7C00 which can be
+   double checked by doing objdump -d bootblock.o
+
+   The simplest "debugging" can be done by writing to screen; segmentation
+   must be used as CGA address of 0xB8000 is outside 16-bit range. This
+   addressing can be done by setting %ds to 0xB800 (segment is multiplied
+   by 16 on access).
+
+   We directly transition from real mode to long mode without going
+   through protected mode. Since the boot sector is small (512 bytes) we
+   have to manually set up page tables instead of loading them from an
+   array.
+
+   We can't use .data segment in bootloader C code (for example to store
+   static string) as only .text is copied with objcopy (-j flag; TODO).
+
+  Bootstrap Page Table
+
+   We map same physical address [0, 1 GiB) to the virtual addresses [0x0,
+   1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as
+   we linked kernel to reside at address 0x8000100000 (as we want it to be
+   mapped into each process space) and when we jump to it it must be
+   mapped properly. The first 1 GiB in virtual address space are mapped
+   because it is the space in which bootloader lives and from where we
+   will perform the jump.
+
+   The reason for choosing address where all 39 bits are zero is so we can
+   have only top level page directory (PML4) which shares mapping with the
+   [0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry
+   to the same entry in PDP (next level page table) as for the first
+   entry.
+
+   direct switch to long mode
+
+  Real Hardware
+
+   UOS does not boot on my desktop as it would need SATA driver. If
+   booting from USB, tt turns out if the disk lacks partition table the
+   BIOS will set DL register to 0x0, i.e., floppy, and floppy does not
+   support linear base addressing (LBA) which we use in IDE driver.
+
+ELF
+
+   Fixed-length ELF header and variable-length program header listing
+     * .text program instructions
+     * .rodata read-only data (we should make it only read-only memory)
+     * .data program's data (e.g., global variables)
+     * .bss is a reserved section for unitialized global variables (we
+       need to zero them)
+     * VMA is link address LMA is load address of .text section (as the
+       offsets in code are absolute due to not using Position Independent
+       Code)
+
+   For some stupid reason .rodata is together with .text in same segment,
+   making .rodata executable. (http://thi�baud.fr/rodata.html)
+
+  Symbol Table
+
+   Symbol table associates names with linear addresses and thus can be
+   used by the loader to find addresses of exception and interrupt
+   handlers (those must have some agreed names).
+
+Drivers
+
+  UART
+
+   Serial port for debugging. Uses port-mapped I/O which ignores virtual
+   memory (thus even if we do not map first 2 MiB it works).
+
+  IDE driver
+
+   [2]Reference
+
+   We need to read kernel from the primary ATA disk in bootloader to
+   initialize the core OS.
+
+   We wrote the function for reading sectors in assembly to avoid inline
+   assembly. Also, we could wrap individual instructions in assembly
+   functions but it was much simpler this way.
+
+Kernel Bootstrap
+
+   Mostly disabling legacy stuff and enabling modern features as the
+   processor assumes it woke up in last century.
+
+Memory Management
+
+     * frame allocator (2 MiB)
+     * avoid mapping physical memory (higher half)
+     * tried recursive paging (pain to modify other processes, TLB
+       pressure)
+     * ended up with linear/physical dual mapped page tables
+     * kernel pages marked as global
+     * kernel heap (used for most allocations)
+
+   There are roughly three levels of memory management:
+     * physical memory
+     * virtual memory
+     * heap
+
+   We use recursive page table mapping to avoid the need to remap page
+   tables each time we need to access them or to use contiguous memory
+   storage for the VA to PA and PA to VA conversions which are necessary
+   for updating the page table. The recursive mapping is the simplest as
+   we need these conversions only for page table manipulation. In other
+   cases such as memory mapped IO we would use a simple mapping.
+
+   Hmm, recursive mapping allows us to access entries of page table only
+   for currently loaded page table which is really annoying.
+
+   There are several approaches:
+     * remap physical memory to kernel and use simple +-KERNEL_BASE
+     * use recursive page table (may use it for user space)
+     * keep small space available and map the page table before modifying
+       it
+     * store virtual and physical addresses for page table (have to keep
+       them in sync, allows for custom format which gets rid of canonical
+       address issues NX bit, problems with dirty and accessed flags)
+
+  Recursive Page Table
+
+   By mapping pml4 entry to pml4 we can access all paging structures for
+   current process. The main advantage is the simplicity and constant
+   virtual address space used in kernel (since only pml4 is mapped in
+   kernel space). If we need to modify another process's page table we can
+   point it to a pml4 slot and use the recursive trick on pdp (which is
+   pml4 of the other process).
+
+   Unfortunately, there are many disadvantages:
+     * since we use 2 MiB pages we have to be careful as the recursive
+       mapping behaves as if we used 4 KiB pages
+     * large TLB footprint (as we use 2 MiB pages)
+     * clunky to modify other process page table (the pdp becomes pml4,
+       ...) - rare, but needed for container creation
+     * weird bootstrap during boot
+     * limited to x86 (not big deal for now)
+
+  Duplicated Page Table
+
+   For kernel we have a separate page table data structure that mimics
+   hardware page table.
+
+   TODO: use completely different data layout so we can easily traverse
+   the pointers from pml4 -> pdp -> pd.
+
+   For user space we may use recursive mapping to allow easy view of the
+   current virtual address space. (TODO: what about TLB pollution? We
+   can't easily map page tables as they live in kernel heap)
+
+   Backing pages cannot be moved without knowledge of the heap to avoid
+   having a separate structure for page tables.
+
+  Initialization
+
+   Since at boot we do not have valid kernel page table we have to build
+   one. This build process is quite different from modifying current page
+   table (one in CR3) as we can't use recursive mapping to access the
+   pml4, pdp, and pd tables. Furthermore, since the page table is not used
+   we do not need to invalidate TLB entries.
+
+   We assume the heap is physically contiguous at the boot time and use a
+   simple offset to setup a initial kernel page table. Alternative
+   solution would be to walk the boot page directory and find linear
+   address for a given physical address (the other way is handled by
+   recursive mapping).
+
+  Manager
+
+   Still go back and forth between the options but leaning towards thin
+   layer in kernel which will provide frame management and ability to map
+   virtual to physical addresses. The kernel would enforce permissions.
+
+   The other options are:
+     * provide only sbrk to grow heap (pain in ass to map any specific
+       frames)
+     * allocate fixed number of frames for kernel and give rest of frames
+       to user space memory manager (two context switches, painful)
+     * I can't think of anything else
+
+  Security
+
+   If I go with 2 MiB pages than we need at least 4 pages for a program (8
+   MiB):
+     * .text, read-only
+     * .rodata, read-only, no-execute
+     * .data, write, no-execute
+     * stack, write, no-execute
+
+  Heap
+
+   I started with a simple heap that allocates multiple of 4096 bytes, but
+   after watching Switch hacking talk (TODO reference) I plan to have a
+   separate allocator (slab) for each struct type, thus avoiding the
+   potential vulnerability of overwriting different struct that can happen
+   on heap (as two different types of struct cannot possibly overlap, and
+   also there are no pointer casting issues).
+
+Security
+
+     * NX, WP, SMAP, SMEP
+     * guard pages
+     * proper permissions for elf segments
+     * capability-based permission (kernel/user are "same")
+
+   The UOS is a exo/micro kernel design and thus it runs all drivers in
+   user mode (ring 3).
+
+   Kernel is prohibited to write to any user mapped memory directly and
+   must map the memory to separate address if it wants to write there (for
+   example for the info structure). Kernel never reads any user memory and
+   all syscall arguments must be passed in registers.
+
+   The recent side-channel attacks are concerning and I will probably go
+   with the separate page table for kernel and user (with the addition of
+   PCID). The Spectre attack seem much harder to mitigate. The attack
+   allows user to read all memory, thus it could read kernel key and
+   corresponding capabilities created from the key. One solution would be
+   to store the keys and capabilities in a special area not mapped when
+   user space is in flight (probably hard to not make a mistake); could we
+   create kernel where everything is non-sensitive?.
+
+   he UOS is a exo/micro kernel design and thus it runs all drivers in
+   user mode (ring 3).
+
+   Kernel is prohibited to write to any user mapped memory directly and
+   must map the memory to separate address if it wants to write there (for
+   example for the info structure). Kernel never reads any user memory and
+   all syscall arguments must be passed in registers.
+
+   The recent side-channel attacks are concerning and I will probably go
+   with the separate page table for kernel and user (with the addition of
+   PCID). The Spectre attack seem much harder to mitigate. The attack
+   allows user to read all memory, thus it could read kernel key and
+   corresponding capabilities created from the key. One solution would be
+   to store the keys and capabilities in a special area not mapped when
+   user space is in flight (probably hard to not make a mistake); could we
+   create kernel where everything is non-sensitive?.
+
+  Capability-based Mechanisms
+
+   Instead of user/group based system have fine grained badges that give
+   owner permissions. Access control lists (ACL) store list of things that
+   have access to given resource. Capabilities store list of resources a
+   thing can access.
+
+   Strive for simplicity
+
+   c-list (capability list), trapdoor functions?
+
+   random numbers (rdrand, rdseed)
+
+   How to do these?
+     * capability per frame
+     * capability per core
+
+   Anton Burtsev, http://www.ics.uci.edu/~aburtsev/
+
+   References:
+     * https://www.cs.cornell.edu/courses/cs513/2007fa/L08.html
+     * https://software.intel.com/en-us/articles/intel-digital-random-numb
+       er-generator-drng-software-implementation-guide
+
+  Design
+
+   Initial approach was to have each frame associated with capability,
+   thus when use allocated a frame he would get a key to it. This method
+   was clunky as it would be tricky to allocate multiple frames and it
+   placed large burden onto user. Furthermore, we were associating the
+   object number with the key to identify it, which caused same problems.
+
+   Instead, we use user provided key (can be random number) when a frame
+   is acquired, thus the same key can be used with any number of frames.
+   User can create capability by hashing the key and permissions. Then
+   when any access such as mapping the frame is performed, OS performs
+   same hashing and check if the hash in capability matches the computed
+   one.
+
+   TODO: test if we can use AES instead of sha256
+
+Multicore
+
+     * ACPI table to get list of cores
+     * x2APIC (MSR)
+     * GS register
+     * one kernel stack per core
+     * create container in user space (aka fork)
+
+   There are at least two ways to detect Application Processors (AP) after
+   Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor
+   Specification tables or ACPI (specifically MADT).
+
+   xv6 uses MPS tables which are obsolete, so we focus on the ACPI
+   approach. The ACPI table pointer structure lives at some memory range
+   and can be checksumed to verify if the struct is really an ACPI table.
+   Since Qemu does not support x2APIC we have to memory map the LAPIC
+   registers to query the APIC id (other alternative is to use CPUID). The
+   ACPI table tells us this address and moreover it has a flag if core is
+   working (the table is filled during boot by each core).
+
+   After parsing the tables we loop through all core entries and if the
+   LAPIC id is different from the id of BSP we attempt to start the core
+   with interprocess interrupt. Then we wait until the core updates the
+   shared core array with running flag. We could do exponential startup
+   for large core counts, but currently it is not an issue. Note, we use
+   x2APIC which uses MSR as an interface instead of memory mapping.
+
+   Each core has its own kernel stack and pml4 with shared entries for the
+   kernel data structures (including heap) and for the entry which holds
+   all core kernel stacks (TODO: should cores see other cores stacks?).
+
+User Mode
+
+     * params passed only in registers
+     * fast syscall/sysret
+     * low level (map memory, create container)
+
+Graphics
+
+   The simplest way to draw pixels is uisng VGA mode 0x13 (320x200 with
+   256 colors with linear addressing) with BIOS interrupt 0x10 in
+   protected mode. This mode switch is done in bootloader before protected
+   mode is entered as only real mode can sanely interact with BIOS to
+   manipulate VGA graphics.
+# two moves can be combined into movw $0x0013, %ax
+movb    $0x0, %ah  # call function 0x0 to set video mode
+movb    $0x13, %al # use video mode 0x13
+# or movw $0x13, %ax
+int     $0x10
+
+   After changing video mode, the pixel can be accessed at physical
+   address 0xA0000, which in xv6 is mapped at KERNBASE + 0xA0000.
+
+   Unfortunately, the small resolution (and only 256 colors) provided by
+   VGA is restrictive and unsuitable for any graphics based system.
+
+   The simplest solution is to use VESA BIOS Extensions (VBE) which can
+   provide more than 1kx1k resolutions with up to 16M colors. Even though
+   modes defined by VESA are considered deprecated, they are likely to
+   work and allow to avoid finding suitable mode. This predefined modes
+   come handy as real mode needs to be used and xv6's boot code can't be
+   more than 512 bytes. Other solution would be to drop down to real mode
+   before switching to long mode (TODO).
+
+   For example, setting 1280x1024 with 16M colors and loading its
+   information.
+movw    $0x4F02, %ax    # set VBE mode function
+movw    $0x411B, %bx    # 1280x1024 24bit mode (with linear framebuffer)
+int     $0x10
+
+movw    $0x4F01, %ax    # get mode information function
+movw    $0x411B, %bx    # mode
+movw    $0x0, %di       # store the mode information struct (256 bytes) at physi
+cal address 0x0
+int     $0x10
+
+   Than we can qeury mode parameters by loading the ModeInfoBlock
+   structure. For simplicity, we load it at fixed physical address 0x0 as
+   kernel code starts on address 0x100000. Since we care only about the
+   PhysBasePtr field (tells us location of linear framebuffer), we could
+   pass it as function parameter to the kernel bootstrap function.
+
+   Now, the tricky bit; we need to create virtual memory mapping to the
+   physical address in PhysBasePtr. On my machine in Qemu, the address is
+   0xFD000000 and we need to map 4 MiB (depends on resolution and bit
+   depth) up to the address 0xFE000000 (defined in memlayout.h as
+   DEVSPACE, i.e., memory mapped I/O).
+
+   I used mappages (vm.c) in function main (main.c) to map some fixed
+   virtual address in FRAMEBUFFER to physical address at vbe->physbaseptr
+   before starting other processors.
+struct vbemodeinfo const *const vbe = (void *)KERNBASE;
+int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm);
+mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr,
+PTE_W);
+
+   Now, pixels can be written at the address FRAMEBUFFER anywhere in
+   kernel. I was surprised I could not find any exported function to
+   perform the memory mapping explicitly, another reason for exokernel.
+
+Floating Points
+
+   By default disabled on amd64
+
+   We can test in bootloader first if AVX2 is supported, and if not hang
+   the OS. Some OS will trap if user used floating point unit and enable
+   it. This mechanism allows OS to avoid storing/restoring floating point
+   registers during context switch if process does not use those
+   registers. Since UOS kernel does not save context at all (except few
+   registers on timer interrupt), it enables AVX2 at the kernel startup
+   and let's user to do register store/restore if necessary. In fact, this
+   approach is simpler as user knows what registers are being used and can
+   do some tricks. Moreover, since kernel does not use floating point
+   registers, user can pass values in registers to the next scheduled
+   container.
+
+Kernel
+
+   We map kernel at high virtual address as user space programs assume
+   loading at address 0.
+
+  Global Descriptor Table
+
+   The GDT from bootloader has only kernel code segment and no user code
+   segment, nor Task State Segment (TSS). Initially, we had one GDT shared
+   by all cores, but each core needs its own TSS, thus we could either
+   create an entry for each core, or better, duplicate whole GDT to obtain
+   full flexibility at mere cost of extra 32 bytes per core. Keeping the
+   GDT separate is the right step into the direction of sharing as little
+   as possible between cores to scale well on many core architectures (>
+   32 cores).
+
+  Memory Layout
+
+   UOS's kernel does not remap physical memory contiguously (higher half)
+   as xv6 does. Instead, the kernel uses the same interfaces as user
+   programs, allocating frame and mapping it. In linear address space, the
+   kernel lives at entry 511 (last) in PML4 page table and thus takes 512
+   GB. We chose to take whole entry for simplicity as it can be easily
+   shared by all user programs. Moreover, we mark the entry as global to
+   avoid flushing TLB on context switch.
+
+   Each core's stack lives in entry 510 and is again marked global. Stacks
+   are one page large (2 MiB), and have guard pages interspersed between
+   the stacks. We do not have kernel stack per container as there are no
+   slow system calls in the kernel (such as IO) that would need to be
+   interrupted. This push of expensive operations to user space makes
+   kernel simpler and easier to reason about.
+
+  Linking
+
+Containers
+
+   Instead of notion of a traditional UNIX process, UOS uses containers as
+   its execution unit abstraction. A container is akin to a process, but
+   it also has access to exceptions, low-level memory management, and
+   scheduling. For example, user can easily implement preemtive threads
+   inside a container.
+
+   In short, container has following entries:
+     * unique identifier
+     * PML4 pointer to a page table
+     * frame for storing rsp and rip during system call
+     * status (running, ...)
+     * penalty (increased on each tick, decreased on yield)
+
+   In contrast to xv6, we do not have any scheduler thread, so the kernel
+   thread becomes a container after bootstrap and if container is killed,
+   next container is scheduled (or in lack of container kernel panics).
+
+User Space Bootstrap
+
+   The initial user mode binary is embedded into the kernel as there is no
+   disk driver in kernel. All binaries use custom link script so there are
+   only .text, .rodata, and .data, and they have proper permissions so
+   they can be loaded into separate pages.
+
+   The ELF file is parsed for timer interrupt handler which needs to be
+   passed to function allocating quantum (so it will be scheduled to run).
+
+System Calls
+
+   xv6 uses interrupt 0x10 for system calls. We use the native
+   syscall/sysret instructions present on amd64 architectures specifically
+   designed for system calls.
+
+   Since the syscall instruction uses rcx register to store the returning
+   instruction pointer we copy the value in the register to caller saved
+   register and restore it back in the syscall to preserve C calling
+   convention.
+
+   We may need to move away from C calling convention due the need of
+   having more than 6 function arguments for the syscall (7-th is on the
+   stack and uOS is prohibited from touching any user memory).
+
+   UOS relies on a C calling convention for system calls, but it must
+   clear caller saved registers before returning to the user as it could
+   leak information from the kernel (e.g., private key).
+
+Entering User Mode
+
+   Since amd64 has no "direct" instruction to switch to user mode we have
+   to pretend that the container is returning either from interrupt
+   (iretq) or system call (sysretq). We used the system call intruction as
+   it does not touch user's stack, thus we can avoid writing to user's
+   memory (which we avoid for security reasons).
+
+Traps and Interrupts
+
+   256 slots total.
+
+   Must be acknowledged
+
+   Priorities, higher is handled first
+
+   TODO: do software excetions need to be acknowledged?
+
+   Needs one Task State Segment (TSS)
+
+   User space exception list:
+     * #DE - divide by zero (vector 0)
+     * #PF - page fault (vector 14) - TODO: user can't read CR2; should
+       kernel push CR2?
+     * #XF - SIMD floating point (vector 19)
+
+   User space rare exceptions (which we let double fault):
+     * #SS - stack exception (vector 12) - TODO: detects canonical address
+       issue
+     * #UD - invalid opcode (vector 6) - TODO: mostly legacy instructions
+       and very rare
+     * #GP - general protection (vector 13) - TODO: mostly for unaligned
+       SSE
+     * #AC - alignment check (vector 17) - can happen only if CPL = 3
+
+   Kernel space exception list:
+     * #DF - double fault (vector 8)
+     * #MC - machine check (vector 18)
+
+   Legacy exceptions:
+     * Non-maskable interrupt (vector 2)
+     * #OF - overflow (vector 4)
+     * #BR - bound range (vector 5)
+     * #NM - device not available (vector 7)
+     * #TS - invalid TSS (vector 10)
+     * #NP - segment not present (vector 11)
+     * #MF - x87 floating point (vector 16)
+
+  Story
+
+   Thanks to #osdev for help.
+
+   After implementing divide zero exception the kernel start rebooting.
+   Changing the .text segment to writable stopped the restarting and
+   seemingly fixed the problem. However, exception should not modify the
+   .text segment so after verifying the IDT is correct I had to check what
+   changed in .text segment.
+
+   I copied all .text data and run comparison after the fault happened.
+   The mismatch was found:
+mismatch at FFFFFF80000000C0: 20990000000000 20980000000000
+
+   Which refered to GDT code segment entry and specifically the accessed
+   bit got flipped. The reason is that during fault the code segment is
+   loaded, but for syscall/sysret it is not, so it was confusing.
+
+   For now, the solution was to move non-bootstrap GDT to .data segment.
+   NOTE: bootstrap GDT is in .text segment and thus the page for starting
+   APs must be writable.
+
+  Division by zero
+
+   When asked on IRC to about advancing RIP after division by zero I got
+   this response :).
+
+   It's *possible*, but it's also likely to break dozens of known good
+   assumptions about computer science and invoke a metric shitload of
+   undefined behaviour in the process (Kazinsal #osdev)
+
+  Timer Interrupt
+
+   So far, all non-critical exceptions were handled by currently running
+   user mode container. Unfortunately, I am not sure how to make the LAPIC
+   timer in user mode as it requires writes to a MSR to reset it (TSC
+   deadline), and more importantly kernel has to relinquish control in
+   case of adversary/buggy container that is reluctant to yield on the
+   interrupt.
+
+   Initially, I thought setting up two timers, one at higher frequency
+   (user mode) and the other at lower frequency (kernel mode) would be
+   neat, but LAPIC has only one timer interrupt vector. Thus, the current
+   plan is to point the interrupt vector to kernel which will bump up
+   penalty integer (if next quantum is a different container) and returns
+   to present container. Since the registers on context switch are saved
+   by containers, kernel must (I think) use either user mode stack (can be
+   read-only) or registers to at minimum save current RIP of the user
+   container before jumping to the handler. If sysretq is used, RCX and
+   R11 must be saved on user stack in addition, thus to better mimick
+   traditional exception, a user read-only stack seems as a better
+   solution (there is the GS/FS register too, but I rather have user
+   complete freedom and not reserve any registers - probably makes it more
+   portable too). At the end, I decided to create read-only container
+   environment (basically, shared memory with kernel) that can be used to
+   share information such as, the current penalty count, last RIP (on
+   interrupt), next quantum container id, and other different kernel
+   counters for active container. This area requires one frame (2 MiB) and
+   is mapped as write into kernel, and as read into user. The policy of
+   kernel never reading any user mode memory (only registers are allowed)
+   is still enforced.
+
+   After kernel returns to the user mode interrupt handler, the container
+   is expected to save its context if it is necessary and yield. There are
+   exceptions, for example if the next quantum is the same container,
+   there is no need to save anything nor yield, the container can restore
+   the stack pointer and jump back to the saved or different RIP. This
+   mechanism can be used to implement user space threads by simply
+   allocating few consecutive quanta (such as 10, each 1 ms long) and
+   yield only on the last one, so the kernel can switch to different
+   container.
+
+   Normally, I use null segment for SS, but when returning from CPL < 3
+   into CPL = 3 via iretq, the SS will get reloaded and will cause general
+   protection fault. Thus a not null selector must be created in GDT and
+   SS must be loaded with that selector index. We use a simple trick where
+   SS gets loaded with proper selector after executing sysretq instruction
+   (starts user mode container). It is possible to use sysretq, but then
+   two extra registers needs to be stored in info (rcx and r11); with
+   iretq we need to store only rip. For user space exceptions we do not
+   follow the manual and avoid iretq as it would set the accesses bit in
+   GDT entry and page fault. (TODO: manual explicitly requires to use
+   iretq)
+
+   TODO: handle what happens if container's interrupt handler gets
+   interrupted (I am thinking about unique ID written by kernel into info
+   as RIP can't be used reliably to identify interrupts). The container
+   should be penalized (and eventually killed), but more importantly,
+   there must be a plan for handling the user space stack in sane way (and
+   not losing some space on it, due to RSP being bump in handler, but not
+   decremented).
+
+   Protocol:
+    1. interrupt jumps into kernel handler
+    2. if current container penalty is 3, kill it and schedule next
+       container
+    3. increase penalty
+    4. update RIP in the info
+    5. set next LAPIC deadline
+    6. jump into user handler
+    7. user is expected to save registers and yield
+    8. decrease penalty (and if negative set to 0)
+    9. schedule next container
+
+   Notice, the penalty indicates if there is any nesting in timer
+   interrupt. A well behaving program should always have penalty equal 1
+   when entering timer interrupt handler.
+
+   Since the user mode bootstrap container must handle timer interrupt, we
+   parse the ELF symbol table and search for timer_interrupt_asm assembly
+   function. This function is then called by kernel timer interrupt
+   handler.
+
+   The kernel interrupt handler is fully written in assembly to avoid any
+   surprises in terms of register modifications. The current
+   implementation needs to push and pop only 3 registers, RAX, RCX, and
+   RDX.
+
+Compile Flags
+
+   -mno-red-zone (disable red zone as interrupts do not respect convention
+   of leaving extra stack space for leaf functions)
+
+Tasks
+
+     * tests
+          + [ ] frame allocator
+          + [ ] sha256
+     * style
+          + [X] header only
+          + [ ] Intel-style assembly
+     * bootloader
+          + [X] load rest of bootloader with BIOS (sort of 2 stage)
+          + [X] switch from real mode to long mode
+          + [ ] query video modes and save this information at fixed
+            address
+          + [ ] switch to most suitable video mode
+          + [X] load ELF binary containing kernel (uses IDE)
+          + [ ] switch to SATA?
+          + [ ] boot on real hardware (needs SATA)
+     * kernel
+          + [X] make work on KVM (needed to disable SMAP on Haswell)
+          + [?] add performance counters to evaluate impact of recursive
+            paging on TLB misses
+               o abandoned recursive paging due to clunky mapping of other
+                 page tables
+          + [X] start application processors (AP)
+          + [X] create kernel process for each application processor (AP)
+          + [X] add spinlock (test it by cores racing on it)
+          + [X] use RDGSBASE to get pointer to core structure instead of
+            the current offset stuff
+          + [X] check required feature set with CPUID
+          + [ ] SKINIT instruction (read Security section in manual)
+          + [ ] restructure so the end is at 16 MiB boundary
+          + [ ] avoid global variables
+          + [ ] use ILP64 mode (sizeof (int) == 8) - seems hard as there
+            is no compiler support
+          + [X] timer (local APIC; TSC-Deadline mode)
+          + [X] kernel double fault handler
+          + [X] enable SCE (system call extensions) in EFER
+          + [X] add system call support (syscall)
+          + [ ] syscall user rip security (validate it is canonical?)
+          + [X] add containers (similar to process abstraction)
+          + [X] load user space bootstrap ELF embedded inside kernel
+          + [X] page permissions for ELF segments
+          + [X] parse ELF symbol table to find timer interrupt handler
+          + [X] switch to user space (via sysretq)
+          + [X] use x2APIC (through MSR) instead of old xAPIC
+          + [X] wait 10 ms on AP startup
+          + [ ] IPC (async to avoid deadlocks...)
+          + [ ] AVX2
+          + [X] one TSS
+          + [ ] sleep idle cores (hlt instructions)
+          + [ ] expose some data structures (and update during task
+            switch?)
+          + [ ] Continuation style system (TODO: read papers on it)
+          + [ ] init system (so servers can reserve resources)
+          + [ ] struct type for physical addresses (what about linear
+            addresses? always a pointer?)
+          + [X] add HPET support (to compute TSC rate)
+          + [ ] compute TSC rate for each core independently (good idea?)
+          + [ ] destroy container on double fault or other error
+     * memory management (only 2 MiB pages, thus only PML4, PDP, and PD is
+       used)
+          + [X] boot page table
+          + [X] kernel page table
+          + [X] recursive PML4 (no longer used for kernel space)
+          + [X] frame allocator
+          + [X] page allocator
+          + [X] heap allocator (only alloc, no free)
+          + [X] duplicated page table (allows easy traversal of any page
+            table - even inactive one)
+          + [X] page mapping interface (global marked pages for kernel to
+            avoid TLB invalidation; write protect bit CR0)
+          + [ ] alignment checking in user space?
+          + [X] Supervisor-Mode Execution Prevention (CR4.SMEP)
+          + [X] non-executable pages (NXE in EFER)
+          + [ ] TLB (tagged)
+          + [X] shared memory (uses capabilities)
+          + [ ] replace heap with slab allocator
+     * scheduling
+          + [ ] penalty protocol for wonky containers
+          + [ ] round-robin scheduler
+          + [ ] container migration to other core
+     * syscalls
+          + [X] log to UART
+          + [ ] frame alloc/free
+          + [ ] map/unmap
+          + [X] alloc/free container
+          + [X] quanta alloc, yield
+          + [X] userspace register store
+     * debugging
+          + [X] setup .gdbinit for automatic GDB startup
+     * security
+          + [ ] separate page table for user and kernel (KAISER)
+          + [X] use guard page for each core kernel stack
+          + [X] proper page flags for .text, .rodata, .data
+          + [X] capability-based resource management
+          + [ ] separate memory pools for each struct type (attacker can
+            only modify same type of struct)
+          + [ ] MTTR (prevent DMA to write into kernel memory)
+          + [ ] kernel ASLR (both linear and physical address; PIC needed)
+          + [ ] random location for user stack
+          + [ ] finer granularity permissions for capabilities
+          + [ ] check if AES can be used instead of sha256
+          + [ ] namespaces?
+          + [ ] string type with length?
+          + [ ] stack protection
+          + [ ] do not map kernel into user space? (recent patches in NT
+            and Linux do it)
+     * libs
+          + [X] sprintf
+          + [X] sha256 routine
+          + [X] random numbers (via rdrand instruction)
+          + [X] strcmp
+     * user space
+          + [X] simple game of life render test
+          + [X] exceptions (divide by zero, page fault, ...)
+          + [X] timer interrupt
+          + [ ] compositor
+     * ACPI
+          + [X] parse MADT for detecting CPUs
+          + [X] build parsed struct with LAPICs
+          + [ ] parse x2APIC (Qemu unfortunately has ACPI 1.0 only)
+     * drivers
+          + [ ] SATA (high priority)
+          + [X] basic serial port
+          + [X] IDE
+          + [ ] keyboard (needs USB stack)
+          + [ ] mouse (needs USB stack)
+          + [ ] Intel GPU driver (blitting support)
+     * graphics
+          + [ ] GUI library
+          + [ ] tiling manager
+     * file system
+          + [ ] sector-based addressing (similar to page frames)
+          + [ ] memory mapped filesystem?
+          + [ ] some filesystem format
+     * networking (not likely before Christmas)
+          + [ ] ethernet driver
+          + [ ] packet filter (downloadable?)
+     * JIT (use wasm or some other typesafe IR and thus run everything at
+       ring 0)
+          + [X] interpreter of a subset of WASM
+
+References
+
+     * Exokernel: An Operating System Architecture for Application-Level
+       Resource Management
+     * Singularity: Rethinking the Software Stack
+     * Amoeba: a distributed operating system for the 1990s
+     * Scheduler Activations: Effective Kernel Support for the User-Level
+       Management of Parallelism
+     * A Comparison of Scheduling Algorithms for Multiprocessors
+     * L4 Microkernels: The Lessons from 20 Years of Research and
+       Deployment
+     * xv6, JOS, seL4
+     * osdev wiki
+     * Operating Systems Design and Implementation
+     * AMD64 Volume 1, 2, 3
+     * ATA Interface Reference Manual
+     * Advanced Configuration and Power Interface (ACPI) Specification
+     * ELF-64 Object File Format
+     * VESA BIOS EXTENSION (VBE) Core Functions Standard
+     * [3]The little book about OS development
+     * [4]Reading privileged memory with a side-channel
+     * [5]Kaiser
+     * [6]Hyperkernel: Push-Button Verification of an OS Kernel
+     * [7]Dune: Safe User-level Access to Privileged CPU Features
+     * Modern Operating Systems - 3rd edition, A. Tanenbaum
+
+References
+
+   1. https://klacansky.com/notes/data/useless-os.zip
+   2. http://wiki.osdev.org/ATA_PIO_Mode
+   3. https://littleosbook.github.io/
+   4. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
+   5. https://klacansky.com/notes/reference/uos/kaiser.pdf
+   6. https://klacansky.com/notes/reference/uos/hyperkernel.pdf
+   7. https://klacansky.com/notes/reference/uos/dune.pdf