From aab647a0cc135a0510ded3b65b812dfd110321a5 Mon Sep 17 00:00:00 2001 From: Andreas Baumann Date: Sun, 14 Jan 2024 19:40:25 +0100 Subject: updated links and documentation --- ...lacansky.com_notes_useless-operating-system.txt | 978 +++++++++++++++++++++ 1 file changed, 978 insertions(+) create mode 100644 doc/klacansky.com_notes_useless-operating-system.txt (limited to 'doc/klacansky.com_notes_useless-operating-system.txt') diff --git a/doc/klacansky.com_notes_useless-operating-system.txt b/doc/klacansky.com_notes_useless-operating-system.txt new file mode 100644 index 0000000..cc15e24 --- /dev/null +++ b/doc/klacansky.com_notes_useless-operating-system.txt @@ -0,0 +1,978 @@ + Useless Operating System + + [1]Code + + In my final project for the advanced operating system class taught by + John Regehr, I implemented parts of an operating system inspired by + exo/micro kernel design. + + The overall goal is to expose hardware as much as possible while + providing protection. + + I decided to stick with amd64 architecture as porting to aarch64 seems + unfeasible given the time frame and I primarily use amd64 CPUs so I + want to know more about them. UOS does not do identity mapping of + physical memory outside boot thus kernel will be using virtual memory + in almost same way as user mode programs. + + After implementing small OS I recommend it to all low-level or high + performance oriented programmers. More importantly, OS development is + joy, especially the design of interfaces and low-level tinkering. + Unfortunately, developing a usable operating system is no small task + (all those drivers) and user mode applications severely constrain the + kernel design. I therefore moved to compilers and programming + languages, which are tractable for a single person (in the days of + LLVM), and can significantly improve ones life by making programming + fun againk + + Also, the development made me finally appreciate debugger, where + simpler printf can't be used as registers need to be inspected before + and after instruction execution, e.g., before and after sysret. + +Important + + * disable red zone with compiler flag -mno-red-zone (can get + corrupted in nested interrupts) + * make sure pointers are in canonical form + +Motivation + + In the class we used xv6, a clean Unix clone with approachable code. + However, adding VESA BIOS Extensions framebuffer required new system + calls and kernel changes which was clunky. I especially disliked that + any experiment required tinkering with the kernel, and provided + abstractions were quite high level. + + Moreover, the xv6 operating system runs on 32-bit architecture, but + most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is + much simpler as it removes tons of old cruft such as segmentation and + has simple system calls. + + The primary inspiration for UOS are MIT's JOS (micro/exo kernel) and a + bit L4 (microkernel). For example, almost all exceptions are handled by + user and not kernel, including page fault. The timer interrupt is + forwarded to user, but in case of buggy/malicious container, kernel + will wait few slots and then kill the misbehaving container. Since the + user knows more about its use of registers, it can save only the + necessary ones, e.g., may not save floating point registers. On context + switch, kernel stores only instruction and stack pointers. + + I decided to go with capability-based permission system, where each + resource is associated with a "lock" and the key to the lock consists + of hashed permissions and the lock itself. Kernel/user can easily + verify the key valid by hashing the lock and permissions of the key and + checking if the key's hash matches. This approaches makes kernel + simpler, as memory that can't be exposed to user (e.g., < 1 MB) is + simply associated with a lock that only kernel knows. Therefore, no + special range checks are required in the kernel when user maps memory. + Shared memory is also simple to implement, as a container can easily + create keys for the memory it owns and give it to a newly created + container. Also, kernel uses same interface as user, i.e., it presents + key if it needs to map memory. TODO: how to safely store key if + Meltdown is present? + + UOS in constrast to Linux does not have a text mode. All assembly is in + separate files as I am convinced inline assembly is not good solution. + Moreover, kernel consists of only single .c file, which simplifies the + build system to a single compile command. It targets desktops and uses + 2 MiB pages. + +Amd64 Architecture + + Most desktops or laptops contain amd64 processor either from Intel or + Amd. Since UOS targets personal computers it supports only Amd64 + architecture. This choice simplifies the kernel as 32bit architectures + have much more legacy cruft. For example, Amd64 has no segmentation, + good syscall interface, simpler instruction set and only one necessary + task segment. The arguments for C ABI are passed in registers. UOS does + not allow kernel to read any user memory without explicit mapping. + Moreover, since there is more registers, all syscall arguments must use + them. + + Any virtual address must be in canonical form, i.e., if bit 47 + (counting from 0) is set, then all most significant bits must be also + set. This quirk divides the address space into 2 pieces with huge hole + in between them. Pretty annoying as UOS is not higher half kernel. + + The System V ABI requires first 6 integer or pointer arguments to be + passed in registers in order rdi, rsi, rdx, rcx, r8, and r9. The + remaining arguments (7 and more) are passed on stack. + +Overview of Startup + + At boot, Basic Input/Output System (BIOS) loads specific sector of 512 + bytes into memory at fixed address and starts execution. Usually, this + program is a bootloader that loads the kernel, or in our case second + stage of the bootloader as 64 bit assembly takes more space and I could + not fit the ELF loader into a single sector. The second stage parses + ELF file (kernel) on disk and loads it into memory. The kernel + bootstraps itself by setting up physical and virtual memory managers + and core struct for the boot processor. It then switches to new page + table and stack and continues by starting other processors and loading + user init program into a container. We will now discuss individual + stages in more detail. + +Bootloader + + We focus on using BIOS instead of UEFI due to simplicity of the setup + as there is no need for additional tools to build BIOS bootable binary. + The disadvantage of BIOS is all the legacy cruft, such as starting in + Real Mode and having only 512 bytes for bootloader. + + The purpose of bootloader is to do initial setup (such as video mode, + virtual memory) and load kernel from a disk into a memory. Since we + could not squeeze bootloader with ELF loader into 512 bytes like xv6, + we split the bootloader into two stages, where the first stage's + purpose is to only load the second stage. For now, we use 4 KiB of + space for both stages and kernel starts at sector 8. + + After the video mode and long mode setup, the bootloader reads kernel + from ELF binary into predefined location and jumps into the kernel + entry point. IDE driver is used to perform reading from a disk. UOS + does not support multiboot standard as xv6, so it will work only with + our bootloader. + + The first stage must be 512 bytes and the assembly needs to be padded + to fill 512 bytes (and adding MBR bytes at 510 offset). + + A20 Line + + Used to ignore 21st bit of address to emulate wraparound from older + architectures. We have to enable this 21st bit so we can address all + memory. + + xv6 bootloader uses keyboard controller to enable this bit. We opt for + BIOS and its functions. +movw $0x2401, %ax +int $0x15 + + http://www.win.tue.nl/~aeb/linux/kbd/A20.html + + Global Descriptor Table + + entry point of bootblock.o must be at the address 0x7C00 which can be + double checked by doing objdump -d bootblock.o + + The simplest "debugging" can be done by writing to screen; segmentation + must be used as CGA address of 0xB8000 is outside 16-bit range. This + addressing can be done by setting %ds to 0xB800 (segment is multiplied + by 16 on access). + + We directly transition from real mode to long mode without going + through protected mode. Since the boot sector is small (512 bytes) we + have to manually set up page tables instead of loading them from an + array. + + We can't use .data segment in bootloader C code (for example to store + static string) as only .text is copied with objcopy (-j flag; TODO). + + Bootstrap Page Table + + We map same physical address [0, 1 GiB) to the virtual addresses [0x0, + 1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as + we linked kernel to reside at address 0x8000100000 (as we want it to be + mapped into each process space) and when we jump to it it must be + mapped properly. The first 1 GiB in virtual address space are mapped + because it is the space in which bootloader lives and from where we + will perform the jump. + + The reason for choosing address where all 39 bits are zero is so we can + have only top level page directory (PML4) which shares mapping with the + [0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry + to the same entry in PDP (next level page table) as for the first + entry. + + direct switch to long mode + + Real Hardware + + UOS does not boot on my desktop as it would need SATA driver. If + booting from USB, tt turns out if the disk lacks partition table the + BIOS will set DL register to 0x0, i.e., floppy, and floppy does not + support linear base addressing (LBA) which we use in IDE driver. + +ELF + + Fixed-length ELF header and variable-length program header listing + * .text program instructions + * .rodata read-only data (we should make it only read-only memory) + * .data program's data (e.g., global variables) + * .bss is a reserved section for unitialized global variables (we + need to zero them) + * VMA is link address LMA is load address of .text section (as the + offsets in code are absolute due to not using Position Independent + Code) + + For some stupid reason .rodata is together with .text in same segment, + making .rodata executable. (http://thiébaud.fr/rodata.html) + + Symbol Table + + Symbol table associates names with linear addresses and thus can be + used by the loader to find addresses of exception and interrupt + handlers (those must have some agreed names). + +Drivers + + UART + + Serial port for debugging. Uses port-mapped I/O which ignores virtual + memory (thus even if we do not map first 2 MiB it works). + + IDE driver + + [2]Reference + + We need to read kernel from the primary ATA disk in bootloader to + initialize the core OS. + + We wrote the function for reading sectors in assembly to avoid inline + assembly. Also, we could wrap individual instructions in assembly + functions but it was much simpler this way. + +Kernel Bootstrap + + Mostly disabling legacy stuff and enabling modern features as the + processor assumes it woke up in last century. + +Memory Management + + * frame allocator (2 MiB) + * avoid mapping physical memory (higher half) + * tried recursive paging (pain to modify other processes, TLB + pressure) + * ended up with linear/physical dual mapped page tables + * kernel pages marked as global + * kernel heap (used for most allocations) + + There are roughly three levels of memory management: + * physical memory + * virtual memory + * heap + + We use recursive page table mapping to avoid the need to remap page + tables each time we need to access them or to use contiguous memory + storage for the VA to PA and PA to VA conversions which are necessary + for updating the page table. The recursive mapping is the simplest as + we need these conversions only for page table manipulation. In other + cases such as memory mapped IO we would use a simple mapping. + + Hmm, recursive mapping allows us to access entries of page table only + for currently loaded page table which is really annoying. + + There are several approaches: + * remap physical memory to kernel and use simple +-KERNEL_BASE + * use recursive page table (may use it for user space) + * keep small space available and map the page table before modifying + it + * store virtual and physical addresses for page table (have to keep + them in sync, allows for custom format which gets rid of canonical + address issues NX bit, problems with dirty and accessed flags) + + Recursive Page Table + + By mapping pml4 entry to pml4 we can access all paging structures for + current process. The main advantage is the simplicity and constant + virtual address space used in kernel (since only pml4 is mapped in + kernel space). If we need to modify another process's page table we can + point it to a pml4 slot and use the recursive trick on pdp (which is + pml4 of the other process). + + Unfortunately, there are many disadvantages: + * since we use 2 MiB pages we have to be careful as the recursive + mapping behaves as if we used 4 KiB pages + * large TLB footprint (as we use 2 MiB pages) + * clunky to modify other process page table (the pdp becomes pml4, + ...) - rare, but needed for container creation + * weird bootstrap during boot + * limited to x86 (not big deal for now) + + Duplicated Page Table + + For kernel we have a separate page table data structure that mimics + hardware page table. + + TODO: use completely different data layout so we can easily traverse + the pointers from pml4 -> pdp -> pd. + + For user space we may use recursive mapping to allow easy view of the + current virtual address space. (TODO: what about TLB pollution? We + can't easily map page tables as they live in kernel heap) + + Backing pages cannot be moved without knowledge of the heap to avoid + having a separate structure for page tables. + + Initialization + + Since at boot we do not have valid kernel page table we have to build + one. This build process is quite different from modifying current page + table (one in CR3) as we can't use recursive mapping to access the + pml4, pdp, and pd tables. Furthermore, since the page table is not used + we do not need to invalidate TLB entries. + + We assume the heap is physically contiguous at the boot time and use a + simple offset to setup a initial kernel page table. Alternative + solution would be to walk the boot page directory and find linear + address for a given physical address (the other way is handled by + recursive mapping). + + Manager + + Still go back and forth between the options but leaning towards thin + layer in kernel which will provide frame management and ability to map + virtual to physical addresses. The kernel would enforce permissions. + + The other options are: + * provide only sbrk to grow heap (pain in ass to map any specific + frames) + * allocate fixed number of frames for kernel and give rest of frames + to user space memory manager (two context switches, painful) + * I can't think of anything else + + Security + + If I go with 2 MiB pages than we need at least 4 pages for a program (8 + MiB): + * .text, read-only + * .rodata, read-only, no-execute + * .data, write, no-execute + * stack, write, no-execute + + Heap + + I started with a simple heap that allocates multiple of 4096 bytes, but + after watching Switch hacking talk (TODO reference) I plan to have a + separate allocator (slab) for each struct type, thus avoiding the + potential vulnerability of overwriting different struct that can happen + on heap (as two different types of struct cannot possibly overlap, and + also there are no pointer casting issues). + +Security + + * NX, WP, SMAP, SMEP + * guard pages + * proper permissions for elf segments + * capability-based permission (kernel/user are "same") + + The UOS is a exo/micro kernel design and thus it runs all drivers in + user mode (ring 3). + + Kernel is prohibited to write to any user mapped memory directly and + must map the memory to separate address if it wants to write there (for + example for the info structure). Kernel never reads any user memory and + all syscall arguments must be passed in registers. + + The recent side-channel attacks are concerning and I will probably go + with the separate page table for kernel and user (with the addition of + PCID). The Spectre attack seem much harder to mitigate. The attack + allows user to read all memory, thus it could read kernel key and + corresponding capabilities created from the key. One solution would be + to store the keys and capabilities in a special area not mapped when + user space is in flight (probably hard to not make a mistake); could we + create kernel where everything is non-sensitive?. + + he UOS is a exo/micro kernel design and thus it runs all drivers in + user mode (ring 3). + + Kernel is prohibited to write to any user mapped memory directly and + must map the memory to separate address if it wants to write there (for + example for the info structure). Kernel never reads any user memory and + all syscall arguments must be passed in registers. + + The recent side-channel attacks are concerning and I will probably go + with the separate page table for kernel and user (with the addition of + PCID). The Spectre attack seem much harder to mitigate. The attack + allows user to read all memory, thus it could read kernel key and + corresponding capabilities created from the key. One solution would be + to store the keys and capabilities in a special area not mapped when + user space is in flight (probably hard to not make a mistake); could we + create kernel where everything is non-sensitive?. + + Capability-based Mechanisms + + Instead of user/group based system have fine grained badges that give + owner permissions. Access control lists (ACL) store list of things that + have access to given resource. Capabilities store list of resources a + thing can access. + + Strive for simplicity + + c-list (capability list), trapdoor functions? + + random numbers (rdrand, rdseed) + + How to do these? + * capability per frame + * capability per core + + Anton Burtsev, http://www.ics.uci.edu/~aburtsev/ + + References: + * https://www.cs.cornell.edu/courses/cs513/2007fa/L08.html + * https://software.intel.com/en-us/articles/intel-digital-random-numb + er-generator-drng-software-implementation-guide + + Design + + Initial approach was to have each frame associated with capability, + thus when use allocated a frame he would get a key to it. This method + was clunky as it would be tricky to allocate multiple frames and it + placed large burden onto user. Furthermore, we were associating the + object number with the key to identify it, which caused same problems. + + Instead, we use user provided key (can be random number) when a frame + is acquired, thus the same key can be used with any number of frames. + User can create capability by hashing the key and permissions. Then + when any access such as mapping the frame is performed, OS performs + same hashing and check if the hash in capability matches the computed + one. + + TODO: test if we can use AES instead of sha256 + +Multicore + + * ACPI table to get list of cores + * x2APIC (MSR) + * GS register + * one kernel stack per core + * create container in user space (aka fork) + + There are at least two ways to detect Application Processors (AP) after + Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor + Specification tables or ACPI (specifically MADT). + + xv6 uses MPS tables which are obsolete, so we focus on the ACPI + approach. The ACPI table pointer structure lives at some memory range + and can be checksumed to verify if the struct is really an ACPI table. + Since Qemu does not support x2APIC we have to memory map the LAPIC + registers to query the APIC id (other alternative is to use CPUID). The + ACPI table tells us this address and moreover it has a flag if core is + working (the table is filled during boot by each core). + + After parsing the tables we loop through all core entries and if the + LAPIC id is different from the id of BSP we attempt to start the core + with interprocess interrupt. Then we wait until the core updates the + shared core array with running flag. We could do exponential startup + for large core counts, but currently it is not an issue. Note, we use + x2APIC which uses MSR as an interface instead of memory mapping. + + Each core has its own kernel stack and pml4 with shared entries for the + kernel data structures (including heap) and for the entry which holds + all core kernel stacks (TODO: should cores see other cores stacks?). + +User Mode + + * params passed only in registers + * fast syscall/sysret + * low level (map memory, create container) + +Graphics + + The simplest way to draw pixels is uisng VGA mode 0x13 (320x200 with + 256 colors with linear addressing) with BIOS interrupt 0x10 in + protected mode. This mode switch is done in bootloader before protected + mode is entered as only real mode can sanely interact with BIOS to + manipulate VGA graphics. +# two moves can be combined into movw $0x0013, %ax +movb $0x0, %ah # call function 0x0 to set video mode +movb $0x13, %al # use video mode 0x13 +# or movw $0x13, %ax +int $0x10 + + After changing video mode, the pixel can be accessed at physical + address 0xA0000, which in xv6 is mapped at KERNBASE + 0xA0000. + + Unfortunately, the small resolution (and only 256 colors) provided by + VGA is restrictive and unsuitable for any graphics based system. + + The simplest solution is to use VESA BIOS Extensions (VBE) which can + provide more than 1kx1k resolutions with up to 16M colors. Even though + modes defined by VESA are considered deprecated, they are likely to + work and allow to avoid finding suitable mode. This predefined modes + come handy as real mode needs to be used and xv6's boot code can't be + more than 512 bytes. Other solution would be to drop down to real mode + before switching to long mode (TODO). + + For example, setting 1280x1024 with 16M colors and loading its + information. +movw $0x4F02, %ax # set VBE mode function +movw $0x411B, %bx # 1280x1024 24bit mode (with linear framebuffer) +int $0x10 + +movw $0x4F01, %ax # get mode information function +movw $0x411B, %bx # mode +movw $0x0, %di # store the mode information struct (256 bytes) at physi +cal address 0x0 +int $0x10 + + Than we can qeury mode parameters by loading the ModeInfoBlock + structure. For simplicity, we load it at fixed physical address 0x0 as + kernel code starts on address 0x100000. Since we care only about the + PhysBasePtr field (tells us location of linear framebuffer), we could + pass it as function parameter to the kernel bootstrap function. + + Now, the tricky bit; we need to create virtual memory mapping to the + physical address in PhysBasePtr. On my machine in Qemu, the address is + 0xFD000000 and we need to map 4 MiB (depends on resolution and bit + depth) up to the address 0xFE000000 (defined in memlayout.h as + DEVSPACE, i.e., memory mapped I/O). + + I used mappages (vm.c) in function main (main.c) to map some fixed + virtual address in FRAMEBUFFER to physical address at vbe->physbaseptr + before starting other processors. +struct vbemodeinfo const *const vbe = (void *)KERNBASE; +int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm); +mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr, +PTE_W); + + Now, pixels can be written at the address FRAMEBUFFER anywhere in + kernel. I was surprised I could not find any exported function to + perform the memory mapping explicitly, another reason for exokernel. + +Floating Points + + By default disabled on amd64 + + We can test in bootloader first if AVX2 is supported, and if not hang + the OS. Some OS will trap if user used floating point unit and enable + it. This mechanism allows OS to avoid storing/restoring floating point + registers during context switch if process does not use those + registers. Since UOS kernel does not save context at all (except few + registers on timer interrupt), it enables AVX2 at the kernel startup + and let's user to do register store/restore if necessary. In fact, this + approach is simpler as user knows what registers are being used and can + do some tricks. Moreover, since kernel does not use floating point + registers, user can pass values in registers to the next scheduled + container. + +Kernel + + We map kernel at high virtual address as user space programs assume + loading at address 0. + + Global Descriptor Table + + The GDT from bootloader has only kernel code segment and no user code + segment, nor Task State Segment (TSS). Initially, we had one GDT shared + by all cores, but each core needs its own TSS, thus we could either + create an entry for each core, or better, duplicate whole GDT to obtain + full flexibility at mere cost of extra 32 bytes per core. Keeping the + GDT separate is the right step into the direction of sharing as little + as possible between cores to scale well on many core architectures (> + 32 cores). + + Memory Layout + + UOS's kernel does not remap physical memory contiguously (higher half) + as xv6 does. Instead, the kernel uses the same interfaces as user + programs, allocating frame and mapping it. In linear address space, the + kernel lives at entry 511 (last) in PML4 page table and thus takes 512 + GB. We chose to take whole entry for simplicity as it can be easily + shared by all user programs. Moreover, we mark the entry as global to + avoid flushing TLB on context switch. + + Each core's stack lives in entry 510 and is again marked global. Stacks + are one page large (2 MiB), and have guard pages interspersed between + the stacks. We do not have kernel stack per container as there are no + slow system calls in the kernel (such as IO) that would need to be + interrupted. This push of expensive operations to user space makes + kernel simpler and easier to reason about. + + Linking + +Containers + + Instead of notion of a traditional UNIX process, UOS uses containers as + its execution unit abstraction. A container is akin to a process, but + it also has access to exceptions, low-level memory management, and + scheduling. For example, user can easily implement preemtive threads + inside a container. + + In short, container has following entries: + * unique identifier + * PML4 pointer to a page table + * frame for storing rsp and rip during system call + * status (running, ...) + * penalty (increased on each tick, decreased on yield) + + In contrast to xv6, we do not have any scheduler thread, so the kernel + thread becomes a container after bootstrap and if container is killed, + next container is scheduled (or in lack of container kernel panics). + +User Space Bootstrap + + The initial user mode binary is embedded into the kernel as there is no + disk driver in kernel. All binaries use custom link script so there are + only .text, .rodata, and .data, and they have proper permissions so + they can be loaded into separate pages. + + The ELF file is parsed for timer interrupt handler which needs to be + passed to function allocating quantum (so it will be scheduled to run). + +System Calls + + xv6 uses interrupt 0x10 for system calls. We use the native + syscall/sysret instructions present on amd64 architectures specifically + designed for system calls. + + Since the syscall instruction uses rcx register to store the returning + instruction pointer we copy the value in the register to caller saved + register and restore it back in the syscall to preserve C calling + convention. + + We may need to move away from C calling convention due the need of + having more than 6 function arguments for the syscall (7-th is on the + stack and uOS is prohibited from touching any user memory). + + UOS relies on a C calling convention for system calls, but it must + clear caller saved registers before returning to the user as it could + leak information from the kernel (e.g., private key). + +Entering User Mode + + Since amd64 has no "direct" instruction to switch to user mode we have + to pretend that the container is returning either from interrupt + (iretq) or system call (sysretq). We used the system call intruction as + it does not touch user's stack, thus we can avoid writing to user's + memory (which we avoid for security reasons). + +Traps and Interrupts + + 256 slots total. + + Must be acknowledged + + Priorities, higher is handled first + + TODO: do software excetions need to be acknowledged? + + Needs one Task State Segment (TSS) + + User space exception list: + * #DE - divide by zero (vector 0) + * #PF - page fault (vector 14) - TODO: user can't read CR2; should + kernel push CR2? + * #XF - SIMD floating point (vector 19) + + User space rare exceptions (which we let double fault): + * #SS - stack exception (vector 12) - TODO: detects canonical address + issue + * #UD - invalid opcode (vector 6) - TODO: mostly legacy instructions + and very rare + * #GP - general protection (vector 13) - TODO: mostly for unaligned + SSE + * #AC - alignment check (vector 17) - can happen only if CPL = 3 + + Kernel space exception list: + * #DF - double fault (vector 8) + * #MC - machine check (vector 18) + + Legacy exceptions: + * Non-maskable interrupt (vector 2) + * #OF - overflow (vector 4) + * #BR - bound range (vector 5) + * #NM - device not available (vector 7) + * #TS - invalid TSS (vector 10) + * #NP - segment not present (vector 11) + * #MF - x87 floating point (vector 16) + + Story + + Thanks to #osdev for help. + + After implementing divide zero exception the kernel start rebooting. + Changing the .text segment to writable stopped the restarting and + seemingly fixed the problem. However, exception should not modify the + .text segment so after verifying the IDT is correct I had to check what + changed in .text segment. + + I copied all .text data and run comparison after the fault happened. + The mismatch was found: +mismatch at FFFFFF80000000C0: 20990000000000 20980000000000 + + Which refered to GDT code segment entry and specifically the accessed + bit got flipped. The reason is that during fault the code segment is + loaded, but for syscall/sysret it is not, so it was confusing. + + For now, the solution was to move non-bootstrap GDT to .data segment. + NOTE: bootstrap GDT is in .text segment and thus the page for starting + APs must be writable. + + Division by zero + + When asked on IRC to about advancing RIP after division by zero I got + this response :). + + It's *possible*, but it's also likely to break dozens of known good + assumptions about computer science and invoke a metric shitload of + undefined behaviour in the process (Kazinsal #osdev) + + Timer Interrupt + + So far, all non-critical exceptions were handled by currently running + user mode container. Unfortunately, I am not sure how to make the LAPIC + timer in user mode as it requires writes to a MSR to reset it (TSC + deadline), and more importantly kernel has to relinquish control in + case of adversary/buggy container that is reluctant to yield on the + interrupt. + + Initially, I thought setting up two timers, one at higher frequency + (user mode) and the other at lower frequency (kernel mode) would be + neat, but LAPIC has only one timer interrupt vector. Thus, the current + plan is to point the interrupt vector to kernel which will bump up + penalty integer (if next quantum is a different container) and returns + to present container. Since the registers on context switch are saved + by containers, kernel must (I think) use either user mode stack (can be + read-only) or registers to at minimum save current RIP of the user + container before jumping to the handler. If sysretq is used, RCX and + R11 must be saved on user stack in addition, thus to better mimick + traditional exception, a user read-only stack seems as a better + solution (there is the GS/FS register too, but I rather have user + complete freedom and not reserve any registers - probably makes it more + portable too). At the end, I decided to create read-only container + environment (basically, shared memory with kernel) that can be used to + share information such as, the current penalty count, last RIP (on + interrupt), next quantum container id, and other different kernel + counters for active container. This area requires one frame (2 MiB) and + is mapped as write into kernel, and as read into user. The policy of + kernel never reading any user mode memory (only registers are allowed) + is still enforced. + + After kernel returns to the user mode interrupt handler, the container + is expected to save its context if it is necessary and yield. There are + exceptions, for example if the next quantum is the same container, + there is no need to save anything nor yield, the container can restore + the stack pointer and jump back to the saved or different RIP. This + mechanism can be used to implement user space threads by simply + allocating few consecutive quanta (such as 10, each 1 ms long) and + yield only on the last one, so the kernel can switch to different + container. + + Normally, I use null segment for SS, but when returning from CPL < 3 + into CPL = 3 via iretq, the SS will get reloaded and will cause general + protection fault. Thus a not null selector must be created in GDT and + SS must be loaded with that selector index. We use a simple trick where + SS gets loaded with proper selector after executing sysretq instruction + (starts user mode container). It is possible to use sysretq, but then + two extra registers needs to be stored in info (rcx and r11); with + iretq we need to store only rip. For user space exceptions we do not + follow the manual and avoid iretq as it would set the accesses bit in + GDT entry and page fault. (TODO: manual explicitly requires to use + iretq) + + TODO: handle what happens if container's interrupt handler gets + interrupted (I am thinking about unique ID written by kernel into info + as RIP can't be used reliably to identify interrupts). The container + should be penalized (and eventually killed), but more importantly, + there must be a plan for handling the user space stack in sane way (and + not losing some space on it, due to RSP being bump in handler, but not + decremented). + + Protocol: + 1. interrupt jumps into kernel handler + 2. if current container penalty is 3, kill it and schedule next + container + 3. increase penalty + 4. update RIP in the info + 5. set next LAPIC deadline + 6. jump into user handler + 7. user is expected to save registers and yield + 8. decrease penalty (and if negative set to 0) + 9. schedule next container + + Notice, the penalty indicates if there is any nesting in timer + interrupt. A well behaving program should always have penalty equal 1 + when entering timer interrupt handler. + + Since the user mode bootstrap container must handle timer interrupt, we + parse the ELF symbol table and search for timer_interrupt_asm assembly + function. This function is then called by kernel timer interrupt + handler. + + The kernel interrupt handler is fully written in assembly to avoid any + surprises in terms of register modifications. The current + implementation needs to push and pop only 3 registers, RAX, RCX, and + RDX. + +Compile Flags + + -mno-red-zone (disable red zone as interrupts do not respect convention + of leaving extra stack space for leaf functions) + +Tasks + + * tests + + [ ] frame allocator + + [ ] sha256 + * style + + [X] header only + + [ ] Intel-style assembly + * bootloader + + [X] load rest of bootloader with BIOS (sort of 2 stage) + + [X] switch from real mode to long mode + + [ ] query video modes and save this information at fixed + address + + [ ] switch to most suitable video mode + + [X] load ELF binary containing kernel (uses IDE) + + [ ] switch to SATA? + + [ ] boot on real hardware (needs SATA) + * kernel + + [X] make work on KVM (needed to disable SMAP on Haswell) + + [?] add performance counters to evaluate impact of recursive + paging on TLB misses + o abandoned recursive paging due to clunky mapping of other + page tables + + [X] start application processors (AP) + + [X] create kernel process for each application processor (AP) + + [X] add spinlock (test it by cores racing on it) + + [X] use RDGSBASE to get pointer to core structure instead of + the current offset stuff + + [X] check required feature set with CPUID + + [ ] SKINIT instruction (read Security section in manual) + + [ ] restructure so the end is at 16 MiB boundary + + [ ] avoid global variables + + [ ] use ILP64 mode (sizeof (int) == 8) - seems hard as there + is no compiler support + + [X] timer (local APIC; TSC-Deadline mode) + + [X] kernel double fault handler + + [X] enable SCE (system call extensions) in EFER + + [X] add system call support (syscall) + + [ ] syscall user rip security (validate it is canonical?) + + [X] add containers (similar to process abstraction) + + [X] load user space bootstrap ELF embedded inside kernel + + [X] page permissions for ELF segments + + [X] parse ELF symbol table to find timer interrupt handler + + [X] switch to user space (via sysretq) + + [X] use x2APIC (through MSR) instead of old xAPIC + + [X] wait 10 ms on AP startup + + [ ] IPC (async to avoid deadlocks...) + + [ ] AVX2 + + [X] one TSS + + [ ] sleep idle cores (hlt instructions) + + [ ] expose some data structures (and update during task + switch?) + + [ ] Continuation style system (TODO: read papers on it) + + [ ] init system (so servers can reserve resources) + + [ ] struct type for physical addresses (what about linear + addresses? always a pointer?) + + [X] add HPET support (to compute TSC rate) + + [ ] compute TSC rate for each core independently (good idea?) + + [ ] destroy container on double fault or other error + * memory management (only 2 MiB pages, thus only PML4, PDP, and PD is + used) + + [X] boot page table + + [X] kernel page table + + [X] recursive PML4 (no longer used for kernel space) + + [X] frame allocator + + [X] page allocator + + [X] heap allocator (only alloc, no free) + + [X] duplicated page table (allows easy traversal of any page + table - even inactive one) + + [X] page mapping interface (global marked pages for kernel to + avoid TLB invalidation; write protect bit CR0) + + [ ] alignment checking in user space? + + [X] Supervisor-Mode Execution Prevention (CR4.SMEP) + + [X] non-executable pages (NXE in EFER) + + [ ] TLB (tagged) + + [X] shared memory (uses capabilities) + + [ ] replace heap with slab allocator + * scheduling + + [ ] penalty protocol for wonky containers + + [ ] round-robin scheduler + + [ ] container migration to other core + * syscalls + + [X] log to UART + + [ ] frame alloc/free + + [ ] map/unmap + + [X] alloc/free container + + [X] quanta alloc, yield + + [X] userspace register store + * debugging + + [X] setup .gdbinit for automatic GDB startup + * security + + [ ] separate page table for user and kernel (KAISER) + + [X] use guard page for each core kernel stack + + [X] proper page flags for .text, .rodata, .data + + [X] capability-based resource management + + [ ] separate memory pools for each struct type (attacker can + only modify same type of struct) + + [ ] MTTR (prevent DMA to write into kernel memory) + + [ ] kernel ASLR (both linear and physical address; PIC needed) + + [ ] random location for user stack + + [ ] finer granularity permissions for capabilities + + [ ] check if AES can be used instead of sha256 + + [ ] namespaces? + + [ ] string type with length? + + [ ] stack protection + + [ ] do not map kernel into user space? (recent patches in NT + and Linux do it) + * libs + + [X] sprintf + + [X] sha256 routine + + [X] random numbers (via rdrand instruction) + + [X] strcmp + * user space + + [X] simple game of life render test + + [X] exceptions (divide by zero, page fault, ...) + + [X] timer interrupt + + [ ] compositor + * ACPI + + [X] parse MADT for detecting CPUs + + [X] build parsed struct with LAPICs + + [ ] parse x2APIC (Qemu unfortunately has ACPI 1.0 only) + * drivers + + [ ] SATA (high priority) + + [X] basic serial port + + [X] IDE + + [ ] keyboard (needs USB stack) + + [ ] mouse (needs USB stack) + + [ ] Intel GPU driver (blitting support) + * graphics + + [ ] GUI library + + [ ] tiling manager + * file system + + [ ] sector-based addressing (similar to page frames) + + [ ] memory mapped filesystem? + + [ ] some filesystem format + * networking (not likely before Christmas) + + [ ] ethernet driver + + [ ] packet filter (downloadable?) + * JIT (use wasm or some other typesafe IR and thus run everything at + ring 0) + + [X] interpreter of a subset of WASM + +References + + * Exokernel: An Operating System Architecture for Application-Level + Resource Management + * Singularity: Rethinking the Software Stack + * Amoeba: a distributed operating system for the 1990s + * Scheduler Activations: Effective Kernel Support for the User-Level + Management of Parallelism + * A Comparison of Scheduling Algorithms for Multiprocessors + * L4 Microkernels: The Lessons from 20 Years of Research and + Deployment + * xv6, JOS, seL4 + * osdev wiki + * Operating Systems Design and Implementation + * AMD64 Volume 1, 2, 3 + * ATA Interface Reference Manual + * Advanced Configuration and Power Interface (ACPI) Specification + * ELF-64 Object File Format + * VESA BIOS EXTENSION (VBE) Core Functions Standard + * [3]The little book about OS development + * [4]Reading privileged memory with a side-channel + * [5]Kaiser + * [6]Hyperkernel: Push-Button Verification of an OS Kernel + * [7]Dune: Safe User-level Access to Privileged CPU Features + * Modern Operating Systems - 3rd edition, A. Tanenbaum + +References + + 1. https://klacansky.com/notes/data/useless-os.zip + 2. http://wiki.osdev.org/ATA_PIO_Mode + 3. https://littleosbook.github.io/ + 4. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html + 5. https://klacansky.com/notes/reference/uos/kaiser.pdf + 6. https://klacansky.com/notes/reference/uos/hyperkernel.pdf + 7. https://klacansky.com/notes/reference/uos/dune.pdf -- cgit v1.2.3-54-g00ecf