summaryrefslogtreecommitdiff
path: root/doc/klacansky.com_notes_useless-operating-system.txt
blob: cc15e24b037fab1bb91d0d8fbf180741f0d1105a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
                            Useless Operating System

   [1]Code

   In my final project for the advanced operating system class taught by
   John Regehr, I implemented parts of an operating system inspired by
   exo/micro kernel design.

   The overall goal is to expose hardware as much as possible while
   providing protection.

   I decided to stick with amd64 architecture as porting to aarch64 seems
   unfeasible given the time frame and I primarily use amd64 CPUs so I
   want to know more about them. UOS does not do identity mapping of
   physical memory outside boot thus kernel will be using virtual memory
   in almost same way as user mode programs.

   After implementing small OS I recommend it to all low-level or high
   performance oriented programmers. More importantly, OS development is
   joy, especially the design of interfaces and low-level tinkering.
   Unfortunately, developing a usable operating system is no small task
   (all those drivers) and user mode applications severely constrain the
   kernel design. I therefore moved to compilers and programming
   languages, which are tractable for a single person (in the days of
   LLVM), and can significantly improve ones life by making programming
   fun againk

   Also, the development made me finally appreciate debugger, where
   simpler printf can't be used as registers need to be inspected before
   and after instruction execution, e.g., before and after sysret.

Important

     * disable red zone with compiler flag -mno-red-zone (can get
       corrupted in nested interrupts)
     * make sure pointers are in canonical form

Motivation

   In the class we used xv6, a clean Unix clone with approachable code.
   However, adding VESA BIOS Extensions framebuffer required new system
   calls and kernel changes which was clunky. I especially disliked that
   any experiment required tinkering with the kernel, and provided
   abstractions were quite high level.

   Moreover, the xv6 operating system runs on 32-bit architecture, but
   most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is
   much simpler as it removes tons of old cruft such as segmentation and
   has simple system calls.

   The primary inspiration for UOS are MIT's JOS (micro/exo kernel) and a
   bit L4 (microkernel). For example, almost all exceptions are handled by
   user and not kernel, including page fault. The timer interrupt is
   forwarded to user, but in case of buggy/malicious container, kernel
   will wait few slots and then kill the misbehaving container. Since the
   user knows more about its use of registers, it can save only the
   necessary ones, e.g., may not save floating point registers. On context
   switch, kernel stores only instruction and stack pointers.

   I decided to go with capability-based permission system, where each
   resource is associated with a "lock" and the key to the lock consists
   of hashed permissions and the lock itself. Kernel/user can easily
   verify the key valid by hashing the lock and permissions of the key and
   checking if the key's hash matches. This approaches makes kernel
   simpler, as memory that can't be exposed to user (e.g., < 1 MB) is
   simply associated with a lock that only kernel knows. Therefore, no
   special range checks are required in the kernel when user maps memory.
   Shared memory is also simple to implement, as a container can easily
   create keys for the memory it owns and give it to a newly created
   container. Also, kernel uses same interface as user, i.e., it presents
   key if it needs to map memory. TODO: how to safely store key if
   Meltdown is present?

   UOS in constrast to Linux does not have a text mode. All assembly is in
   separate files as I am convinced inline assembly is not good solution.
   Moreover, kernel consists of only single .c file, which simplifies the
   build system to a single compile command. It targets desktops and uses
   2 MiB pages.

Amd64 Architecture

   Most desktops or laptops contain amd64 processor either from Intel or
   Amd. Since UOS targets personal computers it supports only Amd64
   architecture. This choice simplifies the kernel as 32bit architectures
   have much more legacy cruft. For example, Amd64 has no segmentation,
   good syscall interface, simpler instruction set and only one necessary
   task segment. The arguments for C ABI are passed in registers. UOS does
   not allow kernel to read any user memory without explicit mapping.
   Moreover, since there is more registers, all syscall arguments must use
   them.

   Any virtual address must be in canonical form, i.e., if bit 47
   (counting from 0) is set, then all most significant bits must be also
   set. This quirk divides the address space into 2 pieces with huge hole
   in between them. Pretty annoying as UOS is not higher half kernel.

   The System V ABI requires first 6 integer or pointer arguments to be
   passed in registers in order rdi, rsi, rdx, rcx, r8, and r9. The
   remaining arguments (7 and more) are passed on stack.

Overview of Startup

   At boot, Basic Input/Output System (BIOS) loads specific sector of 512
   bytes into memory at fixed address and starts execution. Usually, this
   program is a bootloader that loads the kernel, or in our case second
   stage of the bootloader as 64 bit assembly takes more space and I could
   not fit the ELF loader into a single sector. The second stage parses
   ELF file (kernel) on disk and loads it into memory. The kernel
   bootstraps itself by setting up physical and virtual memory managers
   and core struct for the boot processor. It then switches to new page
   table and stack and continues by starting other processors and loading
   user init program into a container. We will now discuss individual
   stages in more detail.

Bootloader

   We focus on using BIOS instead of UEFI due to simplicity of the setup
   as there is no need for additional tools to build BIOS bootable binary.
   The disadvantage of BIOS is all the legacy cruft, such as starting in
   Real Mode and having only 512 bytes for bootloader.

   The purpose of bootloader is to do initial setup (such as video mode,
   virtual memory) and load kernel from a disk into a memory. Since we
   could not squeeze bootloader with ELF loader into 512 bytes like xv6,
   we split the bootloader into two stages, where the first stage's
   purpose is to only load the second stage. For now, we use 4 KiB of
   space for both stages and kernel starts at sector 8.

   After the video mode and long mode setup, the bootloader reads kernel
   from ELF binary into predefined location and jumps into the kernel
   entry point. IDE driver is used to perform reading from a disk. UOS
   does not support multiboot standard as xv6, so it will work only with
   our bootloader.

   The first stage must be 512 bytes and the assembly needs to be padded
   to fill 512 bytes (and adding MBR bytes at 510 offset).

  A20 Line

   Used to ignore 21st bit of address to emulate wraparound from older
   architectures. We have to enable this 21st bit so we can address all
   memory.

   xv6 bootloader uses keyboard controller to enable this bit. We opt for
   BIOS and its functions.
movw $0x2401, %ax
int $0x15

   http://www.win.tue.nl/~aeb/linux/kbd/A20.html

  Global Descriptor Table

   entry point of bootblock.o must be at the address 0x7C00 which can be
   double checked by doing objdump -d bootblock.o

   The simplest "debugging" can be done by writing to screen; segmentation
   must be used as CGA address of 0xB8000 is outside 16-bit range. This
   addressing can be done by setting %ds to 0xB800 (segment is multiplied
   by 16 on access).

   We directly transition from real mode to long mode without going
   through protected mode. Since the boot sector is small (512 bytes) we
   have to manually set up page tables instead of loading them from an
   array.

   We can't use .data segment in bootloader C code (for example to store
   static string) as only .text is copied with objcopy (-j flag; TODO).

  Bootstrap Page Table

   We map same physical address [0, 1 GiB) to the virtual addresses [0x0,
   1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as
   we linked kernel to reside at address 0x8000100000 (as we want it to be
   mapped into each process space) and when we jump to it it must be
   mapped properly. The first 1 GiB in virtual address space are mapped
   because it is the space in which bootloader lives and from where we
   will perform the jump.

   The reason for choosing address where all 39 bits are zero is so we can
   have only top level page directory (PML4) which shares mapping with the
   [0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry
   to the same entry in PDP (next level page table) as for the first
   entry.

   direct switch to long mode

  Real Hardware

   UOS does not boot on my desktop as it would need SATA driver. If
   booting from USB, tt turns out if the disk lacks partition table the
   BIOS will set DL register to 0x0, i.e., floppy, and floppy does not
   support linear base addressing (LBA) which we use in IDE driver.

ELF

   Fixed-length ELF header and variable-length program header listing
     * .text program instructions
     * .rodata read-only data (we should make it only read-only memory)
     * .data program's data (e.g., global variables)
     * .bss is a reserved section for unitialized global variables (we
       need to zero them)
     * VMA is link address LMA is load address of .text section (as the
       offsets in code are absolute due to not using Position Independent
       Code)

   For some stupid reason .rodata is together with .text in same segment,
   making .rodata executable. (http://thiébaud.fr/rodata.html)

  Symbol Table

   Symbol table associates names with linear addresses and thus can be
   used by the loader to find addresses of exception and interrupt
   handlers (those must have some agreed names).

Drivers

  UART

   Serial port for debugging. Uses port-mapped I/O which ignores virtual
   memory (thus even if we do not map first 2 MiB it works).

  IDE driver

   [2]Reference

   We need to read kernel from the primary ATA disk in bootloader to
   initialize the core OS.

   We wrote the function for reading sectors in assembly to avoid inline
   assembly. Also, we could wrap individual instructions in assembly
   functions but it was much simpler this way.

Kernel Bootstrap

   Mostly disabling legacy stuff and enabling modern features as the
   processor assumes it woke up in last century.

Memory Management

     * frame allocator (2 MiB)
     * avoid mapping physical memory (higher half)
     * tried recursive paging (pain to modify other processes, TLB
       pressure)
     * ended up with linear/physical dual mapped page tables
     * kernel pages marked as global
     * kernel heap (used for most allocations)

   There are roughly three levels of memory management:
     * physical memory
     * virtual memory
     * heap

   We use recursive page table mapping to avoid the need to remap page
   tables each time we need to access them or to use contiguous memory
   storage for the VA to PA and PA to VA conversions which are necessary
   for updating the page table. The recursive mapping is the simplest as
   we need these conversions only for page table manipulation. In other
   cases such as memory mapped IO we would use a simple mapping.

   Hmm, recursive mapping allows us to access entries of page table only
   for currently loaded page table which is really annoying.

   There are several approaches:
     * remap physical memory to kernel and use simple +-KERNEL_BASE
     * use recursive page table (may use it for user space)
     * keep small space available and map the page table before modifying
       it
     * store virtual and physical addresses for page table (have to keep
       them in sync, allows for custom format which gets rid of canonical
       address issues NX bit, problems with dirty and accessed flags)

  Recursive Page Table

   By mapping pml4 entry to pml4 we can access all paging structures for
   current process. The main advantage is the simplicity and constant
   virtual address space used in kernel (since only pml4 is mapped in
   kernel space). If we need to modify another process's page table we can
   point it to a pml4 slot and use the recursive trick on pdp (which is
   pml4 of the other process).

   Unfortunately, there are many disadvantages:
     * since we use 2 MiB pages we have to be careful as the recursive
       mapping behaves as if we used 4 KiB pages
     * large TLB footprint (as we use 2 MiB pages)
     * clunky to modify other process page table (the pdp becomes pml4,
       ...) - rare, but needed for container creation
     * weird bootstrap during boot
     * limited to x86 (not big deal for now)

  Duplicated Page Table

   For kernel we have a separate page table data structure that mimics
   hardware page table.

   TODO: use completely different data layout so we can easily traverse
   the pointers from pml4 -> pdp -> pd.

   For user space we may use recursive mapping to allow easy view of the
   current virtual address space. (TODO: what about TLB pollution? We
   can't easily map page tables as they live in kernel heap)

   Backing pages cannot be moved without knowledge of the heap to avoid
   having a separate structure for page tables.

  Initialization

   Since at boot we do not have valid kernel page table we have to build
   one. This build process is quite different from modifying current page
   table (one in CR3) as we can't use recursive mapping to access the
   pml4, pdp, and pd tables. Furthermore, since the page table is not used
   we do not need to invalidate TLB entries.

   We assume the heap is physically contiguous at the boot time and use a
   simple offset to setup a initial kernel page table. Alternative
   solution would be to walk the boot page directory and find linear
   address for a given physical address (the other way is handled by
   recursive mapping).

  Manager

   Still go back and forth between the options but leaning towards thin
   layer in kernel which will provide frame management and ability to map
   virtual to physical addresses. The kernel would enforce permissions.

   The other options are:
     * provide only sbrk to grow heap (pain in ass to map any specific
       frames)
     * allocate fixed number of frames for kernel and give rest of frames
       to user space memory manager (two context switches, painful)
     * I can't think of anything else

  Security

   If I go with 2 MiB pages than we need at least 4 pages for a program (8
   MiB):
     * .text, read-only
     * .rodata, read-only, no-execute
     * .data, write, no-execute
     * stack, write, no-execute

  Heap

   I started with a simple heap that allocates multiple of 4096 bytes, but
   after watching Switch hacking talk (TODO reference) I plan to have a
   separate allocator (slab) for each struct type, thus avoiding the
   potential vulnerability of overwriting different struct that can happen
   on heap (as two different types of struct cannot possibly overlap, and
   also there are no pointer casting issues).

Security

     * NX, WP, SMAP, SMEP
     * guard pages
     * proper permissions for elf segments
     * capability-based permission (kernel/user are "same")

   The UOS is a exo/micro kernel design and thus it runs all drivers in
   user mode (ring 3).

   Kernel is prohibited to write to any user mapped memory directly and
   must map the memory to separate address if it wants to write there (for
   example for the info structure). Kernel never reads any user memory and
   all syscall arguments must be passed in registers.

   The recent side-channel attacks are concerning and I will probably go
   with the separate page table for kernel and user (with the addition of
   PCID). The Spectre attack seem much harder to mitigate. The attack
   allows user to read all memory, thus it could read kernel key and
   corresponding capabilities created from the key. One solution would be
   to store the keys and capabilities in a special area not mapped when
   user space is in flight (probably hard to not make a mistake); could we
   create kernel where everything is non-sensitive?.

   he UOS is a exo/micro kernel design and thus it runs all drivers in
   user mode (ring 3).

   Kernel is prohibited to write to any user mapped memory directly and
   must map the memory to separate address if it wants to write there (for
   example for the info structure). Kernel never reads any user memory and
   all syscall arguments must be passed in registers.

   The recent side-channel attacks are concerning and I will probably go
   with the separate page table for kernel and user (with the addition of
   PCID). The Spectre attack seem much harder to mitigate. The attack
   allows user to read all memory, thus it could read kernel key and
   corresponding capabilities created from the key. One solution would be
   to store the keys and capabilities in a special area not mapped when
   user space is in flight (probably hard to not make a mistake); could we
   create kernel where everything is non-sensitive?.

  Capability-based Mechanisms

   Instead of user/group based system have fine grained badges that give
   owner permissions. Access control lists (ACL) store list of things that
   have access to given resource. Capabilities store list of resources a
   thing can access.

   Strive for simplicity

   c-list (capability list), trapdoor functions?

   random numbers (rdrand, rdseed)

   How to do these?
     * capability per frame
     * capability per core

   Anton Burtsev, http://www.ics.uci.edu/~aburtsev/

   References:
     * https://www.cs.cornell.edu/courses/cs513/2007fa/L08.html
     * https://software.intel.com/en-us/articles/intel-digital-random-numb
       er-generator-drng-software-implementation-guide

  Design

   Initial approach was to have each frame associated with capability,
   thus when use allocated a frame he would get a key to it. This method
   was clunky as it would be tricky to allocate multiple frames and it
   placed large burden onto user. Furthermore, we were associating the
   object number with the key to identify it, which caused same problems.

   Instead, we use user provided key (can be random number) when a frame
   is acquired, thus the same key can be used with any number of frames.
   User can create capability by hashing the key and permissions. Then
   when any access such as mapping the frame is performed, OS performs
   same hashing and check if the hash in capability matches the computed
   one.

   TODO: test if we can use AES instead of sha256

Multicore

     * ACPI table to get list of cores
     * x2APIC (MSR)
     * GS register
     * one kernel stack per core
     * create container in user space (aka fork)

   There are at least two ways to detect Application Processors (AP) after
   Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor
   Specification tables or ACPI (specifically MADT).

   xv6 uses MPS tables which are obsolete, so we focus on the ACPI
   approach. The ACPI table pointer structure lives at some memory range
   and can be checksumed to verify if the struct is really an ACPI table.
   Since Qemu does not support x2APIC we have to memory map the LAPIC
   registers to query the APIC id (other alternative is to use CPUID). The
   ACPI table tells us this address and moreover it has a flag if core is
   working (the table is filled during boot by each core).

   After parsing the tables we loop through all core entries and if the
   LAPIC id is different from the id of BSP we attempt to start the core
   with interprocess interrupt. Then we wait until the core updates the
   shared core array with running flag. We could do exponential startup
   for large core counts, but currently it is not an issue. Note, we use
   x2APIC which uses MSR as an interface instead of memory mapping.

   Each core has its own kernel stack and pml4 with shared entries for the
   kernel data structures (including heap) and for the entry which holds
   all core kernel stacks (TODO: should cores see other cores stacks?).

User Mode

     * params passed only in registers
     * fast syscall/sysret
     * low level (map memory, create container)

Graphics

   The simplest way to draw pixels is uisng VGA mode 0x13 (320x200 with
   256 colors with linear addressing) with BIOS interrupt 0x10 in
   protected mode. This mode switch is done in bootloader before protected
   mode is entered as only real mode can sanely interact with BIOS to
   manipulate VGA graphics.
# two moves can be combined into movw $0x0013, %ax
movb    $0x0, %ah  # call function 0x0 to set video mode
movb    $0x13, %al # use video mode 0x13
# or movw $0x13, %ax
int     $0x10

   After changing video mode, the pixel can be accessed at physical
   address 0xA0000, which in xv6 is mapped at KERNBASE + 0xA0000.

   Unfortunately, the small resolution (and only 256 colors) provided by
   VGA is restrictive and unsuitable for any graphics based system.

   The simplest solution is to use VESA BIOS Extensions (VBE) which can
   provide more than 1kx1k resolutions with up to 16M colors. Even though
   modes defined by VESA are considered deprecated, they are likely to
   work and allow to avoid finding suitable mode. This predefined modes
   come handy as real mode needs to be used and xv6's boot code can't be
   more than 512 bytes. Other solution would be to drop down to real mode
   before switching to long mode (TODO).

   For example, setting 1280x1024 with 16M colors and loading its
   information.
movw    $0x4F02, %ax    # set VBE mode function
movw    $0x411B, %bx    # 1280x1024 24bit mode (with linear framebuffer)
int     $0x10

movw    $0x4F01, %ax    # get mode information function
movw    $0x411B, %bx    # mode
movw    $0x0, %di       # store the mode information struct (256 bytes) at physi
cal address 0x0
int     $0x10

   Than we can qeury mode parameters by loading the ModeInfoBlock
   structure. For simplicity, we load it at fixed physical address 0x0 as
   kernel code starts on address 0x100000. Since we care only about the
   PhysBasePtr field (tells us location of linear framebuffer), we could
   pass it as function parameter to the kernel bootstrap function.

   Now, the tricky bit; we need to create virtual memory mapping to the
   physical address in PhysBasePtr. On my machine in Qemu, the address is
   0xFD000000 and we need to map 4 MiB (depends on resolution and bit
   depth) up to the address 0xFE000000 (defined in memlayout.h as
   DEVSPACE, i.e., memory mapped I/O).

   I used mappages (vm.c) in function main (main.c) to map some fixed
   virtual address in FRAMEBUFFER to physical address at vbe->physbaseptr
   before starting other processors.
struct vbemodeinfo const *const vbe = (void *)KERNBASE;
int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm);
mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr,
PTE_W);

   Now, pixels can be written at the address FRAMEBUFFER anywhere in
   kernel. I was surprised I could not find any exported function to
   perform the memory mapping explicitly, another reason for exokernel.

Floating Points

   By default disabled on amd64

   We can test in bootloader first if AVX2 is supported, and if not hang
   the OS. Some OS will trap if user used floating point unit and enable
   it. This mechanism allows OS to avoid storing/restoring floating point
   registers during context switch if process does not use those
   registers. Since UOS kernel does not save context at all (except few
   registers on timer interrupt), it enables AVX2 at the kernel startup
   and let's user to do register store/restore if necessary. In fact, this
   approach is simpler as user knows what registers are being used and can
   do some tricks. Moreover, since kernel does not use floating point
   registers, user can pass values in registers to the next scheduled
   container.

Kernel

   We map kernel at high virtual address as user space programs assume
   loading at address 0.

  Global Descriptor Table

   The GDT from bootloader has only kernel code segment and no user code
   segment, nor Task State Segment (TSS). Initially, we had one GDT shared
   by all cores, but each core needs its own TSS, thus we could either
   create an entry for each core, or better, duplicate whole GDT to obtain
   full flexibility at mere cost of extra 32 bytes per core. Keeping the
   GDT separate is the right step into the direction of sharing as little
   as possible between cores to scale well on many core architectures (>
   32 cores).

  Memory Layout

   UOS's kernel does not remap physical memory contiguously (higher half)
   as xv6 does. Instead, the kernel uses the same interfaces as user
   programs, allocating frame and mapping it. In linear address space, the
   kernel lives at entry 511 (last) in PML4 page table and thus takes 512
   GB. We chose to take whole entry for simplicity as it can be easily
   shared by all user programs. Moreover, we mark the entry as global to
   avoid flushing TLB on context switch.

   Each core's stack lives in entry 510 and is again marked global. Stacks
   are one page large (2 MiB), and have guard pages interspersed between
   the stacks. We do not have kernel stack per container as there are no
   slow system calls in the kernel (such as IO) that would need to be
   interrupted. This push of expensive operations to user space makes
   kernel simpler and easier to reason about.

  Linking

Containers

   Instead of notion of a traditional UNIX process, UOS uses containers as
   its execution unit abstraction. A container is akin to a process, but
   it also has access to exceptions, low-level memory management, and
   scheduling. For example, user can easily implement preemtive threads
   inside a container.

   In short, container has following entries:
     * unique identifier
     * PML4 pointer to a page table
     * frame for storing rsp and rip during system call
     * status (running, ...)
     * penalty (increased on each tick, decreased on yield)

   In contrast to xv6, we do not have any scheduler thread, so the kernel
   thread becomes a container after bootstrap and if container is killed,
   next container is scheduled (or in lack of container kernel panics).

User Space Bootstrap

   The initial user mode binary is embedded into the kernel as there is no
   disk driver in kernel. All binaries use custom link script so there are
   only .text, .rodata, and .data, and they have proper permissions so
   they can be loaded into separate pages.

   The ELF file is parsed for timer interrupt handler which needs to be
   passed to function allocating quantum (so it will be scheduled to run).

System Calls

   xv6 uses interrupt 0x10 for system calls. We use the native
   syscall/sysret instructions present on amd64 architectures specifically
   designed for system calls.

   Since the syscall instruction uses rcx register to store the returning
   instruction pointer we copy the value in the register to caller saved
   register and restore it back in the syscall to preserve C calling
   convention.

   We may need to move away from C calling convention due the need of
   having more than 6 function arguments for the syscall (7-th is on the
   stack and uOS is prohibited from touching any user memory).

   UOS relies on a C calling convention for system calls, but it must
   clear caller saved registers before returning to the user as it could
   leak information from the kernel (e.g., private key).

Entering User Mode

   Since amd64 has no "direct" instruction to switch to user mode we have
   to pretend that the container is returning either from interrupt
   (iretq) or system call (sysretq). We used the system call intruction as
   it does not touch user's stack, thus we can avoid writing to user's
   memory (which we avoid for security reasons).

Traps and Interrupts

   256 slots total.

   Must be acknowledged

   Priorities, higher is handled first

   TODO: do software excetions need to be acknowledged?

   Needs one Task State Segment (TSS)

   User space exception list:
     * #DE - divide by zero (vector 0)
     * #PF - page fault (vector 14) - TODO: user can't read CR2; should
       kernel push CR2?
     * #XF - SIMD floating point (vector 19)

   User space rare exceptions (which we let double fault):
     * #SS - stack exception (vector 12) - TODO: detects canonical address
       issue
     * #UD - invalid opcode (vector 6) - TODO: mostly legacy instructions
       and very rare
     * #GP - general protection (vector 13) - TODO: mostly for unaligned
       SSE
     * #AC - alignment check (vector 17) - can happen only if CPL = 3

   Kernel space exception list:
     * #DF - double fault (vector 8)
     * #MC - machine check (vector 18)

   Legacy exceptions:
     * Non-maskable interrupt (vector 2)
     * #OF - overflow (vector 4)
     * #BR - bound range (vector 5)
     * #NM - device not available (vector 7)
     * #TS - invalid TSS (vector 10)
     * #NP - segment not present (vector 11)
     * #MF - x87 floating point (vector 16)

  Story

   Thanks to #osdev for help.

   After implementing divide zero exception the kernel start rebooting.
   Changing the .text segment to writable stopped the restarting and
   seemingly fixed the problem. However, exception should not modify the
   .text segment so after verifying the IDT is correct I had to check what
   changed in .text segment.

   I copied all .text data and run comparison after the fault happened.
   The mismatch was found:
mismatch at FFFFFF80000000C0: 20990000000000 20980000000000

   Which refered to GDT code segment entry and specifically the accessed
   bit got flipped. The reason is that during fault the code segment is
   loaded, but for syscall/sysret it is not, so it was confusing.

   For now, the solution was to move non-bootstrap GDT to .data segment.
   NOTE: bootstrap GDT is in .text segment and thus the page for starting
   APs must be writable.

  Division by zero

   When asked on IRC to about advancing RIP after division by zero I got
   this response :).

   It's *possible*, but it's also likely to break dozens of known good
   assumptions about computer science and invoke a metric shitload of
   undefined behaviour in the process (Kazinsal #osdev)

  Timer Interrupt

   So far, all non-critical exceptions were handled by currently running
   user mode container. Unfortunately, I am not sure how to make the LAPIC
   timer in user mode as it requires writes to a MSR to reset it (TSC
   deadline), and more importantly kernel has to relinquish control in
   case of adversary/buggy container that is reluctant to yield on the
   interrupt.

   Initially, I thought setting up two timers, one at higher frequency
   (user mode) and the other at lower frequency (kernel mode) would be
   neat, but LAPIC has only one timer interrupt vector. Thus, the current
   plan is to point the interrupt vector to kernel which will bump up
   penalty integer (if next quantum is a different container) and returns
   to present container. Since the registers on context switch are saved
   by containers, kernel must (I think) use either user mode stack (can be
   read-only) or registers to at minimum save current RIP of the user
   container before jumping to the handler. If sysretq is used, RCX and
   R11 must be saved on user stack in addition, thus to better mimick
   traditional exception, a user read-only stack seems as a better
   solution (there is the GS/FS register too, but I rather have user
   complete freedom and not reserve any registers - probably makes it more
   portable too). At the end, I decided to create read-only container
   environment (basically, shared memory with kernel) that can be used to
   share information such as, the current penalty count, last RIP (on
   interrupt), next quantum container id, and other different kernel
   counters for active container. This area requires one frame (2 MiB) and
   is mapped as write into kernel, and as read into user. The policy of
   kernel never reading any user mode memory (only registers are allowed)
   is still enforced.

   After kernel returns to the user mode interrupt handler, the container
   is expected to save its context if it is necessary and yield. There are
   exceptions, for example if the next quantum is the same container,
   there is no need to save anything nor yield, the container can restore
   the stack pointer and jump back to the saved or different RIP. This
   mechanism can be used to implement user space threads by simply
   allocating few consecutive quanta (such as 10, each 1 ms long) and
   yield only on the last one, so the kernel can switch to different
   container.

   Normally, I use null segment for SS, but when returning from CPL < 3
   into CPL = 3 via iretq, the SS will get reloaded and will cause general
   protection fault. Thus a not null selector must be created in GDT and
   SS must be loaded with that selector index. We use a simple trick where
   SS gets loaded with proper selector after executing sysretq instruction
   (starts user mode container). It is possible to use sysretq, but then
   two extra registers needs to be stored in info (rcx and r11); with
   iretq we need to store only rip. For user space exceptions we do not
   follow the manual and avoid iretq as it would set the accesses bit in
   GDT entry and page fault. (TODO: manual explicitly requires to use
   iretq)

   TODO: handle what happens if container's interrupt handler gets
   interrupted (I am thinking about unique ID written by kernel into info
   as RIP can't be used reliably to identify interrupts). The container
   should be penalized (and eventually killed), but more importantly,
   there must be a plan for handling the user space stack in sane way (and
   not losing some space on it, due to RSP being bump in handler, but not
   decremented).

   Protocol:
    1. interrupt jumps into kernel handler
    2. if current container penalty is 3, kill it and schedule next
       container
    3. increase penalty
    4. update RIP in the info
    5. set next LAPIC deadline
    6. jump into user handler
    7. user is expected to save registers and yield
    8. decrease penalty (and if negative set to 0)
    9. schedule next container

   Notice, the penalty indicates if there is any nesting in timer
   interrupt. A well behaving program should always have penalty equal 1
   when entering timer interrupt handler.

   Since the user mode bootstrap container must handle timer interrupt, we
   parse the ELF symbol table and search for timer_interrupt_asm assembly
   function. This function is then called by kernel timer interrupt
   handler.

   The kernel interrupt handler is fully written in assembly to avoid any
   surprises in terms of register modifications. The current
   implementation needs to push and pop only 3 registers, RAX, RCX, and
   RDX.

Compile Flags

   -mno-red-zone (disable red zone as interrupts do not respect convention
   of leaving extra stack space for leaf functions)

Tasks

     * tests
          + [ ] frame allocator
          + [ ] sha256
     * style
          + [X] header only
          + [ ] Intel-style assembly
     * bootloader
          + [X] load rest of bootloader with BIOS (sort of 2 stage)
          + [X] switch from real mode to long mode
          + [ ] query video modes and save this information at fixed
            address
          + [ ] switch to most suitable video mode
          + [X] load ELF binary containing kernel (uses IDE)
          + [ ] switch to SATA?
          + [ ] boot on real hardware (needs SATA)
     * kernel
          + [X] make work on KVM (needed to disable SMAP on Haswell)
          + [?] add performance counters to evaluate impact of recursive
            paging on TLB misses
               o abandoned recursive paging due to clunky mapping of other
                 page tables
          + [X] start application processors (AP)
          + [X] create kernel process for each application processor (AP)
          + [X] add spinlock (test it by cores racing on it)
          + [X] use RDGSBASE to get pointer to core structure instead of
            the current offset stuff
          + [X] check required feature set with CPUID
          + [ ] SKINIT instruction (read Security section in manual)
          + [ ] restructure so the end is at 16 MiB boundary
          + [ ] avoid global variables
          + [ ] use ILP64 mode (sizeof (int) == 8) - seems hard as there
            is no compiler support
          + [X] timer (local APIC; TSC-Deadline mode)
          + [X] kernel double fault handler
          + [X] enable SCE (system call extensions) in EFER
          + [X] add system call support (syscall)
          + [ ] syscall user rip security (validate it is canonical?)
          + [X] add containers (similar to process abstraction)
          + [X] load user space bootstrap ELF embedded inside kernel
          + [X] page permissions for ELF segments
          + [X] parse ELF symbol table to find timer interrupt handler
          + [X] switch to user space (via sysretq)
          + [X] use x2APIC (through MSR) instead of old xAPIC
          + [X] wait 10 ms on AP startup
          + [ ] IPC (async to avoid deadlocks...)
          + [ ] AVX2
          + [X] one TSS
          + [ ] sleep idle cores (hlt instructions)
          + [ ] expose some data structures (and update during task
            switch?)
          + [ ] Continuation style system (TODO: read papers on it)
          + [ ] init system (so servers can reserve resources)
          + [ ] struct type for physical addresses (what about linear
            addresses? always a pointer?)
          + [X] add HPET support (to compute TSC rate)
          + [ ] compute TSC rate for each core independently (good idea?)
          + [ ] destroy container on double fault or other error
     * memory management (only 2 MiB pages, thus only PML4, PDP, and PD is
       used)
          + [X] boot page table
          + [X] kernel page table
          + [X] recursive PML4 (no longer used for kernel space)
          + [X] frame allocator
          + [X] page allocator
          + [X] heap allocator (only alloc, no free)
          + [X] duplicated page table (allows easy traversal of any page
            table - even inactive one)
          + [X] page mapping interface (global marked pages for kernel to
            avoid TLB invalidation; write protect bit CR0)
          + [ ] alignment checking in user space?
          + [X] Supervisor-Mode Execution Prevention (CR4.SMEP)
          + [X] non-executable pages (NXE in EFER)
          + [ ] TLB (tagged)
          + [X] shared memory (uses capabilities)
          + [ ] replace heap with slab allocator
     * scheduling
          + [ ] penalty protocol for wonky containers
          + [ ] round-robin scheduler
          + [ ] container migration to other core
     * syscalls
          + [X] log to UART
          + [ ] frame alloc/free
          + [ ] map/unmap
          + [X] alloc/free container
          + [X] quanta alloc, yield
          + [X] userspace register store
     * debugging
          + [X] setup .gdbinit for automatic GDB startup
     * security
          + [ ] separate page table for user and kernel (KAISER)
          + [X] use guard page for each core kernel stack
          + [X] proper page flags for .text, .rodata, .data
          + [X] capability-based resource management
          + [ ] separate memory pools for each struct type (attacker can
            only modify same type of struct)
          + [ ] MTTR (prevent DMA to write into kernel memory)
          + [ ] kernel ASLR (both linear and physical address; PIC needed)
          + [ ] random location for user stack
          + [ ] finer granularity permissions for capabilities
          + [ ] check if AES can be used instead of sha256
          + [ ] namespaces?
          + [ ] string type with length?
          + [ ] stack protection
          + [ ] do not map kernel into user space? (recent patches in NT
            and Linux do it)
     * libs
          + [X] sprintf
          + [X] sha256 routine
          + [X] random numbers (via rdrand instruction)
          + [X] strcmp
     * user space
          + [X] simple game of life render test
          + [X] exceptions (divide by zero, page fault, ...)
          + [X] timer interrupt
          + [ ] compositor
     * ACPI
          + [X] parse MADT for detecting CPUs
          + [X] build parsed struct with LAPICs
          + [ ] parse x2APIC (Qemu unfortunately has ACPI 1.0 only)
     * drivers
          + [ ] SATA (high priority)
          + [X] basic serial port
          + [X] IDE
          + [ ] keyboard (needs USB stack)
          + [ ] mouse (needs USB stack)
          + [ ] Intel GPU driver (blitting support)
     * graphics
          + [ ] GUI library
          + [ ] tiling manager
     * file system
          + [ ] sector-based addressing (similar to page frames)
          + [ ] memory mapped filesystem?
          + [ ] some filesystem format
     * networking (not likely before Christmas)
          + [ ] ethernet driver
          + [ ] packet filter (downloadable?)
     * JIT (use wasm or some other typesafe IR and thus run everything at
       ring 0)
          + [X] interpreter of a subset of WASM

References

     * Exokernel: An Operating System Architecture for Application-Level
       Resource Management
     * Singularity: Rethinking the Software Stack
     * Amoeba: a distributed operating system for the 1990s
     * Scheduler Activations: Effective Kernel Support for the User-Level
       Management of Parallelism
     * A Comparison of Scheduling Algorithms for Multiprocessors
     * L4 Microkernels: The Lessons from 20 Years of Research and
       Deployment
     * xv6, JOS, seL4
     * osdev wiki
     * Operating Systems Design and Implementation
     * AMD64 Volume 1, 2, 3
     * ATA Interface Reference Manual
     * Advanced Configuration and Power Interface (ACPI) Specification
     * ELF-64 Object File Format
     * VESA BIOS EXTENSION (VBE) Core Functions Standard
     * [3]The little book about OS development
     * [4]Reading privileged memory with a side-channel
     * [5]Kaiser
     * [6]Hyperkernel: Push-Button Verification of an OS Kernel
     * [7]Dune: Safe User-level Access to Privileged CPU Features
     * Modern Operating Systems - 3rd edition, A. Tanenbaum

References

   1. https://klacansky.com/notes/data/useless-os.zip
   2. http://wiki.osdev.org/ATA_PIO_Mode
   3. https://littleosbook.github.io/
   4. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
   5. https://klacansky.com/notes/reference/uos/kaiser.pdf
   6. https://klacansky.com/notes/reference/uos/hyperkernel.pdf
   7. https://klacansky.com/notes/reference/uos/dune.pdf