1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
|
Useless Operating System
[1]Code
In my final project for the advanced operating system class taught by
John Regehr, I implemented parts of an operating system inspired by
exo/micro kernel design.
The overall goal is to expose hardware as much as possible while
providing protection.
I decided to stick with amd64 architecture as porting to aarch64 seems
unfeasible given the time frame and I primarily use amd64 CPUs so I
want to know more about them. UOS does not do identity mapping of
physical memory outside boot thus kernel will be using virtual memory
in almost same way as user mode programs.
After implementing small OS I recommend it to all low-level or high
performance oriented programmers. More importantly, OS development is
joy, especially the design of interfaces and low-level tinkering.
Unfortunately, developing a usable operating system is no small task
(all those drivers) and user mode applications severely constrain the
kernel design. I therefore moved to compilers and programming
languages, which are tractable for a single person (in the days of
LLVM), and can significantly improve ones life by making programming
fun againk
Also, the development made me finally appreciate debugger, where
simpler printf can't be used as registers need to be inspected before
and after instruction execution, e.g., before and after sysret.
Important
* disable red zone with compiler flag -mno-red-zone (can get
corrupted in nested interrupts)
* make sure pointers are in canonical form
Motivation
In the class we used xv6, a clean Unix clone with approachable code.
However, adding VESA BIOS Extensions framebuffer required new system
calls and kernel changes which was clunky. I especially disliked that
any experiment required tinkering with the kernel, and provided
abstractions were quite high level.
Moreover, the xv6 operating system runs on 32-bit architecture, but
most modern CPUs are all 64-bit. Futhermore, the Amd64 architecture is
much simpler as it removes tons of old cruft such as segmentation and
has simple system calls.
The primary inspiration for UOS are MIT's JOS (micro/exo kernel) and a
bit L4 (microkernel). For example, almost all exceptions are handled by
user and not kernel, including page fault. The timer interrupt is
forwarded to user, but in case of buggy/malicious container, kernel
will wait few slots and then kill the misbehaving container. Since the
user knows more about its use of registers, it can save only the
necessary ones, e.g., may not save floating point registers. On context
switch, kernel stores only instruction and stack pointers.
I decided to go with capability-based permission system, where each
resource is associated with a "lock" and the key to the lock consists
of hashed permissions and the lock itself. Kernel/user can easily
verify the key valid by hashing the lock and permissions of the key and
checking if the key's hash matches. This approaches makes kernel
simpler, as memory that can't be exposed to user (e.g., < 1 MB) is
simply associated with a lock that only kernel knows. Therefore, no
special range checks are required in the kernel when user maps memory.
Shared memory is also simple to implement, as a container can easily
create keys for the memory it owns and give it to a newly created
container. Also, kernel uses same interface as user, i.e., it presents
key if it needs to map memory. TODO: how to safely store key if
Meltdown is present?
UOS in constrast to Linux does not have a text mode. All assembly is in
separate files as I am convinced inline assembly is not good solution.
Moreover, kernel consists of only single .c file, which simplifies the
build system to a single compile command. It targets desktops and uses
2 MiB pages.
Amd64 Architecture
Most desktops or laptops contain amd64 processor either from Intel or
Amd. Since UOS targets personal computers it supports only Amd64
architecture. This choice simplifies the kernel as 32bit architectures
have much more legacy cruft. For example, Amd64 has no segmentation,
good syscall interface, simpler instruction set and only one necessary
task segment. The arguments for C ABI are passed in registers. UOS does
not allow kernel to read any user memory without explicit mapping.
Moreover, since there is more registers, all syscall arguments must use
them.
Any virtual address must be in canonical form, i.e., if bit 47
(counting from 0) is set, then all most significant bits must be also
set. This quirk divides the address space into 2 pieces with huge hole
in between them. Pretty annoying as UOS is not higher half kernel.
The System V ABI requires first 6 integer or pointer arguments to be
passed in registers in order rdi, rsi, rdx, rcx, r8, and r9. The
remaining arguments (7 and more) are passed on stack.
Overview of Startup
At boot, Basic Input/Output System (BIOS) loads specific sector of 512
bytes into memory at fixed address and starts execution. Usually, this
program is a bootloader that loads the kernel, or in our case second
stage of the bootloader as 64 bit assembly takes more space and I could
not fit the ELF loader into a single sector. The second stage parses
ELF file (kernel) on disk and loads it into memory. The kernel
bootstraps itself by setting up physical and virtual memory managers
and core struct for the boot processor. It then switches to new page
table and stack and continues by starting other processors and loading
user init program into a container. We will now discuss individual
stages in more detail.
Bootloader
We focus on using BIOS instead of UEFI due to simplicity of the setup
as there is no need for additional tools to build BIOS bootable binary.
The disadvantage of BIOS is all the legacy cruft, such as starting in
Real Mode and having only 512 bytes for bootloader.
The purpose of bootloader is to do initial setup (such as video mode,
virtual memory) and load kernel from a disk into a memory. Since we
could not squeeze bootloader with ELF loader into 512 bytes like xv6,
we split the bootloader into two stages, where the first stage's
purpose is to only load the second stage. For now, we use 4 KiB of
space for both stages and kernel starts at sector 8.
After the video mode and long mode setup, the bootloader reads kernel
from ELF binary into predefined location and jumps into the kernel
entry point. IDE driver is used to perform reading from a disk. UOS
does not support multiboot standard as xv6, so it will work only with
our bootloader.
The first stage must be 512 bytes and the assembly needs to be padded
to fill 512 bytes (and adding MBR bytes at 510 offset).
A20 Line
Used to ignore 21st bit of address to emulate wraparound from older
architectures. We have to enable this 21st bit so we can address all
memory.
xv6 bootloader uses keyboard controller to enable this bit. We opt for
BIOS and its functions.
movw $0x2401, %ax
int $0x15
http://www.win.tue.nl/~aeb/linux/kbd/A20.html
Global Descriptor Table
entry point of bootblock.o must be at the address 0x7C00 which can be
double checked by doing objdump -d bootblock.o
The simplest "debugging" can be done by writing to screen; segmentation
must be used as CGA address of 0xB8000 is outside 16-bit range. This
addressing can be done by setting %ds to 0xB800 (segment is multiplied
by 16 on access).
We directly transition from real mode to long mode without going
through protected mode. Since the boot sector is small (512 bytes) we
have to manually set up page tables instead of loading them from an
array.
We can't use .data segment in bootloader C code (for example to store
static string) as only .text is copied with objcopy (-j flag; TODO).
Bootstrap Page Table
We map same physical address [0, 1 GiB) to the virtual addresses [0x0,
1 GiB), and [0x8000000000, 0x8000000000 + 1 GiB). This is necessary as
we linked kernel to reside at address 0x8000100000 (as we want it to be
mapped into each process space) and when we jump to it it must be
mapped properly. The first 1 GiB in virtual address space are mapped
because it is the space in which bootloader lives and from where we
will perform the jump.
The reason for choosing address where all 39 bits are zero is so we can
have only top level page directory (PML4) which shares mapping with the
[0, 1 GiB) entries. This sharing is simply done by pointing PML4 entry
to the same entry in PDP (next level page table) as for the first
entry.
direct switch to long mode
Real Hardware
UOS does not boot on my desktop as it would need SATA driver. If
booting from USB, tt turns out if the disk lacks partition table the
BIOS will set DL register to 0x0, i.e., floppy, and floppy does not
support linear base addressing (LBA) which we use in IDE driver.
ELF
Fixed-length ELF header and variable-length program header listing
* .text program instructions
* .rodata read-only data (we should make it only read-only memory)
* .data program's data (e.g., global variables)
* .bss is a reserved section for unitialized global variables (we
need to zero them)
* VMA is link address LMA is load address of .text section (as the
offsets in code are absolute due to not using Position Independent
Code)
For some stupid reason .rodata is together with .text in same segment,
making .rodata executable. (http://thiébaud.fr/rodata.html)
Symbol Table
Symbol table associates names with linear addresses and thus can be
used by the loader to find addresses of exception and interrupt
handlers (those must have some agreed names).
Drivers
UART
Serial port for debugging. Uses port-mapped I/O which ignores virtual
memory (thus even if we do not map first 2 MiB it works).
IDE driver
[2]Reference
We need to read kernel from the primary ATA disk in bootloader to
initialize the core OS.
We wrote the function for reading sectors in assembly to avoid inline
assembly. Also, we could wrap individual instructions in assembly
functions but it was much simpler this way.
Kernel Bootstrap
Mostly disabling legacy stuff and enabling modern features as the
processor assumes it woke up in last century.
Memory Management
* frame allocator (2 MiB)
* avoid mapping physical memory (higher half)
* tried recursive paging (pain to modify other processes, TLB
pressure)
* ended up with linear/physical dual mapped page tables
* kernel pages marked as global
* kernel heap (used for most allocations)
There are roughly three levels of memory management:
* physical memory
* virtual memory
* heap
We use recursive page table mapping to avoid the need to remap page
tables each time we need to access them or to use contiguous memory
storage for the VA to PA and PA to VA conversions which are necessary
for updating the page table. The recursive mapping is the simplest as
we need these conversions only for page table manipulation. In other
cases such as memory mapped IO we would use a simple mapping.
Hmm, recursive mapping allows us to access entries of page table only
for currently loaded page table which is really annoying.
There are several approaches:
* remap physical memory to kernel and use simple +-KERNEL_BASE
* use recursive page table (may use it for user space)
* keep small space available and map the page table before modifying
it
* store virtual and physical addresses for page table (have to keep
them in sync, allows for custom format which gets rid of canonical
address issues NX bit, problems with dirty and accessed flags)
Recursive Page Table
By mapping pml4 entry to pml4 we can access all paging structures for
current process. The main advantage is the simplicity and constant
virtual address space used in kernel (since only pml4 is mapped in
kernel space). If we need to modify another process's page table we can
point it to a pml4 slot and use the recursive trick on pdp (which is
pml4 of the other process).
Unfortunately, there are many disadvantages:
* since we use 2 MiB pages we have to be careful as the recursive
mapping behaves as if we used 4 KiB pages
* large TLB footprint (as we use 2 MiB pages)
* clunky to modify other process page table (the pdp becomes pml4,
...) - rare, but needed for container creation
* weird bootstrap during boot
* limited to x86 (not big deal for now)
Duplicated Page Table
For kernel we have a separate page table data structure that mimics
hardware page table.
TODO: use completely different data layout so we can easily traverse
the pointers from pml4 -> pdp -> pd.
For user space we may use recursive mapping to allow easy view of the
current virtual address space. (TODO: what about TLB pollution? We
can't easily map page tables as they live in kernel heap)
Backing pages cannot be moved without knowledge of the heap to avoid
having a separate structure for page tables.
Initialization
Since at boot we do not have valid kernel page table we have to build
one. This build process is quite different from modifying current page
table (one in CR3) as we can't use recursive mapping to access the
pml4, pdp, and pd tables. Furthermore, since the page table is not used
we do not need to invalidate TLB entries.
We assume the heap is physically contiguous at the boot time and use a
simple offset to setup a initial kernel page table. Alternative
solution would be to walk the boot page directory and find linear
address for a given physical address (the other way is handled by
recursive mapping).
Manager
Still go back and forth between the options but leaning towards thin
layer in kernel which will provide frame management and ability to map
virtual to physical addresses. The kernel would enforce permissions.
The other options are:
* provide only sbrk to grow heap (pain in ass to map any specific
frames)
* allocate fixed number of frames for kernel and give rest of frames
to user space memory manager (two context switches, painful)
* I can't think of anything else
Security
If I go with 2 MiB pages than we need at least 4 pages for a program (8
MiB):
* .text, read-only
* .rodata, read-only, no-execute
* .data, write, no-execute
* stack, write, no-execute
Heap
I started with a simple heap that allocates multiple of 4096 bytes, but
after watching Switch hacking talk (TODO reference) I plan to have a
separate allocator (slab) for each struct type, thus avoiding the
potential vulnerability of overwriting different struct that can happen
on heap (as two different types of struct cannot possibly overlap, and
also there are no pointer casting issues).
Security
* NX, WP, SMAP, SMEP
* guard pages
* proper permissions for elf segments
* capability-based permission (kernel/user are "same")
The UOS is a exo/micro kernel design and thus it runs all drivers in
user mode (ring 3).
Kernel is prohibited to write to any user mapped memory directly and
must map the memory to separate address if it wants to write there (for
example for the info structure). Kernel never reads any user memory and
all syscall arguments must be passed in registers.
The recent side-channel attacks are concerning and I will probably go
with the separate page table for kernel and user (with the addition of
PCID). The Spectre attack seem much harder to mitigate. The attack
allows user to read all memory, thus it could read kernel key and
corresponding capabilities created from the key. One solution would be
to store the keys and capabilities in a special area not mapped when
user space is in flight (probably hard to not make a mistake); could we
create kernel where everything is non-sensitive?.
he UOS is a exo/micro kernel design and thus it runs all drivers in
user mode (ring 3).
Kernel is prohibited to write to any user mapped memory directly and
must map the memory to separate address if it wants to write there (for
example for the info structure). Kernel never reads any user memory and
all syscall arguments must be passed in registers.
The recent side-channel attacks are concerning and I will probably go
with the separate page table for kernel and user (with the addition of
PCID). The Spectre attack seem much harder to mitigate. The attack
allows user to read all memory, thus it could read kernel key and
corresponding capabilities created from the key. One solution would be
to store the keys and capabilities in a special area not mapped when
user space is in flight (probably hard to not make a mistake); could we
create kernel where everything is non-sensitive?.
Capability-based Mechanisms
Instead of user/group based system have fine grained badges that give
owner permissions. Access control lists (ACL) store list of things that
have access to given resource. Capabilities store list of resources a
thing can access.
Strive for simplicity
c-list (capability list), trapdoor functions?
random numbers (rdrand, rdseed)
How to do these?
* capability per frame
* capability per core
Anton Burtsev, http://www.ics.uci.edu/~aburtsev/
References:
* https://www.cs.cornell.edu/courses/cs513/2007fa/L08.html
* https://software.intel.com/en-us/articles/intel-digital-random-numb
er-generator-drng-software-implementation-guide
Design
Initial approach was to have each frame associated with capability,
thus when use allocated a frame he would get a key to it. This method
was clunky as it would be tricky to allocate multiple frames and it
placed large burden onto user. Furthermore, we were associating the
object number with the key to identify it, which caused same problems.
Instead, we use user provided key (can be random number) when a frame
is acquired, thus the same key can be used with any number of frames.
User can create capability by hashing the key and permissions. Then
when any access such as mapping the frame is performed, OS performs
same hashing and check if the hash in capability matches the computed
one.
TODO: test if we can use AES instead of sha256
Multicore
* ACPI table to get list of cores
* x2APIC (MSR)
* GS register
* one kernel stack per core
* create container in user space (aka fork)
There are at least two ways to detect Application Processors (AP) after
Bootstrap Processor (BSP) starts running the bootloader: Multiprocessor
Specification tables or ACPI (specifically MADT).
xv6 uses MPS tables which are obsolete, so we focus on the ACPI
approach. The ACPI table pointer structure lives at some memory range
and can be checksumed to verify if the struct is really an ACPI table.
Since Qemu does not support x2APIC we have to memory map the LAPIC
registers to query the APIC id (other alternative is to use CPUID). The
ACPI table tells us this address and moreover it has a flag if core is
working (the table is filled during boot by each core).
After parsing the tables we loop through all core entries and if the
LAPIC id is different from the id of BSP we attempt to start the core
with interprocess interrupt. Then we wait until the core updates the
shared core array with running flag. We could do exponential startup
for large core counts, but currently it is not an issue. Note, we use
x2APIC which uses MSR as an interface instead of memory mapping.
Each core has its own kernel stack and pml4 with shared entries for the
kernel data structures (including heap) and for the entry which holds
all core kernel stacks (TODO: should cores see other cores stacks?).
User Mode
* params passed only in registers
* fast syscall/sysret
* low level (map memory, create container)
Graphics
The simplest way to draw pixels is uisng VGA mode 0x13 (320x200 with
256 colors with linear addressing) with BIOS interrupt 0x10 in
protected mode. This mode switch is done in bootloader before protected
mode is entered as only real mode can sanely interact with BIOS to
manipulate VGA graphics.
# two moves can be combined into movw $0x0013, %ax
movb $0x0, %ah # call function 0x0 to set video mode
movb $0x13, %al # use video mode 0x13
# or movw $0x13, %ax
int $0x10
After changing video mode, the pixel can be accessed at physical
address 0xA0000, which in xv6 is mapped at KERNBASE + 0xA0000.
Unfortunately, the small resolution (and only 256 colors) provided by
VGA is restrictive and unsuitable for any graphics based system.
The simplest solution is to use VESA BIOS Extensions (VBE) which can
provide more than 1kx1k resolutions with up to 16M colors. Even though
modes defined by VESA are considered deprecated, they are likely to
work and allow to avoid finding suitable mode. This predefined modes
come handy as real mode needs to be used and xv6's boot code can't be
more than 512 bytes. Other solution would be to drop down to real mode
before switching to long mode (TODO).
For example, setting 1280x1024 with 16M colors and loading its
information.
movw $0x4F02, %ax # set VBE mode function
movw $0x411B, %bx # 1280x1024 24bit mode (with linear framebuffer)
int $0x10
movw $0x4F01, %ax # get mode information function
movw $0x411B, %bx # mode
movw $0x0, %di # store the mode information struct (256 bytes) at physi
cal address 0x0
int $0x10
Than we can qeury mode parameters by loading the ModeInfoBlock
structure. For simplicity, we load it at fixed physical address 0x0 as
kernel code starts on address 0x100000. Since we care only about the
PhysBasePtr field (tells us location of linear framebuffer), we could
pass it as function parameter to the kernel bootstrap function.
Now, the tricky bit; we need to create virtual memory mapping to the
physical address in PhysBasePtr. On my machine in Qemu, the address is
0xFD000000 and we need to map 4 MiB (depends on resolution and bit
depth) up to the address 0xFE000000 (defined in memlayout.h as
DEVSPACE, i.e., memory mapped I/O).
I used mappages (vm.c) in function main (main.c) to map some fixed
virtual address in FRAMEBUFFER to physical address at vbe->physbaseptr
before starting other processors.
struct vbemodeinfo const *const vbe = (void *)KERNBASE;
int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm);
mappages(kpgdir, (void *)FRAMEBUFFER, DEVSPACE - FRAMEBUFFER, vbe->physbaseptr,
PTE_W);
Now, pixels can be written at the address FRAMEBUFFER anywhere in
kernel. I was surprised I could not find any exported function to
perform the memory mapping explicitly, another reason for exokernel.
Floating Points
By default disabled on amd64
We can test in bootloader first if AVX2 is supported, and if not hang
the OS. Some OS will trap if user used floating point unit and enable
it. This mechanism allows OS to avoid storing/restoring floating point
registers during context switch if process does not use those
registers. Since UOS kernel does not save context at all (except few
registers on timer interrupt), it enables AVX2 at the kernel startup
and let's user to do register store/restore if necessary. In fact, this
approach is simpler as user knows what registers are being used and can
do some tricks. Moreover, since kernel does not use floating point
registers, user can pass values in registers to the next scheduled
container.
Kernel
We map kernel at high virtual address as user space programs assume
loading at address 0.
Global Descriptor Table
The GDT from bootloader has only kernel code segment and no user code
segment, nor Task State Segment (TSS). Initially, we had one GDT shared
by all cores, but each core needs its own TSS, thus we could either
create an entry for each core, or better, duplicate whole GDT to obtain
full flexibility at mere cost of extra 32 bytes per core. Keeping the
GDT separate is the right step into the direction of sharing as little
as possible between cores to scale well on many core architectures (>
32 cores).
Memory Layout
UOS's kernel does not remap physical memory contiguously (higher half)
as xv6 does. Instead, the kernel uses the same interfaces as user
programs, allocating frame and mapping it. In linear address space, the
kernel lives at entry 511 (last) in PML4 page table and thus takes 512
GB. We chose to take whole entry for simplicity as it can be easily
shared by all user programs. Moreover, we mark the entry as global to
avoid flushing TLB on context switch.
Each core's stack lives in entry 510 and is again marked global. Stacks
are one page large (2 MiB), and have guard pages interspersed between
the stacks. We do not have kernel stack per container as there are no
slow system calls in the kernel (such as IO) that would need to be
interrupted. This push of expensive operations to user space makes
kernel simpler and easier to reason about.
Linking
Containers
Instead of notion of a traditional UNIX process, UOS uses containers as
its execution unit abstraction. A container is akin to a process, but
it also has access to exceptions, low-level memory management, and
scheduling. For example, user can easily implement preemtive threads
inside a container.
In short, container has following entries:
* unique identifier
* PML4 pointer to a page table
* frame for storing rsp and rip during system call
* status (running, ...)
* penalty (increased on each tick, decreased on yield)
In contrast to xv6, we do not have any scheduler thread, so the kernel
thread becomes a container after bootstrap and if container is killed,
next container is scheduled (or in lack of container kernel panics).
User Space Bootstrap
The initial user mode binary is embedded into the kernel as there is no
disk driver in kernel. All binaries use custom link script so there are
only .text, .rodata, and .data, and they have proper permissions so
they can be loaded into separate pages.
The ELF file is parsed for timer interrupt handler which needs to be
passed to function allocating quantum (so it will be scheduled to run).
System Calls
xv6 uses interrupt 0x10 for system calls. We use the native
syscall/sysret instructions present on amd64 architectures specifically
designed for system calls.
Since the syscall instruction uses rcx register to store the returning
instruction pointer we copy the value in the register to caller saved
register and restore it back in the syscall to preserve C calling
convention.
We may need to move away from C calling convention due the need of
having more than 6 function arguments for the syscall (7-th is on the
stack and uOS is prohibited from touching any user memory).
UOS relies on a C calling convention for system calls, but it must
clear caller saved registers before returning to the user as it could
leak information from the kernel (e.g., private key).
Entering User Mode
Since amd64 has no "direct" instruction to switch to user mode we have
to pretend that the container is returning either from interrupt
(iretq) or system call (sysretq). We used the system call intruction as
it does not touch user's stack, thus we can avoid writing to user's
memory (which we avoid for security reasons).
Traps and Interrupts
256 slots total.
Must be acknowledged
Priorities, higher is handled first
TODO: do software excetions need to be acknowledged?
Needs one Task State Segment (TSS)
User space exception list:
* #DE - divide by zero (vector 0)
* #PF - page fault (vector 14) - TODO: user can't read CR2; should
kernel push CR2?
* #XF - SIMD floating point (vector 19)
User space rare exceptions (which we let double fault):
* #SS - stack exception (vector 12) - TODO: detects canonical address
issue
* #UD - invalid opcode (vector 6) - TODO: mostly legacy instructions
and very rare
* #GP - general protection (vector 13) - TODO: mostly for unaligned
SSE
* #AC - alignment check (vector 17) - can happen only if CPL = 3
Kernel space exception list:
* #DF - double fault (vector 8)
* #MC - machine check (vector 18)
Legacy exceptions:
* Non-maskable interrupt (vector 2)
* #OF - overflow (vector 4)
* #BR - bound range (vector 5)
* #NM - device not available (vector 7)
* #TS - invalid TSS (vector 10)
* #NP - segment not present (vector 11)
* #MF - x87 floating point (vector 16)
Story
Thanks to #osdev for help.
After implementing divide zero exception the kernel start rebooting.
Changing the .text segment to writable stopped the restarting and
seemingly fixed the problem. However, exception should not modify the
.text segment so after verifying the IDT is correct I had to check what
changed in .text segment.
I copied all .text data and run comparison after the fault happened.
The mismatch was found:
mismatch at FFFFFF80000000C0: 20990000000000 20980000000000
Which refered to GDT code segment entry and specifically the accessed
bit got flipped. The reason is that during fault the code segment is
loaded, but for syscall/sysret it is not, so it was confusing.
For now, the solution was to move non-bootstrap GDT to .data segment.
NOTE: bootstrap GDT is in .text segment and thus the page for starting
APs must be writable.
Division by zero
When asked on IRC to about advancing RIP after division by zero I got
this response :).
It's *possible*, but it's also likely to break dozens of known good
assumptions about computer science and invoke a metric shitload of
undefined behaviour in the process (Kazinsal #osdev)
Timer Interrupt
So far, all non-critical exceptions were handled by currently running
user mode container. Unfortunately, I am not sure how to make the LAPIC
timer in user mode as it requires writes to a MSR to reset it (TSC
deadline), and more importantly kernel has to relinquish control in
case of adversary/buggy container that is reluctant to yield on the
interrupt.
Initially, I thought setting up two timers, one at higher frequency
(user mode) and the other at lower frequency (kernel mode) would be
neat, but LAPIC has only one timer interrupt vector. Thus, the current
plan is to point the interrupt vector to kernel which will bump up
penalty integer (if next quantum is a different container) and returns
to present container. Since the registers on context switch are saved
by containers, kernel must (I think) use either user mode stack (can be
read-only) or registers to at minimum save current RIP of the user
container before jumping to the handler. If sysretq is used, RCX and
R11 must be saved on user stack in addition, thus to better mimick
traditional exception, a user read-only stack seems as a better
solution (there is the GS/FS register too, but I rather have user
complete freedom and not reserve any registers - probably makes it more
portable too). At the end, I decided to create read-only container
environment (basically, shared memory with kernel) that can be used to
share information such as, the current penalty count, last RIP (on
interrupt), next quantum container id, and other different kernel
counters for active container. This area requires one frame (2 MiB) and
is mapped as write into kernel, and as read into user. The policy of
kernel never reading any user mode memory (only registers are allowed)
is still enforced.
After kernel returns to the user mode interrupt handler, the container
is expected to save its context if it is necessary and yield. There are
exceptions, for example if the next quantum is the same container,
there is no need to save anything nor yield, the container can restore
the stack pointer and jump back to the saved or different RIP. This
mechanism can be used to implement user space threads by simply
allocating few consecutive quanta (such as 10, each 1 ms long) and
yield only on the last one, so the kernel can switch to different
container.
Normally, I use null segment for SS, but when returning from CPL < 3
into CPL = 3 via iretq, the SS will get reloaded and will cause general
protection fault. Thus a not null selector must be created in GDT and
SS must be loaded with that selector index. We use a simple trick where
SS gets loaded with proper selector after executing sysretq instruction
(starts user mode container). It is possible to use sysretq, but then
two extra registers needs to be stored in info (rcx and r11); with
iretq we need to store only rip. For user space exceptions we do not
follow the manual and avoid iretq as it would set the accesses bit in
GDT entry and page fault. (TODO: manual explicitly requires to use
iretq)
TODO: handle what happens if container's interrupt handler gets
interrupted (I am thinking about unique ID written by kernel into info
as RIP can't be used reliably to identify interrupts). The container
should be penalized (and eventually killed), but more importantly,
there must be a plan for handling the user space stack in sane way (and
not losing some space on it, due to RSP being bump in handler, but not
decremented).
Protocol:
1. interrupt jumps into kernel handler
2. if current container penalty is 3, kill it and schedule next
container
3. increase penalty
4. update RIP in the info
5. set next LAPIC deadline
6. jump into user handler
7. user is expected to save registers and yield
8. decrease penalty (and if negative set to 0)
9. schedule next container
Notice, the penalty indicates if there is any nesting in timer
interrupt. A well behaving program should always have penalty equal 1
when entering timer interrupt handler.
Since the user mode bootstrap container must handle timer interrupt, we
parse the ELF symbol table and search for timer_interrupt_asm assembly
function. This function is then called by kernel timer interrupt
handler.
The kernel interrupt handler is fully written in assembly to avoid any
surprises in terms of register modifications. The current
implementation needs to push and pop only 3 registers, RAX, RCX, and
RDX.
Compile Flags
-mno-red-zone (disable red zone as interrupts do not respect convention
of leaving extra stack space for leaf functions)
Tasks
* tests
+ [ ] frame allocator
+ [ ] sha256
* style
+ [X] header only
+ [ ] Intel-style assembly
* bootloader
+ [X] load rest of bootloader with BIOS (sort of 2 stage)
+ [X] switch from real mode to long mode
+ [ ] query video modes and save this information at fixed
address
+ [ ] switch to most suitable video mode
+ [X] load ELF binary containing kernel (uses IDE)
+ [ ] switch to SATA?
+ [ ] boot on real hardware (needs SATA)
* kernel
+ [X] make work on KVM (needed to disable SMAP on Haswell)
+ [?] add performance counters to evaluate impact of recursive
paging on TLB misses
o abandoned recursive paging due to clunky mapping of other
page tables
+ [X] start application processors (AP)
+ [X] create kernel process for each application processor (AP)
+ [X] add spinlock (test it by cores racing on it)
+ [X] use RDGSBASE to get pointer to core structure instead of
the current offset stuff
+ [X] check required feature set with CPUID
+ [ ] SKINIT instruction (read Security section in manual)
+ [ ] restructure so the end is at 16 MiB boundary
+ [ ] avoid global variables
+ [ ] use ILP64 mode (sizeof (int) == 8) - seems hard as there
is no compiler support
+ [X] timer (local APIC; TSC-Deadline mode)
+ [X] kernel double fault handler
+ [X] enable SCE (system call extensions) in EFER
+ [X] add system call support (syscall)
+ [ ] syscall user rip security (validate it is canonical?)
+ [X] add containers (similar to process abstraction)
+ [X] load user space bootstrap ELF embedded inside kernel
+ [X] page permissions for ELF segments
+ [X] parse ELF symbol table to find timer interrupt handler
+ [X] switch to user space (via sysretq)
+ [X] use x2APIC (through MSR) instead of old xAPIC
+ [X] wait 10 ms on AP startup
+ [ ] IPC (async to avoid deadlocks...)
+ [ ] AVX2
+ [X] one TSS
+ [ ] sleep idle cores (hlt instructions)
+ [ ] expose some data structures (and update during task
switch?)
+ [ ] Continuation style system (TODO: read papers on it)
+ [ ] init system (so servers can reserve resources)
+ [ ] struct type for physical addresses (what about linear
addresses? always a pointer?)
+ [X] add HPET support (to compute TSC rate)
+ [ ] compute TSC rate for each core independently (good idea?)
+ [ ] destroy container on double fault or other error
* memory management (only 2 MiB pages, thus only PML4, PDP, and PD is
used)
+ [X] boot page table
+ [X] kernel page table
+ [X] recursive PML4 (no longer used for kernel space)
+ [X] frame allocator
+ [X] page allocator
+ [X] heap allocator (only alloc, no free)
+ [X] duplicated page table (allows easy traversal of any page
table - even inactive one)
+ [X] page mapping interface (global marked pages for kernel to
avoid TLB invalidation; write protect bit CR0)
+ [ ] alignment checking in user space?
+ [X] Supervisor-Mode Execution Prevention (CR4.SMEP)
+ [X] non-executable pages (NXE in EFER)
+ [ ] TLB (tagged)
+ [X] shared memory (uses capabilities)
+ [ ] replace heap with slab allocator
* scheduling
+ [ ] penalty protocol for wonky containers
+ [ ] round-robin scheduler
+ [ ] container migration to other core
* syscalls
+ [X] log to UART
+ [ ] frame alloc/free
+ [ ] map/unmap
+ [X] alloc/free container
+ [X] quanta alloc, yield
+ [X] userspace register store
* debugging
+ [X] setup .gdbinit for automatic GDB startup
* security
+ [ ] separate page table for user and kernel (KAISER)
+ [X] use guard page for each core kernel stack
+ [X] proper page flags for .text, .rodata, .data
+ [X] capability-based resource management
+ [ ] separate memory pools for each struct type (attacker can
only modify same type of struct)
+ [ ] MTTR (prevent DMA to write into kernel memory)
+ [ ] kernel ASLR (both linear and physical address; PIC needed)
+ [ ] random location for user stack
+ [ ] finer granularity permissions for capabilities
+ [ ] check if AES can be used instead of sha256
+ [ ] namespaces?
+ [ ] string type with length?
+ [ ] stack protection
+ [ ] do not map kernel into user space? (recent patches in NT
and Linux do it)
* libs
+ [X] sprintf
+ [X] sha256 routine
+ [X] random numbers (via rdrand instruction)
+ [X] strcmp
* user space
+ [X] simple game of life render test
+ [X] exceptions (divide by zero, page fault, ...)
+ [X] timer interrupt
+ [ ] compositor
* ACPI
+ [X] parse MADT for detecting CPUs
+ [X] build parsed struct with LAPICs
+ [ ] parse x2APIC (Qemu unfortunately has ACPI 1.0 only)
* drivers
+ [ ] SATA (high priority)
+ [X] basic serial port
+ [X] IDE
+ [ ] keyboard (needs USB stack)
+ [ ] mouse (needs USB stack)
+ [ ] Intel GPU driver (blitting support)
* graphics
+ [ ] GUI library
+ [ ] tiling manager
* file system
+ [ ] sector-based addressing (similar to page frames)
+ [ ] memory mapped filesystem?
+ [ ] some filesystem format
* networking (not likely before Christmas)
+ [ ] ethernet driver
+ [ ] packet filter (downloadable?)
* JIT (use wasm or some other typesafe IR and thus run everything at
ring 0)
+ [X] interpreter of a subset of WASM
References
* Exokernel: An Operating System Architecture for Application-Level
Resource Management
* Singularity: Rethinking the Software Stack
* Amoeba: a distributed operating system for the 1990s
* Scheduler Activations: Effective Kernel Support for the User-Level
Management of Parallelism
* A Comparison of Scheduling Algorithms for Multiprocessors
* L4 Microkernels: The Lessons from 20 Years of Research and
Deployment
* xv6, JOS, seL4
* osdev wiki
* Operating Systems Design and Implementation
* AMD64 Volume 1, 2, 3
* ATA Interface Reference Manual
* Advanced Configuration and Power Interface (ACPI) Specification
* ELF-64 Object File Format
* VESA BIOS EXTENSION (VBE) Core Functions Standard
* [3]The little book about OS development
* [4]Reading privileged memory with a side-channel
* [5]Kaiser
* [6]Hyperkernel: Push-Button Verification of an OS Kernel
* [7]Dune: Safe User-level Access to Privileged CPU Features
* Modern Operating Systems - 3rd edition, A. Tanenbaum
References
1. https://klacansky.com/notes/data/useless-os.zip
2. http://wiki.osdev.org/ATA_PIO_Mode
3. https://littleosbook.github.io/
4. https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
5. https://klacansky.com/notes/reference/uos/kaiser.pdf
6. https://klacansky.com/notes/reference/uos/hyperkernel.pdf
7. https://klacansky.com/notes/reference/uos/dune.pdf
|