The Three Generations of Microkernels

From Mach to seL4

Writing down my understanding of the history of microkernel development and research.


Let's start with a brief definition of a microkernel. A microkernel is an OS kernel that implements a significantly smaller part of functionality than conventional ("monolithic") kernels such as Linux or BSD kernels — in particular, no file system and no network protocol stacks. Instead, a microkernel provides the functionality (primarily scheduling, virtual memory and an IPC mechanism) required to implement these missing parts, and more, in user space.

Among oft-cited advantages of microkernel-based systems — where a lot of the code is run as userspace programs rather than being in the kernel — are increased security due to more isolation and better resilience to crashes and other kinds of errors. Personally, I'm more intrigued by their flexibility and extensibility.

There are a number of problems with microkernels too. I'll describe some of them in more detail later; but the often mentioned ones are poor IPC performance and concerns around scalability.

They must be the reason that, despite microkernels being widely deployed and used by many people without them even knowing it, no popular user-facing operating system is built on top of a true microkernel (Darwin and NT do not count). This could change with Fuchsia, a new operating system from Google, based on its own microkernel, Zircon; it is expected that Fuchsia will replace Android as Google's primary operating system for consumer devices.

The first generation of microkernels

The idea of a microkernel is nothing new. As Wikipedia helpfully tells me, the first attempt at building a microkernel, RC 4000, was made back in 1969, about the same time Unix was being created.

In 1975-1976 the Aleph kernel (part of a system called RIG) was created at the University of Rochester. In 1797-1985, the Carnegie Mellon University developed Accent based on the ideas from Aleph. Starting from 1985, they have worked on Mach, which started as an attempt to make it possible to run/reimplement BSD Unix on top of Accent design concepts, and ended up being one of the two most famous microkernels in existence, used by more than a billion users today (as a part of the iOS kernel).

Capability-based security

I may write a longer post on capability-based security later; but here are the basics.

In a capability-based system, in order to be able to access a resource (such as a file, a process or an on-screen window) a program (process, task) has to be explicitly granted the right to access this resource by some other program that already has the right (for example, the creator of the resource), as opposed to just having authority to access this kind of resources (such as being root on Unix). A capability then is this unforgeable access token; it both names/designates/references the resource and grants its holder rights to access/manipulate the resource; programs grant each other rights to access resources by exchanging capabilities, and there's no way to name a resource without having a capability to it (so, no global PIDs and window IDs). To access/manipulate any resource, a program needs to name, using a capability, which resource is that it wants to access/manipulate; the program that implements the resource (be it the kernel or a userspace task) then doesn't even have to ask itself whether the calling program should be allowed to perform the operation — it has the capability, so, yes.

Capability-based security is much more flexible and dynamic than authority-based security; moreover, it solves some issues that the latter has: notably, it's not prone to race conditions and confused deputies. It's also a "mechanism, not policy", and in fact it's simpler and easier to implement on the kernel side than even UID checks (essentially, the kernel leaves the question of access control up to userspace), which makes it a natural fit for microkernels.

In Mach, capabilities are tightly integrated with IPC. Mach ports are essentially kernel-maintained message queues, the userspace accesses them via port rights, which are capabilities — a lot like Unix programs access open files via file descriptors. While each port is just a message queue, it typically represents some resource maintained by whoever has receive right for the port (aka a server); to manipulate the resource, other programs (clients) send specially crafted messages to the port, which the server interprets as commands to perform actions on the resource. There are tools — primarily, MIG, the Mach interface generator — to implement this model using convenient remote procedure calls (RPC), which makes it look like the client directly invokes functions in the server.

Issues

The giant, glaring problem with these first-generation microkernels is their low IPC performance. That is not to say that traditional monolithic kernels have much faster IPC (Linux does, but that's because of massive effort that went into optimizing it), but this is not much of a problem for monolithic kernels because there IPC is used occasionally, when there is, indeed, a need for inter-process communication (for example, for sending logs to a syslog daemon). In contrast, microkernel-based systems use IPC pervasively to implement all kinds of operations, so IPC performance starts to matter a lot.

Furthermore, on a monolithic kernel, IPC would typically be thought of in terms of (asynchronous) message passing; and on a microkernel in terms of (synchronous) remote procedure calls. Even if RPC is internally implemented on top of message passing, RPC brings with it an expectation that the calls would be reasonably fast. Maybe not as fast as direct function calls, but you still wouldn't want each of these calls to involve the kernel parsing, copying and buffering the message, then waiting for the target process to get scheduled, then copying the message back to the target processes memory, then the other process parsing it, and then doing it all over again to send the response.

In short, microkernels need IPC to be fast because they rely on it way more than monolithic kernels do.

L4

Central to the second generation of microkernels is the work of Jochen Liedtke. In his 1993 paper titled "Improving IPC by Kernel Design" he describes an experimental microkernel named L3 which was able to pass messages up to 22 times faster than Mach. That's right, not 20% faster, not 2 times faster, but jaw-dropping, mind-boggling twenty-two times faster. Can you even begin to imagine that?

The way to bring about such an exciting performance improvement, it turns out, was to make the kernel even more micro-, to make it handle even less. L3 is top-to-bottom designed for high IPC performance, not for the features. It does away with capabilities and capability-passing; this way the kernel does not have to parse the messages. Instead of sending a message to a port, you explicitly specify which thread you're sending the message to. This additionally removes one level of indirection and frees the kernel from maintaining bookkeeping information about ports and port rights. Another notable and familiar feature of Mach ports that L3 does not implement is message buffering in the kernel: instead of copying the message to a kernel buffer and subsequently to the destination buffer, as it's done in Mach, L3 directly switches control to the receiving thread without going through re-scheduling and buffering.

L3 also employs a number of tricks to make IPC and context-switching faster. One worth mentioning is that for short messages, L3 uses an additional optimization where it would pass the message body to another thread entirely in CPU registers, further avoiding copying and RAM access; it was found that between 50% and 80% of messages contain 8 bytes or less, so this optimization is worthwhile. This additionally simplifies the work the userspace has to do, because there's no need to serialize and deserialize short messages into and from buffers. Instead, the RPC routine can be inlined, and values placed into the registers automatically by the compiler. In the same way, in the server program, the register values can be directly used by the compiler to perform some computation on them.

As a result of all those simplifications, L3 is so much smaller than the first-generation microkernels that it becomes feasible to count, and then try and minimize, the number of processor instructions executed on the fast path of message sending.

Kernels this small are also called nanokernels, to emphasize their small size even in comparison to "traditional" microkernels. Nanokernels generally follow the principle of minimality, as formulated by Liedtke: "A concept is tolerated inside the microkernel only if moving it outside the kernel, i.e., permitting competing implementations, would prevent the implementation of the system's required functionality." Essentially, a nanokernel contains the code necessary to manage virtual memory and perform context switches and not much else; this is why nanokenels are sometimes described as being "CPU drivers".

Of course, in absence of capabilities and other niceties provided by the kernel, it becomes much more difficult and troublesome to build a complete operating system on top of a nanokernel. A traditional monolithic operating system may consist of the kernel and "some utilities" like a shell and a text editor on top of it (though modern operating systems have a much, much larger userspace with many daemons/services, graphical desktop environments and so on). In an operating system based on a nanokernel basically all of the operating systems resides in the userspace, and the kernel itself is more of an utility. A common approach to get a microkernel-based system to do something useful is to run a version of Unix or Linux virtualized (or paravirtualized, i.e. aware it's not running directly on the hardware); this lets one achieve some working results fast, but is "cheating", because the resulting system is still monolithic even though the guest kernel does not run in the hardware kernel mode, which means that all of the supposed advantages — security, crash resilience, flexibility and extensibility — are lost.

A reworked version of the L3 kernel described in "Improving IPC by Kernel Design" was named L4, and it spawned a whole family of L4 microkernels that differ in the implementation and some aspects of the design, but share the same basic idea and principles — most importantly, the principle of minimality. Microkernels of the L4 family (OKL4 in particular) are widely deployed to billions of users inside embedded devices such as secure coprocessors in modern smartphones.

So this is what the second generation of microkernels is about. An order of magnitude better performance and a complete lack of features. The problems with the second-generation microkernels fall into two categories: access control and resource accounting.

Issues

Without capabilities, the system is in need of another way to manage access control. One approach is implementing capabilities in userspace, either in a central server, or by having each task track the capabilities it has given out to other tasks. The former approach has a large performance overhead and somewhat nullifies the IPC performance improvements made possible by removing capabilities from the kernel space. The latter approach works, but has other complications around passing capabilities between tasks and accepting capabilities from unknown tasks.

There are also capability-less approaches, and L4 implements one of them, known as clans and chiefs, in the kernel. Basically, like Unix and unlike Mach, L4 tracks parent-child relationships between tasks. A group of tasks created by a given task forms a clan, and the creating task is called its chief. Within a clan, messages can be sent freely; but messages to other clans have to go through their chiefs which are able to filter, manipulate or redirect the message. This model is quite simple and does not impose a noticeable performance overhead on L4 IPC, which is why L4 adopted it. Still, it's not very beautiful, flexible or useful. In particular, either each message has to go through multiple redirections, or all the threads on the system need to know each other's thread IDs, which is bad for encapsulation and stability, and even then messages have to go through the chiefs.

There are more questions to access control than IPC messaging. For example, there needs to be a way to let tasks share memory, and yet there should be some access control on when a task is allowed to access the memory of another task. L4 solves this using an interesting mechanism called recursive memory mapping, which is simple — basically, a thread can grant another thread access to a part of its own memory, and these grants form an inheritance tree — but has many issues of its own.

The other large problem is resource accounting. It's not actually specific to second-generation microkernels — first-generation microkernels and monolithic kernels suffer form it as well, but as it's the case with IPC performance, resource accounting does not pose as much of a problem for monolithic kernels.

The system has only a limited amount of resources available. This primarily refers to the physical RAM and the CPU time, but can include other resources such as threads that need their data to be allocated in that RAM and scheduled to take up a part of that CPU time. In a traditional monolithic system, the kernel can just allocate as much memory as it needs to service the userspace requests. To prevent a single (possibly malicious) process from consuming all of the system resources there usually are various limits and quotas in place.

In the same way, in a microkernel-based system, the kernel will try to allocate as much memory as it needs, and so will userspace servers. Often, these resources will be needed not for the service itself, but for it to be able to service a request made by another task. The problem is, it is much harder to track these resource requirements in such a distributed system where many servers all perform some kind of services for each other, so it makes little sense to put quotas on resource usage of each particular server.

Third-generation microkernels

The third-generation microkernels are characterized by:

  • The return of capabilities as a way to solve both access control and resource accounting.
  • Focus on making it possible to formally verify the correctness of the implementation.
  • Achieving the above without compromising on IPC performance.
  • Being able to act as a hypervisor.

It turned out that capabilities bring more solutions than problems, after all. It is possible to have extremely fast IPC even with a level of indirection that endpoints (message queues, like Mach ports, except without actually queuing messages) are. In return, capabilities free the kernel and the userspace from trying to build an access control system on top of an inflexible mechanism such as clans and chiefs, and also, it turns out, can be used to solve resource accounting in the same elegant way they solve access control.

The idea is simple: the kernel does not allocate dynamic memory at all. Whenever it needs some memory to service a userspace request — such as to create a new thread — the userspace has to allocate memory itself and pass a kind of pointer to this memory to the kernel, via capabilities. The way this works is on startup, the kernel enumerates the available physical memory, then creates a bunch of capabilities that represent this memory and gives them to the initial thread. The userspace is then free to pass these capabilities between threads as usual and implement any allocation strategies it wants. Creating a kernel object, such as a new thread requires passing in a free memory ("untyped" in seL4 parlance) capability of sufficient size, and may in fact be implemented as a method call on the free memory capability.

This mechanism, on the one hand, helps reduce the kernel even further by freeing it form dynamic memory management concerns. And indeed in accordance with the minimality principle this makes it possible for the userspace to use several competing strategies for memory allocation, i.e. several allocator implementations. On the other hand, it provides a generic and flexible framework for resource accounting. The resource pools, in form of capability sets, can be passed between tasks/threads; it is expected that each application-level task would be provided with such a pool upon startup (or on request), and each server would require the caller to provide, along with each request, a resource pool to service the request. The server would use the pool to allocate its own resources that it needs for handling this request, as well as transitively passing parts of it to other servers when making requests to them. This way, the total number of resources allocated to performing all the actions necessary to fulfill one top-level request is limited by the resource pool explicitly passed when making that request, and the total resource pool given to a task serves as a quota for how much of the system resources the task can use.

The most widely known, and the most successful third-generation microkernel is seL4 from the L4 family, which was developed in the late 2000s and publicly released as free software in 2014. IPC on seL4 is among the fastest, if not the fastest, among microkernels including L3 / original L4. In addition to that, seL4 is the first and still the only microkernel (or a kernel of any kind) whose correctness has been successfully verified using formal methods. This guarantees that seL4 is essentially bug- and vulnerability-free, which is a rare quality for software of any kind, enabling seL4 to be widely used in high-assurance systems such as in militarily helicopters.

The only problem preventing further seL4 adoption seems to be that there doesn't exist a general-purpose, user-friendly operating system built on seL4. There are some prototypes of such a system though. There is RefOS, which is more of a demo of what's possible than an actual OS, and there exists a port of the Genode operating system framework to run on top of seL4.

A conclusion, sort of

The microkernels of the third generation, and seL4 in particular, solve the problems and the shortcomings of the previous generations, and provide a solid foundation for building secure, high-assurance systems on top of them.

I haven't mentioned all the microkernels; I'm not even familiar with all of them. Here are some names for you to do your own searching:

  • QNX
  • Chorus
  • KeyKOS
  • EROS
  • Coyotos
  • Viengoos