Operatingsystemlessness
I have found it difficult to keep track of all the new container-related terminology: container orchestrators, serverless, cloud native, and so forth. I think a term that hasn’t gotten much usage yet, operatingsystemless (mentioned here), is a really apt way of understanding where compute is going and hopefully will gain some increased adoption. This is based on an observation that the trend over the last two decades, for production systems, has been to increasingly reduce the usage of any explicit use of most operating system features. Ultimately, production systems may evolve to some form of Unikernel-like format(s), although it is hard to say exactly what that would look like.
For the purposes of this blog entry I’m going to wildly oversimplify and classify compute workloads into two camps:
- Interactive applications; workloads intended to be run on a single machine, with a human logged in and, if not actively interacting with the software, at least able to check in from time to time and fix problems as they arise
- Production systems; workloads intended to be run on many machines, without many humans close by; humans can be paged in (ideally by automated systems) to fix problems, but only at significant cost
When my present company first deployed a major website in the late 1990s, Unix was the natural choice for the production systems. Although designed for interactive workloads (multi-user workstations), engineers working on our systems were familiar with Unix, and the company needed to move fast. Unix was (and still is) an excellent operating system for software development. Although there is no law of physics stating that one must use the same operating system in both production and development environments, there was no other practical alternative at the time.
Our website had, roughly, the following architecture (circa 1999):
webserver (x8) database (cluster)
-------------- ------------------
httpd (master) /--floating-ip
\_ httpd (child)| /
\_ httpd (child)|_/
\_ httpd (child)|
\_ httpd (child)|
I won’t focus too much on the database side of things, just the webservers. On each of the webserver machines, using Unix had a number of missing features and even anti-features, when it came to production services, which I’ll outline in the next few sections.
Process Management
You’ll notice in the diagram above that there was nothing to make sure the master httpd process actually stayed up. In fact, we had a Perl script that would sweep across all the webservers and make sure those processes were running. This was because Unix was not really designed for this kind of lights-out automated scenario. Unix has been geared towards humans sitting at workstations. There are myriad ways that processes can get stuck (or become zombies). In production those states are difficult to handle correctly; one would prefer they not exist. In the interactive workstation case, they’re acceptable annoyances.
Process management is also primitive. For a production system one wants supervision tree semantics to ensure fail-fast, crash-only systems that are naturally self-healing:
- One can be certain that specific processes are always present, yet also do not restart so quickly (under error conditions) that they swamp the CPU
- If a parent process dies, all child processes automatically die as well
- If a child process dies with an error code, the parent process must explicitly inform the operating system that it handled the error, otherwise it, too, dies
- Zombie processes are not possible
As another example of a missing feature, the Perl script mentioned above had code that would try to ensure that all the child processes of a webserver were actually killed, even if they themselves spawned children. Unix does not provide this sort of behavior, and we were never able to fix all the potential race conditions.
As a result, the industry has produced numerous “process managers”, over the years, outside of the operating system proper, none of which are wholly satisfactory: SYSV init, upstart, systemd, runit, monit, daemontools, etc. All of these systems have their pros and cons, but I think fully robust production-oriented process management cannot be solved outside of the kernel. Without kernel-level guarantees, there will always be edge cases where processes spawn endlessly, fail to restart, etc.
The industry-wide trend in response to this has been to stop launching multiple processes on the same (logical) machine at all, leveraging new kernel features like chroot, namespaces and cgroups.
Terminal Handling
Unix has a complex set of terminal handling capabilities necessary for multiple human users on the same machine. This capability is useless for a service with thousands of machines running non-interactive workloads. And no terminals in sight. Although terminal handling is useful for those times when an engineer is trying to diagnose a problem, even this is considered an anti-pattern. Partly to accommodate terminal handling, Unix has an unnecessarily complex process model centered around interactive control of processes.
Job Control
Closely intertwined with Unix support for terminals is Unix job control functionality. It was not designed for production systems. The most obvious example is the terminal stop feature, allowing a human to suspend a running process by hitting a key on the keyboard like Control-Z. In a headless service, processes should never “stop” in this way. The mere presence of the concept of a “stopped” process means either one more aspect of Unix one must ignore, or else one more aspect to ensure never happens via monitoring.
Sessions
Also closely related to terminal handling is the Unix idea of sessions. Sessions are groups of processes that can be, as a group, associated (or disassociated) from an interactive terminal. Although the name “session” sounds like it might be useful, the implementation is irrelevant for production systems.
Scheduling
Scheduling on a single machine, with one process (or at least very few), becomes a lot simpler, as there is only one (or, again, a very few) process to schedule. At the same time, in production systems the more important kind of scheduling occurs across a cluster of machines. For example, scheduling two processes near or far from each other (for either performance or blast radius purposes), requires cluster wide knowledge of which workloads are related to each other and where they are already running. Scheduling more processes of a certain type because a queue or load-balancer has become backed-up requires both knowing about the load balancer, knowing if/where there is free capacity, and knowing what type of process to spawn in order to fix that particular type of backlog. For reliability, this knowledge must be distributed across the cluster, not just tracked on a single machine.
Users and Groups
One of the first pain points when using Unix for production, was answering the question, “what users and groups should my processes run as?” While we naturally did not want anything to run as root, picking arbitrary users and groups at all was an annoyance. User and group membership on an individual machine does not have much meaning in a production environment. At best, it helped keep two services on the same machine (which one needed to do from time to time, for cost reasons, in the pre-VM era) somewhat separate from each other. This, though, required complex management system to push the right user and group configuration to every machine.
Worse, the things one would want to control (for example which
database servers were accessible to the process) also require
knowledge and ownership that exists outside of the production
machines. Over the network, everybody is nobody
, even root
users.
Overall, for production systems, machine-local users and groups just add a lot of hassle without much benefit.
Towards Operatingsystemlessness
To get a sense of how little of Unix is necessary for distributed
systems, it’s worth taking a look at gVisor’s list of supported
system
calls.
This serves as a reasonable, albeit conservative, proxy for what a
modern “container” needs from an operating system. Note that under 50%
of all Linux syscalls are fully supported, and some of those
(getgid
, etc) clearly aren’t necessary in the long run.
Count | Status | Percentage |
---|---|---|
165 | Supported | 47.5% |
89 | Partially Supported | 25.6% |
48 | Error With Event | 13.8% |
25 | Cap Error | 7.2% |
20 | Error | 5.7% |
347 | TOTAL | 100% |
Overall there is a trend moving away from using traditional workstation operating system features at all:
- Either a single process per “machine”, using multiple threads, if necessary, or else using multiple processes but in a strict supervision tree as outlined earlier
- All IPC is over the network abstraction, even if physically local
- No users or groups
- Scheduling decisions made based on external factors (e.g., work queue depth)
- Little to no use of signals and no terminal handling features