Service Mesh: Back to the Future

I recently started learning about service mesh and am excited by the concepts, as well as all the features existing service mesh products already provide. I’ve been trying to understand the area in more detail. This blog entry covers what I’ve learned so far. It’s been helpful for me to think about how service mesh relates to past experiences. I wasn’t always working on “services” so the first seven or so years of my career don’t apply, but starting in 1999, when I joined Amazon, I began working on “services”, although they weren’t called that at the time, and have been doing so ever since.

With hindsight, there were both good and bad things about service development circa 1999. I’m starting to think that service mesh is a way of “having one’s cake and eating it too”. It has the potential to enable the simpler development model of 1999, but without the bad parts that we either suffered with or else were unaware of.

1999: The Good Parts

Developing “service” software was easier 20+ years ago, in some ways. At my company there were at least an order of magnitude fewer software developers and fewere teams to coordinate amongst. The front-end that implemented our website, which I’ll focus on in this section, was a 2-tier architecture comprised of a single outer tier and a single inner tier (database). New front-end developers could understand its architecture in a 30 minute whiteboard session; it was as simple as:

2. Front-end application (running on about 8 hosts)
3. Databases

Moreover, there was a laundry list — a long one — of things most developers never worried about.

For one, developers didn’t have to worry about “discovering” services. The only “service”, from most developers’ perspective, was a database, and we’d only recently split up the database tier into multiple logical databases. Before that, there had been only one, so you could write code like:

// Note: this is not real code, just a conceptually equivalent
//       example.
DbConn databaseConn = DatabaseDriver.connect(databaseIp);


Even after the split into multiple databases, discovering one was as straightforward as using DNS, which had the advantage of being a standardized protocol already built-in to the operating system. No special container orchestrator, dependency injection framework, other software libraries, or anything of that nature was needed. For example, when catalog data got split out into a separate database, code needing catalog information needed only minor modification to look up a different database name:

// Note: this is not real code, just a conceptually equivalent
//       example.
DbConn catalogDatabaseConn = DatabaseDriver.connect(catalogDatabaseIp);


Deployments were much easier; developers never did them. Feature teams did not manage their own service environments, Kubernetes clusters, or anything like that. Those technologies didn’t even exist. A single centralized team did all front-end deployments. Once a feature team (say a team working on personalization features) got their code checked in, the rest was taken care of by the centralized team.

Developers also didn’t worry about SSL certificates (TLS wasn’t widely used yet). Nobody worried about certificates expiring, or pushing out renewed ones safely, etc. The very idea that service-to-service communication inside of a data center needed to be encrypted would have been a novel concept at the time. Computers were slower, too, so there was reluctance to spend CPU cycles on encrypting internal traffic. Of course, our public websites had SSL certificates, but these were all managed by a single team (in fact I once had to renew the SSL certificates by hand, an experience that has scarred me to this day).

Teams also didn’t worry much about capacity management or capacity planning. A centralized infrastructure team most of it. They didn’t worry about patching; it wasn’t as big of a concern in the industry at the time, and a central team handled it in any case. Teams didn’t worry about availability zones and blast radius (we didn’t have multiple data centers yet). They didn’t worry about intra-service call tracing (X-ray, Dapper, Splunk, etc) because there weren’t enough different services to have to trace through. Load balancing was a lot easier. There was only one thing to load-balance, the website, and a networking team managed all the load balancers.

There are other examples as well. Overall:

• Many aspects of developing services were easier because centralized teams of people took care of entire, specific problem areas
• In other cases, the industry as a whole wasn’t as mature or as ambitious as today (e.g., services, patching and mutual auth)

If everything was so great back then, why did anything change? In fact, there were a number of significant pain points that led companies like mine on the drive towards services.

For one thing, there was just one front-end application, and a 3-digit number of developers were trying to work on it simultaneously. Feature velocity was supposed to be high: we were rapidly expanding and new features were supposed to be released multiple times per week. Unfortunately, there was no great way for hundreds of developers across many, sometimes unrelated, teams all to commit code changes to the same code base. And release weekly.

So we were starting to slow down; sometimes the front-end application wouldn’t be released for weeks on end as conflicts between the latest set of code changes were ironed out. The problem was exacerbated by our build system of the day (a monorepo with many circular dependencies), but the larger problem was that we needed teams to be as independent from each other as possible. Independence was, and still is, the primary motivation behind our move to services.

Another motivation was to improve scalability and availability. The database layer was most often the scaling bottleneck at any given time. SQL statements generated by many, diverse teams’ code would hit the same database server. Even if it scaled well for some queries, there was no way to scale well for all of them. The blast radius was high, as well. With relatively few databases, if one went down, so did a bunch of important functionality across the company. The outer tier also had a wide blast radius. All teams’ code was running on the same machines, in the same processes, so a performance or correctness bug from one team could affect all machines and all functionality.

Retries, timeouts and circuit-breaking were pain points and another leading cause of outages. The Greatest Hit of the era revolved around database brownouts (dealing with outright failure was easier). Each front end process handled one request at a time and also connected directly to the every databases. If any one database slowed down (without failing completely), an entire process would get stuck waiting for it . Eventually most front-end processes across all webservers would be stuck in this way, leading to a large-scale outage.

Determining cause and effect during outages could be difficult, leading to longer MTTR. All teams’ code and data was co-located in a single process. It was hard to tell whether an outage was caused by a bug in someone’s code, a schema change, hardware failure, or simply due to unanticipated customer-driven scaling.

Testing and debugging was challenging. There was no easy and reliable way to replicate the production stack on local developer machines with sufficient fidelity. It was difficult even to set up new dedicated test environments, and the existing ones kept breaking. Our test environments were maintained by a combination of manual efforts (by the centralized team that did deployments) and brittle scripts (also maintained by the same centralized team). Teams that needed to do acceptance tests on new features of the website needed to get the centralized team to set up new environments for them on shared test hosts. These were often running out of CPU and disk.

There was also a policy at the time of using only one programming language in production: C (technically, we were using a C++ compiler, but only for its name-mangling that gave us “type-safe linkage”). Perl was allowed, but only for non-customer-facing uses (systems administration, monitoring, etc). Similarly, we only had one operating system (Digital’s Tru64 Unix).

Getting new hardware, once teams did need more, was hard. As nice as it was to have a dedicated team of people worrying about capacity management, dealing with those same humans to procure more hardware (either for your new feature or to scale out an existing one) was time-consuming and error-prone.

Monitoring and logging were likewise ad hoc. There was crude metrics gathering and log aggregation. Alarming was ad hoc and teams often resorted to having the code email them when something appeared to be going wrong. There was no equivalent of CloudWatch, Splunk, DataDog or other tools.

There are many other examples like the above, but hopefully by now you can see why we drove to microservices:

• Teams were stepping on each other’s toes and we were slowing down as a business
• For similar reasons, we had blast radius, availability and scalability problems
• DevOps was more primitive (i.e., monitoring, alarming change management, etc)

One thing you may have noticed in the above explanations is the idea of centralized teams providing services to the rest of the company. In order to scale we needed to automate many of their tasks so that as we added hundreds (and eventually thousands) of services we didn’t need to scale those central teams correspondingly. This lead to the a strategy of self-service federated tools.

Self-Service Federated Tools

If I had to pick one concept we’ve relied on the most to address 1999-era problems, I’d say it is that of self-service federated tools. For most of the problems that we either didn’t solve at all back then, or which were solved by having a centralized team (deployments, capacity management), we have created self-service tools allowing feature teams to take ownership of the problem space and stop depending on other teams. There are a number of areas where companies working on distributed systems have created such tools:

• Builds, Deployments, Integration Testing
• Performance Testing
• SSL/TLS/X.509 Certificate Management
• Access Control Management
• Logs, Monitoring, Alarming, Aggregation
• Capacity Management
• Patching
• Etc.

The appeal of self-service tools is that they allow teams to solve problems without depending on another (usually centralized) team. The downside is that they do not, however, tend to make problems go away entirely.

Self-service isn’t the only way to solve these problems, we have also created new abstractions that eliminate entire classes of problems (this is called conceptual compression by DHH). Some AWS examples are:

Problem Space Example New Abstraction
Running Out of Disk S3
Storing Enormous Objects Somewhere S3
Data Loss S3
Complex Relational Database Failures DynamoDB
Running Out of Resources Lambda, Fargate, AutoScaling
Quantum-safe Network Encryption Lever Link Encryption

My observation is that, industry-wide so far, we have more self-service tools than we have abstractions, and it’s for a good reason: creating useful abstractions is more difficult and takes longer than creating good self-service tools. Note that I don’t mean self-service tools are easy to create, just that they’re easier to create than abstractions. An abstraction is often a new thing that nobody has seen before. It’s harder to think of those things. A bad abstraction is often worse than no abstraction at all (think of the backlash against programming “frameworks” and “platforms” that prove too restrictive over time).

Self-service tools mirror processes that were already hammered out in practice. Abstractions are riskier — if you get them wrong they provide little value, or even make a problem worse. For example, let’s say someone launched an object storage service (similar to S3) that provided the abstraction of infinite storage, but then just couldn’t figure out how to scale it out or eliminate performance bottlenecks. That would be a significant problem for people who had already adopted it.

When a team I was on developed a major internal deployment system around 2002, by contrast, we built a self-service tool that closely modeled a deployment process (performed by the centralized team) that we had been doing for several years. There was less risk that we might be heading down a dead-end path.

The downside of self-service tools, such as the aforementioned deployment system, is that they inherently impose an overhead. It is pretty much baked into the name. Self-service implies that whoever the self is (developers on a service/feature team, in most cases), has to do some continual amount of work. In any one case, this isn’t a big deal. A 7-person team can devote 0.3 people per year to doing deployments without losing too much momentum. But can they devote 0.2 additional developers per year to dealing with certificates using a certificate manager? You can see where this is going: as we add more and more self-service tools, entire teams can become consumed with operating them.

Furthermore, self-service tools usually expand the number of concepts developers need to understand. The deployment system I worked on, on balance, required a typical team to understand many new concepts: environments, deployments, hosts, host classes, packages, package versions, and more.

Thus, while the automation was useful and broke the dependency on a centralized team, it came with these two costs.

One solution to self-service overhead is to create meta self-service tools which manage the other self-service tools. This reduces, but does not eliminate, the overhead. Those tools can themselves be complicated, and since they are a level of indirection around other less-meta self-service tools (like monitoring, load balancing, etc), teams still need to understand all of the concepts and failure modes of those self-service tools. Meta self-service tools “buy time” by reducing the difficultly of setting up those other tools, but given that we in the software industry are constantly adding to the list of concepts developers need to worry about in the first place, that’s all they do. Only abstractions have the potential to make a problem go away entirely.

So, bringing everything full circle, ideally what would make the world an even better place for developers?

1. We want to get back to the simple world of 1999-era service development
2. Without the bad parts of 1999-era service development
3. Without most of the self-service overheads we’ve added over the years

To me, this is the problem that service mesh is ultimately addressing.

The Promise of Service Mesh

This section covers how, specifically, service mesh helps achieve the above goals, breaking down by feature area.

Service Discovery

As a thought experiment, let’s go back to the 1999-era style of service discovery. Imagine our service (OurService) is talking to another service (TheirService), maintained by a different team:

IpAddr otherServiceIp = DNS.lookup("their-service.internal.example.com");


This was easy to write. It uses a well-documented Internet standard (DNS) that was invented in 1983. It works in any programming language (with only minor syntactic differences), on any version of any operating system, without dependencies on any third-party libraries. It should be fast and reliable, too, unless you’ve messed up how you do DNS. So what’s actually wrong with it?

1. I can’t test OurService on my laptop (because their-service.internal.example.com would either not resolve at all, or else resolve to the production copy of the service)
2. My team can’t have alpha, beta, gamma version of OurService for integration testing (because they would all talk to the same production version of TheirService)
3. This code doesn’t work if OurService is supposed to be deployed into “shards” of some kind (regions, cells, blast radius zones, etc). All the shards would talk to the same copy of TheirService.

A common solution to this problem is to add a layer of indirection in code. A team might use a library that provides service discovery as a feature:

// Assume I've added a dependency on the Something v3.5 framework in
// to whatever build system I'm using
import org.something.discovery;

String domainName = discovery.Discover("TheirService");


This works but has downsides:

1. Unlike DNS, the Something framework has not been around since 1983 and does not have ~100 books and 1,000s of blog entries and tutorials written about it
2. It’s language-specific and also doesn’t necessarily work on any operating system
3. I have to keep upgrading the version of Something that I’m using, to keep up with security patches, bug fixes, and benefit from new features; this is a continual tax and sometimes breaks OurService (sometimes in subtle ways, too)
4. Frameworks like this are often opinionated about many things (config, concurrency, asynchrony, annotations, etc) and don’t always interoperate well with other frameworks or libraries, constraining how I can write my code
5. OurService uses many libraries, not just the Something v3.5 library. Some of them depend on different versions of Something. Now I’m in dependency Hell
6. Because my team is using Something directly in my code, it prevents other teams or services from automatically taking care of service discovery for me

Sounds pretty bad right? So let’s back up a bit — what if the DNS lookup of their.corp.example.com magically did the right thing, depending on context? On my laptop, their.corp.example.com would resolve to a fake version of TheirService (which also might be running on my laptop). If running in OurService’s test environment, it would resolve to the IP address of a test version of TheirService. The key point is the easy, existing abstraction (DNS) is still being used. This works because:

• DNS isn’t programming-language- or OS-specific
• DNS is a well-defined protocol
• DNS can be transparently intercepted by something that understands the present context, such as a service mesh

Mutual Auth

Back in 1999, developers didn’t have to manage TLS (SSL) certificates; they were managed by a central team. It wasn’t a common practice yet to use certificates for internal communication, and there weren’t any many services anyway. If services had been more prevalent, they would have written something like the following (in pseudo-code), where network communication is unauthenticated and unencrypted:

Socket listener = Socket.listen(port);
while (true) {
Socket connection = listener.accept();
handleConnection(connection);
}


Now let’s pretend that there were microservices back then, and that a developer had had some massive insight about the future and decided to do mutual TLS (mTLS), or something like it. The change they’d need to make to the above code to handle TLS certificates looks pretty similar:

// Don't be fooled by how easy this looks
TLSCertificate cert = Somehow.getCertificateFor(siteName);
Socket listener = Socket.listen(port);
while (true) {
Socket connection = listener.accept();
handleConnection(cert, connection);
}


They just had to add one line of code; easy, right? No! While it doesn’t add a lot of code, it adds a disproportionate amount of devops load:

1. The server side TLS private key is secret material - how does it safely get onto the server without being tampered with or snooped on?
2. The certificate will expire; how do I ensure that I always replace it in time?
3. Deploying a new certificate is inherently risky, a change that could have a ripple effect and cause an outage (especially if, somehow, the wrong certificate is being deployed).
4. I cannot test on my laptop. I can’t get production certificates on my laptop (for obvious reasons). Even if I could, they wouldn’t match my laptop’s hostname. Either way, every time I test, I’ll get scary security warnings from my browser that I have to ignore. Ignoring warnings is a bad idea.

What I want is to be able to do is write the first version of the code, just like I would have back in 1999, completely skip the four pain points listed above, yet still have all the benefits of mutual auth.

Service mesh seems to be addressing this need. A common practice is to have a sidecar proxy such as Envoy handle all the TLS stuff. Developers write code that ignores TLS and, transparently, in production, Envoy handles the heavy lifting.

Secret Distribution and PKI

Sidecars like Envoy can only solve so much of the mutual auth problem on their own, though. Somehow the secret material must make it to them securely.

Service mesh could provide this transparently. The core feature needed is Public Key Infrastructure (PKI) for servers. Each server can prove that it has a well-known identity, that it is truly acting on behalf of a particular microservice, and that it cannot be tampered with by anyone outside of the team that maintains the service.

Cross-Service Call Tracing

Nowadays, there are many system to help trace calls across multiple hops across multiple microservices: AWS X-ray, Google Dapper, etc. Back in 1999 I didn’t have any of that stuff. Calling a microservice, had I had one to call, would have looked like:

Result result = client.call(args);
// Error handling elided


I wouldn’t have written:

String requestId = UnitOfWork.getInboundRequestId();
Result result = client.call(requestId, args);
// Error handling elided


Nor would I have bothered to handle inbound reqeustId metadata on the service side, either. With service mesh I would expect to get this for free.

Handling Timeouts and Backoff

Trying to handle backoffs and retries was something developers did have to write code for 20 years ago — and found it hard to get right. As an example, one of our large customer data sets had been split into a separate pool of about 7 databases. Having 7 databases solved some scaling problems we’d been having, but created new blast radius issues.

The databases themselves were “sharded” (distinct customers on each one), but the rest of the system was not sharded. Thus every web server host eventually connected to every database host. A failure of any one of the 7 database hosts had the potential to bring the entire website down, since eventually every webserver would touch it and potentially tie up a process. If our application code didn’t handle the database query just right, it would hang the unlucky process for too long, eventualy causing widespread, cascading failures. In other words, we had 7 potential points of failure, instead of just one.

This is a complicated topic and one that our industry is still working on 20 years later. Service mesh may not be able to solve the problem completely, but it should help a lot. All of the code to handle retry, back-off, circuit-breaking, flow-control and so forth can be moved into the mesh. There, strategies can be applied consistently across groups of services rather than enduring haphazardly coded strategies from individual teams. Concretely, anything like following code should be removed from business logic:

MumbleClient client = MumbleClient.new(MUMBLE_CLIENT_TIMEOUT);
callWithRetry(ExpontentialBackoffStrategy.class, () -> {
try {
Result result = client.call(args);
} except (Exception e) {
if (Retryable.isRetryable(e)) {
throw new CallWithRetry.doRetry();
} else {
throw;
}
}
});


What’s wrong with this code?

• The MUMBLE_CLIENT_TIMEOUT is almost certainly wrong — it’s either too long, so the whole thread/process ends up hanging when it shouldn’t, or it’s too short, so customers see errors needlessly, or it used to be correct but isn’t any longer because the Mumble service is faster, or slower, or more prone to brown-outs, etc, than it used to be.
• Changing the timeout requires pushing out code to the running service. All code changes carry a certain amount risk; this particular kind of change is especially risky.
• Expontential backoff strategies are difficult to get right and often don’t work as intended.
• Reconfiguring or removing the backoff strategy requires a code change.
• All of the above was specific to a particular programming language.

Service mesh allows us to remove as much of that code as possible, ideally boiling the above down to:

MumbleClient client = MumbleClient.new();
try {
Result result = client.call(args);
} except (Exception e) {
// handle error
}


The service mesh would be smart enough to understand that this is (say), OurService calling the MumbleService, and have specific policies about what that communication should look like (e.g., is it allowed in the first place?, how critical is it?, is MumbleService overloaded right now?, etc). The mesh can also learn what the time-outs should be rather than expect them to be hard-coded, or it could be controlled by developers but without having to push all of the business logic out in order to make a change.

Mechanistic Service Availability Invariants

In 1999 I was working on a 2-tier application. The outer tier was a website and the inner tier was a third-party database. Verifying that the correct tier depended on the correct other tier took about 5 seconds, since there weren’t enough tiers. There was simply no way for the database to sprout the ability to make network connections back to the website. Similarly we weren’t worried about blast radius, since we didn’t have multiple failure domains to take advantage of.

Nowadays, both of these are important problems and are difficult to solve without expensive and error-prone manual audits. Adding a bad/recursive dependency bewteen two services (say having Mumble depend on Blorch while at the same time Blorch depends on Mumble) is as easy as, at any point in time, a developer innocently adding a call to that service for some reasonable reason. Other invariants around maintaining blast radius (cell architecture, etc) has the same problem. Years of developer effort can be thwarted by a single new line of code that calls a service in the wrong fault zone.

This is one of those problems that is “difficult in the large”. If you’re inspecting a small enough portion of the problem (say you’re auditing a single Java file), it’s fairly easy to spot a mistake. Across 100s of teams and services, and across years of changes, it’s a hard problem. It seems like service mesh is also addressing this in a way that wouldn’t require any new code or frameworks beyond what we would have written 20 years ago:

// This is in OurVeryHighlyAvailableService.java, and the following,
// unbeknownst to the developer who wrote this code, isn't so highly
// available.
Connection conn = innocentClient.connect(innocentSeemingIp);


Service mesh, as far as I understand it, solves this problem without requiring developers to write more code. The mesh knows which services are allowed to call which other servcies, and can fail either the DNS lookup, or the ability to establish the connection, or both. This enforcement works across all programming languages and operating systems.

Polyglot Programming

As mentioned above, I was using only one programming language for production systems in 1999: C. Nowadays, most companies use multiple programming languages in production: Java, Python, Ruby, Go, Rust, JavaScript, Clojure, Scala, C#, C++, and more. While an entire book could be written about the pros and cons of this state of affairs, given how software is developed it seems like a net win. There simply isn’t a monopoly on good ideas. If one wants to take advantage of literal armies of people working on various useful things, you have to use the programming language they’re using, not try to port all their code to your language.

One clear challenge with the polyglot style is that sometimes teams need to provide “fat clients” to their service. The fat client might help with compute offload or some other problem that is hard to solve any other way. They usually can’t afford to write more than one fat client though, so that means picking a particular programming language. This wasn’t a problem for me in 1999 because I was only using one programming language. Nowadays, Service mesh can help extend fat clients to other languages by running it in a proxy sidecar.

Another example is that of the highly available service mentioned above. Ideally one would like to ensure, mechanistically, that all highly available services actually are highly available — via some kind of automatic, continual audit. One way to do this would be to ask every service team to use a software library to wrap all of their calls to every other service. The library would enforce policies such as “services classified as highly available are unable to call services not so classified over the network” (but the opposite flow would be ok). Similar arguments could be made for things like enforcing proper blast radius architecture. However, this is similar to the fat client — what programming languages do you support. Service mesh can and no doubt already is helping with this concern.

It is now clear that, even if we had great self-service tools, this is too much undifferentiated heavy-lifting (tax) to impose on hundreds or thousands of service teams. My own personal experience with load balancers is that it takes a considerable amount of time to manage them properly. Instead, this kind of functionality seems to be more and more part of service mesh technology. The mesh knows about everything necessary to fully automate load-balancing: hosts, host health, service health metrics, ongoing deployments (including blue/green), canary tests, one-box tests, and so forth. This would result in better availability (fewer outages due to human error) and also remove a significant tax on teams.

Monitoring and Alarming

In 1999 there weren’t tools/features like CloudWatch, CloudWatch Logs, DataDog, Splunk, Grafana, Prometheus, etc. The only option was to log to a log file.

“Fred, hold my beer!”

In conclusion a few things stand out to me, in retrospect:

• Developing services was not less complex in 1999, but industry standards were more lax (and more naive)
• Computers were slower back then (duh); compromises were made, out of necessity, that we are no longer willing to make
• Full automation and self-service are not the same thing; developing services today isn’t always more complicated, sometimes it’s just more self-service so we notice the complexity

In a recent blog entry, Fred Hebert posited (correctly, in my opinion), that “complexity has to live somewhere” and also referred to Fred Brooks’ distinction (from No Silver Bullet) between accidental and essential complexity. Accidental complexity is stuff you have to do like builds and infrastructure and configuration and so forth, whereas essential complexity is things like the code your customers want you to deliver. Brooks made the now bold-sounding claim, back in 1986, that a typical developer only spent about 10% of their time dealing with accidental complexity. My guess is he was probably about right, at that point in time. However, round about 1999, at the dawn of the microservices era, I think our industry had a collective, “Fred, hold my beer” moment.

As tool after tool and concept after concept got added, the amount of accidental complexity a typical developer is faced with has indeed grown. Some developers today are no doubt spending 90% (or more!) of their time on accidental complexity and 10% (or less!) on the rest. This complexity, as Hebert points out, was always there, it was just being handled elsewhere (centralized teams of humans), which had its own downsides.

So for this reason I think that what service mesh will ideally bring about is creating abstractions that cut down on the net accidental complexity we’ve added (for good reasons) over the last 20 years. In many cases those abstractions already exist in the form of IETF RFCs.