Paradigm for building clouds with Circuit and Escher

View a short slide deck with the key points in this article.

This article is a design document that describes a framework for building and maintaining cloud applications comprised of large numbers of interconnected services in a manner that is intuitive and understandable to users.

We propose a syntactic abstraction, called Escher circuits, for representing the state of the cloud. The abstraction enables modular compositing of large circuits from smaller components, facilitating manual descriptions of cloud topologies. Further, it supports a circuit “difference” calculation to facilitate the representation of incremental system changes.

The result is a system that provides a 3-step workflow for the Operations Engineer, which is captured in the following command-line operations:

cloud sense > current_state.circuit
cloud diff desired_state.circuit current_state.circuit > diff.circuit
cloud materialize diff.circuit

All .circuit files involved in the control of the cloud are simple text files (that use Escher syntax) and as such all changes to the cloud integrate cleanly with versioning systems like Git, resulting in full hitorical accountability of cloud state changes.

Framework

Every well-defined system requires a clear specification of the objects at play, their possible interrelations in any moment in time, as well as the allowable operations that can be performed to its components.

The systems of interest here, which model cloud applications in the datacenter, have three types of components: hosts, services and links. We will treat these objects cleanly in a formal manner, but it should be clear that they will end up corresponding to well-known real technologies utilized in specific manners.

Our hosts will correspond to physical machines (or virtual machines, as the case might be). Our services will correspond to Docker containers, whose images are configured in a standard manner to expect a number of named incoming or outgoing TCP connections. And each of our links will correspond to a pair of lightweiht DNS servers, one at each endpoint host, configured to point the respective Docker TCP connections at each other.

The exact details of the correspondence between hosts, services and links, and machines, Docker containers and DNS servers, respectively, will be fleshed out in a later section. For now, suffice it to say that this correspondence will be made simple and natural through the use of the gocircuit.org tool for dynamic cloud orchestration (although with an appropriate driver a similar result can be accomplished with any cloud provider like Google Compute Engine or Amazon EC2, for instance).

Getting back to the abstract system framework, the allowed relationships between hosts, services and links are described in a few simple postulates:

Every host in the system is identified by a unique string identifier
Every service “resides” on one host and every such service has a string identifier, unique only across the services residing on the same host.
Every service has a “type” denoted by a string (which will correspond to the Docker image name of its container).
Every service can have zero or more named “valves” (where a valve will correspond to a TCP connection, client or server) under the requirement that valve names are unique within one service.
Every link “connects” one service-valve pair to another, so that no such pair is connected more than once.

Relationships between the components of a system can be represented visually using the same symbolism employed by Escher for representing nested circuits:

In the illustration above there are two hosts named host1 and host2. Two services, named cache and server, reside on host1. One service, named database, resides on host2. Service cache is of type MemCache, service server is of type Http and service database is of type Redis. There are two links in the system: one connecting the service-valve pair (server, x) to (cache, y), and one connecting (cache, z) to (database, w). (Disregard the labels p and q for now.)

Thus far we have addressed the properties describing the state of a system in a singular moment in time. System state can change over time, or “dynamically”, according to the following additional postulates:

Hosts, services and links can “emerge” and “disappear” asynchronously from the system.
When a host disappears, all services residing on it disappear as well.
When a service disappears, all links incident to it disappear as well.

In particular, hosts, services and links can appear independently of each other.

Some of these dynamic events (of emergence or disappearance) will be caused by external factors (for instance a host might die due to bad hardware) and others will be caused by operations that we perform with the system (for instance, we might start a service). No matter what the cause for an event is, the important thing is that these are the only changes of state that can happen to the system.

The resulting UI to the engineer

We view the cloud itself as an independent device, which computes and changes over time. Some of the changes to the cloud will be caused by external factors, for instance physical failures in the hosting hardware. Other changes will be caused by commands initiated by the user.

Since user-initiated changes and external changes are mutually asynchronous, we propose the following simple workflow for the user's point-of-view or point-of-control, as the case might be:

Connect to the “cloud” and retrieve a consistent representation of the “current” cloud state.
Compute the difference between a representation of the “desired” state of the cloud and the retrieved “current” state.
Send a minimal stream of “commands” to the cloud, aimed at modifying its state from “current” to “desired”.

In the remainder of this document, we describe the design of a command-line tool cloud which embodies the above three operations as—

cloud sense > current.circuit
cloud diff desired.circuit current.circuit > diff.circuit
cloud materialize diff.circuit

The descriptions below assume the Circuit as the underlying cloud management backend. It is however entirely possible that other backends, such as Amazon EC2 or Google Compute Engine, be used.

Representation

The symbolic visual representation of system state, exemplified above, can very well be used as a formal representation, much like architectural blueprints are used as formal representations of building design. However, this visual representation while natural for people is not easy to use by machines.

As we explain in the section on Escher syntax, this visual representation has an equivalent syntactic (i.e. written) form, which is well-suited for machine manipulations. In particular, the syntactic representation of the diagram above would be as follows:

{
	host1 {
		cache MemCache
		server Http
		server:x = cache:y
		cache:z = :p
	}
	host2 {
		database Redis
		database:w = :q
	}
	host1:p = host2:q
}

In other words, every system state can be represented in the form of an Escher circuit. This gives us a two-fold benefit.

On the one hand, Escher circuits can be manipulated programmatically (both from Go and from Escher) simply as data structures. This allows flexible programmatic investigation of system state through familiar technologies.

On the other hand, Escher's programming and materialization mechanism allows for such circuits to be built out modularly from smaller component circuits. In other words, large datacenter topologies can be composed out of smaller standard components, whereby even the components circuits can span multiple machines and themselves be non-trivial subsystems.

For instance, our example system state could be generated out of smaller components in the following manner. Let the following circuit be an index (i.e. a library), consisting of two circuits designs:

Index {
	HttpHost {
		cache MemCache
		server Http
		server:x = cache:y
		cache:z = :p
	}
	DbHost {
		database Redis
		database:w = :q
	}
}

Then, if we materialize the program

{
	host1 HttpHost
	host2 DbHost
	host1:p = host2:q
}

relative to Index, the resulting residue will be precisely the system state circuit that we started with, i.e. the one illustrated in the drawing above.

Dual representation

We call the circuit representation of system state, described thus far, a “primal” representation or simply a primal. Every primal has an equivalent “dual” representation. Transitioning from primal to dual and vice-versa is a matter of a simple transformation, as we shall see.

The dual representation of system state is useful to us, as it is more convenient to carry out certain manipulations within this representation. In particular, it will be easier to compute the difference between two states in the dual. As well as it will be easier to “materialize” a dual system state description into an actual running datacenter topology.

The dual representation of a system state primal consists of two lists: a list of services and a list of links.

The list of services simply enumerates all services found in the primal, each specified by its “full path” in the primal, accompanied by its type. For our running example, the list of services would be

(host1.cache, MemCache)
(host1.server, Http)
(host2.database, Redis)

The list of links enumerates all service-to-service links present in the primal representation as pairs of endpoints, wherein each endpoint (a service-valve pair) is also specified by its “full path” in the primal. In our example, that list would be:

(host1.server:x, host1.cache:y)
(host1.cache:z, host2.database:w)

It is not hard to see how the primal can be derived from the dual by reversing this process.

Furthermore, it is self-evident that one can compute the “difference” between two systems, when this makes sense, by simply computing the difference of their corresponding dual representations elementwise.

Sensing and materializing

Sensing and materializing are the two operations that convert between the abstract circuit representation of a cloud topology and the actual physical topology that executes on the cloud.

Sensing is the operation of “reading” the current state of the cloud and representing it in the primal form for the engineer to work with.

Materializing is the operation of “writing” (or “executing”) a cloud topology in primal form into an actual physical network of services running in the cloud.

In the following sections we describe how sensing and materializing to and from dual form work. The subsequent conversions from dual to primal, a mere data structure transformation, was explained in the previous section.

The specific API for manipulating the cloud can be any: Google Compute Engine, Amazon EC2, Circuit, and so forth. Our following explanations will be based on the Circuit as its simple API provides exactly the minimum necessary for such manipulations.

Preparing Docker service containers

We have chosen to use executable Docker containers as embodiment for services.

Each service communicates with the outside—with other services—through a set of zero or more named valves. A valve corresponds to a TCP client connection, a TCP server connection or both.

Service container images must be prepared in a standardized manner so that after the execution of a container, our framework can (i) obtain the TCP server address corresponding to each valve (if there is one), as well as (ii) supply the remote TCP server address if the valve also corresponds to a TCP client connection.

There are various ways to prepare Docker containers to accomplish this and we do not mandate a specific one. Here, we merely suggest one way of doing it without going into unnecessary technical detail.

To accomplish (i), one can utilize the Docker port mapping mechanism. In particular, the software inside the container can be hardwired to listen to specific port numbers which, in lexicographic order, correspond to the valve names of the service. Once the container is executed, the effective TCP server addresses—those visible to other containers in the cloud network—can be automatically obtained using the docker port command. They will be utilized by our system to “link” containers (i.e. service valves) in a manner described later.

To accomplish (ii), we propose configuring each Docker service container to use a DNS server whose address is passed on it upon execution, using any one of the various mechanisms available for passing arguments to containers upon execution, provided by Docker itself. Subsequently, the software executing inside the Docker container should simply be hardwired to obtain the IP address for any given valve name by simply looking up that valve name (perhaps prefixed by a standard domain name) through the DNS system. Our framework, described later, which executes the Docker containers will arrange for a light-weight dedicated DNS server for each container, whose sole job would be to resolve these queries appropriately.

Materializing a dual form to the cloud

Let us consider the task of materializing the system from our running example which, as we showed above, has the following dual form. The list of services is:

(host1.cache, MemCache)
(host1.server, Http)
(host2.database, Redis)

And the list of links is:

(host1.server:x, host1.cache:y)
(host1.cache:z, host2.database:w)

Materialization proceeds like so:

Obtain a list of available and unused hosts in the cloud.
The Circuit API presents all of its resources uniformly as a file system, where root level directories correspond to available hosts. Unused hosts are precisely those root level directories that have no children (i.e. no services or other Circuit elements are running on them). Such a list can be obtained through the API or through the command line using circuit ls /.... Let us assume, for instance, that the list of available and unused hosts is
```
/X65cc3c8e31817756
/Xe4abe0c286b0e0bc
/X9738a5e29e51338e
```
Group the elements of the list of services (from the dual) by host and assign a unique (available and unused) Circuit host to each of the hosts from dual. For instance:
```
(/X65cc3c8e31817756, host1)
(/Xe4abe0c286b0e0bc, host2)
```
Execute every service in the dual, as follows. Take, for instance, the service
```
(host1.cache, MemCache)
```
- Create a dedicated light-weight DNS server for this service, on the Circuit host assigned to this service in the previous step. Using the Circuit, we spawn a DNS element and choose its name to follow this convention:
```
/X65cc3c8e31817756/host1/cache/dns
```
  This is accomplished using the Circuit circuit mkdns command. The details of this are omitted for brevity. Initially the DNS server will have no resource records, i.e. it will not resolve any lookups. Appropriate records will be added to it later, when we materialize the list of links from the dual form.
- Execute the service's Docker container on that same host using a similar naming convention:
```
/X65cc3c8e31817756/host1/cache/service
```
  This is accomplished using the Circuit's circuit mkdkr command, and recall that the service type, MemCache in this case, is used as the name of the Docker image to be used. Furthermore, the IP address of the DNS server created in the previous step is passed to the Docker container on execution.
For each link in the list of links, add DNS resource records to the appropriate DNS servers. Take for instance the link:
```
(host1.cache:z, host2.database:w)
```
- First, we inquire into the TCP server address for host1.cache:z, if one is available. To do so, we access the Docker container
```
/X65cc3c8e31817756/host1/cache/service
```
  and we query the TCP server address for valve named z, using the Docker port exporting provisions set in place as described earlier.
- Next, we access the Circuit DNS element
```
/Xe4abe0c286b0e0bc/host2/database/dns
```
  and set the resource record for the domain name w to that TCP server address obtained in the previous step. In addition to setting a DNS A record for the name w, we also set a DNS TXT record for the same record with the value of host1.cache:z. This TXT record will later facilitate recovering the dual form for this link directly from the DNS server itself.
- Finally, we repeat the same process with the roles of host1.cache:z and host2.database:w reversed.

Sensing the cloud state into a dual form

Reading the current state of the cloud is fairly straightforward. After listing the contents of the Circuit, using circuit ls /..., there will only be paths ending in /service and paths ending in /dns. We are going to read the list of services from the former ones, and then the list of links from the latter one.

To read the list of services, we consider each path ending in /service. For instance, the path

/X65cc3c8e31817756/host1/cache/service

will correspond to a service named host1.cache (simply drop the first and last path elements and replace slashes with dots). Then we query the configuration of the underlying Docker container, using the circuit peek command. This gives us the Docker image name of the container—which is the service type—and thus the service entry has been recovered.

To read the list of links, we consider in turn each path ending in /dns unless it has already been considered. For instance—

/X65cc3c8e31817756/host1/cache/dns

This path will be a link endpoint with a prefix host1.cache:, as follows from the manner in which we materialized links in the previous section.

We then list the DNS resource records at this path, using circuit peek, and in the case of this example we will see resource records for the domain names y and z. In other words, the names correspond to valve names of the service. And so each name gives us one endpoint in a link. In this case—

(host1.cache:y, …)
(host1.cache:z, …)

To recover the other endpoint in each of the links, it suffices to look at the DNS TXT record accompanying each of the names, y and z. These TXT records will contain, as per the materialization process, the other endpoint of the respective link, thus allowing is to recover the whole links—

(host1.cache:y, host1.server:x)
(host1.cache:z, host2.database:w)

Before we add these links to the list of links, we also verify that the opposing service is still alive. Otherwise by convention we treat the link as not present. For instance, if we want to verify that the endpoint host2.database is alive, we simply consider the Circuit path list, obtained with circuit ls /..., and look for the glob pattern /*/host2/database/service.