View a short slide deck with the key points in this article.
This article is a design document that describes a framework for building and maintaining cloud applications comprised of large numbers of interconnected services in a manner that is intuitive and understandable to users.
We propose a syntactic abstraction, called Escher circuits, for representing the state of the cloud. The abstraction enables modular compositing of large circuits from smaller components, facilitating manual descriptions of cloud topologies. Further, it supports a circuit “difference” calculation to facilitate the representation of incremental system changes.
The result is a system that provides a 3-step workflow for the Operations Engineer, which is captured in the following command-line operations:
cloud sense > current_state.circuit cloud diff desired_state.circuit current_state.circuit > diff.circuit cloud materialize diff.circuit
All .circuit
files involved in the control of the cloud are simple text files
(that use Escher syntax) and as such all changes to the cloud
integrate cleanly with versioning systems like Git,
resulting in full hitorical accountability of cloud state changes.
Every well-defined system requires a clear specification of the objects at play, their possible interrelations in any moment in time, as well as the allowable operations that can be performed to its components.
The systems of interest here, which model cloud applications in the datacenter, have three types of components: hosts, services and links. We will treat these objects cleanly in a formal manner, but it should be clear that they will end up corresponding to well-known real technologies utilized in specific manners.
Our hosts will correspond to physical machines (or virtual machines, as the case might be). Our services will correspond to Docker containers, whose images are configured in a standard manner to expect a number of named incoming or outgoing TCP connections. And each of our links will correspond to a pair of lightweiht DNS servers, one at each endpoint host, configured to point the respective Docker TCP connections at each other.
The exact details of the correspondence between hosts, services and links, and machines, Docker containers and DNS servers, respectively, will be fleshed out in a later section. For now, suffice it to say that this correspondence will be made simple and natural through the use of the gocircuit.org tool for dynamic cloud orchestration (although with an appropriate driver a similar result can be accomplished with any cloud provider like Google Compute Engine or Amazon EC2, for instance).
Getting back to the abstract system framework, the allowed relationships between hosts, services and links are described in a few simple postulates:
Relationships between the components of a system can be represented visually using the same symbolism employed by Escher for representing nested circuits:
In the illustration above there are two hosts named host1
and host2
.
Two services, named cache
and server
, reside on host1
.
One service, named database
, resides on host2
. Service cache
is of type MemCache
, service server
is of type Http
and
service database
is of type Redis
. There are two links in the system:
one connecting the service-valve pair (server, x)
to (cache, y)
, and
one connecting (cache, z)
to (database, w)
. (Disregard the labels
p
and q
for now.)
Thus far we have addressed the properties describing the state of a system in a singular moment in time. System state can change over time, or “dynamically”, according to the following additional postulates:
In particular, hosts, services and links can appear independently of each other.
Some of these dynamic events (of emergence or disappearance) will be caused by external factors (for instance a host might die due to bad hardware) and others will be caused by operations that we perform with the system (for instance, we might start a service). No matter what the cause for an event is, the important thing is that these are the only changes of state that can happen to the system.
We view the cloud itself as an independent device, which computes and changes over time. Some of the changes to the cloud will be caused by external factors, for instance physical failures in the hosting hardware. Other changes will be caused by commands initiated by the user.
Since user-initiated changes and external changes are mutually asynchronous, we propose the following simple workflow for the user's point-of-view or point-of-control, as the case might be:
In the remainder of this document, we describe the design of a command-line tool
cloud
which embodies the above three operations as—
cloud sense > current.circuit
cloud diff desired.circuit current.circuit > diff.circuit
cloud materialize diff.circuit
The descriptions below assume the Circuit as the underlying cloud management backend. It is however entirely possible that other backends, such as Amazon EC2 or Google Compute Engine, be used.
The symbolic visual representation of system state, exemplified above, can very well be used as a formal representation, much like architectural blueprints are used as formal representations of building design. However, this visual representation while natural for people is not easy to use by machines.
As we explain in the section on Escher syntax, this visual representation has an equivalent syntactic (i.e. written) form, which is well-suited for machine manipulations. In particular, the syntactic representation of the diagram above would be as follows:
{ host1 { cache MemCache server Http server:x = cache:y cache:z = :p } host2 { database Redis database:w = :q } host1:p = host2:q }
In other words, every system state can be represented in the form of an Escher circuit. This gives us a two-fold benefit.
On the one hand, Escher circuits can be manipulated programmatically (both from Go and from Escher) simply as data structures. This allows flexible programmatic investigation of system state through familiar technologies.
On the other hand, Escher's programming and materialization mechanism allows for such circuits to be built out modularly from smaller component circuits. In other words, large datacenter topologies can be composed out of smaller standard components, whereby even the components circuits can span multiple machines and themselves be non-trivial subsystems.
For instance, our example system state could be generated out of smaller components in the following manner. Let the following circuit be an index (i.e. a library), consisting of two circuits designs:
Index { HttpHost { cache MemCache server Http server:x = cache:y cache:z = :p } DbHost { database Redis database:w = :q } }
Then, if we materialize the program
{ host1 HttpHost host2 DbHost host1:p = host2:q }relative to
Index
, the resulting residue will be precisely the system state circuit that
we started with, i.e. the one illustrated in the drawing above.
We call the circuit representation of system state, described thus far, a “primal” representation or simply a primal. Every primal has an equivalent “dual” representation. Transitioning from primal to dual and vice-versa is a matter of a simple transformation, as we shall see.
The dual representation of system state is useful to us, as it is more convenient to carry out certain manipulations within this representation. In particular, it will be easier to compute the difference between two states in the dual. As well as it will be easier to “materialize” a dual system state description into an actual running datacenter topology.
The dual representation of a system state primal consists of two lists: a list of services and a list of links.
The list of services simply enumerates all services found in the primal, each specified by its “full path” in the primal, accompanied by its type. For our running example, the list of services would be
(host1.cache, MemCache) (host1.server, Http) (host2.database, Redis)
The list of links enumerates all service-to-service links present in the primal representation as pairs of endpoints, wherein each endpoint (a service-valve pair) is also specified by its “full path” in the primal. In our example, that list would be:
(host1.server:x, host1.cache:y) (host1.cache:z, host2.database:w)
It is not hard to see how the primal can be derived from the dual by reversing this process.
Furthermore, it is self-evident that one can compute the “difference” between two systems, when this makes sense, by simply computing the difference of their corresponding dual representations elementwise.
Sensing and materializing are the two operations that convert between the abstract circuit representation of a cloud topology and the actual physical topology that executes on the cloud.
Sensing is the operation of “reading” the current state of the cloud and representing it in the primal form for the engineer to work with.
Materializing is the operation of “writing” (or “executing”) a cloud topology in primal form into an actual physical network of services running in the cloud.
In the following sections we describe how sensing and materializing to and from dual form work. The subsequent conversions from dual to primal, a mere data structure transformation, was explained in the previous section.
The specific API for manipulating the cloud can be any: Google Compute Engine, Amazon EC2, Circuit, and so forth. Our following explanations will be based on the Circuit as its simple API provides exactly the minimum necessary for such manipulations.
We have chosen to use executable Docker containers as embodiment for services.
Each service communicates with the outside—with other services—through a set of zero or more named valves. A valve corresponds to a TCP client connection, a TCP server connection or both.
Service container images must be prepared in a standardized manner so that after the execution of a container, our framework can (i) obtain the TCP server address corresponding to each valve (if there is one), as well as (ii) supply the remote TCP server address if the valve also corresponds to a TCP client connection.
There are various ways to prepare Docker containers to accomplish this and we do not mandate a specific one. Here, we merely suggest one way of doing it without going into unnecessary technical detail.
To accomplish (i), one can utilize the Docker
port mapping mechanism. In particular, the software inside the container can be hardwired to
listen to specific port numbers which, in lexicographic order, correspond to the valve names
of the service. Once the container is executed, the effective TCP server addresses—those visible to
other containers in the cloud network—can be automatically obtained using the docker port
command. They will be utilized by our system to “link” containers (i.e. service valves) in a manner described later.
To accomplish (ii), we propose configuring each Docker service container to use a DNS server whose address is passed on it upon execution, using any one of the various mechanisms available for passing arguments to containers upon execution, provided by Docker itself. Subsequently, the software executing inside the Docker container should simply be hardwired to obtain the IP address for any given valve name by simply looking up that valve name (perhaps prefixed by a standard domain name) through the DNS system. Our framework, described later, which executes the Docker containers will arrange for a light-weight dedicated DNS server for each container, whose sole job would be to resolve these queries appropriately.
Let us consider the task of materializing the system from our running example which, as we showed above, has the following dual form. The list of services is:
(host1.cache, MemCache) (host1.server, Http) (host2.database, Redis)And the list of links is:
(host1.server:x, host1.cache:y) (host1.cache:z, host2.database:w)
Materialization proceeds like so:
The Circuit API
presents all of its resources uniformly as a file system, where root level directories correspond to available hosts.
Unused hosts are precisely those root level directories that have no children (i.e. no services or other Circuit elements
are running on them). Such a list can be obtained through the API or through the command line using
circuit ls /...
. Let us assume, for instance, that the list of available and unused hosts is
/X65cc3c8e31817756 /Xe4abe0c286b0e0bc /X9738a5e29e51338e
(/X65cc3c8e31817756, host1) (/Xe4abe0c286b0e0bc, host2)
(host1.cache, MemCache)
/X65cc3c8e31817756/host1/cache/dns
This is accomplished using the Circuit circuit mkdns
command. The details of this are omitted for brevity.
Initially the DNS server will have no resource records, i.e. it will not resolve any lookups. Appropriate records will be added
to it later, when we materialize the list of links from the dual form.
/X65cc3c8e31817756/host1/cache/service
This is accomplished using the Circuit's circuit mkdkr
command, and recall that the service type,
MemCache
in this case, is used as the name of the Docker image to be used. Furthermore, the IP address of the DNS server created in the previous step is passed to the Docker container on execution.
(host1.cache:z, host2.database:w)
host1.cache:z
, if one is available.
To do so, we access the Docker container
/X65cc3c8e31817756/host1/cache/serviceand we query the TCP server address for valve named
z
, using the Docker port exporting
provisions set in place as described earlier.
/Xe4abe0c286b0e0bc/host2/database/dnsand set the resource record for the domain name
w
to that TCP server address obtained in the previous step.
In addition to setting a DNS A record for the name w
, we also set a
DNS TXT record for the same record with the value of host1.cache:z
.
This TXT record will later facilitate recovering the dual form for this link directly from the DNS server itself.
host1.cache:z
and host2.database:w
reversed.
Reading the current state of the cloud is fairly straightforward.
After listing the contents of the Circuit, using circuit ls /...
,
there will only be paths ending in /service
and paths
ending in /dns
. We are going to read the list of services
from the former ones, and then the list of links from the latter one.
To read the list of services, we consider each path ending in /service
.
For instance, the path
/X65cc3c8e31817756/host1/cache/servicewill correspond to a service named
host1.cache
(simply drop the first and last path elements
and replace slashes with dots).
Then we query the configuration of the underlying Docker container, using the
circuit peek
command. This gives us the Docker image name of the
container—which is the service type—and thus the service entry has been recovered.
To read the list of links, we consider in turn each path ending in /dns
unless it has already been considered. For instance—
/X65cc3c8e31817756/host1/cache/dns
This path will be a link endpoint with a prefix host1.cache:
, as follows
from the manner in which we materialized links in the previous section.
We then list the DNS resource records at this path, using circuit peek
,
and in the case of this example we will see resource records for the domain names
y
and z
. In other words, the names correspond to valve
names of the service. And so each name gives us one endpoint in a link. In this case—
(host1.cache:y, …) (host1.cache:z, …)
To recover the other endpoint in each of the links, it suffices to look at the DNS TXT
record accompanying each of the names, y
and z
.
These TXT records will contain, as per the materialization process, the other endpoint
of the respective link, thus allowing is to recover the whole links—
(host1.cache:y, host1.server:x) (host1.cache:z, host2.database:w)
Before we add these links to the list of links, we also verify that the opposing service
is still alive. Otherwise by convention we treat the link as not present.
For instance, if we want to verify that the endpoint host2.database
is alive,
we simply consider the Circuit path list, obtained with circuit ls /...
, and
look for the glob pattern /*/host2/database/service
.