Fundamental Components in a Distributed System

In the last several weeks I have had a surprising number of conversations about the fundamental building blocks of a large web-based system. I thought I’d write up the main bits of a good way to do it. This is far from the only way, but most reasonably large systems will wind up with most of this stuff. We’ll start at the base and work our way up.

Operational Platform

At the very base of the system you need to have networking gear, servers, the means to put operating systems onto the servers, bring them up to baseline configuration, and monitor their operational status (disk, memory, cpu, etc). There are lots of good tools here. Getting the initial bits onto disk will usually be determined by the operating system you are using, but after that Chef or Puppet should become your friend. You’ll use these to know what is out there and bring servers up to a baseline. I personally believe that chef or puppet should be used to handle things like accounts, dns, and stable things common to a small number of classes of server (app server, database server, etc).

The operational platform provides the raw material on which the system will run, and the tools here are chosen to manage that raw material. This is different than application management.

Deployment

The first part of application management is a means of getting application components onto servers and controlling them. I generally prefer deploying complete, singular packages which bundle up all their volatile dependencies. Tools like Galaxy and Cast do this nicely. Think hard about how development, debugging, and general work with these things will go, as being pleasent to work with during dev, test, and downtime will trump idealism in production.

Configuration

Your configuration system is going to be intimately tied to your deployment system, so think about these things together. Aside from seperating the types of configuration you want there are a lot of tradeoffs. In generally, I like immutable configuration obtained at startup or deployment time. A new set of configs means a restart. In this case, you can either have the deployment system provide it to the application, or have the application itself fetch it. Some folks really like dynamic configuration, in that case Zookeeper is going to be your friend. Most things don’t reload config well without a restart though, and I like having a local copy of the config, so… YMMV.

Application Monitoring

Application level monitoring and operational level monitoring are very similar, and can frequently be combined in one tool, but are conceptually quite different. For one thing, operational monitoring is usually available out of the box from good old Nagios to newer tools like ‘noit. Generally you will want to track the same kinds of things, but how you get it, and what they mean will vary by the application. Monitoring is a huge topic, go google it :-)

Discovery

Assuming you have somewhere to deploy, and the ability to deploy, your services need to be able to find each other. I prefer dynamic, logical service discovery where services register their availability and connection information (host and port, base url, etc) and then everything finds each other via the discovery service. A lot of folks use Zookeeper for this nowadays, and most everyone I know who has used it loves it. One of the best architecty type guys I know would probably have its baby if he could, based on how effusive he is about. That said, you can do lots of different things.

I have heard discussion about using the reporting capabilities of a tool like Galaxy, or the CMDB capabilities of Chef to accomplish this, but I think these are ill suited. Firstly, they operate on either concrete deployment units, or on specific low level roles, rather than on logical services. Secondly, they are quite outside the lifecycle control of the service itself.

In an ideal world the location of your discovery system is the only well known address in the whole system. Some things don’t participate well in discovery out of the box – those being fully formed components such as databases, caches, and so on. How you integrate these will vary, but two good techniques I have seen are the use of companion processes which interact with the discovery service, and static entries in the discovery service. In the case of a companion process, the companion generally does a very basic health check (is the server running?) and provides a local view of whatever is needed from the service. In the case of static entries, the entry may be placed and removed by the startup script, or via some alternate channel (doing it by hand, etc).