Rethinking Some Galaxy Core Assumptions

Galaxy has been very successful, in multiple companies, but I think it can actually be simplified and made more powerful. The first touches on the heart of Galaxy, the second on the often-argued configuration aspect.

RC Scripts

I described the heart of Galaxy as a tarball with an rc script. I think it likely that the rc script should give way to a simple run script. The difference is that the run script doesn’t daemonize – it simply keeps running. An implementation will probably need to have a controller process which does daemonize (or defer to upstart or its ilk for process management).

While writing an rc script is far from rocket surgery, it turns out that the nuances are annoying to implement again, and again, and again. The main nuance is daemonizing things correctly. I’d prefer to provide that for software rather then force applications to get it right. Many app servers handle daemonization well, but they also all (that I know of, anyway) provide a mechanism for running in the foreground as well.

Unfortunately, a run script model makes a no-install implementation much trickier. The lore on daemonizing from bash is tricky, but even assuming bash is tricky. Using something like daemonize is nice, but then it requires an installation. Grrr. This is an implemenation problem though, and requiring some kind of installation on the appserver side may be worth it for simplifying the model.

Configuration

In a moment of blinding DUH, I came back to environment variables for environmental information. I mean, it works for everyone on Heroku or Cloud Foundry.

There has been a trend in Galaxy implementations (and elsewhere) to use purely file based configuration. This is great for application configuration, but is meh for environmental configuration. This has lead to most Galaxy implementation supporting some model of URL to Path mapping, for placing configuration files into deployed applications. These mechanisms are a great way to provide escape hatch/override configuration but plays against the goal of making deployments self contained, which I like. This punts on going all the way, which Dan likes to advocate, to putting environment information into the deploy bundle, but I am not sold on this myself :-)

Regardless, a general purpose implementation probably needs to support both env var configuration and file based, but you can certainly recommend one way of making use of it.


What is Galaxy?

At Ning, in 2006, Martin and I wrote a deployment tool called Galaxy. Since that time I know of at least three complete reimplementations, two major forks, and half a dozen more partial reimplementations. In a bizarre twist of fate, I learned yesterday from Adrian that my friend James also has a clean room implementation. Using Fabric called Process Manager. Holy shit.

Beyond reimplementations and forks from ex-Ninglets who are using a Galaxy derivitive, I frequently hear from ex-Ninglets who are not and wish they could. We clearly got something right, it seems. Fascinatingly, folks all seem to focus on different aspects of Galaxy in terms of what they love about it. They also tend to have a common set of complaints about how Ning’s version worked, and have adapted theirs to accomodate them.

To me, the heart of Galaxy is the concept of the galaxy bundle, a tarball with the application and its dependencies coupled with an RC script at a known location inside the bundle. Given such a bundle, a Galaxy implementation is then the tooling for deploying and managing those bundles across a set of servers. From personal, and second hand, experience this simple setup can keep things happy well into the thousands of servers.

To many others, the heart of Galaxy seems to be the tooling itself, and the fairly nice way of managing applications seperately from servers. At least one major user even ignores the idea of putting the applications and their dependencies in the bundle, and uses Galaxy to install RPMs! (I personally think this approach is not so great, but the person doing it is one of the best engineers I know, so am happy to believe I may be wrong).

Different folks have also drawn the line of what the Galaxy implementation should manage in quite different places. In the orginal implementation, Galaxy included bundle and configuration repositories, along with how those repos were structured, an agent to control the application on the server, a console to keep track of it all, and a command line tool to query and take actions on the system. On the other hand, the Proofpoint/Airlift implementation weakens the contracts on configuration (in a good way), requires a Maven repository for bundles, supports an arbitrary number of applications per host, and has Galaxy handle server provisioning as well as application deployment. The Ness and, I believe, Metamarkets, implementation changes the configuration contract significantly, also supports several applications per host, and includes much more local server state in what Galaxy itself manages.

The other (generally minor) implementations and experiments have taken it in quite a few different directions, ranging from Pierre’s reimplementation using Erlang and CouchDB, to my reimplementation with no agents or console.

There seems to be an awful lot of experimentation around the concepts in Galaxy, which is awesome! Unfortunately, only the original implementation is very well documented at this point, so it is tough to use Galaxy unless you have used it before (hence my shock at James even knowing about it). I guess it’s time to start documenting and try to save other folks some work!


Learning to Code

Rewinding my career a ways, I want to weigh in on the Learn to Code debate as a non-programmer who coded. I am a professional programmer now, but my previous career was teaching English in High School (you are not allowed to take that as license to mock my grammar).

Programming and the Profession of Programming are quite different things. Programming is being able to efficiently tell a computer exactly what to do in a repeatable manner. The profession of programming is being able to efficiently convert business requirements into bugs.

As an English teacher I programmed regularly in order to make my life easier. I generated vocabulary quizzes (and grading sheets for them), I created interactive epic poetry (I shit you not) with my classes (those studens really grokked epic poetry thereafter), I wrote hundreds of small scripts to calculate various things (many of which could have been done in excel, but I knew perl, not excel), I turned at least one student onto cypherpunks during a study hall, I built various one-off web applications for teachers, classes, groups, etc. I calculated lots of statistics on student performance, tests, and so on so I could better understand and calibrate things (teachers may not always grade on a curve explicitely, but new teachers always do at least hand-wavily as they don’t have tests and teaching well calibrated yet).

Programming is a tool that let me be more efficient, that allows you to automate boring things, and sometimes opens up options which would otherwise be unavailable. I later left teaching, went into technical writing, and then (back) into the profession of programming full time. As Zed Shaw put it well, “Programming as a profession is only moderately interesting. … You’re much better off using code as your secret weapon in another profession.”. I happen to love programming for itself, so programming as a profession works well for me. Code is ephemeral though, and most folks don’t like to “see your works become replaced by superior ones in a year. unable to run at all in a few more” as _why described. Progamming is an exceptionally powerful tool for accomplishing other things.


Java URL Handlers

There are two ways to register your own URL scheme handler in Java, but they both kind of suck. The first is to set a system property to a list of packages, and then name subpackages and classes therein just right, the second is to register a handler factory. The handler factory approach seems great, except that you can register one, once, ever – and, oh yeah, Tomcat registers one. Given the complete brokeness of the factory registration the sane way is to align packages and class names perfectly, though this is annoying to do, so like all good programmers, I wrote a library to put a nice facade on the process: URL Scheme Registry .

Using it is about as simple as I could figure out – because URL handlers need to be global (given how URL works) there is a single static method. To register the dinner scheme handler you just do like:

UrlSchemeRegistry.register("dinner", DinnerHandler.class);

Et voila, you can now use dinner://okonomiyaki and an instance of DinnerHandler, which must implement URLStreamHandler, will be used if you retrieve the resource.

Note that you can only register a given scheme once, and your handler must have a no-arg constructor.

Under the covers the library adds a particular package to the needed system property for the package/class lookup method and creates a subclass of the class you pass it, using the super convenient CGLib. This way it can leave the handler factory setting alone (so you can use it in Tomcat or others that register the url handler factory), and don’t need to manually set system properties.

You can fetch it via Maven ( search for current version ) in two forms. The first has a normal dependency on cglib:

<dependency>
    <groupId>org.skife.url</groupId>
    <artifactId>url-scheme-registry</artifactId>
    <version>0.0.1</version>
</dependency>

The second puts cglib into a new namespace and bundles it in case there are still cases of colliding cglib versions in the same namespace out there:

<dependency>
    <groupId>org.skife.url</groupId>
    <artifactId>url-scheme-registry</artifactId>
    <version>0.0.1</version>
    <classifier>nodep</classifier>
</dependency>

Have fun!


Hello Pummel

I’ve been doing some capacity modeling at $work recently and found myself needing to do “find the concurrency limit at which the 99th percentile stays below 100 milliseconds” type thing. So I wrote a tool, pummel to do it for me.

$ pummel limit --labels ./urls.txt 
clients	tp99.0	mean	reqs/sec
1	2.00	1.03	967.59
2	2.00	1.04	1932.37
4	3.00	1.43	2799.16
8	16.00	3.64	2199.13
16	130.00	7.81	2049.02
8	17.00	3.54	2262.44
12	73.00	5.62	2135.61
16	129.00	7.58	2110.93
12	71.00	5.57	2155.99
14	117.97	6.58	2127.79
12	71.00	5.57	2155.99
$ 

By default it is looking for that “tp99 < 100ms” threshold, which in this case it found at 12 requests in flight at the same time.

Also really useful is the step command, which just increases concurrent load and reports on times:

$ pummel step --limit 20 --max 5000 ./urls.txt 
1	2.00	1.05	956.21
2	3.00	1.08	1854.26
3	4.00	1.24	2415.46
4	3.00	1.41	2836.48
5	6.00	2.05	2444.27
6	9.00	2.54	2358.31
7	11.00	2.92	2398.74
8	16.00	3.38	2364.49
9	23.99	3.99	2257.22
10	35.99	4.43	2258.05
11	54.99	5.10	2157.37
12	72.98	5.94	2020.61
13	87.97	5.94	2187.89
14	125.99	6.40	2187.64
15	125.00	6.85	2188.50
16	130.00	7.39	2163.68
17	134.00	7.86	2163.13
18	143.98	8.35	2156.93
19	156.00	8.92	2129.52
$ 

Assuming you put this output in data.csv this also plots very nicely with gnuplot

set terminal png size 640,480
set xlabel 'concurrency'
set ylabel 'millis'

set output 'tp99.png'
plot 'data.csv' using 2 with lines title 'tp99 response time'

set output 'mean.png'
plot 'data.csv' using 3 with lines title 'mean response time'

set output 'requests_per_second.png'
plot 'data.csv' using 4 with lines title 'requests/second'

Nothing super fancy, but is kind of fun :-)