My Favorite Interview Question

As Ning is ramping up recruiting again, I need to brush off my interrogation techniques interview questions. Sadly, one of my favorites is no longer so useful, as originally designed, due to technical advances in hard drives. I figured I’d share and discuss how I use it. Hopefully folks can give me some feedback on how to better find out what I am looking for.

The question goes like this

Given a hard drive with one terabyte of data, arranged in 2^32 key/value pairs, where the keys and values each have lengths of 128 bytes, you need to design, build, and deploy (by yourself) a system that lets you look up the value for a given key, over the internet, at a peak rate of 5000 lookups per second. The data never changes. Let’s design that system.

The question was designed (4 or 5 years ago now) to just barely require building a distributed system. With the widespread understanding and availability of solid state drives, it is fairly trivial to do in a single box now.

There is additional information available if the candidate asks for it, things like requiring responses in 100 millis at the 90th percentile, that the budget is, “well, we don’t know how much it is going to be worth until we see it in use for a while, so try to do it cheap, if it is too much we’ll just not bother building it,” and that we have a datacenter and a switched network we can put it on, but no pre-specified servers. We want 99.9% availability, measured on a monthly basis, but are not offering an SLA to consumers. The keys are not distributed evenly within the keyspace. Requests are distributed evenly (and randomly) across the keys (I do this to make the problem easier). Etc.

For most good candidates, designing such a system is very straightforward. It is an interesting design exercise for junior folks, or folks coming from different areas of programming (desktop apps, embedded, etc), but for most candidates, it should not be difficult.

I’m looking for quite a few things as we go through the question. The first is their opinion of “a terabyte of data” and “5000 lookups per second.” Do they consider this to be a lot of data, or a fairly boring amount, same with the lookups per second. Leaning either way isn’t a failure, it is just information gathering for me, and referencing it against how you represented yourself in your resume, cover letter, and phone screen.

I’m looking to see what additional information the candidate wants. Again, this is mostly to try to understand the candidate, and I don’t expect any barrage of questions out of the gate – they usually dribble in at forks in the design discussion.

I expect the candidate to design something that will work. Gimmick answers (put it in S3 and put up a web server that 301’s over to S3, etc) are valid, and you get some points for reasonable ones, but you still have to design an in-house version. I expect the solution to be within reasonable bounds for hardware, etc. Most folks do some kind of hashing scheme on the key in a front end server, and fan out to some database (or database-like) servers behind that. This is a fine answer.

One interesting side-bit, it is fascinating to see the folks who decide they need to implement a database. Usually they don’t phrase it that way, they just start talking about on-disk hashing schemes, mmap, and so on. While I am positive, given reasonable time, the problem lends itself to a hyper-efficient custom solution, I consider this to be a bad path compared to “slap it in $(database)” – frankly, you are not going to significantly outperform cdb or tc. I generally ask these folks if they think the problem of looking up a key by value has been solved before, especially given the two weeks to be live in production requirement.

Once they have a design, I ask about details, what technologies languages, etc. I generally remind folks that they have to have it in production two weeks from right now, and they need to implement it themselves, so they should probably stick to things they are very comfortable with, or can run a very quick experiment on to validate. I’ll then poke and and prod for their level of understanding of the tools they choose. I am looking for them to have a stong understanding of the strengths, limitations, and nuances of their go-to toolset.

Now the fun part begins, the followup question is, “how is it going to fail?”

Candidates usually either start listing everything they can think of, or immediately start describing how to add data-level redundancy. If so, I’ll rephrase it as “what are the most likely things that will make it fail?” This is actually my favorite part. From experienced folks, I expect a pretty accurate rundown of why things fail, from junior folks I expect at least good first-principles reasoning.

Depending on time, and how interesting it has been going, I’ll then either dive what we need to do to target 99.9% uptime, 99.99%, and so on. Alternately, I dive into a different question.

I love this question as my role in interviews is generally to determine how they think about and design systems. Is this someone who you can hand a problem description to, or just a task list? Do they bring a very different way of thinking about systems to the team (which can be either good or bad, depending on their approach)?

I get to feel out a lot of quality attributes, such as their familiarity with running a system as compared to handing it to ops, how much they have invested in mastering their tools, their balance between buzzword compliance and bedrock technologies, etc. You can certainly fail this question via a number of means, but it is set up to help me understand candidates more than grade them.

Interestingly, the question has been picked up by a number of other folks at Ning, mostly when we got larger and I stopped being involved in every engineering hire. Other folks generally used it in other ways, looking at other things. Most commonly for diving into algorithms and implementation details (what hash algorithm, if you are doing front end in PHP, let’s whiteboard code it out, etc). Sometimes I’d interview a candidate who had already been asked the question, and they usually got upset when I pressed forward on it anyway (usually after asking who asked the question, so I could make sure they hadn’t used it in the same way). “Then it should be easy, what was your answer?” got us past the first part and into the fun stuff, anyway!

Sadly, solid state drives killed the question, time for something new :-)

As an experiment, please leave any feedback or comments over on hacker news.