Vivek Haldar

Operations should be in the computer science curriculum

CS 2013 is a proposal to update the computer science curriculum. Anyone interested in learning or teaching computer science should read the whole thing. I think it does an admirable job not just of providing coverage, but also outlining the principles for selecting topics, and impressing upon the reader the complexity and nuance of this task.

That said, I want to point out one glaring omission. I realized its importance only after working in the real world for a few years, and being schooled in how inadequately prepared my CS education had left me in this arena.

That topic is operations. It encompasses all the activities required to keep a computing service deployed, performant and available. Chances are, if you work in the industry today, even if your title is something like “programmer” or “software engineer”, operations will be some component of your job. The fraction of time you spend on operations might fluctuate between 0% and 100%, but it will certainly exist.

In broad strokes, operations includes the following:

  • Production design: okay, your shiny new system is code complete. How exactly will it be deployed? How many machines? What kinds of machines? What resources does it need? What are the failure modes of your system? What happens if x% of your machines go down? What happens if all your machines go down? Under what failure circumstances can the service continue to limp along, as opposed to being completely unavailable? The answers to all these questions makes up the production design of your system.
  • Deployment: how would you automate the process of pushing new code? Of taking the entire system up and down? Taking subcomponents of the system up and down? How will you do a rolling restart of all your machines?
  • Monitoring: how will you get visibility into the performance and health of your system? This includes inserting hooks to measure key characteristics (qps, latency etc.) as well as collecting and visualizing them automatically, in a way that will make sense to a bleary-eyed person who has been woken up by a page at 3 AM.
  • Alerting: okay, you’ve done a great job of monitoring the health and performance of key system metrics. Now how will you notify yourself when something goes wrong? This includes figuring out the exact thresholds of the metrics after which the system is deemed unhealthy. This really needs to be foolproof. You don’t want your system to be down, and not even get paged about it.

In short, being responsible for the uptime of your own code gives you a whole new perspective when you write it. You start thinking about what could happen in there that might page you at 3 AM. You also realize how writing code is but one step in the never-ending cycle of deploying, tweaking, and writing again.

Where might this fit into a computer science curriculum? One option would be to make it an additional component of existing programming exercises, by having students deploy their programs on test systems and simulating failures, and then seeing how well their monitoring and alerting holds up. An interesting twist might be to have students be responsible for the upkeep of not their own but other students’ programs.

Also see: the CS assignment I wish I’d had.