Operations Driven Development – Heuristics & Best Practices

operations driven development best practices


Apply ODD Best Practices – Part 1

Most enterprise software is intended to run reliably in production with high availability at scale.  Achieving high levels of reliability and availability in real world operations involves as much art (“best practices”) as science and requires a disciplined up-front approach to architecting and building for “operability” – something we like to call “Operations Driven Development” (ODD).  This series of articles will present some of the heuristics and best practices we use as part of ODD.

Let’s start by looking at a typical operations environment.  Application support personnel are often required to monitor a large, diverse portfolio of applications and don’t have intimate knowledge of any one.  They are often fairly technical but not developers, so can’t be expected to understand or deduce what’s happening “under the covers”.  They often operate in fast paced, stressful environments, especially during periods of high activity.  When an application instance runs into problems, they need fast access to clear diagnostics and effective repair tools.  (Sadly, what we find is that the only management capability is often a full restart and the only support fixture is logging.)

To begin addressing the needs of operations we anchor ODD on the following heuristics and best practices:

  • “5 minute rule”:  outages happen; applications fail.  We of course strongly advocate fault tolerant architectures and rigorous quality control mechanisms to minimize the probability and customer impact of outages, but we also recognize that they will happen. When they do, we apply a “5 minute rule” – basically, from the time of the outage/degradation to resolution should be 5 minutes.  In this 5 minutes the system or it’s monitoring framework must recognize the outage and preferably auto-repair.  In cases where auto-repair doesn’t work a technician must be notified and must have adequate diagnostic and repair tools available to diagnose and recover the system.

  • Operations as Actors:  Most systems are built with an intense focus on the end-user features and use cases; little attention is paid to the internal operations and tech support users.  Focusing on these Operator Actor use cases early and often ensures that they are scoped and prioritized effectively.

  • Developers as Operators:  Developer and testers spend a non-trivial amount of time debugging and diagnosing systems under development in pre-production and can benefit from many of the same diagnostics that operations personnel could use in production; basically the investment in ODD can also significantly accelerate the development cycle.  We don’t hesitate to invest in capabilities that overlap developer and operator needs and we strongly encourage our developer to think “beyond logs.”

Check back for part two this article. Specific tools and techniques will be presented for improving operability.

Written by: Dan Cripe, CTO