PagerDuty uses Chef for some of its configuration management needs. While most Chef cookbooks we develop internally are not useful outside of PagerDuty’s infrastructure and workflow, sometimes we do come across a problem that seems general enough to make open-sourcing the solution meaningful.

The pd-feature cookbook solves the problem of gradually rolling out a new feature across a uniform fleet of machines while avoiding manual actions (we want all infrastructure to be controlled by and visible in source code). The cookbook allows fine-grained control of the process, supports a number of common scenarios, and makes features easy to discover. But, to justify the cookbook’s existence, let me explain why Chef does not solve this problem by itself.

Chef attributes are the usual method for controlling optional features. A common pattern defines a boolean attribute for the feature and takes different recipe paths based on that attribute’s value. In this example, the code would install Failure Friday-related tooling only in production:

in cookbooks/pd-base/attributes/default.rb:

default['pd-base']['failurefriday_enabled'] = false

in cookbooks/pd-base/recipes/failurefriday.rb:

if node['pd-base']['failurefriday_enabled']
  cookbook_file '/opt/failurefriday/reboot.sh' do
    source 'failurefriday/reboot.sh'
    owner 'root'
    group 'root'
    mode 0744
  end
end

in environments/production.rb:

default_attributes(
  'pd-base' => {
    'failurefriday_enabled' => true
  }
)

Chef can set attributes on individual environments and roles, so if a feature maps exactly onto an environment or a role, attributes are enough. However, if it is a shared feature set on a particular role in a given environment things can get tricky (set to true in the environment and false in all the other roles is one none-too-pleasant way of accomplishing this).

The situation is even more complex for a uniform fleet (for example, twenty identical machines with a web-app role in the same environment with a feature that should be enabled on two of them). Attributes on role or environment level do not help since these machines share the same environment and role so their attribute values are the same. A different role can be assigned to a subset of machines, but that’s a fair bit of work. And assignment of customized roles, like other manual approaches such as editing node state of selected machines directly or assigning Chef tags to a few nodes, are not visible anywhere in our source Chef code (thus violating our infrastucture-as-code principle). Because that configuration is not in the code, it does not get replicated when manually modified nodes are replaced. And replaced they will be because constant, gradual churn of the fleet is a fact of life in large environments such as PagerDuty’s, usually due to hardware failure over time. The churn is guaranteed to eventually obliterate any node configuration change made by hand.

This is where pd-feature comes in. Without repeating the extensive documentation, the solution is still attribute-based but the attribute’s value specifies the rules for application of the feature instead of being a boolean on/off switch. For example, a count:2 value answers the previous paragraph’s requirement, and if one of the selected machines gets replaced the cookbook will automatically select another one on the next run. The rules are expressed in code and are tweaked with one-liner changes to adjust the feature’s reach.

A side benefit of using a unified approach to feature flags is consistency. In our Chef codebase, I can find boolean flags ending with “enable”, “enabled”, “disable”, and “disabled” with values being mostly booleans (true and false) but sometimes strings ('true' and 'false'), depending on the author, age, and inspiration of the cookbook. Mistakes were made, including by yours truly, because of this variety. Using a helper for feature flags enforces a standard behavior and, by naming convention, clearly separates feature flags from other boolean attributes.

I hope you will find this cookbook useful. This is just one example of general infrastructure problems PagerDuty engineers are solving in addition to developing the PagerDuty platform. If these kinds of challenges interest you, we are hiring.