Few things in recent years have changed the game plan of the tech organization as much as the infrastructure as code movement. With infrastructure itself largely having moved into the cloud, automating provisioning, upgrades and management of that infrastructure was a natural next step.
While most of the cloud providers offer their own proprietary technologies for describing deployed infrastructure in a declarative manner—as JSON, YAML or XML—Terraform has emerged as the leading open source solution for infrastructure as code. Although treating infrastructure like servers as cattle and not pets felt like—and was!—a huge leap for many, handling infrastructure components as disposable resources obviously doesn’t diminish the actual value of what that infrastructure provides.
Codifying not only the infrastructure, but also the requirements surrounding it, allows us to quickly deal with changes to our infra resources without fear of breaking with requirements around security, reliability or costs.
In this article, we’ll take a closer look at how Open Policy Agent (OPA) can be leveraged to secure infrastructure deployments by building policy-based guardrails around them, allowing us to codify rules like:
Which machine images should we allow on server instances.
In which regions deployments should be allowed.
Prohibit creation of storage resources unless they are encrypted.
Allow changes to infrastructure only if costs and “blast radius” are within reasonable bounds.
Ensure presence of organization tags and labels on all deployed resources.
Being a truly general purpose policy engine, using OPA for infrastructure policy means an organization can leverage their skills, experience and tooling in the policy domain not just for infrastructure but across the whole organization, with use cases as diverse as Kubernetes admission control, microservices authorization, data filtering or CI/CD pipelines.
Note: We’ll use AWS resources for some of the examples in this blog, but the principles described here are equally applicable regardless of your choice of cloud provider.
Why Infrastructure needs policy
While it’s common to put policy and OPA in the “security” category, infrastructure policy is not just about security in the sense of trying to protect infrastructure deployments from malicious actors. Although that’s certainly an important aspect, the guardrails you build around your infrastructure are there just as much to protect you from yourself.
Mistakes are easy to make, and potentially very expensive. While mistakes in application code can be dangerous enough, applications most often run in isolated environments where the impact of mistakes, bugs or even intrusions is more easily contained. Mistakes in basic infrastructure like networking could however impact an entire organization—either by letting the wrong people in, or by keeping legitimate users out.
Due to this, there’s already a long history of policy around infrastructure deployments. However, up until now, this policy has often been written down and kept in internal wikis, Word documents and PDF files, and enforcement often done by means of manual processes. With more and more organizations codifying their infrastructure, policy needs to follow suit!
Evaluating Terraform Plans
So, where do we begin? While it’s certainly doable to audit the current state of one's cloud infrastructure, most of the time your primary interest will be to let policy determine whether planned state changes should be allowed or not. This includes not only new infrastructure but just as much changes to existing resources. Using the terraform command line tool to extract planned changes in JSON format is fairly easy:
terraform plan -out tfplan.binary terraform show -json tfplan.binary > tfplan.json
We may now use the JSON data in the tfplan.json file as input to OPA in order to evaluate planned changes against policy. Since you’ll be running this on all changes, you might want to create a little script to help simplify the process. With the tfplan.json file at our disposal, we are now ready to write our first infrastructure policies.
While the data on resources such as storage or compute will differ depending on which underlying cloud provider is used, the base structure of the JSON plan representation format is always the same. Make sure to review it so you know where to find data like variables. The structure you’ll be working with the most is the one named resource_changes. As the name suggests, this attribute contains a list of all resources that would be added, deleted or modified by the Terraform plan, if applied.
Each resource has (among other attributes) a name, a type, and a change. An example policy to enforce all AWS S3 buckets have a private ACL might look something like this:
Let’s take a closer look at what’s going on here. Since many resources could violate our policy, the deny rule used here is one that generates sets of messages explaining the possible violations encountered.
Inside the body of the rule we start an iteration over all resource changes reported in the plan. Next we call a helper function to determine if the type of change is a create or update operation, as we aren’t too concerned about non-private buckets being deleted. Next, we check to see that the resource is an S3 bucket before we proceed to check the ACL value. Note that we use the after attribute of the change here, since that’s the state of interest. Finally, if all the above conditions have been met we conclude that a resource has violated our policy and we create a human readable message with the bucket name, and add it to the set.
While the above policy is simple, it serves well to demonstrate how most infrastructure policies are structured: iterate over the planned changes and determine if any of the new values are in violation of policy. Some other useful examples include calculating blast radius of changes, checking for the existence and/or format of tags, or using security policies to minimize friction in code reviews.
Also, just as with other types of policy, decisions can be made not just based on the data provided to OPA as input (i.e. the Terraform plan) but also from external sources and facts like the current date and time, whether anyone from the SRE team is on call, or if the changes are determined to be of sensitive nature, such as those attempting to modify roles and permissions.
How could we improve on our policy from here? One immediate observation is how our rule starts with iterating over all the changes in the plan:
changeset := input.resource_changes[_]
If we add more rules they’ll likely need to start with the same iteration. While it isn’t terribly efficient to iterate over the list of changes in all of our rules, performance is not our main concern here. First, the number of changes are normally few enough for this not to be problematic. Secondly, evaluating Terraform plans is rarely a performance critical operation like e.g. application authorization. Rather, wouldn’t it be nice if instead we could group the changes by the type of resource being modified, and then have our rules evaluated only when the resource type is applicable? Using an object comprehension, we could group the changes by the type of resource they apply to:
When evaluating the changes_by_type variable, we’ll now find an object keyed by the resource type, and the values containing a list of changes for that particular type:
With this data structure available, we could choose to put it in a separate “util” policy and import it in each of our policies:
Note how we no longer need the changeset.type == "aws_s3_bucket" check as that is already ensured by iterating only over the changes relevant to the resource type.
Another option made available by this change is to use dynamic policy composition to let a “main” policy service all incoming requests, and dynamically dispatch policy evaluation only to the policies where the corresponding resource has changed. A main policy routing request might look something like this:
If we wanted to, we could overwrite input entirely and only provide each policy with the changes applicable to that resource type. There could however be cases where the planned change of one resource depends on planned changes in other resources.
Additionally, not all of our rules apply only to a single resource type. If we for example would like to ensure that any resource deployed had a resource tag to identify the owner of the resource, we wouldn’t want to repeat that rule in each and every resource policy. We’d probably rather extend the main policy, or dispatch to another policy for “common” rules:
With a few basic policies to extend from, how would we run OPA as part of infrastructure build or deployment steps?
OPA's different modes of operation
One thing noticeably different when running OPA in CI/CD pipelines is how to deploy or execute OPA for policy evaluation. If you’ve run OPA to make authorization decisions for your microservices, or perhaps configured OPA for Kubernetes admission control, you’ll remember that OPA commonly runs as a server. Clients—whether they’re microservices, applications or the Kubernetes API server—then query the OPA REST API for decisions.
CI/CD processes on the other hand are often fairly short lived. A build task, for example, is normally processed inside of a container or perhaps in a serverless function, and after the build is completed the container, function or process that contained the build has served its purpose and is terminated. While it would be possible to start the OPA server inside of such a process—or make requests to an OPA server running outside of it—it is often preferable to let OPA run as a short-lived one-off process too. In order to do this, we have a couple of options to choose from.
The first one is opa eval. This command is often referred to as the “Swiss army knife of OPA” and it truly is a versatile tool. Using the terraform plan as input, we could use it to evaluate our policy like this:
opa eval --fail-defined --format raw --input tfplan.json --data
So, what’s going on here? The first argument provided (--fail-defined) tells OPA to exit with a non-zero (i.e. “fail”) exit code on anything but an empty result. This is useful when using “deny” style—or other set generating rules—since the ideal outcome is then an empty response (no violations). Next we’ll use the --format flag to make the output readable for humans, but which output format is best largely depends on the context, and if another tool is meant to parse it later on. Next, we provide the Terraform plan as input, and our policy directory as data/policies. Finally, we evaluate the data.main.deny[x] query. If there are any violations, they’ll be printed to the screen before OPA exits with an error code.
Another option is to use Conftest. From its own description, “Conftest is a utility to help you write tests against structured configuration data.” The structured configuration data in this case would then be the same Terraform plan previously used as input to opa eval. As for policies, conftest by default scans for those in a directory called policy, which is also what we’ve used in our examples. Calling conftest would thus be as simple as:
conftest test tfplan.json
As a tool built for testing, there’s no need to explicitly tell conftest to “fail” when there are failures in the test. Additionally, the default output format is both readable and pretty. When running conftest in CI/CD pipelines we may even benefit from a number of additional output formats commonly found in these contexts, such as JUnit or TAP.
As we’ve seen by now, bringing policy into infrastructure as code adds those guardrails and safety nets essential for continuously deploying infrastructure without fear of breaking, all while knowing the same infrastructure deployed is in conformance with organizational policies.
I hope this blog has not only taught you why this is important, but also provided some hands-on experience examining what some real infrastructure policies and deployment scenarios might look like.
With infrastructure as code becoming ever more prevalent, I’m sure we’ll see a lot of interesting future developments in this field. While it’s hard to predict exactly what these changes will entail, one thing is for certain—we’re going to need policy to follow suit!
Here's an excellent repo to demonstrate OPA policies.