🔥Azure DevOps & Terraform: Breaking Up The Monolith — Strategy
Hey all!
Azure DevOps is a CI/CD automation platform from Microsoft ($MSFT). It supports repositories and running all sorts of code and automated code against the code (among many, many other functions). This includes Terraform, a tool that converts scripted, declarative configurations to real resources in cloud (and other) providers via API calls.
Terraform has been an excellent tool for us so far, and is starting to be adopted by other teams, for other purposes, to manage more accounts and resources. Which means the model we selected — to have a single terraform file (with a single .tfstate file) that calls all resources and configurations for all resources in an environment, is quickly getting strained.
Here’s an example — say you have this above environment, with a single file. You have a dozen developers working in parallel building projects and adding them to the single monolithic file. Changes might get through the PR process without being properly vetted. Devs might push changes to the terraform repo and not deploy changes yet — maybe the changes aren’t ready yet, maybe they shouldn’t be deployed yet for some dependency reason. And now it’s time that you want to push a tiny little change — maybe to change the size of an instance. You push your PR, run a terraform plan, and it wants to change 22 resources in 3 different time-zones. Would you push the approval through? If you’re an experienced engineer, heck no you wouldn’t. You could break any number of things.
So that’s a scary situation, and probably an eventuality for most companies that start using terraform and don’t plan an extensible way to manage these files But that’s okay — for better or worse, the best driver of innovation is impending failure.
What Options Do We Have?
So how can we fix this problem? I have a few different strategies I want to discuss.
Option A: A few more TF files
We could of course break the single monolithic TF and .tfstate file into a few TF files. For instance, put all servers into a single file, and all databases into another. This has the benefit of minimizing changes to process, and putting off the eventual time where many changes are queued up for TF apply-ing.
This has the benefit also of being easily supported by Azure DevOps — you can point the native Terraform plan/apply steps at the several different files, even have them in different concurrent stages of the Terraform release. They can all run automatically, and boom, you’re in business.
The big con here is that the problem is only delayed. You have expanded the ability for your processes to scale, but you’re still queueing up changes within a single file. And you’re going to need to do this again and again in the future.
What would be more ideal is a solution to the problem, rather than a bandaid. So what else can we do?
Option B: Many project TF files, Terragrunt recursion
A problem with Azure DevOps and Terraform in general is that each Terraform step must be pointed at a single directory, and Terraform doesn’t support recursion. Which means if you have half a dozen TF files that need to be run, your TF release pipeline is going to be relatively complex. But if you have hundreds? It’d be untenable. Not to mention that ever time a project is added your release project would need to be updated.
Which is exactly the gap that Terragrunt looks to fill. It natively supports recursion, complex deployments, and lots of tools to keep your configuration DRY (humorously, Don’t Repeat Yourself).
A pro here is that now you can expand ad infinitum with Terraform stacks. Your can tell your devs that if they drop their terraform code into a folder tree you specify, their code will be executed on the next run.
There’s still some downsides. Terragrunt, because of its additional deployment logic, requires new files to be added, and some changes to your TF stack config. If you already have lots of files, not great. And learning a new tool just for this problem isn’t ideal either. One complication that seems trivial (but probably isn’t) is the Azure DevOps tasks that consume a Service Principal are for Terraform in particular, not any other command, even if it’s very similar (Terragrunt). Which means you’re looking for a Terragrunt deployment module, which… doesn’t exist (yet). So you’re deploying code with straight-up terminal commands, and handling the service principal authentication yourself, which isn’t a security best practice.
And one of our big initial drawbacks remains — when an “apply” is run against the top-level of the folder structure, all changes that have been queued up by PR approvals in the terraform repo will be executed. Again, we might end up pushing out dozens of changes if devs haven’t been applying their changes right after getting PRs approved. Still not ideal.
Ideally, we’d be able to get all the benefits from Option B (Recursive Terragrunt) without learning and implementing a new tool and applying changes en masse during a single run. And what a monster I’d be if I didn’t present something that satisfied that criteria — customized
Option C: Targeted, custom Azure DevOps release pipelines
What many companies do is implement Jenkins, an extensible CI/CD that permits more customization of releases, including setting variables that can target particular files for jobs. This is used to help target and run specific Terraform file updates.
Thankfully, Azure DevOps supports similar functionality. The functionality is relatively recent and still in development, so documentation isn’t great. However, we can piece together enough disparate features to make this work well.
When initiating a TF release pipeline, we can surface a variable that can be consumed by our TF steps within the pipeline to target specific files for execution. Combine building individual TF files with individual state files with a release pipeline that permits executing single TF files one at a time, and we can scale out indefinitely (thousands of TF files) and programmatically define where the TF state file is stored for each TF file.
Conclusions
The output of all this:
We can scale out TF files indefinitely — TF files now stand alone, and aren’t all tied back to a single file that can become cluttered and queue up many changes
Changes can be applied carefully and methodically — TF updates aren’t applied all at once for an entire folder structure — they are targeted and only a single stack is updated at a time
No new tooling has to be implemented — We can rely on native Azure DevOps and Terraform functionality. There’s no need to teach your team an entirely new tool and methodology
In future blog posts I’ll be looking at Terragrunt to implement TF recursively in a folder structure, and separately at customizing Azure DevOps release pipelines with custom variables to permit releases targeting only a single arbitrary TF file.
Good luck out there!
kyler