🔥HashiTalks2020: Enterprise Deployment to Azure and AWS in Azure DevOps
Hey all
This is a talk I gave as part of HashiTalks2020. It’s about my own and my company’s journey from a single-cloud, single provisioning tool environment to multi-cloud, cloud-agnostic Terraform provisioning and management.
The full talk is 25 minutes long and you can watch it here:
Alternatively, this is the text of the speech. These are cleaned up notes, and may not be perfect or make the most sense without the visual aids.
Hey everyone! This is “Enterprise Deployment to Azure and AWS in Azure DevOps”. It’s my group’s journey from NoOps to DevOps, which includes establishing technical tooling, building business processes, and helping internal teams take advantage of the automation infrastructure we built.
It’s our story of making mistakes, forming communities, and empowering our developers to iterate faster than ever before.
I’m Kyler Middleton. I’m currently a DevOps and network engineer. Two years ago I hadn’t even heard of CloudFormation or terraform. This is my DevOps trial by fire.
I’ve just about got this… oh no
Like everyone watching this recording, I witnessed what was happening to the global IT Infrastructure field over the past few years. Physical data centers were shutting down, workloads were moving to something called “The Cloud”, and automation tools were springing up left and right. Frankly, that kind of thing terrified me. I could hold my own with bi-directional route redistribution on a Cisco router and configuring a AnyConnect VPN on an ASA, but programming? I could “hello world” in bash, but that was about it.
So when I had an offer to join a company that treated infrastructure as code, I jumped at the chance.
I loved it. Six months in, I was taking every opportunity I could to solve problems with DevOps tooling. I was slow, and needed a lot of handholding, but I was making progress!
This company was successful with a single large product — a SaaS application for health care professionals to store and manage medical data. Everything was built on top of AWS. There were mature DevOps processes built around SparkleFormation that spit out CloudFormation stacks, which were deployed by Jenkins. Our product was reasonably successful — successful enough to attract the attention of a larger company (I’ll affectionately call them the mothership), which purchased us.
The mothership decided with our experience running user-facing applications and securely managing health care data (by far the scariest kind of data to store on your servers), we were a good fit for their newly minted medical data science division. All the applications and acquisitions they made in that field were assigned to the business division that we owned.
Which was fantastic, but also…a problem.
Most of these tools were built on Azure, and we had only worked with AWS to that point. Our infrastructure provisioning tool was CloudFormation, which only worked for AWS. But before we dig into the challenges in front of us, let’s talk about what we had going for us.
What We Already Had Going For Us
Our shop was already strongly infrastructure-as-code. Most environments are prevented by process from being updated by manual changes. Rather, we use DevOps tools to push out changes. And all changes to infrastructure config needed to be made through pull requests, meaning it was reviewed and approved before being made live. All of this will continue, and helped ease the transition to the new platform.
The business is committed to up-skilling staff on DevOps tooling. Many IT operations team members are new or wary of DevOps processes. Programming can be scary, and it takes concerted support and training to make operations members feel comfortable with these new tools. The business had already decided to put out the required time investment, which was (and is) a tremendous help.
Problems to Solve
The first rule of business is that change or chaos are just other words for opportunity. And we had all of those in spades.
Firstly, the new applications (new to us anyway) were built on Azure, and wanted to stay that way. The teams weren’t interested in migrating their application over to AWS, and why should they be? It’s far less time consuming to train an Ops team on a new cloud provider than rebuild an entire application for another provider. So it was time for us to learn Azure, and we needed to know it yesterday.
In the past we were operating exclusively within AWS, and our tooling reflected that. We exclusively used Sparkle and CloudFormation, provisioning tooling that only supports AWS. So not only did our skills need an update, our tooling did too. We needed to select an infrastructure provisioning tool that would support not only Azure, but AWS and other clouds as well, so we wouldn’t have this problem in a year or two, when the next major cloud needs to be supported.
Our previous CI/CD was Jenkins, but that Jenkins was exclusively owned by the single-product company’s DevOps group. We had the option to build our own Jenkins servers, but we didn’t have anyone on our team with strong DevOps skills, so we had the opportunity to look around for a new CI/CD platform. If we’re going to learn it from the ground up anyway, let’s find the new hotness.
Since we deal with so much health care data, we had to be 100% absolutely sure that data was protected. The default builders for most SaaS CI/CDs is to execute the code in an ephemeral builder hosted in the cloud. It’s hard to make sure that our very sensitive data would remain secure, and we needed to.
Within our new CI/CD, we needed to build pipelines that could be understood by the group which could run our new tooling to deploy resources. It’d need to be integrated into a repo accessible by the group, and store configuration there.
The DevOps group I mentioned earlier was assigned exclusively to one of the products, and didn’t have cycles to help us build this greenfield deployment. This is stated as a problem (because we were on our own) but it was also a tremendous opportunity — how many engineers get the opportunity to design and build a complete CI/CD and DevOps pipeline themselves? Let’s do this.
Problems to Solve #2
Skilling up on Azure cloud was by far the easiest problem to solve here. We hired in an expert! That architect has been fantastic, and resolved this major concern. 1 down, 5 to go.
Terraform
The first decision we made was on our tooling. Over the last few years HashiCorp’s Terraform has dominated the market for infrastructure provisioning.
It has several key strengths that drew us to it.
Firstly, the open source nature of the tool. We are a highly open-source centric company, so the ability to contribute to the source code, troubleshoot our problems within a community, and maybe even contribute back to the tooling to add features we needed was and is a a huge draw for us.
Free — as in beer and speech. It was important to us that the tool was free. This goes beyond the amount of dollars we’d need to pay out in order to buy a tool. When a product is expensive it limits the ability for the community to interface with it, get interested in it, and contribute to it. Free means a world-wide community of developers could get exposed to it.
Broad support for providers, particularly cloud providers, is an absolute requirement for us. The ability to support more providers in the future, and the rapid pace of innovation and provider involvement makes this a very exciting tool.
Weaknesses. Like any tool, Terraform is not without its weaknesses.
Terraform requires a remote state file which must be managed. This gives it some powerful attributes, like tracking a remote state of resources, but means we need to support a way to store the file and support secure remote access to it from our ephemeral CI/CD builder machines.
Which is a good segue into the next weakness — the remote state file contains many clear-text secrets. Terraform is working on this feature, but the current state of the art is that if you encrypt your state file you’re doing it yourself in a custom solution. This makes it critical to protect and secure this remote state file — it contains the keys to the castle.
Problems to Solve #3
This is a HashiTalk, so you knew where this was going. Infrastructure provisioning tool selected — Terraform!
Now that we’ve picked a tool, we need to run it from somewhere.
Azure DevOps
Because terraform requires a remote state file to be accessed by all users, and we wanted any of our half-dozen development teams to be able to access the configuration and execute terraform from anywhere, we knew we needed a CI/CD. We considered deploying a self-hosted Jenkins into one of our own accounts, but our needs weren’t and aren’t simple. We have a half-dozen teams working concurrently on the same infrastructure, deploying applications to different environments, and somehow tying it all back to the same network so all access is internal and secure. We’d need separation of application code, but not environment terraform code, permissions would be complex and multi-layered, and we’d need approvals by different groups for different pipelines.
After deciding it would take us about 4 years to build something workable in-house, we decided to hire out. We tested out a TON of other CI/CDs to see how they compare, and we haven’t seen any that fit our needs so well. HasiCorp Enterprise runs terraform in an incredibly simple way, but can’t run other programming languages. If you need to compile .net or run a powershell post-script you’re out of luck — or messing around with a webhook to another CI/CD. GitHub actions didn’t support the complex pipeline approvals we’d require, TravisCI didn’t have team segmentation in a workable way.
Azure DevOps (you’ll hear me call this ADO frequently during this talk) has a good reputation as an enterprise CI/CD platform, and on paper seemed to fit our requirements, so we dove in and haven’t looked back.
All in all, the cost of the service is entirely reasonable, particularly for the excellent enterprise feature-set. One feature that we require, since our applications sometimes deal with health care data, is keeping all execution and processing of data within BAA-approved and locked down infrastructure. By which I mean to say — we wanted to configure complex pipelines on ADO, but all application builds and data processing needed to happen on hosts that we controlled. So let’s talk about that.
By the way, the picture on this slide I took from an official microsoft blog. Because building a CI/CD is like launching a rocket while brewing some type of motion, while constructing a tree out of… rings. Or something.
Problems to Solve #4
We have a home for our pipelines configuration and orchestration. We don’t have anything configured yet, but at least we know where we’ll configure it at. Next, we need to register builders to our orchestrator that will execute code in a highly secure way.
Code Execution Hosts
Azure DevOps is a centralized orchestrator for all pipelines and workloads. All configuration of pipelines happens here regardless of where the workload is executed, be it a Microsoft-hosted builder or a self-hosted local linux or windows builder.
As I mentioned earlier, ADO’s default for running and executing code is Microsoft ephemeral servers hosted by Microsoft themselves. Our InfoSec team made a compelling point — what if some health care data passed through these machines, something went wrong, and the data was exposed to another user of ADO’s ephemeral builders? For the low low fee of running our own builders, we could avoid the possibility of that ever happening.
Terraform can execute on any type of host for which a binary exists, which gave us a choice of what type of hosts we’d like to run it on. I respect the hold Windows has on the consumer market, but my heart will always run on nix, so we deployed linux builders within every AWS account to execute terraform plan and applies.
For Azure we had no such security concerns. We wanted to offer both a windows-based cluster that could natively compile .net or run powershell natively, as well as a linux cluster that are license-free and natively support linux tooling. Because there would be many more of these hosts — handling ALL workloads require a lot more horsepower than the infrequent terraform planning and applying of the AWS hosts — we went with a more sophisticated solution of Azure ScaleSets.
Scale-sets, in Azure lingo, are hosts built upon a shared golden image that are spun up dynamically based on configuration or schedule. On a regular basis (this can be quite frequently if you’d like) we build a new image based on updated operating system patches and updated tooling versions. A few hosts are deployed as canaries, and we watch their error rates to execute code. If issues are identified, changes are rolled back. If error rates remain low, all hosts are migrated to the new image.
Authentication in Azure
Azure DevOps, as you would expect, more easily support the Azure Cloud. Azure DevOps supports registration of Service Connections, which can either be built manually or via an automated wizard built into ADO.
Permissions can be assigned globally, by subscription (AWS users would call this an account), by resource group, or to individual resources.
These service connections link to Service Principals, which are a fancy name for service accounts, machine-focused resource accounts in Azure that are assigned permissions. If this sounds complex, it’s really not. Despite the inherent complexities of running an enterprise-scale permissions-scheme, Azure handles it remarkably well. Compare that to AWS, which is a complex and often unwieldy system that’s been in the news recently for facilitating security vulnerabilities.
Authentication in AWS
As you might expect, Azure DevOps doesn’t hand-hold you through AWS authorization like it does for Azure. We played around with several add-on modules for ADO that grant the ability to authenticate to AWS accounts, but all were too inflexible for our use case.
They only way we could find to authorize AWS from outside the accounts was to use static IAM credentials. When we explained the basically limitless permissions we’d be providing to Terraform, and also that we’d be using static credentials, the InfoSec team just laughed, and laughed. Definitely a no-go.
We played around with IAM and found a neat and much more highly secure method of authenticating.
We spin up a linux builder host within the AWS account we want to manage. That ec2 is assigned an IAM role that has basically no permissions. This is illustrated on-screen by the Joker hat on our ec2 instance. The only permission we grant is the ability to assume a higher IAM role when required. Most of the time, however, this ec2 instance has very limited permissions. This is a careful security measure in case this host is compromised somehow.
When terraform runs, the AWS provider block instructs it to assume an IAM role. It engages the STS service — the Security Token Service — and assumes an administrative IAM role.
It’s good to be the king. While the terraform action is running, the ec2 instance has the ability to do just about anything within the account. As soon as the terraform run is complete, however, the IAM role is released, the ec2 instance returns to being just a joker.
All of this is done via terraform natively, which is super cool. We didn’t realize this functionality existed at first so I wrote a batch script that’d do all the heavy lifting. We didn’t end up needing it, but if you do, the link in the bottom right is where to find it.
Problems to Solve #5
With our local hosts executing code securely and locally both within AWS and Azure, our secure code execution problem was solved! Our next challenge was to build the terraform pipelines with ADO.
Terraform GUI Pipelines
For our Terraform ADO pipelines, we decided on two stages that are run at different times.
The “Terraform Plan” stage is a validation stage. It executes terraform init, validate, and plan. It’s run when a pull request is opened in ADO. If any of those validation stages returns a non-zero code, indicating a failure, the pull request is blocked from being merged.
The second stage is for code that has been merged, and only runs when the pipeline is kicked off manually. It first runs the terraform plan stage again to generate a more recent terraform plan, then pauses and waits for a manual approval.
We could automate the “terraform apply” stage kick off — say, when a PR is merged. However, we decided for our business we’d rather a real human read the terraform plan and hit an approval button.
Problems to Solve #6
With our terraform pipelines built and deployed, we’re in business, and we’ve tackled building our pipelines. Life was good — and then a problem crept up on us.
Pipeline Construction and Management
This was initially a green-field deployment, so we started with the most simple solution.
Within a single Azure DevOps project, we built release pipelines using the graphical engine. Look, there’s one of our pipelines now. These pipelines are drag and drop, making them incredibly easy for beginners to get started with.
There are obvious limitations to this simple model though.
An early warning sign of the trouble were were in was when we had about 20 individual environment pipelines, and each one defined a version of Terraform to install and prepare for code execution.
As most of you know, Terraform tends to release a new version of their code every few weeks. And every few weeks, I’d need to go into each pipeline, in each stage (plan & apply, remember) and set the new version of terraform, which would take 30 uninterrupted minutes to accomplish.
As we onboarded teams, they needed new environments, so we built new terraform pipelines.
We also started to have teams build their own pipelines in their own projects.
Our goal was to have lots of validation against terraform code within our main terraform project, but have our dev teams execute the code from their projects, which meant they needed their own pipelines.
Their pipelines were exactly the same as our validation pipelines, except their said “terraform apply” instead of “terraform plan”
In other words, we were building and maintaining EXACTLY DUPLICATED pipelines.
Suddenly, the number of places we were maintaining terraform pipeline code doubled, then tripled. Our DevOps team time gradually filled up with maintaining pipeline.
This couldn’t possibly continue. What would happen when we had 200 pipelines? 500? Something had to change.
Zen
Here’s how DevOps should make us feel — in control, better able to handle expansive workloads through automation. DevOps should free us up to do more interesting things. However…
Ahhhh
At this point we were individually managing about 70 pipelines. We were entirely overwhelmed both by where we were then, and by the prospect of adding anything to the terraform pipelines. If we wanted to add a single item to each pipeline we’d need to make the change 70 times! Because the terraform version is specified both in the “plan” and “apply” stages of each terraform pipeline, something as simple as a updating the terraform version means changing the value in 140 places!
Not to mention we can only hope to make the exact same changes when doing this repetitively by hand over a hundred times. More likely, we make at least one mistake, and only learn about it when things break. This is not how to do DevOps well.
This seemed to be an insurmountable problem. Did we need to switch to another CI/CD? Am I really a fraud at this whole DevOps thing?
Problems to Solve #7
At this point we’re forced to walk backwards a little bit. We have a working model for our pipelines, but we’ve reached a growth point where our solution is no longer working. We need a paradigm shift — some different way of managing the pipelines that can be done at scale, either through shared components, templates, or a better way to build pipelines.
Thankfully ADO had a feature come out of beta only a few months before we reached this point.
Pipelines As Code
ADO released a feature that rescued us from this dire situation — YAML pipelines. If infrastructure-as-code is powerful, impact how powerful pipelines-as-code are.
We could write the same pipelines we had in the GUI, but embed them in code. This sound small, but it has SO MANY BENEFITS.
For one, managing these pipelines at scale becomes a heck of a lot easier. Updating terraform versions used to take 30 minutes, and now it takes 15 seconds of find-replace against the pipeline directory.
Adding validation steps to all 70 terrarform pipelines would have taken a full day, and now it can be done in 20 minutes of copy and paste.
Also, every change to the pipelines needs to go through a peer approval process in a pull request, making the chance of a bad change dramatically less.
At this point I’m jumping up and down from excitement. And then we found ANOTHER benefit to these code-based pipelines. Remember when I said we were building nearly the EXACT SAME pipelines in multiple projects? We found that the YAML that define pipelines can be reference BETWEEN projects, letting us build a single pipeline definition file, then reference it from both places.
With this change we dramatically reduce the amount of time required to maintain these pipelines. We have time to build cool things again! And we have time for the Zen DevOps should make us feel.
Problems to Solve #8
Thank goodness we were able to solve that problem. YAML management of pipelines cleared our roadblock and permitted us to keep growing. There will certainly be other growth-related problems in the future, but for now, we are able to move forward from putting out fires and into adding cool new things to improve the qualify of our code through validation.
Pre-Merge Validation
Terraform at a tool is amazing. It can translate text config into API calls for thousands of providers. It can do some validation (there is a terraform validate) command after all. However, some provider issues like capitalization rules for names aren’t identified as a problem by terraform. The terraform syntax is right, so terraform happily approves the code as valid.
Which isn’t a great situation to be in. In a perfect world, any terraform code we commit to our master branch is fully valid and guaranteed to successfully apply. We want to run as many validations as possible on pull requested code as we possibly can.
In short, terraform could use some help. Thankfully, the open source community is on it. A tool called TFLint endeavors to validate all entered values to make sure they’re solid. If any problems are detect, TFLint throws an error and our validation fails, alerting the PR opener to the problem. The project only supports AWS at this point, but other provider support is coming!
Because of our YAML pipelines, we were able to add this validation step to around 30 AWS pipelines in 10 minutes.
What’s Next For Us?
Today all our modules and environment code are in the same repo. This is great for simplicity, but we’d love to be able to call modules and environment definitions based on their version. Git has a great solution for this in “git tags”. You can specify a version of code paused in time and marked with a git tag. These are commonly used in software development as code is pushed through environments to prod. However, git tags are repo-wide. That means it’d be impossible for our many teams to take advantage of them.
We’d love to move each module to its own repo, but that’d require building around 100 repos. Managing them by hand would be our pipeline dilemma all over again, so we need to automate the repo build process in some way. We’re looking at Terraform and hoping it can handle this in the near future.
Right now when we write new versions of modules, we have to manually validate them. And in the wise words of Bruce Eckel, “If it’s not tested, it’s broken”. We would love to start writing abstract tests against our modules. That’s hard to do in a static way, but what if you could deploy a resource with an example config to a sandbox, run validations against it, and then tear the resource down? That’s exactly what a tool called TerraTest aims to do. TerraTest is written in Go language, and can be used to write a barrage of real-world tests against a deployed resource. Word on the street is, this is what HashiCorp proper does against new versions of the modules in their module library.
In the same way we’re testing terraform modules, we’d like to validate that changes to our YML pipelines are valid before merging a pull request against them. We’re not yet aware of any testing suite for YML pipelines, but if one exists we’d like to deploy it. If you know of any please reach out to me! This type of testing would help to make a larger portion of our team comfortable making changes to pipelines, which is a win for our whole team.
Problems to Solve #9
Oh, and there’s one more problem to mark as resolved — we did this on our own.
Closing
Growth and transformation are challenges that make engineers stronger. Through this process I’ve become stronger as an engineer, and I hope the lessons presented here help you to become stronger also.
If you have any feedback you’d like to share you can find me on twitter @KyMidd. I’m Kyler Middleton. Thanks everyone!
kyler