Automating Retry for Failed Terraform Launches
We work with a lot of Terraform users to orchestrate and deploy application environments.
As part of the deployment process, we saw a recurring challenge among Terraform users. A number of transient errors with Terraform or the cloud service provider caused specific commands to fail.
Even though simply retrying the Terraform command tends to fix the problem, it was still just as frustrating to our team as it was to the Terraform users we worked with. Users who typically deploy environments in a matter of minutes were pulled away from their day-to-day work just to retry Terraform deployments. It was the kind of additional step that we try to remove for our users.
So we looked into what we can do to automate the process.
First, it helps to understand how our platform works. Quali Torque orchestrates YAML files — which we call blueprints — for application environments directly from the IaC modules defined in Git. Once an administrator is comfortable with the inputs, dependencies, and outputs for the environment defined in the YAML, they can “publish” it, or release it to the teams to deploy. Publishing a blueprint lists it on a self-service catalog within Quali Torque (and makes it available to integrate with other tools). Once published, users can deploy the environment based on that blueprint as many times as needed.
For every Terraform module in each environment, the deployment process executes the Terraform plan as defined in Git. Any failed Terraform command would break this process down. This also applies to the destroy command — any failure in this function will leave environments running (and accruing costs) longer than the developer intended, even though they attempted to terminate it.
To help address this problem, we spoke with a few of our customers to understand some of the most common transient errors that were fixed after simply retrying the command.
Once we settled on an initial list of common errors, we pushed an update that instructed Torque to recognize a failed command caused by one of those errors and automatically retry the Terraform plan in response. Essentially, the platform automates that step so the DevOps engineer doesn’t get pulled away from more important work.
While it sounds like a simple automation, our users have already reported improved success rate for environment deployments, fewer idle cloud resources, and better overall resiliency as a result. This kind of automation is just one way DevOps teams can cut out redundant manual work.