Telephone iconCall UsTelephone icon0333 0146 683
Our opening hours
Chevron left icon
Tech blog

Overcoming Terraform state locking issues with ECS tasks

4-minute read

Vincenzo Zambianchi

7 August 2020

Facebook iconTwitter iconLinkedIn icon

At Simply Business, we've integrated Terraform with our automated deployment pipeline as an easy way of building, configuring and versioning programmable infrastructure for our applications on AWS.

Terraform has many useful features, such as the ability to create templates, modules and provision resources across providers. However, it does not come without some challenges, amongst which is state file locking. In this post, I'll explain how we've implemented running Terraform within AWS ECS tasks to overcome some issues we've encountered with Terraform's state file locking.

The need to maintain a state

One of Terraform's features is its ability to prevent concurrent runs of a Terraform binary, pointing at the same Terraform state file, from persisting changes that would leave an inconsistent Terraform state. To draw a parallel with a field in which I am by no means expert, the problem that Terraform developers have faced is what in database management theory is at the base of ensuring ACID, atomicity, consistency, isolation, and durability of operations. In object-oriented programming, it's similar to the concept of ‘thread-safe’ object operations.

By way of example, if we have a single entity that needs munging by other entities, how do we ensure that all operating entities start from a known good state and maintain that assumption throughout the lifecycle? How do we ensure that the manipulated entity is left in a state that is consistent and can be picked up by any other worker? For decades, a solution has been found in locking / mutexing.

Terraform allows a lock to be stored for its state file on shared common resources such as the AWS DynamoDB service. If one Terraform binary attempts to acquire a lock on a state file that is already locked, an exception is raised and the Terraform run exits.

This is not particularly graceful, especially within a continuous integration / continuous deployment (CI/CD) environment. Software engineers, or in my case, security engineers, raise changes that occasionally fail because of this behaviour. Nothing wrong with the code, just a nuisance of sorts, causing builds to be rerun unnecessarily.

Simply Business technology

How we use Terraform

At Simply Business, our CI/CD deployment pipeline integrates with the Jenkins automation server to deploy our applications on AWS. Instead of running Terraform directly using Jenkins, it's run as an AWS ECS task.

This setup has a couple of advantages:

1 - Terraform can run with a dedicated AWS Identity and Access Management (IAM) role, distinct from the role for Jenkins. Permissions are limited to the project scope, addressing one bad practice of AWS management, i.e. relying on the Jenkins assigned IAM role for all infrastructure changes.

2 - The load burden on Jenkins is considerably reduced. ECS is a very scalable service. Jenkins is less so, even when deployed as an autoscaling group instance template. The result is that Jenkins agents are protected as the scarce resource, and spinning up ECS tasks becomes a cheap offloading operation.

There is also an additional by-product: the AWS APIs can be used to check the status of running tasks. We will explore this with an example.

Scenario

Consider a scenario in which two separate builds are started at roughly the same time. The builds progress through the Jenkins pipeline until a Terraform plan or apply operation in either build finds a Terraform state file locked by the other branch and forces an exit. We would like to have a way of overcoming these conflicts.

Let’s consider a couple of options to understand why running Terraform as an ECS task provides a solution to the stated problem.

What if, in our scenario, we were running Terraform on Jenkins as a local container? Containers running on the same system can be listed by interacting with the local Docker agent; however the Jenkins agents are not set up as a Docker cluster (read Kubernetes / Docker services). What happens if the two builds get assigned to different Jenkins agents? In most cases, this would be the norm. So how can checks be performed on what’s running on a separate Jenkins agent, where one build is completely unaware of the other?

Solution

The solution we've come up with is to check for the status of running ECS tasks and recently exited tasks. If no tasks related to the same project are found, a Terraform action (plan or apply) can be executed.

It's worth noting at this point that locking the Jenkins stage at which Terraform is run is not really an option. Terraform plan and Terraform apply are run in separate Jenkins stages.

Why? This ensures that a Terraform plan action can be run on an integration branch, but the Terraform apply action (and Terraform import, for completeness) can be run only on the main branch. A software engineer must be able to view the planned changes before applying them to the main branch.

By running Terraform as an ECS task, some of the limitations of state file locking can be overcome.

We cannot use a lock on the Jenkins stages where Terraform is executed. When our main branch progresses to the apply stage, an integration branch may start executing the plan stage. The Jenkins stage locking would allow that. The two branches can then fight for state file lock acquisition and will do so; one of the two will fail to acquire the said lock.

So we can instead look at which ECS tasks are executing. If the task currently running for our project has not yet returned a valid numeric exit code, we can wait and by all means make our tasks synchronous across different Jenkins stages and agents. That’s a good by-product of running Terraform as ECS tasks in my opinion.

A couple of shortcomings of this approach derive from how ECS tasks can be listed. There is no control over how long it takes for the list-tasks endpoint to include a newly launched task. It is also possible only to filter the results returned by the ECS list-tasks operation by family name, something that is not particularly intuitive and that has taken some time to realise. The safest approach is to introduce a few seconds delay in the polling of the ECS list-tasks endpoint to let any newly launched tasks to display. No doubt it won’t take long before AWS announces new features that cover additional ground on both of these points.

How to implement ECS tasks for Terraform

And now time for some code. Here is a sample bash file to implement ECS tasks synchronisation when running Terraform:

#Get the cluster name
CLUSTER=$(aws ecs list-clusters --region ${REGION} | grep -E "${ENVIRONMENT}-${APP_STAGE}" | sed -E 's/"//g')
waitbool="true"
#Wait until any listed tasks in the cluster are related to the current app, then rerun this check before launching a new task.
while [ $waitbool == "true" ]
do
    TASKS=$(aws ecs list-tasks --cluster ${CLUSTER} --region ${REGION} | jq '.taskArns')
    #set the wait variable to false and reset it to true only if an app-related task is found.
    waitbool="false"
    for TASK in $(jq '.[]' <<< ${TASKS})
    do
        echo "###"
        echo "### Debugging: Task value is ${TASK}"
        TASK=$(sed -E 's/"//g' <<< $TASK)
        TASK_DESC=$(aws ecs describe-tasks --region ${REGION} --cluster ${CLUSTER} --tasks ${TASK} )
        TASK_APP_NAME=$(jq -r '.tasks[0].containers[0].name' <<< ${TASK_DESC} )
        echo "###"
        echo "### Debugging: the task app name is ${TASK_APP_NAME}"
            if [[ ${TASK_APP_NAME} =~ "${APP_NAME}" ]]
        then
            CONTAINER_EXIT_CODE=''
            re='^[0-9]+$'
            until [[ ${CONTAINER_EXIT_CODE} =~ ${re} ]]
            do
                echo "### Waiting until a numeric exit code for the task is found."
                CONTAINER_EXIT_CODE=$(jq '.tasks[0].containers[0].exitCode' <<< ${TASK_DESC} | sed -E 's/"//g')
                TASK_DESC=$(aws ecs describe-tasks --region ${REGION} --cluster ${CLUSTER} --tasks ${TASK} )
                echo "### I am about to sleep for 5s."
                sleep 5s
            done
            waitbool="true"
            fi
        echo "###"
        echo "### Debugging: The waitbool var value is ${waitbool}"
    done
    if [[ $waitbool == "true" ]]
    then
        echo "###"
        echo "### I am about to sleep for a random number of secs between 30 and 45."
        sleep $(jq -n $(( $(od -t u4 -N1 -An /dev/urandom) ))/255*15+30)
    fi
done

Hopefully these tips on how to overcome Terraform state file locking issues by using ECS tasks will be useful in your infrastructure as code projects.

See our latest technology team opportunities

If you see a position that suits, why not apply today?

Find out more

We create this content for general information purposes and it should not be taken as advice. Always take professional advice. Read our full disclaimer

Find this article useful? Spread the word.

Facebook icon
Share
Twitter icon
Tweet
LinkedIn icon
Post

Keep up to date with Simply Business. Subscribe to our monthly newsletter and follow us on social media.

Subscribe to our newsletter

Insurance

Public liability insuranceBusiness insuranceLandlord insuranceTradesman insuranceProfessionals' insuranceNot for profit insuranceRestaurant insuranceCommercial van insuranceInsurers

Address

6th Floor99 Gresham StreetLondonEC2V 7NG

Sol House29 St Katherine's StreetNorthamptonNN1 2QZ

© Copyright 2020 Simply Business. All Rights Reserved. Simply Business is a trading name of Xbridge Limited which is authorised and regulated by the Financial Conduct Authority (Financial Services Registration No: 313348). Xbridge Limited (No: 3967717) has its registered office at 6th Floor, 99 Gresham Street, London, EC2V 7NG.