Introduction
There is a saying among Linux administrators - with great power, comes great responsibility
.
The same is true if you decide to use low-level tools (like Boto3) to manage cloud resources. You are in full control of your AWS services, and your responsibility is to handle them properly.
Race condition
A race condition can be difficult to reproduce because the end result is nondeterministic and depends on the relative timing between interfering processes.
In an automated deployment of cloud resources, race conditions often happen when services depend on each other.
The root cause of race condition issues is timing. If a script immediately invokes a series of AWS API calls (via Boto3), it will try to use/modify a service that is not yet fully deployed. This will lead to a corrupted state of your cloud environment and (sometimes hard to track and reproduce) issues.
Tools
You do not have to worry about race conditions when you deploy infrastructure using services like AWS CloudFormation. In that case, the provisioning service is responsible for solving dependencies between managed resources.
But the deployment of some types of systems can not be automated using CloudFormation. For instance, Internet of Things (IoT) systems require resources that are not (fully) supported by the CloudFormation service. In those cases, you need to use low-level tools like AWS SDK for Python (Boto3).
Boto3 Clients provide a low-level interface to AWS (closely related to service APIs). Clients are generated from a JSON service definition file, so they can manage all aspects of a given service.
Examples
Typical cases when the race condition can appear:
- AWS Identity and Access Management (IAM) - when you create a Role and immediately invoke AWS Security Token Service (STS) to generate a set of temporary security credentials for that Role
- generate a certificate for an IoT Thing and immediately try to attach IoT Policy to it
- Associate Client device to newly created Greengrass Core device
- create new Greengrass Deployment and execute Greengrass discovery to obtain connection information for the Greengrass Core device (Greengrass Core definition might be out-of-date and Greengrass Deployment might be still in process)
Solution
To avoid most of above problems, you should understand relations between AWS services. Invoke API calls in proper order.
In crucial places, add artificial delays in your script - this will give AWS Cloud enough time to finish the deployment of related services.
Carefully catch and manage exceptions returned by Boto3 Clients - add automated retries when appropriate and roll-back changes in AWS environment if you detect corrupted state.
Links
Race condition - [link]
Boto3 low-level clients - [link]
AWS CloudFormation - [link]