Generic waiter custom resource
2020-10-17 21:43
In this article, I would like to share a pattern that we’ve employed to untangle the dependency between Cloudformation stacks/stacksets.
The best way to set up a new AWS account is to drop it into an organization unit(OU), and have various stacksets deployed to that OU. In this way, the new account will quickly pick up the configurations that we have planned for it. However, one of the annoyances of using stackset is the dependency between stacksets. For example, we would like to have one stackset to enable AWS Config for all accounts/regions, then have another stackset to deploy the custom config rule and yet another stackset for the conformance packs, the latter two stacksets depends on the deployment of the first stackset. However, when stackset instances are deployed to the new account, Cloudformation will ignore this dependency, and quite possibly the deployment of the second/third stackset will fail because the first one has not finished deployment. We resolved this dependency issue using a lambda based custom resource, which is pretty generic in that most of the cases you can apply the snippet to your Cloudformation template without modification. We will not talk about the basics of Custom Resource in Cloudformation here, please refer to the official documentation for details.
In the following example, we need to create an event rule for an eventbridge that is created in another stackset. We would like to write an !ImportValue
and get the name of the eventbridge like this:
SomeRule:
Type: AWS::Events::Rule
Properties:
EventBusName: !ImportValue stackset-event-bridge-name
EventPattern:
account:
- !Ref SomeAccountId
State: ENABLED
Targets:
- Arn: !GetAtt SomeLambda.Arn
Id: run-some-lambda
And this stackset is likely to fail because the checking of the !ImportValue
items is done before the deployment. We could pass in the name of the eventbridge as a template parameter no doubt, but the deployment may still fail if the eventbridge is not created at the time we create this rule. Logically, we would like to wait there until the eventbridge is ready. So it is natural to think of a lambda that can list all the eventbridge buses and do the waiting. However, the drawback of this approach is the lack of generality: for each service type, we have to write some logic to do the query, also we need to change the permission in the lambda role to grant that access. So instead, what we have done here is to ask the user to provide an export name like this:
Waiter:
Type: 'Custom::WaitResources'
Version: '1.0'
Properties:
ServiceToken:
!GetAtt WaiterLambda.Arn
WaitingFor:
- Id: Eventbridge
ExportName: stackset-infra-event-bridge-name
The advantages are threefold:
- We can pretty much fix the lambda role for this WaiterLambda. The only policy we need there is:
- PolicyName: allow-list-stack-exports
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action:
- cloudformation:ListExports
Resource: '*'
- We can apply the same logic to all the queries, which means we don’t have to change the code in our lambda function as well.
- We can wait on multiple things and return the value of the exports in this custom resource. As an example, As an example, we can change our earlier eventbridge rule to:
SomeRule:
Type: AWS::Events::Rule
Properties:
EventBusName: !GetAtt Waiter.Eventbridge
EventPattern:
account:
- !Ref SomeAccountId
State: ENABLED
Targets:
- Arn: !GetAtt SomeLambda.Arn
Id: run-some-lambda
Here, the .Eventbridge
attribute is returned in the response object as described here. The full lambda code should be of little interest to everyone. However, I would like to present the function that handles the Cloudformation CREATE
event here:
def create(event):
data = {}
while True:
for resource in event["ResourceProperties"]["WaitingFor"]:
export_name = resource["ExportName"]
export_id = resource["Id"]
exports = get_exports()
logging.info(f"Current exports: {exports}.")
if export_name in exports:
data[export_id] = exports[export_name]
if len(data) == len(event["ResourceProperties"]["WaitingFor"]):
logging.info(f"All exports are found: {data}")
break
logging.info("Some exports are not found. Current data: {data}")
time.sleep(30)
return {
"status": "SUCCESS",
"PhysicalResourceId": "cfn-waiter",
"Data": data,
}
It’s quite straightforward: we have a dead loop that will look at all the stack exports every 30 seconds, and if all the exports are found in the exports, we will break this loop and return the value of the exports in a dictionary. Although we have not shown the source code of that get_exports
function here, it should be easy to implement, and the only advice here is not to forget the pagination of exports.
Finally, a few caveats for this solution:
- It is to be noted that we are not using
!ImportValue
and the lower stack can be modified/removed, which may cause the upper stack to cease function properly. - It is advised to set the timeout of the lambda function to 900 seconds(the maximum). We have every belief that the deployment of the stacksets should finish before this timeout, but as far as I know there’s no SLA on it. It’s a good enough approach, not a silver bullet.