The day that I found a bug on AWS: Use of Capacity Provider

Mar 13, 2020

TLDR; https://github.com/aws/containers-roadmap/issues/784

With Amazon Web Services we can build amazing blocks on the cloud that can do anything. However, we may get used to this "anything" that "just works". I still find myself with some challenges in trying to explain to a coworker or friend all the setup underlying the infrastructure that I am working on CyberLabs. Because "it just works". However, this level of abstraction can hurt you badly when you need to debug an issue. Or we forget that below everything in there we have software that has bugs. Because EVERY software has bugs.

Capacity Provider is a service that was launched by AWS at the end of 2019. Its purpose is to make our life easier on maintaining EC2 instances with Auto Scaling inside an ECS Cluster. The first time that I was trying to set up an EC2 environment with a Capacity Provider, it was hell. And I gave up. I used the "traditional" EC2 setup for ECS that was used before Capacity Provider came along.

But now, I needed to go back on that again. And since technology evolves so fast in 3 months, I was willing to give a second try on the use of Capacity Provider.

After 3 exhausting days reading AWS docs, understanding all the underlying setup, that you need for a Capacity Provider to work within a cluster(This is subject for another post), I have got the following problem on ECS Console:

service lays-rabbit was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your capacity provider. For more information, see the Troubleshooting section.

So naturally, I went to the Troubleshooting section. No luck there. At this part, I already had the Capacity Provider registered to the cluster and configured on my ECS Service that needed to use that specific Capacity Provider to run. And every time I was getting that error.

I got in contact with AWS Tech Support, and with Mitch, we tried to figure out what was going on, because if I started the task by the Task Definitions menu, it started without a problem, but if was from the service, it didn't work.

Mitch and I ended our chat after he was going to try to replicate my environment and see if he could find any problem, and I went to bed. Because PST and UTC-3 times don't match.

In the next day, I got to read Mitch's reply asking me to try a couple of things using AWS CLI and a task definition using Nginx. So far I was using my Terraform script to set up everything. And so I gave it a try. And running a task definition from AWS CLI, setting up the service, it worked out of the box. But still, my RabbitMQ didn't.

And then another friend of mine(Kudos Henrique =D) said to me to run a describe of the ECS Services, and check the difference between them, where we could figure out the problem. And was with that that I found the bug.

When using AWS CLI or Terraform, and specifying the Capacity Provider in a service, you can use the full arn of the resource or just the name. I was using the full arn, and when changed just to the name of the Capacity Provider, everything started to work.

But no one around the internet has got the error, so apparently, I am being the lucky one to have these bugs on corner cases with AWS.

You may think now: "Ah I was reading all of this for just a simple bug on the documentation?" Yes. Could be a problem on the docs, or maybe on the software where we may have others problems in the future.

Later I discovered that I didn't need to specify the Capacity Provider on my ECS Service since the same is already available on the Cluster, probably I would need that if I needed some specific resource that only a specific Capacity Provider has.

I have filled a bug report on the Containers Roadmap of AWS, and you can check the full report of this issue on this link.

Lays’s on the Clouds

Discussion about this post