Et tu, AWS?
Today I want to share with you an event that happened to me last week.
For my previous posts, you can imagine that I am working with ECS and deploying applications with FARGATE. However, FARGATE doesn't support GPU instances, at least for now, and I needed to deploy a Tensorflow model for the application that I am working here at CyberLabs. So the only way to have that managed by ECS was using the launch type EC2. So far so good.
I have followed the documentation of setting up an Auto Scaling group for my EC2 instances. Suffered a little bit, but survived that.
The next step was to set up the ECS Agent to use the instances of that Auto Scaling group and attach them to the ECS Cluster. And that was where the adventure started.
After all the setup I got this error on the ECS Task:
Timeout waiting for network interface provisioning to complete.
And obviously, the first thing that I did was to search on google about that issue. And I found nothing. After that, I did more digging within the logs of ECS and from the ECS Agent on the EC2 instance and didn't find any useful information, so I decided to open a ticket on the AWS Support.
The AWS employee that attended to me on the chat said after looking at the attached logs on the ticket, that he would need more time to find the issue and that didn't want to leave me hanging on the chat, so we closed that and I waited for about 4 hours until he replied on the ticket, saying that:
Above issue can happen due to some issue with the instance health. To confirm, in my lab environment I ran tasks with similar configuration and I was able to see tasks running without any issues.
In order to bring tasks to Running status I suggest to try replacing this container instance with another and let me know if the issue continues to happen.
In other words: "Turn off and on again"
Et tu, AWS? Well, I wasn't expecting that. Not on AWS. All the instance metrics were looking good, but the task wasn't coming alive. And surprisingly after terminating my instance and waiting for the Auto Scaling event to make another one for me, I saw that the task was up and running. So yeah, maybe you can solve a bug on AWS turning off and on again.
That's all folks.