How To: Set a time limit on your AWS ECS tasks
You know how you can set timeouts in Lambda? Well, with a bit of bash magic, you can do the same in ECS as well.
Just run your code from another script with a timeout functionality. This article will use a
bash
script, but you can do this with other languages too. Introduction
WHERE TO RUN YOUR CODE ON AWS!?!?!?
LAMBDA vs ECS
OOH, THE BIG, BIG QUESTION
... Well, the one thing that ECS has over Lambda is that it doesn't have a hard time limit.
- Lambda has a maximum time limit of 15 minutes.
- ECS doesn't have a time limit at all. Your task can run for days, months or years.
This is one of the selling points for ECS over Lambda. Usually.
But recently I stumbled across a situation where I wanted a hard timeout on my ECS tasks. I wanted my tasks to stop forcibly.
It turns out, ECS doesn't support this out of the box. So I went out of my way to implement a hard timeout on my ECS tasks with some scripting magic. Today I will share this with you all.
Case Study
I manage a project where we run some simulations of a mine. This simulation runs every minute. The results of the simulations are relayed to the people at the mines and are used for operation control. So, the simulations have to be timely and up-to-date at all times.
At first, we were planning to use Lambda for running the simulation. However, the Python script that runs the simulation has some multiprocessing operations that aren't supported by the hardware in Lambda. Therefore, we opted to use ECS and Fargate instead.
After the migration to ECS, the project was running smoothly. We're pretty happy with how things turned out. As a bonus, there were hidden advantages in ECS over Lambda that we found as well:
- We had more control over the compute power of the Fargate instances, which sped up each simulation task.
- We implemented SOCI index to speed up the container pull times.
Our simulation process takes less than a minute to run, meaning our people at the mines are happy.
HOWEVER... All good times must come to an end 😭
One Monday morning, I came into work and was told that the simulation hadn't been updating for a few days. Timidly, I opened up the simulation dashboard to see the words:
"Simulation age: 35k seconds"
Bright red, signalling that there is an issue in the simulation process.
I scrambled around and opened a new tab to access the AWS console. I access CloudWatch straight away. I check the ECS task log group.
Nothing. No simulation logs whatsoever.
Where are my ECS tasks?
In another tab, I opened up EventBridge, which was responsible for spawning new ECS tasks. I wanted to see if the schedule was tampered with.
It should be set to run a simulation every minute. But maybe the settings for the dev environment, with a much lower simulation frequency, were accidentally applied to the prod environment as well.
But no. The EventBridge scheduler was surely displaying a schedule of rate(1 minutes)
.
Running out of options, I opened up the ECS console to check the ECS service health.
And there I saw, 100 ECS tasks, all 3 days old, still in a running status.
I changed the filters to show stopped tasks as well. It turns out EventBridge was doing its job after all - there was a new task spawning every minute. But it was failing straight away because there were no more IP addresses available in the account.
So why are there 3-days old tasks still hanging around being active? I open one of the tasks and find Python exceptions in the logs. It seems one of the data sources was temporarily unreachable. However, this was only occurring in some threads in the multiprocessing code. The master of the multiprocessing script was waiting indefinitely for the simulation results from the failed threads.
The fix was easy - I forcibly stopped all tasks that were running in the cluster. That freed up all the IP addresses in the account. Since the data source issue was temporary, all new tasks were running the simulations without issue, and the system was back up online.
However, we couldn't afford to have this happen again. Therefore, I opted to implement a hard timeout on my simulation tasks.
3...2...1... Timeout
My simulations are supposed to be short-lived anyway. They should finish in around 1 minute. If it takes any longer than that, it usually means there is some sort of network issue.
Not only that, but the longer it takes for the simulation to run, the less value the simulation provides. In this project, the freshness of our simulations is most important. We don't need simulations from 5-10 minutes ago - we need live data.
So to prevent long-lived simulations, it made sense to implement a hard time limit on our simulation tasks. I decided to set a hard timeout of 2 minutes. If my ECS tasks take any longer than that, the task is forcefully killed.
... And so we come back to the introduction of this post. AWS ECS does NOT support a timeout
parameter at the moment. (There is an open GitHub issue for this, go and show your support for the cause!) Unlike Lambda, we need some ✨magic✨ to make this happen.
The solution
My Dockerfile used to call my Python script as the ENTRYPOINT
directly:
...
# COPY my code to the Docker image
COPY main.py ${LAMBDA_TASK_ROOT}/main.py
# Set the entrypoint to the Python script
ENTRYPOINT ["/usr/local/bin/python", "main.py"]
To implement the timeout, I created a bash script that calls my Python script instead. I use the timeout
command which forcefully exits from my Python script if it takes more than 120 seconds.
#!/bin/sh
# Run the simulation with a timeout of 2 minutes
timeout 120s /usr/local/bin/python main.py
# Check if the task timed out
STATUS=$?
if [ $STATUS -eq 124 ]; then
echo "ERROR: The simulation timed out."
fi
exit $STATUS
I change my Dockerfile to call my new bash script as the ENTRYPOINT
:
...
# Copy the main function code
# Copy the main function code and the entrypoint script
COPY main.py ${LAMBDA_TASK_ROOT}/main.py
COPY docker-entrypoint.sh ${LAMBDA_TASK_ROOT}/docker-entrypoint.sh
# Set the entrypoint of the container to be the bash script
ENTRYPOINT ["/bin/sh", "./docker-entrypoint.sh"]
And that's it!
main.py
file as well. That would save me from creating a new bash script... Oh well! 🤷Conclusion
AWS ECS team pls implement timeouts 🙏