How To: Set a time limit on your AWS ECS tasks

You know how you can set timeouts in Lambda? Well, with a bit of bash magic, you can do the same in ECS as well.

How To: Set a time limit on your AWS ECS tasks
Photo by János Venczák / Unsplash
💡
TL;DR
Just run your code from another script with a timeout functionality. This article will use a bash script, but you can do this with other languages too.

Introduction

WHERE TO RUN YOUR CODE ON AWS!?!?!?




LAMBDA vs ECS




OOH, THE BIG, BIG QUESTION

... Well, the one thing that ECS has over Lambda is that it doesn't have a hard time limit.

  • Lambda has a maximum time limit of 15 minutes.
  • ECS doesn't have a time limit at all. Your task can run for days, months or years.

This is one of the selling points for ECS over Lambda. Usually.

But recently I stumbled across a situation where I wanted a hard timeout on my ECS tasks. I wanted my tasks to stop forcibly.

It turns out, ECS doesn't support this out of the box. So I went out of my way to implement a hard timeout on my ECS tasks with some scripting magic. Today I will share this with you all.

Case Study

I manage a project where we run some simulations of a mine. This simulation runs every minute. The results of the simulations are relayed to the people at the mines and are used for operation control. So, the simulations have to be timely and up-to-date at all times.

At first, we were planning to use Lambda for running the simulation. However, the Python script that runs the simulation has some multiprocessing operations that aren't supported by the hardware in Lambda. Therefore, we opted to use ECS and Fargate instead.

After the migration to ECS, the project was running smoothly. We're pretty happy with how things turned out. As a bonus, there were hidden advantages in ECS over Lambda that we found as well:

  • We had more control over the compute power of the Fargate instances, which sped up each simulation task.
  • We implemented SOCI index to speed up the container pull times.

Our simulation process takes less than a minute to run, meaning our people at the mines are happy.

HOWEVER... All good times must come to an end 😭

One Monday morning, I came into work and was told that the simulation hadn't been updating for a few days. Timidly, I opened up the simulation dashboard to see the words:

"Simulation age: 35k seconds"

Bright red, signalling that there is an issue in the simulation process.

I scrambled around and opened a new tab to access the AWS console. I access CloudWatch straight away. I check the ECS task log group.

Nothing. No simulation logs whatsoever.

Where are my ECS tasks?

In another tab, I opened up EventBridge, which was responsible for spawning new ECS tasks. I wanted to see if the schedule was tampered with.

It should be set to run a simulation every minute. But maybe the settings for the dev environment, with a much lower simulation frequency, were accidentally applied to the prod environment as well.

But no. The EventBridge scheduler was surely displaying a schedule of rate(1 minutes).

Running out of options, I opened up the ECS console to check the ECS service health.

And there I saw, 100 ECS tasks, all 3 days old, still in a running status.

I changed the filters to show stopped tasks as well. It turns out EventBridge was doing its job after all - there was a new task spawning every minute. But it was failing straight away because there were no more IP addresses available in the account.

So why are there 3-days old tasks still hanging around being active? I open one of the tasks and find Python exceptions in the logs. It seems one of the data sources was temporarily unreachable. However, this was only occurring in some threads in the multiprocessing code. The master of the multiprocessing script was waiting indefinitely for the simulation results from the failed threads.

The fix was easy - I forcibly stopped all tasks that were running in the cluster. That freed up all the IP addresses in the account. Since the data source issue was temporary, all new tasks were running the simulations without issue, and the system was back up online.

However, we couldn't afford to have this happen again. Therefore, I opted to implement a hard timeout on my simulation tasks.

3...2...1... Timeout

My simulations are supposed to be short-lived anyway. They should finish in around 1 minute. If it takes any longer than that, it usually means there is some sort of network issue.

Not only that, but the longer it takes for the simulation to run, the less value the simulation provides. In this project, the freshness of our simulations is most important. We don't need simulations from 5-10 minutes ago - we need live data.

So to prevent long-lived simulations, it made sense to implement a hard time limit on our simulation tasks. I decided to set a hard timeout of 2 minutes. If my ECS tasks take any longer than that, the task is forcefully killed.

... And so we come back to the introduction of this post. AWS ECS does NOT support a timeout parameter at the moment. (There is an open GitHub issue for this, go and show your support for the cause!) Unlike Lambda, we need some ✨magic✨ to make this happen.

The solution

My Dockerfile used to call my Python script as the ENTRYPOINT directly:

...

# COPY my code to the Docker image
COPY main.py ${LAMBDA_TASK_ROOT}/main.py

# Set the entrypoint to the Python script
ENTRYPOINT ["/usr/local/bin/python", "main.py"]

To implement the timeout, I created a bash script that calls my Python script instead. I use the timeout command which forcefully exits from my Python script if it takes more than 120 seconds.

#!/bin/sh

# Run the simulation with a timeout of 2 minutes
timeout 120s /usr/local/bin/python main.py

# Check if the task timed out
STATUS=$?
if [ $STATUS -eq 124 ]; then
  echo "ERROR: The simulation timed out."
fi

exit $STATUS

I change my Dockerfile to call my new bash script as the ENTRYPOINT:

...

# Copy the main function code
# Copy the main function code and the entrypoint script
COPY main.py ${LAMBDA_TASK_ROOT}/main.py
COPY docker-entrypoint.sh ${LAMBDA_TASK_ROOT}/docker-entrypoint.sh

# Set the entrypoint of the container to be the bash script
ENTRYPOINT ["/bin/sh", "./docker-entrypoint.sh"]

And that's it!

💡
Looking back, I think I could've implemented the timeout function inside my main.py file as well. That would save me from creating a new bash script... Oh well! 🤷

Conclusion

AWS ECS team pls implement timeouts 🙏

Read more

zshの模様替えをしました

zshの模様替えをしました

💡この記事は下の動画の受け売りです: はじめに 今年も早いもので、すでに三月。春の始まりです。 春といえば新しい生活。心機一転! 周りの環境が大きく変わる方も多いのではないでしょうか? 今年の春、私の人生に大した変化はないのですが、それでも人生にちょっと新しい風を吹かしたいな~と思うので、今回思い切って模様替えをしようと思います! ターミナル環境のね。 zshellって? シェル (shell) とは、OSとユーザーの仲介役。ターミナルを開いたときに動くプログラムがシェルです。 そのシェルにも色々あるんです。sh, bash, fish, ksh などなど…。今回は、その中のzsh というシェルの話。 zshは高いカスタマイズ性が人気です。zsh そのものの機能と豊富なプラグインでターミナル環境を自分の好きなように設定できます。 私もzshを使い始めて数年たちますが、欲しいところにちゃんと手が届く、という印象です! …とか言いつつ私は面倒くさがりなので、自分で.zshrc をイジることはしません。YouTubeやSNSなどで見つけたカッコいいタ

By Roland Thompson
Terraformを使って"AWS Lambdaとその取り巻き"を召喚しよう

Terraformを使って"AWS Lambdaとその取り巻き"を召喚しよう

はじめに 人生、いろんな「派閥」ってありますよね。私が属するITインフラ業界にも色々あります。どのクラウドを使うか、どのIDEを使うか、どのOSを使うか…枚挙に暇がありません。 インフラ(IaC)言語もその一つ。私はこの業界に足を踏み入れてから、ずっとCloudFormation派閥です。私はAWS専門だったので、AWS公式のCloudFormationで仕事が成り立ってました。 しかし最近、AzureやDatabricks関連の仕事も私に降ってくるようになりました。そうなると、AWS限定のCloudFormationでは対応できません。 そんなとき、複数のクラウドプラットフォームに対応できるTerraformという存在を耳にしました。 Terraformの練習として、AWS Lambdaとその取り巻き(ECR, IAM, Secrets Manager, CloudWatch, SQS) を召喚してみたので、この記事にまとめます。 環境構築 まずは公式マニュアルを参考にTerraformをインストールしましょう。 ターミナルからTerraformが動けば、インストー

By Roland Thompson