How To: Set a time limit on your AWS ECS tasks

You know how you can set timeouts in Lambda? Well, with a bit of bash magic, you can do the same in ECS as well.

How To: Set a time limit on your AWS ECS tasks
Photo by János Venczák / Unsplash
💡
TL;DR
Just run your code from another script with a timeout functionality. This article will use a bash script, but you can do this with other languages too.

Introduction

WHERE TO RUN YOUR CODE ON AWS!?!?!?




LAMBDA vs ECS




OOH, THE BIG, BIG QUESTION

... Well, the one thing that ECS has over Lambda is that it doesn't have a hard time limit.

  • Lambda has a maximum time limit of 15 minutes.
  • ECS doesn't have a time limit at all. Your task can run for days, months or years.

This is one of the selling points for ECS over Lambda. Usually.

But recently I stumbled across a situation where I wanted a hard timeout on my ECS tasks. I wanted my tasks to stop forcibly.

It turns out, ECS doesn't support this out of the box. So I went out of my way to implement a hard timeout on my ECS tasks with some scripting magic. Today I will share this with you all.

Case Study

I manage a project where we run some simulations of a mine. This simulation runs every minute. The results of the simulations are relayed to the people at the mines and are used for operation control. So, the simulations have to be timely and up-to-date at all times.

At first, we were planning to use Lambda for running the simulation. However, the Python script that runs the simulation has some multiprocessing operations that aren't supported by the hardware in Lambda. Therefore, we opted to use ECS and Fargate instead.

After the migration to ECS, the project was running smoothly. We're pretty happy with how things turned out. As a bonus, there were hidden advantages in ECS over Lambda that we found as well:

  • We had more control over the compute power of the Fargate instances, which sped up each simulation task.
  • We implemented SOCI index to speed up the container pull times.

Our simulation process takes less than a minute to run, meaning our people at the mines are happy.

HOWEVER... All good times must come to an end 😭

One Monday morning, I came into work and was told that the simulation hadn't been updating for a few days. Timidly, I opened up the simulation dashboard to see the words:

"Simulation age: 35k seconds"

Bright red, signalling that there is an issue in the simulation process.

I scrambled around and opened a new tab to access the AWS console. I access CloudWatch straight away. I check the ECS task log group.

Nothing. No simulation logs whatsoever.

Where are my ECS tasks?

In another tab, I opened up EventBridge, which was responsible for spawning new ECS tasks. I wanted to see if the schedule was tampered with.

It should be set to run a simulation every minute. But maybe the settings for the dev environment, with a much lower simulation frequency, were accidentally applied to the prod environment as well.

But no. The EventBridge scheduler was surely displaying a schedule of rate(1 minutes).

Running out of options, I opened up the ECS console to check the ECS service health.

And there I saw, 100 ECS tasks, all 3 days old, still in a running status.

I changed the filters to show stopped tasks as well. It turns out EventBridge was doing its job after all - there was a new task spawning every minute. But it was failing straight away because there were no more IP addresses available in the account.

So why are there 3-days old tasks still hanging around being active? I open one of the tasks and find Python exceptions in the logs. It seems one of the data sources was temporarily unreachable. However, this was only occurring in some threads in the multiprocessing code. The master of the multiprocessing script was waiting indefinitely for the simulation results from the failed threads.

The fix was easy - I forcibly stopped all tasks that were running in the cluster. That freed up all the IP addresses in the account. Since the data source issue was temporary, all new tasks were running the simulations without issue, and the system was back up online.

However, we couldn't afford to have this happen again. Therefore, I opted to implement a hard timeout on my simulation tasks.

3...2...1... Timeout

My simulations are supposed to be short-lived anyway. They should finish in around 1 minute. If it takes any longer than that, it usually means there is some sort of network issue.

Not only that, but the longer it takes for the simulation to run, the less value the simulation provides. In this project, the freshness of our simulations is most important. We don't need simulations from 5-10 minutes ago - we need live data.

So to prevent long-lived simulations, it made sense to implement a hard time limit on our simulation tasks. I decided to set a hard timeout of 2 minutes. If my ECS tasks take any longer than that, the task is forcefully killed.

... And so we come back to the introduction of this post. AWS ECS does NOT support a timeout parameter at the moment. (There is an open GitHub issue for this, go and show your support for the cause!) Unlike Lambda, we need some ✨magic✨ to make this happen.

The solution

My Dockerfile used to call my Python script as the ENTRYPOINT directly:

...

# COPY my code to the Docker image
COPY main.py ${LAMBDA_TASK_ROOT}/main.py

# Set the entrypoint to the Python script
ENTRYPOINT ["/usr/local/bin/python", "main.py"]

To implement the timeout, I created a bash script that calls my Python script instead. I use the timeout command which forcefully exits from my Python script if it takes more than 120 seconds.

#!/bin/sh

# Run the simulation with a timeout of 2 minutes
timeout 120s /usr/local/bin/python main.py

# Check if the task timed out
STATUS=$?
if [ $STATUS -eq 124 ]; then
  echo "ERROR: The simulation timed out."
fi

exit $STATUS

I change my Dockerfile to call my new bash script as the ENTRYPOINT:

...

# Copy the main function code
# Copy the main function code and the entrypoint script
COPY main.py ${LAMBDA_TASK_ROOT}/main.py
COPY docker-entrypoint.sh ${LAMBDA_TASK_ROOT}/docker-entrypoint.sh

# Set the entrypoint of the container to be the bash script
ENTRYPOINT ["/bin/sh", "./docker-entrypoint.sh"]

And that's it!

💡
Looking back, I think I could've implemented the timeout function inside my main.py file as well. That would save me from creating a new bash script... Oh well! 🤷

Conclusion

AWS ECS team pls implement timeouts 🙏

Read more

今年読んだ本を振り返る 2024

今年読んだ本を振り返る 2024

早いもので、今年も年の瀬。皆さん、年末いかがお過ごしでしょうか。 いきなりですが、読書歴って人の性格とマイブームを如実に写し出すものだと思うんですね。というわけで、年末というキリのいい時期に、今年読んだ本を振り返って見ようと思います。 「ああ、年初はこんなこと考えてたなぁ」 「こんな本読んでたっけ!?」 思い出にふける自分用の記事になってます。 「最強!」のニーチェ入門 幸福になる哲学 (河出文庫) 去年末、國分 功一郎先生の「暇と退屈の倫理学」を読んでから哲学にハマってました。その流れを汲んで2024年一発目の本は哲学書。「暇と退屈の倫理学」の中でニーチェに触れ、彼がカッコいいと思いニーチェ入門書を購入…みたいな理由だったと思います。 数年前、飲茶先生の著書「史上最強の哲学入門」を読み、"とにかく分かりやすいなー!"と思ったのは覚えてます。哲学の入門書といえば飲茶先生です。 この「ニーチェ入門」も、著者と一般人の対話形式なので凄く分かりやすいです。哲学書によくある難しい言い回しはないですし、ページ数も少ないので一般人でも読みやすい本になってます。 入門書なんですが、

By Roland Thompson
Keychron Q11のすゝめ

Keychron Q11のすゝめ

年がら年中パソコンに張り付いてるネット民の同志の皆様、ごきげんよう。快適なパソコンライフ送ってますか? 私はと言うと、実はとても調子がいいのです! え?なぜかって?実はこいつのおかげでしてね・・・ Keychron Q11!! このスパッと2つに割れたキーボードを使い始めてから体の調子がすこぶる良くて、 * 長年悩まされた肩痛が治った * 姿勢が良くなった * 彼女ができた * 宝くじがあたった * この前、初めてホームランを打ったの! と、いいこと三昧なんですね。 そこで今回はこのKeychron Q11を皆さんに宣伝紹介したいと思います。 私とキーボード 我々の生活にスマホが普及して久しいですが、パソコンもまだまだ現役。読者の皆様は常日頃からパソコンを使う方が多いと思います。 総務省の情報通信白書によると、パソコンの世帯保有率は70%、インターネット利用率は50%前後を推移しています。まだまだ高い水準ですね。 かく言う私もパソコンがないと生きていけない生活を送ってます。 * まず、仕事がエンジニアなので、パソコンがないと飯が食えない * 続

By Roland Thompson