AWS Parallelcluster Monitoring

January 24, 2026
chrisdag

Proactive AWS ParallelCluster Monitoring with Slack

This is an ongoing experiment to test and learn about different ways to monitor Slurm HPC clusters built with AWS ParallelCluster; I will keep this page updated, and if there is interest, we can publish the repo with terraform plans and lambda helpers that we use for the current implementation.

Questions or comments? Feel free to reach out directly at dag@bioteam.net.

Parallelcluster Monitoring Pain Points

Not all HPC errors are user or job-related, especially on AWS, where computational resources are often dynamically created with a lot of “invisible” AWS services orchestrating in the background.

Scientific end-users don’t have visibility into HPC cluster problems involving infrastructure or AWS issues such as vCPU quota or capacity issues. Instead, users are presented with error responses like this:

ubuntu@login:~$ srun -p s-mgpu-48c-192g-3t –pty /bin/bash -i
srun: error: Node failure on s-mgpu-48c-192g-3t-dy-g612xl-1
srun: error: Nodes s-mgpu-48c-192g-3t-dy-g612xl-1 are still not ready
srun: error: Something is wrong with the boot of the nodes.

Compounding this issue is that sometimes transient issues become semi-permanent issues requiring human intervention to resolve and repair. AWS ParallelCluster will proactively place a cluster or Slurm partition into a special “Protected Mode” status that can ONLY be cleared manually by an HPC administrator when certain failure counts exceed a configurable threshold. The default threshold seems to be 10 before protected mode is triggered.

This means that even “transient” AWS issues, like temporary EC2 capacity issues in a regional AZ that clear over time, can still result in an unusable Slurm partition until an administrator clears them manually.

Example

Here is a recent example from a cluster that was undergoing saturation load testing. Several of the Slurm partitions on this cluster can scale up to 256 compute nodes. By default there are zero compute nodes as they are all dynamically created only when jobs are pending.

Although the AWS account in question had ample EC2 vCPU quota to run far more than 250 concurrent servers, AWS was unable to fulfill the fleet creation request because it had no additional capacity at the time the tests were being run.

As a result, after 10 consecutive failures to create new compute nodes, the cluster was placed into Protected status and the Slurm partitions were marked as down/inactive, rendering them unable to run any jobs at all.

What the scientific end-user sees:

ubuntu@login:~$ sinfo

PARTITION          AVAIL  TIMELIMIT  NODES  STATE NODELIST
default-q*            up   infinite      1  idle~ default-q-dy-c62xl-1
default-q*            up   infinite      1   idle default-q-st-c62xl-1
s-2c-4g-118g          up 3-00:00:00     20  idle~ s-2c-4g-118g-dy-c6l-[1-20]
d-2c-4g-118g          up 3-00:00:00     20  idle~ d-2c-4g-118g-dy-c6l-[1-20]
s-4c-8g-474g          up 3-00:00:00     20  idle~ s-4c-8g-474g-dy-c62xl-[1-20]

d-4c-8g-237g       inact 3-00:00:00    240  idle~ d-4c-8g-237g-dy-c6xl-[1-7,9-11,13-19,21-52,54-55,57-66,68-144,146-147,149-151,153-177,179-189,191-202,204-205,207-228,230-233,235-251,253-256]

d-4c-8g-237g       inact 3-00:00:00     16  down~ d-4c-8g-237g-dy-c6xl-[8,12,20,53,56,67,145,148,152,178,190,203,206,229,234,252]

d-4c-16g-237g      inact 3-00:00:00    229  idle~ d-4c-16g-237g-dy-m6xl-[1-20,22-41,43-71,73-82,84-105,108-113,115-120,122-128,130,132-148,150,153-154,156,158-170,172-183,185-189,191,193,195-197,199-214,217-218,220-235,237-245,247,249-256]

d-4c-16g-237g      inact 3-00:00:00     27  down~ d-4c-16g-237g-dy-m6xl-[21,42,72,83,106-107,114,121,129,131,149,151-152,155,157,171,184,190,192,194,198,215-216,219,236,246,248]

d-8c-61g-2t           up 3-00:00:00     20  idle~ d-8c-61g-2t-dy-i32xl-[1-20]
d-12c-96g-7t          up 3-00:00:00     20  idle~ d-12c-96g-7t-dy-i33xl-[1-20]
d-16c-128g-300g       up 3-00:00:00     10  idle~ d-16c-128g-300g-dy-r54xl-[1-10]
s-gpu-4c-8g-250g      up 3-00:00:00     10  idle~ s-gpu-4c-8g-250g-dy-g6xl-[1-10]
s-mgpu-48c-192g-3t    up 3-00:00:00      2  idle~ s-mgpu-48c-192g-3t-dy-g612xl-[1-2]
d-gpu-4c-8g-250g      up 3-00:00:00     10  idle~ d-gpu-4c-8g-250g-dy-g6xl-[1-10]
d-mgpu-48c-192g-3t    up 3-00:00:00      2  idle~ d-mgpu-48c-192g-3t-dy-g612xl-[1-2]
d-128c-256g-7t        up 3-00:00:00      1  idle~ d-128c-256g-7t-dy-c632xl-1
d-48c-96g-3t          up 3-00:00:00      2  idle~ d-48c-96g-3t-dy-c612xl-[1-2]
s-128c-256g-7t        up 3-00:00:00      1  idle~ s-128c-256g-7t-dy-c632xl-1
s-48c-96g-3t          up 3-00:00:00      2  idle~ s-48c-96g-3t-dy-c612xl-[1-2]

That is what the user sees — Slurm partitions that are down and unusable.

What IT, HPC and Cloud Admins can see (if they know where to look!)

The “reason” for the inactive Slurm partition is buried inside one of the many parallelcluster log files created on the HeadNode and streamed into a Cloudwatch Log Group if that setting is enabled:

2026-01-23 18:06:34,184 - [slurm_plugin.clustermgtd:_handle_protected_mode_process] - WARNING - Cluster is in protected mode due to failures detected in node provisioning. Please investigate the issue and then use 'pcluster update-compute-fleet --status START_REQUESTED' command to re-enable the fleet

Why this is embarrassing

As someone who has spent his entire career supporting scientists working on data-intensive life science infrastructure, it’s always embarrassing when end users notice a problem before IT does. These are the IT problems I like to get ahead of when I can, because it never feels good when the first sign of a problem is end users being inconvenienced or pipelines grinding to a halt.

ParallelCluster is great at Monitoring but it does not Alert humans by default

I love ParallelCluster, it gets better with every release. It’s been great to see how much the Cloudwatch Dashboards, log groups, composite alarms and metrics have improved over time. Every cluster I deploy has monitoring turned on, although with a very conservative retention time because Cloudwatch Log Groups that never expire are financially wasteful when not actually needed for compliance, audit or regulatory reasons.

My cluster config files usually contain this Monitoring block:

Monitoring:
  Logs:
    CloudWatch:
      Enabled: true
      RetentionInDays: 7

I find that seven-day retention is reasonable for most use cases, it’s rare for me to want anything older than seven days. Beyond seven days I usually only care about Slurm accounting logs, not cluster log files.

To see what AWS ParallelCluster does for you by default, review this URL: https://docs.aws.amazon.com/parallelcluster/latest/ug/monitoring-overview.html.

Out of the box you get the following:

All of the interesting logs streamed into an organized format under a per-cluster Cloudwatch Log Group
A custom Cloudwatch Dashboard for each cluster with easy access to log streams, metrics and alarms

Image source: https://docs.aws.amazon.com/images/parallelcluster/latest/ug/images/CW-dashboard.png

The things I’m most interested in seeing, however, are sort of hidden away as composite metrics that are graphed:

Each one of those graphed composite items is something that I, as an HPC support person or cluster operator, would like to know about, ideally as soon as possible. I am very interested in:

Instance Provisioning Errors
Unhealthy Instance Errors
Custom Action Errors

My ParallelCluster Monitoring/Alerting Goal

Learn about ParallelCluster issues before end-users report them
Real-time or near real-time alerts for the following HPC cluster or compute fleet conditions
- Cluster enters PROTECTED MODE state for any reason
- Compute fleet enters STOPPED state for any reason
- Bootstrap failures when using s3:// scripts configured as part of an OnNodeBoot CustomAction
- Node creation issues originating from the AWS EC2 service due to
  - Insufficient Capacity errors
  - vCPU quota limit breached

Turning Monitoring/Alerting Goals into Terraform

Step 1 – SNS Topic

The first set of resources we need to create is the standard design pattern for receiving messages/alerts and allowing different consumers to subscribe to those messages — an AWS SNS Topic. We have two different subscribers to the SNS Topic we created:

Standard email subscribers who will get email each time a message is delivered to the Topic
A simple Python Lambda function is also subscribed to the SNS Topic. The purpose of this lambda function is to receive the message, format it for better readability and deliver it to a Slack Channel via a webhook HTTPS address

Step 2 – Cloudwatch Metric/Alarm Pairs and a Cloudwatch Subscription Filter

I had to break my monitoring setup into two different styles of monitored conditions:

Simple conditions where basic “ALARM” or “OK” status was sufficient, with no additional contextual details needed.
More complex conditions where I wanted to send detailed information extracted from JSON-formatted log entries into the Alert Message

For the simple conditions the well established Cloudwatch Metric / Cloudwatch Alarm pairing is sufficient. We create a new Metric, set the default value to 0 and then scan the log group for patterns we care about. When the pattern is found we use the Sum function to add 1 to the value of the metric. For the paired Alarm it is even more simple — the alarm is in OK status when the paired metric value is 0 and in ALARM status when the paired metric value is >= 1.

Whenever the Alarm is triggered we notify the SNS Topic and whenever the Alarm clears back to OK status we also notify the SNS Topic.

This simple “Alarm” and “OK” status works great for two core conditions I want alerts on:

Cluster in PROTECTED mode
Compute fleet in STOPPED state

However, there are more complex things I want to monitor and most importantly I want to extract data from the log entries and send that information into the SNS Topic for delivery to my Slack channel or email inbox. We can’t use Cloudwatch Metric Filters for that.

To handle the more complex scenario, we use CloudWatch Subscription Filters, which enable real-time monitoring and (even better) forward the contents of log messages to various AWS services for downstream handling and processing. This involves:

Creating a Subscription Filter that looks for JSON-formatted log messages
When detected, the Subscription Filter sends the log payload to a lambda function that parses the JSON log entry, makes a more “human readable” summary, and then forwards that on to the SNS Topic for delivery to the email inbox or Slack channel.

This is what it looks like as a simple architecture diagram:

The image gallery below shows what the detection of “PROTECTED MODE” status looks like in both the AWS Cloudwatch console and the Terraform code that creates the Metric Filter and Alarm.

It’s very simple and cheesy — you will note that we are filtering for a simple pattern:

%WARNING - Cluster is in protected mode%

It’s simple but it works. Whenever we see “WARNING – Cluster is in protected mode” show up in a log stream we increase the Metric count and generate an Alarm.

Cloudwatch Metric (console view)

Cloudwatch Alarm (console view)

Cloudwatch metric (terraform view)

Cloudwatch Alarm (terraform view)

Cloudwatch Subscription Filter (console view)

Cloudwatch Subscription Filter (terraform)

What an Alarm announcement looks like in slack

It’s just a simple “ALARM” or “OK” status with some extra information about the environment and alarm description added in by the lambda that takes messages from the SNS Topic and sends to the Slack webhook URL:

Handling the more complex error conditions

For anything related to node bootstrap errors, vCPU quota errors or errors thrown by AWS for insufficientCapacity we need to be a bit more sophisticated in how we handle things. This is because we (ideally) want to extract contextual information from the log entry itself and send that along into the SNS Topic.

This is where CloudWatch Subscription Filters come in — their primary purpose is to forward log data to various AWS services for downstream processing.

The good news is that, when looking at ParallelCluster error logs, I noticed that for almost everything I care about, ParallelCluster generates a beautiful, fully loaded JSON error message with great detail.

Here is an example error message from a compute node bootstrap failure when the IAM Instance Role on the node did not have permission to download the bootstrap script from the S3 bucket:

That is a FANTASTIC error log entry! It tells us exactly where the error happened (during a CustomAction OnNodeStart event) and what happened (a permission denied on an S3 download attempt …). Getting this level of detail into our Alert message is essential, as it saves tons of time by eliminating the need to scan log files for “what went wrong …“.

This is also really straightforward to set up. We create a Cloudwatch Log Group Subscription Filter that also looks for a simple pattern matching the JSON payload that ParallelCluster commonly logs.

And whenever we hit that pattern we send the payload to a python lambda function that reformats the JSON log entry into a prettier more human-readable message that then gets delivered to the SNS Topic responsible for delivering messages to email inboxes and the Slack webhook URL

Subscription Filter Design Attempt 1 (ERROR messages only)

{ $.level = \"ERROR\" }

In the first iteration of this project I thought that I only cared about ERROR class events so the subscription filter reflected that thinking.

That worked for a bit until I started testing alerts for things like vCPU related quota launch errors. Turns out the log entry for an event like that is level=WARN so my first filter was missing all those events and not sending alerts.

Subscription Filter Design Attempt 2 (ALL JSON messages)

{ $.level = * }

Great! Now I get the vCPU related failure alerts that our prior filter missed because it was only triggered when level=ERROR. Now we get the level=WARNING stuff:

That solved the problem with not seeing alerts from vCPU quota errors but introduced a new issue — a ton of ‘noise’ alerts generated from clustermgtd of event type “compute-node-idle-time“. Those events/alerts are not useful to me personally and were coming in constantly.

I don’t care about these alerts, and they are polluting my Slack channel:

Subscription Filter Design Attempt 3 (ALL JSON messages but filter inside the lambda)

{ $.level = * }

OK so now we know that we are interested in ALL the JSON messages however we want to ignore or filter out some noisy things like :”event_type=compute-node-idle-time“.

That presented a problem — it’s difficult to place logical regex AND|OR operators into a subscription filter and you are limited in the number of filters that can be applied to a montored Cloudwatch Log Group.

So this is where I decided that the path of least resistance was to keep sending ALL logs into the lambda but add a bit of code on the lambda side to ignore events we don’t care about.

So in the current filter we are operating like this:

Log Group Subscription Filter

{ $.level = * }

However, the json_error_forwarder.py function now has a few extra lines of Python code in it:

# ignore logs where event-type = "compute-node-idle-time"
if event_type=='compute-node-idle-time':
    print(f"Skipping event-type: compute-node-idle-time")
    continue

Code Screenshot:

The downside to this is obvious, though — we are triggering an event and firing up a lambda only to have the lambda ignore the log entry and shut down after logging a “I’m ignoring this…” message. At scale, this can be wasteful and costly.

However, given the reasonable rate of ParallelCluster log messages in the HPC clusters I work in, I’m OK with this (for now). Our Lambda usage is still well below the AWS Free Tier threshold: 1 million free requests and 400,000 GB-seconds of compute time per month so I’m not overly worried at the moment. However I will be monitoring the cost of this stack over time to see how it can be improved.

What the “complex” Alert message looks like in slack

The value here is that we go beyond simple “ALARM” and “OK” status. By using a Cloudwatch Subscription Filter and a lambda to process the log and turn it into a formatted message we get much more useful and actionable alerts.

The alert below clearly shows that a compute node failed to start because the configured path to the S3:// hosted bootstrap script had a typo in it, resulting in a “404 not found” error when the download was attempted.