Listen to our newest podcast with Ruth Marinshaw, CTO of Research Computing at Stanford University.

Close this search box.

AWS ParallelCluster Private Deployment in Hardened VPCs

Image: Firing up a public cluster configured to deploy through a logging Squid proxy was the only way to discover all of the various Internet-based URls and endpoints that AWS Parallelcluster needs in order to successfully complete a full deployment. 

AWS ParallelCluster Private Deployment in Hardened VPCs


Executive Summary

If you are struggling like me to provide a Firewall and InfoSec team with a list of external communication requirements that AWS ParallelCluster needs then skip to the bottom of this post. I had the recent chance to deploy this AWS HPC product via a logging Squid proxy and the proxy logs are super useful. Full raw log data available as well as some curated/context-aware interpretation of the destinations is provided.


I’m an HPC nerd with a biology background. A significant chunk of my professional career in the life sciences has involved building, configuring, tuning, troubleshooting and operating large linux clusters acting as HPC grids to execute life science research workflows.

I loved CfnCluster and I love what CfnCluster became when it was rebranded and refactored as AWS ParallelCluster. Let me list the ways that I love this product stack:

          • AWS ParallelCluster perfectly creates Linux HPC grids on AWS that look, function and operate exactly how on-premise or on-campus grids operate
          • This is fantastic for existing workflows (many of which will never be rewritten or re-architected for serverless execution or converted to using only object storage for data). It also greatly minimizes the training and knowledge transfer required for an HPC cloud migration project
          • This includes all the things we expect like a shared filesystem and functional MPI parallel message passing stack

On top of delivering a sweet HPC environment that looks and feels familiar, ParallelCluster also gives us some nice cloud features that simply don’t exist on-premise:

              • Auto-scaling! My grids now shrink and expand depending on the length of the pending job list
              • Spot market support!  My grids can now leverage super cheap spare AWS capacity when I don’t have time sensitive workloads
              • AWS FSx support! Easy integration with AWS FSx/Lustre if I need parallel scratch filesystem features
              • It’s an officially supported AWS product which means Enterprise Support and Amazon TAMs are interested and will actively help support and troubleshoot issues
              • The developers are amazing and super responsive on their github page

The General Problem

As a person who mainly does AWS work for commercial and industrial clients, my biggest overall complaint with AWS best practices, reference design architectures, tutorials, community HowTo’s and all of the evangelist activities is that they all make the same rosy colored and often inaccurate assumption that:

          • Legacy workflows and requirements simply do not exist and are not worth discussing.
          • Anything and Everything can, should and will be totally rebuilt and rearchitected for serverless or cloud-native design patterns
          • Every user will have the necessary IAM permissions to run cloudformation stacks that may create VPCs, alter route tables, deploy Internet Gateways and attach major infrastructure crap directly to the public Internet

But my biggest complaint of all?

          • 99% of AWS documentation, tutorials, design patterns, reference architectures and evangelist materials assume that unrestricted Internet access is always available, to all destinations for all things sitting in AWS VPC public or private subnets. 

The operating assumption by AWS marketing, technical and PR people that everything always has unrestricted egress access to the full Internet is becoming the bane of my existence. Growing portions of the real world simply do not work this way and implementing “unrestricted internet access for all the things!” is (in my mind) a fairly significant security risk that needs to be accounted for in AWS customer InfoSec threat models.

Current HPC Situation – 100% Private VPC with API and Internet Access Blocked By Default

I can’t divulge many details but the general gist of a hardened AWS footprint for a security conscious client is this:

          1. Production VPCs are 100% private. There are no public-facing subnets in the standard VPC design
          2. No unsolicited inbound traffic from the Internet. At all. Ever.
          3. Systems, Servers and Lambda functions by default have no route to the Internet
          4. Systems, Servers and Lambda functions by default have no route to AWS API endpoints
          5. Lots of firewalls. Lots.  Most of them are capable of SSL decrypt and payload inspection as TLS/HTTPS is becoming a major threat vector for data exfiltration, malware and botnet C&C coordination
          6. Access to AWS APIs or any Internet destination requires approval and an active firewall policy change


CfNCluster / ParallelCluster Past Versions

Prior versions of CfnCluster/ParallelCluster were fairly straightforward to operate even in our “hardened” subnets because the external communication requirements were very simple:

          • ParallelCluster needs to talk to a bunch of AWS APIs for both deployment as well as ongoing operation
          • Core AMIs are custom regional images and contain vast majority of software/code that needs to be brought inside
          • Non-AMI data and assets like chef cookbooks used in the deployment were all stored in regional S3 buckets owned by the ParallelCluster team
          • Access to EPEL repo and the OS/Patch/Update sites for CentOS/Ubuntu/Amazon Linux
          • Access to Python PyPI mirror(s)

So that was pretty easy for us to work with. We just had to whitelist or authorize access to a known set of AWS APIs and we also had to allow HTTPS downloads from known S3 bucket locations.

The “hardest” thing was generating custom AMIs or pre_install bootstrap scripts that replaced the node OS software and EPEL repo files with versions that pointed to our local internal mirrors. Since we needed to touch files in the node OS anyway it was trivial to also install the extra  SSL CA certs our firewalls required.

Simply replacing the software repo files with versions that pulled from private software mirrors reduced the external traffic requirements for ParallelCluster to “S3 downloads + APIs” which was fantastic.

Until it was not …

ParallelCluster Current Version (v2.4.1 as of this writing)

As of now I’m unable to deploy the latest version of ParallelCluster at my client because the latest version has started to require download access to undisclosed website URLs.

It took a lot of time and troubleshooting to figure out why our new HPC grid deployments failed and the cloud formation stacksets rolled back.

This was because nowhere in the Release Notes or website for the project was an announcement that new destinations were required. I’m not super angry about this because it feels like we are an edge case. Remember again — almost all cloud evangelism and design patterns make the core assumption that “internet access is always available from everything to everything” so why would anyone but a bunch of niche market wierdos care about listing out exactly what external access requirements a product like this requires?

Root cause turned out to be …

          • Both master and compute nodes seem to require HTTPS access to during cluster deployment
          • Our firewall was killing these connection attempts via connection-reset packets
          • The failure to download from github triggers the node deploy failure state
          • Eventually the Cloudformation timeout is hit and the entire grid does a rollback and self-deletion

Documenting All ParallelCluster Communication Requirements

I finally had the time to stand up a logging Squid proxy server in a personal AWS account and do a full AWS ParallelCluster deployment through the proxy so I could log all of the HTTPS traffic coming from both the Master and Compute HPC nodes.

Sorting the squid logs was all I needed to generate an API and destination list for our firewall team.

This is not super useful or interesting data but maybe it will help someone else who also needs to work with a Firewall team on managing outbound traffic and API traffic.

Full Raw Proxy Logs From ParallelCluster Deployment

If you want a full .txt file showing all of the Squid proxy logs for a 3-node CentOS-7 ParallelCluster deployment in the US-EAST-1 region then just click this download link:

[download id=”4191″]

Deduplicated List Of All Destinations From ParallelCluster Deployment

If you just want a unique set of external URLs with duplication removed, here is the parsed version of the full squid log:


Context & Category Aware ParallelCluster Destinations:


AWS API and S3 Storage Endpoints:


Other Destinations (Not API, Not S3, Not OS or Package Repo)


CentOS Mirrors, EPEL Mirrors and Python PyPi Mirrors

Note: This is the hardest category from a Firewall perspective. You can see below that these destinations are all CDN networks or redirecting mirror sites that can end up sending your download clients to a different set of download destinations on every request or every cluster deployment.  Our use of internal mirrors for downloads of this type solves this issue neatly for us as I can ignore these log entries.


What My Firewall Team Actually Needs


Thankfully I don’t have to worry about OS, EPEL or Python mirrors on the internet that may use a constantly changing set of mirror hosts or CDN locations.  All of our OS, Update, Patch and Software Package repos are hosted on internal private mirrors.

So in my case I can strip all destinations related to CentOS, EPEL and Python PyPI mirrors because our HPC grids are customized so the sources of those assets always point to internal private mirrors.

So the final set of info that my Firewall team actually needs to see is this curated list which basically boils down to “APIs + S3 downloads + Github downloads”

The above list will be applied to a dynamic firewall policy triggered by special Tags assigned to the EC2 hosts so we don’t have to worry about source IP firewall rules or subnet-wide VPC firewall rules.

Hope this may prove helpful to someone else dealing with a similar problem! Obviously my API destinations are region-specific so you’ll need to adjust accordingly if you are operating outside of US-EAST-1.



Get updates from BioTeam in your inbox.

Trends from the Trenches eBook January 2022

The eBook of BioTeam and Bio-IT World's Most Trending Articles