Modern HPC integration practices for Schrödinger workloads tend to want 10–12 Slurm partitions and matching host entries, because Maestro doesn’t let chemists pass Slurm arguments directly. That design preference collided with AWS Parallel Computing Service on a recent deployment: PCS has a hard, non-adjustable cap of 10 partitions and 10 compute node groups per cluster.
TL;DR
AWS Parallel Computing Service has a hard, non-adjustable cap of 10 compute node groups and 10 queues per cluster. The cap applies at all cluster sizes (SMALL/MEDIUM/LARGE) — going from SMALL to MEDIUM raises managed-instance and tracked-job ceilings but does not raise the NG/queue cap. The number lives in the PCS endpoints/quotas page under a separate “Internal quotas” subsection, and it’s not visible to aws service-quotas list-service-quotas --service-code pcs.
A real-world Schrödinger Suite workload — Maestro chemists running Glide, FEP+, IFD-MD, Desmond, AutoDesigner — naturally wants 10–12 partitions, because Maestro’s “Host” dropdown is the chemist’s only Slurm-side decision and each entry in that dropdown maps to a fixed qargs string in the Job Server’s hosts.yml. Distinct memory profiles, GPU profiles, and on-demand-vs-spot choices all need their own host entry, and on PCS that pushes most of them to their own NG.
We designed for 12 NGs on a sandbox PCS cluster, hit ServiceLimitExceeded at the 11th create, dropped one queue (two NGs — an AutoDesigner-specific spot CPU pool with two-instance-type diversification) to fit, and turned the cluster over to chemists for testing at 10/10.
Why a Schrödinger cluster wants ~12 partitions
Schrödinger Suite is many products with different resource shapes:
| Product class | Examples | Compute shape |
|---|---|---|
| Ligand prep / 1D properties | LigPrep, Epik, ConfGen | CPU, single-core-ish, memory-light |
| Docking | Glide SP/XP, Glide HTVS | CPU-bound subjobs, embarrassingly parallel, spot-tolerant |
| Induced fit | IFD-MD | Single GPU + CPU driver co-located, memory-sensitive |
| Free energy perturbation | FEP+, FEP+ Pose Builder | CPU driver + many single-GPU edge subjobs |
| Molecular dynamics | Desmond, Desmond WaterMap | Single GPU per replica; multi-GPU for replica exchange |
| De novo design | AutoDesigner | Long-running campaign driver + scaled-out CPU spot workers |
| Protein modeling | Prime, BioLuminate | CPU, occasionally large memory |
A “one queue fits all” cluster either over-provisions (configure for the worst case, waste resources on small jobs) or under-serves (configure for the median, watch FEP+ drivers OOM-kill). So you specialize. The only question is how granular.
The granularity isn’t up to you in the way you might expect. Schrödinger’s Job Server (jobserverd) reads a single config file — <jobserver_dir>/config/hosts.yml — that maps user-facing names to Slurm submit arguments. A short excerpt of what an entry looks like:
entries:
- name: cpu
tmpdir: /scr
qargs: --partition=cpu --nodes=1 --ntasks-per-node=%NPROC%
processors: 2000
processors_per_node: 8
- name: gpu
tmpdir: /scr
qargs: --partition=gpu --nodes=1 --ntasks-per-node=%NPROC% --gres=gpu:%NPROC%
processors: 100
gpgpu:
- index: 0
description: NVIDIA L4
When a chemist clicks “Run” in Maestro, they get a Host dropdown populated from these names. They pick one. That’s the entire scope of their Slurm-side decision. Maestro does not surface a --mem field, a --cpus-per-task override, a --time field, or a way to add arbitrary qargs. Job Server takes the selection, looks up the entry, and submits with exactly those qargs.
That’s a defensible UX choice — chemists shouldn’t need to learn Slurm — but it has a structural consequence:
Every distinct combination of Slurm submit arguments you want chemists to choose between has to exist as its own named host entry in
hosts.yml.
In principle multiple host entries can share a Slurm partition (the cpu and cpu_highmem examples in Schrödinger’s docs both target --partition=cpu, differing only on --mem-per-cpu). In practice on AWS PCS, partitions and node groups are 1:1 most of the time, because:
- A partition associates with exactly one compute node group by default.
- Different memory profiles often want different instance types (e.g.,
c6i.4xlargefor general CPU vsr6id.2xlargefor an AutoDesigner driver that wants local NVMe), and instance type is fixed per NG. - Different GPU profiles (single L4 vs four L4s on one node) need separate NGs because GPU GRES is per-instance-type.
- Different purchase options (on-demand vs spot) need separate NGs.
So each meaningfully-distinct workload tier ends up wanting its own host entry, its own partition, and — on PCS — its own NG.
The 12-partition design we drew up
This is what we landed on for a client’s sandbox PCS cluster, before we hit the unchangable 10-NG cap.
Control-plane layer (3 NGs): a persistent pcs-login node (c6i.large, 1/1, fixed FQDN, no jobs); a small pcs-interactive queue for srun --pty bash and ad-hoc CLI runs; a general pcs-driver pool on c6i.2xlarge for FEP+, IFD-MD orchestrators, CSP, and Meta Workflow Builder drivers. The driver size matters — c6i.xlarge (8 GiB) OOM-crashed multi-edge FEP+ drivers in early testing; c6i.2xlarge (16 GiB) holds.
AutoDesigner driver (1 NG): pcs-ad-driver on r6id.2xlarge, max=1. Two reasons it doesn’t collapse into the general driver pool: AutoDesigner stores intermediate campaign state on local NVMe SSD (the r6id has 474 GB, c6i has none), and the AUTODESIGNER feature is a single-token license, so we want max=1 specifically here rather than max=8 across general drivers.
General CPU compute (3 NGs): pcs-cpu (c6i.4xlarge, on-demand, max=16) for CPU work that isn’t spot-tolerant — long-running multi-stage workflows, license-token-checked-out jobs, anything without restart-from-checkpoint logic. pcs-spot-cpu (c6i.xlarge, spot, max=100) for high-throughput Glide subjobs and LigPrep batches where each subjob is a re-runnable unit. pcs-spot-cpu-large (c6i.4xlarge, spot, max=30) for parallel docking batches that benefit from co-locating multiple subjobs on one node — shared scratch, lower per-job startup overhead.
GPU compute (3 NGs): pcs-l4-gpu on g6.xlarge (1× L4, max=100) — the workhorse single-GPU pool, where most FEP+ edge subjobs and single-replica Desmond runs land. pcs-l4-gpu-ifd-md on g6.8xlarge (1× L4 with 32 vCPU, max=8) — a separate pool for IFD-MD specifically. The MD refinement step in IFD-MD wants more vCPU per GPU than g6.xlarge’s 4 vCPU provides; 32 vCPU keeps the GPU fed during the protein-flexibility CPU portion. pcs-l4-multi-gpu on g6.12xlarge (4× L4, max=12) for multi-GPU FEP+ Pose Builder driver subjobs and large Desmond replica-exchange systems that benefit from intra-node bandwidth across GPUs.
The two NGs we wanted but didn’t build: an AutoDesigner-specific spot CPU pool with two-instance-type diversification (c6id.4xlarge + c5d.4xlarge — both with local NVMe, both feeding a single pcs-ad-spot-cpu queue for spot-fulfillment resilience). One queue, two NGs. Dropping it freed both an NG and a queue at once.
That’s 12 NGs and 11 queues by design. The cap is 10/10. We had to drop something to keep the test moving.
A reasonable question: do we really need this many? Couldn’t it be CPU / GPU / spot-CPU / driver, four partitions total, and chemists figure it out? The Maestro-dropdown constraint is the key blocker. Each pair encodes a qargs distinction the chemist has no other way to express:
pcs-cpuvspcs-spot-cpu— chemists can’t pick spot vs on-demand per job.pcs-spot-cpuvspcs-spot-cpu-large— chemists can’t request “give me the bigger node.”pcs-l4-gpuvspcs-l4-gpu-ifd-md— chemists running IFD-MD can’t request “32 vCPU for the CPU side, please.”pcs-l4-gpuvspcs-l4-multi-gpu— chemists running multi-GPU Pose Builder can’t request “4 GPUs on one node.”pcs-drivervspcs-cpu— without a dedicated driver pool, FEP+ campaign drivers compete with subjobs for cores and OOM.
Drop any pair and either the workload it served stops running, or chemists have to over-provision against the largest remaining partition.
Hitting the cap
We applied the 12-NG design via Terraform. Ten NGs created cleanly. The eleventh and twelfth came back with:
Error: AWS SDK Go Service Operation Incomplete
StatusMessage: You have reached the quota: ComputeNodeGroup
ErrorCode: ServiceLimitExceeded
The PCS endpoints/quotas page lists this in a separate “Internal quotas” section:
| Name | Default | Adjustable |
|---|---|---|
| Concurrent cluster creation | 1 | No |
| Compute node groups per cluster | 10 | No |
| Queues per cluster | 10 | No |
Two things made this easy to miss. First, the Service Quotas console only exposes the main “Service quotas” table for PCS — aws service-quotas list-service-quotas --service-code pcs returns only Clusters: 5.
The 10-NG and 10-queue caps don’t show up in self-service tooling at all; you have to read the docs page. Second, the cluster-size table on the PCS docs only describes instance-count and tracked-job scaling between SMALL/MEDIUM/LARGE, which can leave the impression that NG/queue caps scale with cluster size too. They don’t.
It’s also worth being explicit about what cluster size does and doesn’t change, because the SMALL/MEDIUM/LARGE distinction is easy to read as “bigger size = more of everything.” It isn’t. Going from SMALL to MEDIUM raises managed instances (32 → 512) and tracked jobs (256 → 8192). It does not raise the NG or queue caps. Those stay at 10/10 regardless of cluster size.
So the design met the cap. We dropped the AutoDesigner-specific spot CPU queue — both NGs and the one queue at the same time, exactly closing the gap to 10/10 — and re-applied. AutoDesigner workers run on the general pcs-spot-cpu and pcs-spot-cpu-large pools instead. We lose worker-level NVMe locality (only the AD driver still has it) and lose instance-type diversification within the AD-specific queue. For our workload mix that trade-off is fine; AD campaigns are infrequent enough that general-pool spot capacity covers it. For an AutoDesigner-heavy site, this trade-off would be more painful and might push the design toward one of the workarounds below.
A second complaint: PCS provisioning time
The 10-NG/10-queue cap is the major constraint, but the second thing that surprised us on this build was how long PCS takes to provision a non-trivial cluster — and how the per-NG times grew as the cluster filled out.
Here are the timings from the from-scratch apply that built the 10-NG design (terraform + the awscc provider, us-east-2, MEDIUM cluster). Cluster came up first, then all 10 NG Create API calls were issued in parallel within a few seconds of each other:
| # | Resource | Creation time |
|---|---|---|
| — | awscc_pcs_cluster.sandbox_hpc |
15m47s |
| 1 | compute_node_group.login |
3m48s |
| 2 | compute_node_group.interactive |
6m57s |
| 3 | compute_node_group.spot_cpu_large |
8m19s |
| 4 | compute_node_group.l4_multi_gpu |
10m57s |
| 5 | compute_node_group.cpu |
13m53s |
| 6 | compute_node_group.driver |
17m21s |
| 7 | compute_node_group.ad_driver |
20m16s |
| 8 | compute_node_group.ad_spot_cpu_c5d |
22m37s |
| 9 | compute_node_group.spot_cpu_medium |
25m41s |
| 10 | compute_node_group.ad_spot_cpu_c6id |
28m28s |
The pattern is striking. Terraform issued all 10 CreateComputeNodeGroup calls effectively simultaneously, but the completions came back monotonically — each NG roughly 2–3 minutes after the prior one. The 10th NG took ~7.5x as long as the 1st. After the NGs finished, the queues that depend on them landed in another 24–28 minutes each, similarly serialized.
We did not set out to benchmark this and don’t have a controlled comparison across cluster sizes or regions, so caveat the numbers as one data point. But the shape of the data — parallel API calls, monotonically increasing completion times — is consistent with a backend that’s processing NG creation serially or with a small fixed concurrency, regardless of how many requests are in flight. We don’t have visibility into PCS internals to confirm what’s actually happening; from the outside it looks like there’s a bottleneck somewhere in the control plane that gets exercised once per NG.
The operational cost adds up: – A from-scratch 10-NG cluster apply takes roughly 45–60 minutes wall-clock before you can submit a job. Cluster (~16m) + last-NG-completion (~28m) + queues serialized behind that (~28m). – An iterative design loop — “tweak NG sizing, re-apply, test” — pays a meaningful chunk of that on every cycle, even for a single-NG change if a destroy-and-recreate is involved. – Combined with the NG/queue cap from the previous section, this also makes the “just spin up a bigger cluster as a workaround” option (workaround #3 earlier) more painful than it sounds. Each cluster is a half-hour-plus to stand up.
A few practical adjustments we’d make on the next build: – Don’t iterate cluster shape via Terraform when you don’t have to. The aws pcs update-cluster and per-NG update paths are dramatically faster than destroy-and-recreate when only attributes are changing. – Plan apply windows accordingly. A “small” change that touches NG count is not a five-minute apply. – Build out the cluster once, then iterate workload-side. Get to a “good enough” 10-NG shape, hand it over to users, and let chemists drive iteration on hosts.yml rather than Terraform.
If you’ve measured PCS provisioning behavior under different conditions (bigger account, larger MEDIUM utilization, LARGE clusters, other regions) and seen a different shape, we’d be curious to hear about it.
Workarounds when 10 isn’t enough
A few options if your design genuinely needs more than 10 NGs or queues:
- Heterogeneous NGs via
instance_configs. A single PCS compute node group can list multiple instance types in itsinstance_configsarray, and PCS picks at launch time. This is the obvious collapse for an “AD spot c6id + c5d” pair — one NG, two configs. Trade-off: chemists can’t pick which instance type they get, and--gres/--memconstraints have to apply across all configs. - Queue-level GRES or feature differentiation. One NG can serve multiple queues if you use Slurm
AvailableFeaturesor GRES specs to differentiate, with chemists submitting against--constraintor--gresrequirements. The upside is more queues per NG; the downside is the resource constraint logic gets split betweenhosts.ymlqargsand Slurm partition config, and that’s harder to reason about in a year when somebody else is on call. - Multiple PCS clusters. Spread workloads across two clusters — say, a GPU-heavy one and a CPU-heavy one. Doubles operational complexity (two Job Server topologies, two cluster IDs, two Terraform states) and breaks cross-workload scheduling. Probably not worth it unless you’re scaling past 16-18 logical partition needs.
- AWS Support escalation. The docs say the cap is non-adjustable. We have not personally tested whether AWS Support will grant exceptions, but it’s worth filing if you’ve genuinely exhausted the design optimizations and have a business case for more.
- Accept the trim. Identify the lowest-value partition pair early and treat it as your sacrificial target. For us this was AutoDesigner spot diversification.
- Use AWS ParallelCluster instead. If the 10/10 cap is a real blocker for your design and none of the collapses above work cleanly, ParallelCluster is the obvious fallback — it’s the free, open-source AWS HPC option (you run the controller on EC2 yourself rather than handing it to a managed control plane) and it does not impose a partition or queue cap. We’ve stood up Schrödinger workloads on ParallelCluster at multiple sites and it handles 12+ partitions without complaint. The trade-off is real: you give up the managed
slurmctldand accounting database that PCS provides, and you maintain the controller AMI and patching yourself. The slow per-NG provisioning behavior we measured on PCS isn’t something you’d inherit on ParallelCluster either. For Schrödinger-heavy shops where the partition count is the binding constraint, this is worth weighing seriously rather than treating PCS as the only choice. Most of BioTeam’s Schrodinger environments are running on Parallelcluster with no plans to transition.
What we’d suggest if you’re sizing a PCS cluster for Schrödinger
A few practical takeaways from this build, for HPC admins evaluating or designing a Schrödinger workload on PCS:
- Plan for 8–12 partitions as the natural shape of a multi-product Schrödinger workload. Don’t try to compress to 3–4 — the Maestro UX will fight you, and the chemists will end up over-provisioning to the largest partition that exists.
- Know the 10-NG / 10-queue cap going in. It’s not in self-service Service Quotas; only the docs page’s “Internal quotas” subsection. Cluster size doesn’t change it.
- Identify a sacrificial partition pair early. Decide which of your design’s pairs could collapse onto general pools without breaking the workload, and have it ready as the trim target.
- Document the Maestro →
hosts.yml→qargs→ partition chain for your chemists’ admin handoff. It’s the single most common “how does this cluster actually work” question that comes up after deployment.
Wrapping up
The 10-NG and 10-queue per-cluster cap on AWS PCS is a real architectural constraint that meets a real architectural pull from Schrödinger’s hosts.yml-and-Maestro design. The two pulls don’t always fit, but they fit close enough that for most workloads a careful trim lands inside the cap without losing anything chemists actually use. If you’re sizing a PCS cluster for Schrödinger and you’re staring at 11 or 12 partitions on the design, that’s not necessarily wrong — it just means you’ve got one trim decision to make before terraform apply.


