Schrödinger 2026-1 and broken jsc certs

April 15, 2026
chrisdag
computational chemistry, hpc, post wide, Schrödinger, Slurm, troubleshooting

Note: This post is pure google bait designed to help others who run into the same problem. We hope it helps!

You upgraded jobserverd, your functionality tests ran clean on the Slurm side, and a few hours later your scientists started complaining. Here’s what might have happened, how to confirm it in your environment, and the three-command fix.

Schrödinger JobServer says we have problems, but slurm shows “All is well!”:

TL;DR

Schrödinger’s jobserverd version 73060 and above (shipped with Suite 2025-4 late builds and all 2026-1 releases) is linked against OpenSSL 3.x, whose default security policy rejects RSA keys shorter than 2048 bits. Every Schrödinger user whose client cert was generated before Schrödinger’s own default moved to 2048-bit RSA is now silently broken — Maestro, testapp, and jsc info all fail with “client certificate uses weak RSA key of 1024 bits.”

The confusing part: sbatch submissions still land in Slurm and jobs still run to completion with correct license accounting, so Slurm-side monitoring shows a 100% OK cluster while users’ Maestro sessions are encounter issues.

There is no Schrödinger release note or doc warning about this change — the runtime error message is the only documentation. The fix is a simple rotation per affected user (jsc cert list → jsc cert remove → jsc cert get), which replaces the 1024-bit cert with a fresh 2048-bit one. The rotated cert is version-agnostic: one rotation fixes every installed Schrödinger release simultaneously.

Below is the audit script, the rotation procedure, and the verification we used to confirm zero regression for 2025-4 client workloads.

Why?

We just ran a Schrödinger Suite 2026-1 upgrade on a client AWS HPC cluster — new suite install on the shared filesystem, $SCHRODINGER/jsc admin upgrade --dir <jobserver_dir> on the central Job Server host, systemctl start jobserverd, test jobs that request real license tokens, etc. And we hit an issue that took a while to untangle because from a cluster admin perspective all was well.

I just worked through this on a live cluster with four registered compchem users across two Job Servers, identified the potential root cause, confirmed the fix doesn’t break anything for older Schrödinger client releases, and rotated the whole user population very quickly. The good news is the fix is fast and backwards compatible with older Schrödinger versions.

How we hit the issue

Our post-upgrade functionality test followed Schrödinger’s own documented procedure: submit a testapp -HOST <queue> -t 90 -l MMLIBS:4 -j <name> job, confirm Slurm picks it up, confirm license-pool accounting updates correctly. We ran it as a standard hpc user, who happened to have an older cert. The job appeared in squeue, ran to completion, sacct reported ExitCode=0:0, and scontrol show lic showed the expected MMLIBS token consumption. Great. Upgrade ticket closed.

But the testapp client itself had exited code 1 with a long gRPC error, and we almost didn’t notice because every other output was fine. It took reading the error text carefully — literally scrolling up in the terminal — to realise that the upgrade had introduced a new failure mode entirely.

Had a hard time finding documentation about this

We tried to confirm if the existing docs included info about this. We searched via the 2026-1 admin guide, the Updating Job Server page, the Authentication architecture page, the release notes, the system requirements page, the troubleshooting guide.

We searched with a BioTeam-only MCP-backed RAG index running against a full ingest of the 2026-1 doc set (https://hpc-mcp.apps.bioteam.cloud/) . Full-text searches for “1024,” “2048,” “weak key,” “RSA,” and “OpenSSL 3” returned only third-party OpenSSL legal notices. It seems that most of the actionable info comes from the error response itself.

What broke under the hood (and, confusingly, what didn’t)

The trickiest part of this issue is that a client holding a 1024-bit cert is not completely broken. Slurm-side machinery, license accounting, and some parts of the jsc tooling continue to function. This is why the problem is so easy to miss in standard post-upgrade verification. Here is the partial-failure matrix that was observed on a live AWS ParallelCluster environment — four users, two Job Servers, live production workloads happening alongside.

Component	Outcome	Why
`sbatch` submission lands in Slurm	Works	The submit RPC reaches `jobserverd` and the Slurm job is created with the correct `SchrodingerJobId=<uuid>` in its `Comment` field and `Licenses=mmlibs@…:N` annotation
Slurm runs the job to completion	Works	Compute node picks up the batch script at `/tmp/<uuid>` and executes it normally
License-pool accounting (`scontrol show lic mmlibs`)	Works	Token count increments on submit, decrements on completion — correct to within one scheduler tick
`sacct` post-completion accounting	Works	Reports `ExitCode=0:0`, Elapsed time, Start/End timestamps — exactly what you’d expect from a clean run
`testapp` on the affected client	FAILS exit 1	Prints the weak-RSA-key gRPC error. But the submission actually landed. The user sees red; the server did the thing.
`jsc info <jobid>` from the affected user on a remote submit host	FAILS	Same gRPC cert error — user cannot poll their own job’s progress
`jsc info <jobid>` run directly on the `jobserverd` host	Works	The local path appears to bypass the client-cert TLS check. Ops staff debugging on the jobserver see “Status: Completed” while users off the jobserver see “Job launch failed” for the same job.
Maestro job-submit / job-monitor panels	FAILS	Same gRPC channel as `testapp`. Users see authentication errors when submitting or registering.
`jsc cert list`	Works	Reads only the local config; does not touch `jobserverd`
`jsc cert remove`	Works	Local-only operation — edits `jobserver.config` to delete the stale entry
`jsc cert get` (to re-register)	Works	Re-registration generates a fresh 2048-bit keypair before opening the new gRPC session, so the weak-key check does not trip

Root cause

Two separate changes line up to produce this. Schrödinger’s jobserverd, at some point between release 2025-2 and 2025-4, was rebuilt against OpenSSL 3.x. The specific crossing point is at jobserverd version 73060. OpenSSL 3 ships with a default security level that rejects RSA keys shorter than 2048 bits during TLS handshakes. This is a principled decision by the OpenSSL project — 1024-bit RSA has been considered inadequate against well-funded adversaries for over a decade, and OpenSSL’s default security level 1 says “no shorter than 2048.”

But from the application layer, the rejection looks like a silent-breaking change, because it happens inside the cryptography library before Schrödinger’s own code is involved. Schrödinger’s client-cert generation, at some point in its history, changed its default RSA key size from 1024 bits to 2048.

Users who first registered with a Job Server (via $SCHRODINGER/jsc cert get <host>:<port>) before that change received a 1024-bit cert that is stored locally in ~/.schrodinger/jobserver.config.

Users who registered after got 2048. Two registrations. Same file. Different eras. Everything works until the day jobserverd crosses the 73060 boundary.

At that moment, OpenSSL 3 inside the upgraded server starts rejecting every legacy 1024-bit client cert. And because the rejection is perfectly correlated with “how long ago did this user first use this Job Server,” the affected population is exactly the long-tenured heavy users.

Find out who is affected in your environment

If you’ve done this upgrade, you may want to run a quick audit. You can likely cover the whole user population in under ten seconds. The client cert lives inside ~/.schrodinger/jobserver.config, a JSON file with one entry per registered Job Server. The private key is base64-encoded inside each entry’s auth.private field. Decode it through openssl rsa -text -noout and read the key size out of the first line of output. Adjust the glob to match wherever your user home directories actually live — /home/*, /fsx/home/*, /shared/home/*, whatever your shared-filesystem convention happens to be:

python3 <<'PY'
import json, base64, subprocess, glob
for cfg in glob.glob("/fsx/home/*/.schrodinger/jobserver.config"):
    user = cfg.split("/")[3]   # adjust for your home-path depth
    try:
        data = json.load(open(cfg))
    except Exception as e:
        print(f"{user}: read error ({e})")
        continue
    for e in data:
        host = e.get("hostname", "?"); port = e.get("jobport", "?")
        priv = e.get("auth", {}).get("private", "")
        if not priv:
            print(f"{user:24s} {host}:{port}  no-private-key"); continue
        pem = base64.b64decode(priv)
        r = subprocess.run(
            ["openssl", "rsa", "-in", "/dev/stdin", "-noout", "-text"],
            input=pem, capture_output=True)
        size = "?"
        for ln in (r.stdout + r.stderr).decode().splitlines():
            if "bit" in ln and ("Private-Key" in ln or "Public-Key" in ln):
                size = ln.strip(); break
        print(f"{user:24s} {host}:{port}  {size}")
PY

What you get is a per-user × per-Job-Server matrix. Anything reporting Private-Key: (1024 bit, 2 primes) is a user who is broken today, right now, even if they haven’t reported it yet.

Here’s what the real-world output looked like on our audited cluster, with the names anonymized:

Three of four users (75%) were on broken certs against Job Server A. All four were fine against Job Server B, because Job Server B had been upgraded at a different time — never crossing the 73060 boundary while 1024-bit certs existed in the wild, so the registrations against B were all 2048 from the start. User-2 got lucky because they happened to re-register against A recently, after the default-key-size change but before the server upgrade.

The fix, per user

Two commands. Takes a few per user once the user has whatever interactive-auth they need to the Job Server host.

jsc cert remove is non-destructive — it only deletes the local config entry. If something goes wrong between remove and get, the user is simply unregistered and can retry jsc cert get at any time.

jsc cert get authenticates via whichever method you have configured on the Job Server. There are two options:

Socket authentication (SSH-based): the user needs a working SSH path into the Job Server host. If they have a working SSH key deployed (or passwordless SSH works), jsc cert get completes silently in under two seconds. If SSH uses password auth, they get prompted for their UNIX password on the Job Server.
LDAP authentication: the user types their LDAP password at the prompt.

For service accounts that don’t have interactive auth — LiveDesign workers, pipeline automation, anything submitted by a daemon — the admin-initiated one-time-password flow works the same way but seeded by the Job Server admin:

# On the Job Server host, as root or the jobserver service user:
sudo -u jobserver bash -l -c \
    "$SCHRODINGER/jsc admin adduser <service-account-name>"
# → prints a single-use password on stdout, one line

The service-account operator then runs jsc cert remove + jsc cert get as the service account and pastes the one-time password at the socket-auth prompt. This is the right flow for anything without interactive SSH — LiveDesign Hub, Pipeline Pilot workers, data-ingest daemons.

What not to do

A couple of tempting shortcuts that make things worse:

Do not hand-edit ~/.schrodinger/jobserver.config. The file contains interdependent base64-encoded certs and private keys, and partial edits break authentication in hard-to-diagnose ways. Always use jsc cert remove to delete entries.
Do not run jsc admin revoke <user> on the Job Server as part of a routine rotation. revoke invalidates the user’s cert server-side. If the client still has the old cert cached locally, the user will keep hitting auth failures until you also run the client-side jsc cert remove + jsc cert get. revoke is the right command when a user leaves the team and their access should be terminated — not when the goal is to refresh a cert.
Do not skip the audit because “our upgrade happened months ago and everything seems fine.” Users who happen to be working today with a 1024-bit cert may be working on borrowed time — they’ll hit the break on any subsequent session where their client opens a fresh gRPC channel (Maestro restart, new login, any jsc info poll).

Verify the rotation across Schrödinger client versions

This is the important piece to understand: ~/.schrodinger/jobserver.config is version-agnostic. Every installed Schrödinger release reads the same file when it talks to the Job Server — 2026-1, 2025-4, 2025-3, 2024-4, all of them. So one rotation fixes every client version the user has on their path. You don’t have to re-rotate once per suite version.

Prove that to yourself with back-to-back testapp runs under different modules:

Both should print a JobId: <uuid> line, exit 0, and land cleanly in Slurm. No gRPC errors, no “weak RSA key” warnings. This confirms the rotation worked for the specific Schrödinger version you drove it from and any older-version fallbacks your users still rely on.

On our live cluster, this is the result we saw for the admin user post-rotation:

Both exit 0. Both tracked cleanly through Slurm and Job Server. No gRPC errors. Batch rotation across the remaining three users ran the same way and finished quickly.

Note: If you drive this rotation via Ansible or some other automation, remember that jsc cert get is not idempotent — calling it against a Job Server where the user is already registered will produce an error. Always pair it with jsc cert remove first, and treat the combined sequence as your idempotent unit. Or better yet, gate the whole thing behind the key-size audit: only rotate if the existing key is < 2048 bits.

What to change in your post upgrade testing going forward

Three takeaways for any HPC admin running Schrödinger on top of Slurm, applicable well beyond just this one upgrade:

1. Schrodinger post-upgrade verification needs two independent paths

A post-upgrade verification plan needs at least two separate checks that are not co-dependent.

The Slurm-side check (does sbatch land, does the job run, does sacct report clean completion?) is the traditional one. It’s almost always in standard monitoring. It catches most operational regressions.

The client-side check (does testapp exit 0, is jsc info polling succeeding, is Maestro’s Job Monitor panel showing the expected state?) is the one that may be harder to instrument and catch. HPC admins rarely are comfortable driving the Maestro GUI, for instance.

2. Automate the key-size audit into your upgrade playbook

The Python + openssl one-liner above is short enough to drop into a Slack bot, a Jenkins check, or an Ansible role that runs before every Schrödinger upgrade and again after. If the pre-upgrade audit already shows 1024-bit certs in the wild, you know the upgrade will trigger the silent break, and you can schedule the rotation before any user sees a symptom. This is the kind of thing that turns a “production incident” into a “Wednesday afternoon routine maintenance task.”

We’re folding this into a BioTeam operational-readiness review that we are baking into our upgrade and post-upgrade procedures.

3. Treat client-cert rotation as part of user lifecycle management

Every time a user is added, changed, or offboarded, the client-cert matrix changes. If you have a local_users Ansible role, or an Okta-to-Slurm-accounting synchronization, or any other user-lifecycle tooling, that’s the right place to also track registered Job Servers per user and key size per registration. Close the gap between the two layers — the Linux user and the Schrödinger user — so the “who is registered where, with what key size” inventory is always discoverable without a six-command audit.

Bottom Line

jobserverd version 73060 and above will silently reject the 1024-bit RSA client certs that older Schrödinger registrations left behind. The failure is partial: Slurm-side integration and license accounting continue to work, while users see client-side gRPC errors in Maestro and testapp. The fix is a three-command rotation (jsc cert remove → jsc cert get) per affected user, which one-shots every installed Schrödinger version simultaneously because the cert file is shared across them. There is no Schrödinger-published warning about this; the error message itself is the only documentation. If you upgraded jobserverd in the last few months and haven’t audited your user population’s client-cert key sizes, do that now.

Related Insights

News