Why I rotated JWT signing keys without forcing a global re-login.

Most JWT rotation guides assume one signing key and a hard cut: when the key changes, every token issued before the change is invalid. That's tolerable for an internal service where re-auth is a redirect. For a consumer platform with mobile clients, background workers and live WebSocket connections, it's a free Friday-night incident waiting to happen.

The version I shipped at Motosoft rotates JWT signing material across FastAPI + AWS SAM Lambda without forcing a single re-login. Four phases (add → flip → drain → drop), one GitHub Actions workflow per purpose, and a small piece of plumbing on top of Secrets Manager. Here's how it fits together.

The problem with single-key rotation

The naive setup: one secret in Secrets Manager, one signing key in memory, every token signed with that key. When you rotate you generate a new key, replace the value in Secrets Manager, and restart the service so it picks up the new key.

The restart is where it falls apart. Existing tokens in the wild — in mobile-app local storage, in open WebSocket connections, in background workers re-using a refresh token — were signed with the old key. Your service no longer holds that key, so verification fails. Every active session 401s on its next request. You force a global re-auth.

Worse: if you have a refresh-token flow, refresh tokens were also signed with the old key, so they fail too. Users have to log in from scratch. For a pre-launch beta with active rider sessions, real-time chat, and trip-progress state on the connection, a global re-auth isn't "annoying" — it's a data-loss event for whatever was mid-flight.

The `kid` header: who signed this token?

The fix is to make the verifier capable of holding multiple keys at once. Rotation then becomes: introduce a new key alongside the old one, switch the signer to use the new one, wait for old tokens to expire naturally, drop the old key.

The JWT spec supports this directly. The header has an optional field called kid (key ID) whose job is exactly to tell the verifier which key signed this token. Most libraries ignore it because most setups have one key. With multiple keys, it becomes load-bearing.

# When signing
payload = {"sub": user_id, "exp": ...}
token = jwt.encode(
    payload,
    key=current_signing_key.material,
    algorithm="HS256",
    headers={"kid": current_signing_key.kid},   # <- the unlock
)

# When verifying
header = jwt.get_unverified_header(token)
kid = header.get("kid")
key = KEYRING.get(kid)   # may be current, or an older key we still trust
if key is None:
    raise InvalidKid()
payload = jwt.decode(token, key.material, algorithms=["HS256"])

The verifier holds a keyring — a map from kid to key material. On rotation, you add a new entry; old entries stay until you decide they're truly drained.

Four phases: add → flip → drain → drop

Once you have a keyring, the rotation is mechanical:

Phase 1 — ADD. Generate a new key. Add it to Secrets Manager alongside the existing keys (not replacing). All service instances reload and now have N+1 keys in the keyring. Verifications work for both old and new tokens. Signing still uses the old key.

Phase 2 — FLIP. Update the config that names "the current signing kid" to point at the new key. Service instances reload. New tokens are now signed with the new key; old tokens in the wild still verify because the old key is still in the keyring.

Phase 3 — DRAIN. Wait. How long? Long enough that every token signed with the old key has expired. If your access tokens are 1-hour TTL and your refresh tokens are 30-day TTL, you wait 30 days. (If that sounds long, see the refresh-token wrinkle below.)

Phase 4 — DROP. Remove the old key from Secrets Manager. Service instances reload, the keyring no longer contains the old key, the few stragglers still holding tokens signed with it get rejected. If you sized the drain window correctly, that's a vanishingly small number.

Nothing in this flow forces a re-auth. Existing sessions continue uninterrupted.

Secrets Manager: preserve-on-update is the unlock

The first time I shipped this I hit a wall I didn't expect. AWS Secrets Manager's PutSecretValue replaces the secret's value entirely. So if you PutSecretValue with {"new_kid": ...}, the old {"old_kid": ...} is gone. There's no "append" operation.

The right pattern is read-modify-write under a version constraint:

client = boto3.client("secretsmanager")

# 1. Read the current value.
current = client.get_secret_value(SecretId=SECRET_ARN)
secret  = json.loads(current["SecretString"])

# 2. Modify in place.
secret["keys"][new_kid] = {"material": new_key, "created_at": now}

# 3. Write back, with a client-request-token so two concurrent
#    rotations can't both succeed.
client.put_secret_value(
    SecretId=SECRET_ARN,
    SecretString=json.dumps(secret),
    ClientRequestToken=f"add-kid-{new_kid}",
)

ClientRequestToken is the conflict-detection key: if a concurrent rotation tries to write with the same token, AWS rejects the second one. Wrap that in a GitHub Actions workflow with a job-level concurrency group and you have safe, declarative rotations.

The other piece: every other secret in the same JSON blob (signing material for other purposes, rotation state, audit trail) gets preserved because the operation is read-modify-write. That's what I mean by "preserve-on-update semantics" — a tiny atomic-update helper sitting on top of GetSecretValue + PutSecretValue that every rotation goes through.

One workflow per purpose

We sign different things for different purposes — access tokens, refresh tokens, internal service-to-service tokens, signed URLs for media. Each gets its own kid namespace and its own rotation cadence.

The rotation workflow is parameterized by purpose and phase:

# .github/workflows/rotate-jwt.yml
on:
  workflow_dispatch:
    inputs:
      purpose: { type: choice, options: [access, refresh, internal, media] }
      phase:   { type: choice, options: [add, flip, drop] }

concurrency:
  group: rotate-${{ inputs.purpose }}
  cancel-in-progress: false

jobs:
  rotate:
    runs-on: ubuntu-latest
    permissions: { id-token: write }   # OIDC to AWS
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.ROTATION_ROLE_ARN }}
          aws-region: ap-southeast-2
      - run: python ops/rotate_jwt.py
            --purpose ${{ inputs.purpose }}
            --phase   ${{ inputs.phase }}

The concurrency group keys on purpose, so two rotations on different purposes (access + media) run in parallel; two on the same purpose serialize.

The refresh-token wrinkle

The honest version of phase 3 is: you wait for the longest-lived token signed with the old key to expire. For refresh tokens that can be a month. If your security policy says "rotate keys monthly," it means refresh tokens are always signed with the previous key for half their lifetime — you're never fully on a single key.

We solved it by giving the refresh-token purpose its own rotation schedule, decoupled from access tokens. Access tokens rotate frequently with a short drain window. Refresh tokens rotate slowly; when they do, the drain matches the refresh-token TTL. In practice access rotations finish in ~4 hours (TTL + buffer); refresh rotations finish over ~31 days. Both run automated, neither requires a re-login.

Closing

The architecture is small — a keyring, a kid header, four workflow phases, a read-modify-write helper. Nothing exotic. But the difference between "rotate keys and force re-auth" and "rotate keys and nobody notices" is the difference between a runbook your on-call dreads and one you fire-and-forget.

If you're building new auth infra, design for multi-key from day one. Retrofitting kid-aware verification later is much harder than building it in.