Multi-instance + version-skew
Multi-instance + version-skew
Section titled “Multi-instance + version-skew”Compile may run multiple instances of the same producer in parallel:
- Scale-out — many instances behind a load balancer to absorb queue depth.
- Multi-region — separate instances per Railway region for latency.
- Blue/green — old + new at the same time during rollout.
Because Compile guarantees deterministic bytes, two instances of the same producer must produce byte-identical output for the same input + plan, every time. Anything else corrupts the cache.
What the determinism guarantee depends on
Section titled “What the determinism guarantee depends on”The cache key (src/compile_pdf/cache.py) includes:
compile_version— the Compile package versioncodex_pdf_package_version— the Codex wheel versioncolor_schema_version—codex_pdf.color.COLOR_SCHEMA_VERSIONgeom_schema_version—codex_pdf.geom.GEOM_SCHEMA_VERSIONcodex_document_schema_version— pinned incompile_pdf.versionproducer—rewrite/marks/impose/trapsha256(canonical_plan)sha256(input_bytes)
If any of (1)–(5) differ between instances, the cache key differs and the same input + plan can hit different cached entries. That’s fine when correct; it’s a corruption when one instance has been rebuilt against a newer Codex but another hasn’t.
The version_skew health field
Section titled “The version_skew health field”/v1/healthz.version_skew flips true when the codex section
versions Compile was built against drift from what Codex
publishes live. Operators watch this field and:
- Drain the affected instance from the load balancer.
- Rebuild against the new Codex.
- Redeploy and re-add to the LB.
Skew on a single instance is a recoverable state. Skew on all instances of a producer is an outage signal.
Codex change ripple rule
Section titled “Codex change ripple rule”Any change to codex-pdf — code, schema, image tag, or
codex_pdf.version.VERSION — MUST cascade a redeploy of every
Compile container that calls codex. Skipping the cascade silently
pins consumers to a stale contract.
Cascade order:
- Bump
codex-pdfand verify Codex’sproduce_surface_audit.pypasses. - Bump the codex pin in
compile-pdf/pyproject.tomlif the major moved (otherwise the existing range is fine). - Rebuild and redeploy
compile-rewrite,compile-marks,compile-impose,compile-trap(any order). - Rebuild and redeploy
compile-sidecarincompile-pdf-marketing. - Run
compile-pdf healthagainst each environment and confirmversion_skew: false.
Operator runbook
Section titled “Operator runbook”| Symptom | Likely cause | Action |
|---|---|---|
version_skew: true on one instance | Partial rollout in progress | Wait or drain that instance |
version_skew: true on all instances | Codex changed but Compile not rebuilt | Run cascade |
cache_hit_rate drops to ~0% | Codex section bump invalidated everything | Expected; will recover as new entries fill |
queue_depth grows unboundedly | Trap engine selection failure (especially ghostscript on a container without the extra) | Check COMPILE_TRAP_ENGINE against installed extras |