Reproducing AssayPDF
Reproducing AssayPDF
Section titled “Reproducing AssayPDF”Step-by-step guide to running AssayPDF end-to-end on your machine and reproducing a published score.
Prerequisites
Section titled “Prerequisites”brew install ghostscript qpdf mupdf-tools exiftool imagemagickveraPDF is not in core Homebrew. Install via the headless installer:
cd /tmpcurl -L -o verapdf-installer.zip https://software.verapdf.org/rel/verapdf-installer.zipunzip -o verapdf-installer.zip# Follow on-screen prompts or use the auto-install XML approach in scripts/bootstrap.shsudo apt-get install ghostscript qpdf mupdf-tools libimage-exiftool-perl imagemagick# verapdf: same headless installer as macOSPython + uv
Section titled “Python + uv”curl -LsSf https://astral.sh/uv/install.sh | shClone + sync
Section titled “Clone + sync”git clone https://github.com/thinkneverland/assay-pdf.gitcd assay-pdfuv sync --all-extrasuv will fetch Python 3.12, install all dependencies, and create .venv/. About 30 seconds on a fresh machine.
Generate the corpus
Section titled “Generate the corpus”uv run assay generateProduces 62 PDFs in corpus/:
- 23 positives (one per variant)
- 39 negatives (21 concrete + 18 stubs documented for v0.1.1)
corpus/manifest.jsonwith SHA-256 of every file
The PDFs themselves are gitignored. They’re regenerated deterministically from the manifest + the generator code.
Optional: ICC profile setup
Section titled “Optional: ICC profile setup”For variant-specific colorimetry, install Adobe ICC Profiles:
# macOS — install Adobe Acrobat or Photoshop, OR download Adobe ICC profiles directly# from https://www.adobe.com/support/downloads/iccprofiles/iccprofiles_mac.htmlWithout these, AssayPDF falls back to macOS’s Generic CMYK Profile.icc, which is structurally valid but not the spec-recommended ICC for any specific variant.
Run a benchmark
Section titled “Run a benchmark”pdfToolbox (callas)
Section titled “pdfToolbox (callas)”# Set path to your GWG 2022 profile directory if not the defaultexport ASSAY_PDFTOOLBOX_PROFILE_DIR="$HOME/Library/Application Support/callas software/pdfToolbox/Profiles"
uv run assay benchmark --engine pdftoolboxPitStop (Enfocus)
Section titled “PitStop (Enfocus)”export ASSAY_PITSTOP_PROFILE_DIR="$HOME/Library/Application Support/Enfocus/PitStop Server/Preflight Profiles"
uv run assay benchmark --engine pitstoplintPDF
Section titled “lintPDF”uv run assay benchmark --engine lintpdf# Currently a stub — emits a warning per file. Real integration ships with lintPDF API.Each benchmark writes:
results/<engine>-<timestamp>.json— raw EngineResult per PDFresults/<engine>-<timestamp>.score.json— confusion matrix per (rule, variant)
Render the report
Section titled “Render the report”uv run assay report --format md > REPORT.mduv run assay report --format html --output REPORT.htmlThe report aggregates every *.score.json in results/. To compare engines, run the benchmark for each, then re-render.
Verify the corpus
Section titled “Verify the corpus”uv run assay validateWalks every PDF in corpus/manifest.json and verifies:
- The file exists.
- Its SHA-256 matches the manifest.
- It passes verapdf PDF/X-4 validation.
Failures are reported with the file path and the verapdf message. Used in CI on every push.
Reproducing a published score
Section titled “Reproducing a published score”If a published comparison says “pdfToolbox 16.2 scored 78.3% F1 on corpus v0.1.0”:
git checkout v0.1.0 # match corpus versionuv sync # match dependency versionsuv run assay generate # build the same corpus# Have pdfToolbox 16.2 installed (the version recorded in the published score)uv run assay benchmark --engine pdftoolboxuv run assay report --format mdThe F1 should match within rounding (sub-0.5%). If it differs more than that, file an issue with both score JSONs.