Reproducible Benchmarks: Sharing Quantum Job Performance Data Across Teams
researchbenchmarksCI

Reproducible Benchmarks: Sharing Quantum Job Performance Data Across Teams

UUnknown
2026-03-07
10 min read
Advertisement

Standardize quantum benchmark archives (.qba) and CI validation to restore data trust—share signed, verifiable job metrics across labs and vendors.

Hook — Why your team still can’t trust quantum performance numbers

Sharing quantum job metrics across labs, cloud vendors and academic groups in 2026 should be straightforward — but it isn’t. Teams still face siloed datasets, inconsistent metadata, opaque provenance and mismatched tooling that erode data trust. That mismatch slows research, makes audits costly, and frustrates collaborative reproducibility. If your team’s benchmarking artifacts look more like a pile of CSVs than a portable evidence bundle, this article shows a practical path forward: a reproducible benchmark archive format and a CI-integrated workflow to build, validate and share trustworthy quantum job performance data.

“Low data trust continues to limit how far enterprise AI (and by extension, scientific computing) can scale.” — Salesforce State of Data and Analytics (cited trend, 2025–2026)

The problem in 2026: enterprise lessons apply to quantum benchmarks

Enterprise research through late 2025 and early 2026 continues to show the same root causes: silos, missing provenance, ambiguous metadata and mismatched lifecycle controls. The Salesforce State of Data and Analytics research and recent industry moves (for example, acquisitions that emphasize verification and timing analysis) illustrate how important structured artifact management and strong provenance are for trusting results.

Quantum teams face additional constraints: hardware noise variability, vendor-specific backends, limited job throughput, and SDK changes that affect transpilation and runtime. These make reproducibility harder than for classical benchmarks. But by borrowing enterprise-grade controls — signed manifests, immutable archives, CI gates, provenance standards and deterministic simulation baselines — teams can rebuild trust into quantum job metrics.

What to standardize: the QBench Archive (.qba)

Introduce a small, opinionated archive format designed for portability and verification: the QBench Archive (.qba). A .qba is a compressed, signed bundle that contains everything someone needs to validate a benchmark claim: circuits, compiled circuits, raw results, calibration snapshots, a manifest, and reproducibility metadata.

High-level goals for .qba

  • Provenance-first: include the full environment and calibration snapshot
  • Verifiable: manifest with cryptographic checksums and signatures
  • Reproducible: deterministic seeds, pinned Docker/images and simulator noise models
  • CI-friendly: validators and pipelines that can run automatically
  • Vendor-agnostic: adaptable for Qiskit, Cirq, Q#, PennyLane, etc.

Suggested .qba directory layout

The archive is a TAR.GZ (or .zip) with a strict layout:

  • manifest.json — top-level checksums, version, archive-id, signature placeholder
  • provenance.json — environment, SDKs, hardware identifiers, calibration snapshot
  • circuits/ — original source circuits (.qasm, .py, .ipynb references)
  • compiled/ — compiled gate sequences submitted to hardware
  • raw_results/ — raw job outputs (JSON) and job-metadata including job id
  • metrics.json — computed metrics (fidelity, success_rate, walltime, queue_wait, etc.)
  • notebooks/ — analysis notebooks with deterministic seeds
  • plots/ — generated PNG/SVG for quick inspection
  • CITATION.cff — dataset citation metadata for DOIs

Example manifest (conceptual)

manifest.json should be machine-friendly and human-readable. Use SHA-256 checksums and include a top-level archive UUID and semantic version.

{
  "archive_id": "qba:2026:org.example:bench:uuid-1234",
  "format_version": "1.0.0",
  "created_at": "2026-01-12T15:23:00Z",
  "files": {
    "provenance.json": "sha256:abc...",
    "raw_results/job-42.json": "sha256:def...",
    "compiled/circuit1.qasm": "sha256:123...",
    "metrics.json": "sha256:456..."
  },
  "signature": "sigstore:... or gpg:..."
}

Define standard metrics and units (so numbers mean the same thing)

Benchmarks are only useful when metrics are comparable. Here’s a minimal, practical metric set every .qba should include:

  • success_rate — fraction of shots meeting target outcome (unitless, 0–1)
  • fidelity_est — estimated state or process fidelity (0–1, method described in provenance)
  • mean_gate_error — average two-qubit gate error (unit: error rate, 0–1)
  • measurement_error — baseline readout error (unit: error rate, 0–1)
  • shots — number of measurement shots used
  • circuit_depth — compiled circuit depth (integer)
  • walltime_seconds — total wall time to execute the job (seconds)
  • queue_wait_seconds — time queued on provider (seconds)

Attach a small metrics.json that also documents the computation method (e.g., whether fidelity is estimated via cross-entropy benchmarking, randomized benchmarking, tomography, etc.).

Provenance: what to capture (and why)

Provenance is the single biggest contributor to reproducibility. Capture these elements in provenance.json:

  • SDK and tooling: exact package names and pinned versions, e.g., qiskit==0.45.2, cirq==1.4.0
  • runtime environment: OS, Python minor version, container image digest
  • hardware snapshot: backend name, backend-id, calibration timestamp, per-qubit T1/T2, gate errors
  • transpilation settings: optimization level, basis gates, routing seed
  • random seeds: seed for RNGs used in circuit generation, simulator noise
  • job identifiers: vendor job ids, operator names, request timestamps
  • data retention policy: how long raw job logs are kept and where archived

Example provenance fragment (abbreviated):

{
  "sdk": {"qiskit": "0.45.2"},
  "container": {"image": "ghcr.io/org/qbench-runner@sha256:abcdef..."},
  "backend": {"provider": "vendorX", "backend_id": "vendorX_qpu_4v2", "calibration_time": "2026-01-10T03:00:00Z"},
  "calibration": {"t1": [50.2, 47.8], "t2": [40.1, 39.3], "single_qubit_error": [0.0012, 0.0015]}
}

Signing, verification and transparency

To make .qba credible across organizations, add signatures and encourage transparency logs:

  • Sign manifest.json using Sigstore (recommended) or GPG. Sigstore integrates with CI and shortens onboarding for verification.
  • Publish archive metadata to a transparency log or dataset registry (e.g., Zenodo, DataCite) and mint a DOI for published benchmarks.
  • Use in-toto-style provenance attestation so consumers can verify each stage of creation (build -> compile -> submit -> collect -> analyze).

CI integration: a practical end-to-end workflow

CI is where reproducibility becomes operational. The goal: each PR or release runs a reproducibility pipeline that validates archives, runs deterministic simulators, optionally schedules hardware jobs, and publishes artifacts. Below is a GitHub Actions-style template your team can adapt.

  1. Lint & Validate — ensure manifest format, required files, and checksums.
  2. Build — containerized environment that pins tool versions and builds compiled circuits deterministically.
  3. Simulate — run a deterministic simulator with the same seed and noise model; fail if results diverge beyond tolerance.
  4. Submit (optional) — schedule real hardware runs using scoped credentials via a secure runner or scheduled job (not in public CI by default).
  5. Collect & Compute Metrics — compute standardized metrics.json and attach to artifact.
  6. Sign & Publish — sign manifest (Sigstore) and publish archive to storage/registry; create a release and mint DOI when needed.

Example GitHub Actions snippet (concept)

name: QBench CI
on:
  push:
    paths:
      - 'benchmarks/**'
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Validate QBA manifest
        run: python tools/validate_qba.py benchmarks/experiment.qba
      - name: Run deterministic simulation
        env:
          DOCKER_IMAGE: ghcr.io/org/qbench-runner@sha256:abcdef
        run: |
          docker run --rm $DOCKER_IMAGE python run_sim.py --archive benchmarks/experiment.qba --seed 42
      - name: Sign manifest with Sigstore
        uses: sigstore/cosign-action@v2
        with:
          args: sign benchmarks/experiment/manifest.json
      - name: Publish artifact
        uses: actions/upload-artifact@v4
        with:
          name: experiment-qba
          path: benchmarks/experiment.qba

Notes: avoid using provider account tokens in public CI. Use a private runner or scheduled service account to submit to hardware. The CI should be able to run deterministically in a container to validate the archive without access to real hardware.

Practical validation scripts (Python)

Below is a simple conceptual validator that checks manifest checksums and enforces required fields. In production, extend to verify signatures and run lightweight simulation checks.

#!/usr/bin/env python3
import json, hashlib, tarfile, sys

def sha256_bytes(b):
    return hashlib.sha256(b).hexdigest()

def validate_qba(tar_path):
    with tarfile.open(tar_path, 'r:gz') as tf:
        members = {m.name: m for m in tf.getmembers()}
        if 'manifest.json' not in members:
            raise SystemExit('manifest.json missing')
        manifest = json.load(tf.extractfile(members['manifest.json']))
        for fname, expected in manifest['files'].items():
            if fname not in members:
                raise SystemExit(f'Missing {fname} in archive')
            data = tf.extractfile(members[fname]).read()
            if 'sha256:' + sha256_bytes(data) != expected:
                raise SystemExit(f'Checksum mismatch for {fname}')
    print('QBA validated')

if __name__ == '__main__':
    validate_qba(sys.argv[1])

Handling hardware-only metrics and limited access

Because many teams can’t run vendor hardware directly in CI, adopt a hybrid pattern:

  • Simulate in CI with pinned noise models to check for deterministic behaviour.
  • Schedule hardware runs outside public CI via secure runners or a scheduled job that has access to vendor credentials. Log job IDs to provenance.json.
  • Use a delayed verification job that runs when the hardware provider completes the job and someone with access posts the raw_results to a release pipeline. That pipeline re-runs the validator and attaches the signature and DOI.

This approach preserves security while keeping the validation chain auditable.

Operational practices and governance

Standards and automation matter, but governance ensures adoption. Here are practical steps teams should adopt now:

  • Policy: mandate a .qba for any published benchmark or claim used in internal decision-making.
  • Owner: assign a dataset owner responsible for retention, keys and registry submissions.
  • Rollout: add the QBench CI job to templates so new repos include reproducibility gates by default.
  • Audits: run quarterly integrity audits comparing published DOIs and transparency log entries.

Late 2025 and early 2026 have pushed several trends favorable to reproducible quantum benchmarks:

  • Tooling for supply-chain security: Sigstore, Cosign and In-Toto have become default tools in scientific CI, simplifying signing and attestation.
  • Vendor standardization: open interchange formats (OpenQASM 3.x maturity by 2025–2026) and improved backend metadata APIs make capturing calibration snapshots easier.
  • Dataset registries: tighter integration between repositories and DataCite/Zenodo supports DOI issuance for benchmark archives.
  • Enterprise emphasis on verification: acquisitions and product integrations (e.g., timing/verification toolchains) show that organizations value measured, certified performance claims.

These trends reduce friction for cross-vendor reproducibility and make .qba archives easier to verify and trust.

Advanced strategies and future predictions

For teams looking beyond basic reproducibility, consider these advanced strategies:

  • Continuous benchmark ledger: store a lightweight provenance record in a tamper-evident ledger (blockchain or transparency log) so any archive can be traced to its creation CI run and signatures.
  • Cross-lab regression tests: orchestrate multi-site runs where the same .qba is submitted to different backends to measure portability and vendor variance.
  • Benchmark certification: pursue third-party certification for critical benchmarks, especially for enterprise decision-making (e.g., procurement of cloud QPUs).
  • Standardized noise model interchange: by 2027 expect more vendor-adopted noise model specifications that let simulated CI checks better approximate hardware results.

Prediction: by the end of 2027, dataset registries and vendor APIs will routinely accept signed benchmark archives as the default artifact for performance claims in procurement and research publications.

Actionable checklist — make a reproducible benchmark today

  • Create a QBench Archive (.qba) for your experiment
  • Pin toolchain and container images, record them in provenance.json
  • Compute standardized metrics.json and describe computation methods
  • Sign manifests using Sigstore and publish metadata to a registry
  • Integrate validation & simulation into CI; use scheduled runners for hardware submissions
  • Mint a DOI on release and publish CITATION.cff

Example real-world scenario

Lab A and Lab B want to compare a QAOA circuit across Vendor X and Vendor Y. They agree on a QBench Archive spec. Lab A builds the .qba, runs on Vendor X, and publishes a signed archive and DOI. Lab B pulls the .qba, runs the same compiled circuit on Vendor Y with identical transpiler settings and seeds recorded in provenance.json, and publishes its signed .qba. An independent reviewer runs both archives through the CI validator and a deterministic simulator to verify that metrics were computed identically and that signature chains are intact. Auditors can now trust both results because each archive contains the same structural evidence and cryptographic attestation.

Closing — why this matters now

In 2026, quantum computing is moving from exploratory labs into collaborative, multi-institution workflows and even procurement decisions. Enterprise research into data trust shows how quickly projects can stall when artifacts aren’t trustworthy. The QBench Archive + CI integration pattern gives teams a practical, implementable way to build trust: portable evidence bundles, cryptographic attestation, and reproducible CI checks.

Takeaways: standardize your archives, record rich provenance, sign and publish, and validate automatically in CI. This combination turns unverifiable claims into auditable artifacts that your peers, partners and auditors can rely on.

Call to action

Ready to standardize benchmarks at your organization? Download the QBench Archive spec starter, the CI templates, and the validator scripts from our repo, or contribute to the spec on GitHub to make it vendor-neutral. If you run quantum experiments today, package one benchmark into a .qba, run the CI validator and publish it—then share the DOI with your collaborators to start building measurable, trusted benchmarks together.

Advertisement

Related Topics

#research#benchmarks#CI
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-07T00:25:17.615Z