The GATE Conformance Runner: What You Can Automate and What You Cannot

The gate-conformance v1.2.0 runner automates 9 of GATE’s 19 checks against a live evidence store. The other 10 return PARTIAL: not failure, but a structured handoff to the operator.

Andrew Stevens

Jun 17, 2026 · 8 Min read

Engineering AI Governance Security Agentic Gate Compliance Regulation Python Strategy

Most compliance tooling works in binary. A check passes or it fails. That binary is comfortable for auditors and uncomfortable for engineers, because it hides the question that matters most: did the tool actually verify the thing, or did it just find no evidence of failure?

The GATE conformance runner, shipping in gate-conformance v1.2.0, introduces a third status: PARTIAL. It is the most important design decision in the tool and the one worth explaining before you run it.

What PARTIAL means

A PARTIAL result means the runner completed the automatable portion of a check, found no failure in that portion, and determined that the remainder requires something it cannot do from an evidence store query alone: a test execution, a file verification, or a controlled drill with time measurements.

PARTIAL is not failure. It is a structured handoff. Every PARTIAL result carries a manual_steps payload that tells the operator exactly what artefact is still owed: which test to run, what the expected outcome is, and what evidence to collect. Pull that field from the YAML output and you have a checklist, not an open question.

The alternative, marking a check PASS because the query returned no failures, would be worse. A tool that tells you your bypass paths are clean because it could not query for direct tool invocations is producing a false guarantee. The runner does not do that.

What it actually checks

Of GATE’s 19 conformance checks, 9 are fully automatable from an evidence store query:

Check01: zero tool executions without a policy decision record. A SQL join, target zero rows.
Check03: 100% attestation coverage on privileged requests. A coverage metric query.
Check04: schema validation rejects malformed inputs. Queryable from reject logs.
Check05: ledger hash chain integrity. Integrity report status query.
Check08: budget exhaustion denies tool calls. Enforcement deny event count.
Check09: 100% signature coverage on high-impact tools. A coverage metric query.
Check10: memory ACLs enforced at retrieval time. ACL decision log query.
Check12: evidence chain traversable. A SQL join across five tables.
Check13: policy bundle hash in evidence matches deployed bundle. Hash comparison query.

The other 10 are PARTIAL. Check02 requires proving no bypass paths exist at the network and IAM level. That is a configuration verification and a test execution, not a query. Check07 requires measuring time-to-containment on a live circuit breaker drill. Check14 requires verifying HITL approval signatures against the signing key. Check19 queries for event type crossover but cannot verify that the C16 and C19 runbooks are stored as separate documents with sign-off records. That requires human review.

The split is honest. 9 automated, 10 structured-manual, 0 faked.

The tier-aware checks

Three of the new v1.3 checks (Check16, Check17, and Check18) branch internally on autonomy tier. The same evidence that constitutes a PASS at sandbox tier constitutes a FAIL at bounded or high_privilege. This is deliberate, and it means the runner needs to know your tier before it runs.

Check16 covers C17 Agent Discovery. At sandbox tier, the runner looks for discovery events emitting and a signed classifier bundle (observe-only is sufficient). At bounded and high_privilege, it requires a reconciliation delta of zero between the C04 inventory and the tool API stream outside the active remediation TTL window. Sandbox and bounded are different pass criteria against the same evidence.

Check17 covers C18 Data Quality Gates. This is the check most likely to surprise teams who assume that quality_decision events emitting means quality gates are enforced. The runner does not make that assumption. It reads the active quality bundle’s action_matrix and determines whether the configured action for each dimension is “deny” or something softer. At bounded tier, freshness and confidence must be enforced. At high_privilege, provenance must also be enforced. If the quality bundle is configured to flag rather than deny, the check returns FAIL even if events are emitting correctly.

Check18 covers C19 Model Behaviour Monitoring. At bounded tier, the runner requires drift_decision events emitting at the configured cadence. At high_privilege, it additionally requires at least one observed tier_reduction or emergency_stop response_action event in the assessment window. Log-only drift detection is not sufficient at high_privilege. The runner will FAIL a high_privilege deployment that has drift detection configured but no response routing wired.

Running it

pip install pyyaml
git clone https://github.com/deterministic-agents/gate-conformance
cd gate-conformance
# Point it at your evidence store
python -m runner.cli run \
  --config gate-conformance.yaml \
  --output-format yaml \
  --output-file report.yaml

The config file specifies your evidence store connection (BigQuery or sqlite for local testing), tenant ID, environment, and autonomy tier. Thresholds for the v1.3 checks are configurable: remediation TTL for Check16, freshness TTL per content class for Check17, drift decision cadence for Check18.

Exit codes follow the standard pattern: 0 means all must-pass checks passed, 1 means one or more failed or errored, 2 means configuration error. Wire it into CI and you have a gate, not just a report.

What PARTIAL looks like in practice

A PARTIAL result for Check06 (deterministic replay) looks like this in the YAML output:

- check_id: Check06
  status: PARTIAL
  evidence_refs:
    - "replay_success_rate: 97.3% (above 95% threshold)"
    - "replay_traces present for 143 of 143 high-impact runs"
  notes: "Automated portion passed. Manual verification required."
  manual_steps:
    - step: "Select a production run_id from the last 7 days"
      expected_artifact: "Replay report with matching request_hash and response_hash for each tool step"
      rationale: "The runner can verify replay trace presence and success rate metrics but cannot execute the replay harness against live infrastructure"

The automated portion (trace presence and success rate) is done. What is left is executing the harness and confirming hash matching. The operator knows exactly what to do.

The self-assessment YAML is still the source of truth

The runner and self_assessment.yaml are designed to be used together. The runner handles the query-based subset. The self-assessment handles the rest, including the manual_steps the runner surfaces. When you file a conformance report, the runner output populates the evidence_refs for the automatable checks. You complete the remainder from the self-assessment test procedures.

Neither replaces the other. A runner output without completed manual steps is an incomplete conformance report. A self-assessment without runner output is more work than necessary for the checks that can be automated.

Regulatory alignment

The runner was not designed to a specific regulation. Its output does, however, land cleanly on a small set of obligations engineers are increasingly asked to evidence under the EU AI Act and the EU Cyber Resilience Act. This is a starting map, not a legal opinion: every applicable mapping depends on your role under each regulation, the system you operate, your jurisdiction, and how a supervisory authority interprets the evidence in the specific context of your deployment.

EU AI Act (Regulation (EU) 2024/1689). The runner’s runtime evidence is directly relevant to four articles for providers and deployers of high-risk AI systems.

Article 12 (record-keeping): the hash-chained ledger plus Check01, Check05, and Check12 produce the automatic logging Article 12 requires, with lifecycle traceability. The same evidence stream feeds technical documentation under Article 11.
Article 14 (human oversight): Check14’s HITL approval signature verification is the runtime substrate Article 14 demands. The PARTIAL handoff makes the human verification step explicit rather than implicit.
Article 15 (accuracy, robustness, cybersecurity): Checks 02, 03, 08, 09, 10, and 13 together cover the runtime portion of Article 15’s cybersecurity expectations: bypass prevention, attestation, budget enforcement, signing, memory ACLs, and deployed-bundle provenance. Regulatory Stack Part 4 covers the engineering substrate this rests on.
Article 72 (post-market monitoring): for providers of high-risk AI systems under Annex III, Check18’s tier-aware semantics around C19 drift detection map cleanly to Article 72’s continuous post-market monitoring obligation, including the response-action wiring requirement at high_privilege tier. Deployers and general-purpose AI model providers sit under different obligations (Articles 26 and 55 respectively); the runner evidence is supportive but not directly mapped to those regimes.

EU Cyber Resilience Act (Regulation (EU) 2024/2847). Substantive obligations apply from 11 December 2027. The runner’s evidence supports:

Annex I Part I (essential cybersecurity requirements): Checks 03, 05, 09, 10, 13, and 17 provide running-system evidence for unauthorised-access protection, confidentiality, integrity, and (at high_privilege tier) data minimisation through provenance enforcement.
Annex I Part II (vulnerability handling): the PARTIAL handoffs for Check07 (circuit breaker drill), Check14 (HITL signing key), and Check19 (runbook separation) document the “appropriate procedures” Part II requires.
Article 14 (reporting obligations): runner failures wired into the incident pipeline can serve as the detection trigger for the 24-hour notification clock on actively exploited vulnerabilities.

What the runner does not produce: Neither regulation is a pure runtime exercise. Conformity assessment under AI Act Articles 43-44 and CRA Article 32 are manufacturer-level documentation processes. Technical documentation, risk management system records, quality management system evidence, SBOM, and CE marking artefacts all sit outside the runner’s scope. High-risk classification under AI Act Article 6 is a legal categorisation, not a technical check. The runner produces the runtime layer of evidence these regimes assume exists; it does not replace the broader compliance programme that wraps it.

The source is at github.com/deterministic-agents/gate-conformance. The runner/README.md covers the full quickstart, the per-tier behaviour notes, and how to add a custom check or evidence-store adapter.

For the architecture behind the controls the runner verifies, see “GATE: The Missing Infrastructure Layer for Agentic AI”. For the v1.3 release that introduced the runner and the three tier-aware checks, see “GATE v1.3: New Controls for Shadow Agents, Data Quality, and Model Drift”.

Not legal advice. The regulatory mappings in this post are general engineering guidance for teams implementing GATE-aligned agent governance. They are not legal advice and are not a substitute for advice from qualified counsel. Specific obligations under the EU AI Act and the EU Cyber Resilience Act depend on your role (provider, deployer, importer, distributor, manufacturer), the system you operate, your jurisdiction, and the position of your supervisory authority. Validate every mapping against independent legal advice, your own conformity assessment process, and current supervisory authority guidance before relying on it for a regulatory filing or attestation.

Andrew Stevens is CTO and CISO at Sakura Sky. GATE is published at deterministicagents.ai. The strategic companion to this framework is the Trustworthy Agentic AI Blueprint, co-authored with Sakura Sky.