Three Task Families · v0.2 Public Launch

Can AI systems conform to
schema definitions?

ConformBench measures how reliably AI systems respect schema field boundaries — refusing when a field doesn't exist, coercing types correctly, satisfying constraints. Bare LLMs hallucinate. Schema-native architectures don't.

274
benchmark tasks
3
task families
v0.2
dataset version
Core Claim

Bare LLMs and vector RAG systems cannot reliably refuse queries about schema fields that don't exist. Schema-native architectures — systems that ground generation in explicit schema definitions — achieve perfect conformance. ConformBench provides the empirical evidence.

Leaderboard v0.2 · All Families
# System Accuracy Refusal F1 Cost / run Notes
1
ARAMAI SCR Stub
schema-native oracle
1.000
1.000 $0.00 Perfect oracle — proves SCR architectural claim · macro across all 3 families
GPT-5 (bare)
bare LLM baseline
pending · api-keys pending · api-keys Post-launch run once API keys available in CI
Claude Opus 4.7 (bare)
bare LLM baseline
pending · api-keys pending · api-keys Post-launch run once API keys available in CI
Keci KGE
knowledge graph embedding
pending · api-keys pending · api-keys Post-launch run once API keys available in CI
ComplEx KGE
knowledge graph embedding
pending · api-keys pending · api-keys Post-launch run once API keys available in CI

v0.2 results · Dataset seed 0xC0FFEE · 274 tasks across 3 families · Generated 2026-06-02 · View methodology

Task Families
RefusalCorrectness
v0.1 · live
Given a schema and a field query, the system must either return the field value (ANSWER) or refuse gracefully (REFUSE) when the field doesn't exist in the schema. Tests the core schema-native property: grounding responses in explicit definitions.
Tasks: 100
Split: 50 ANSWER / 50 REFUSE
Seed: 0xC0FFEE
Metrics: accuracy, refusal_f1, confusion_matrix
TypeCoercion
v0.2 · live
Given a schema field with a declared type and a candidate value, the system must either confirm the value is type-valid (ANSWER) or refuse when the type doesn't match (REFUSE). Tests whether systems correctly identify type mismatches for schema fields.
Tasks: 100
Split: 50 ANSWER / 50 REFUSE
Seed: 0xC0FFEE
Metrics: accuracy, refusal_f1, confusion_matrix
ConstraintSatisfaction
v0.2 · live
Given a schema field with constraints (enum values, min/max bounds, format rules) and a candidate value, the system must identify whether the value satisfies the constraints (ANSWER) or violates them (REFUSE). Tests enum, range, and format constraint enforcement.
Tasks: 74
Split: 31 ANSWER / 43 REFUSE
Seed: 0xC0FFEE
Metrics: accuracy, refusal_f1, confusion_matrix
CrossSchemaAlignment
v0.3 · planned
Cross-schema field mapping and alignment tasks. Tests whether systems can correctly identify equivalent fields across heterogeneous schemas.
About

Project Ownership

ConformBench is a Schematica project. ARAMAI funds development and provides ongoing maintenance. MIT licensed — code and tasks are permissively licensed for reproducibility.

Reproducibility

All v0.2 tasks are deterministically generated from seed 0xC0FFEE. Running conformbench generate --version v0.2 --family all produces the same 274 tasks across all three families on any machine.

Paper

"ConformBench: Schema Conformance as a Benchmark for Schema-Native Architectures" targets SEMANTiCS 2026. ConformBench provides the empirical foundation.

Governance

v0.1 and v0.2 decisions rest with ARAMAI / Schematica maintainers. Community governance and open submission procedures are planned for v0.3. See GOVERNANCE.md.