Three Task Families · v0.2 Public Launch

Can AI systems conform to
schema definitions?

ConformBench measures how reliably AI systems respect schema field boundaries — refusing when a field doesn't exist, coercing types correctly, satisfying constraints. Bare LLMs hallucinate. Schema-native architectures don't.

274

benchmark tasks

task families

v0.2

dataset version

Core Claim

Bare LLMs and vector RAG systems cannot reliably refuse queries about schema fields that don't exist. Schema-native architectures — systems that ground generation in explicit schema definitions — achieve perfect conformance. ConformBench provides the empirical evidence.

Leaderboard v0.2 · All Families

#	System	Accuracy	Refusal F1	Cost / run	Notes
1	ARAMAI SCR Stub schema-native oracle	1.000	1.000	$0.00	Perfect oracle — proves SCR architectural claim · macro across all 3 families
—	GPT-5 (bare) bare LLM baseline	pending · api-keys	pending · api-keys	—	Post-launch run once API keys available in CI
—	Claude Opus 4.7 (bare) bare LLM baseline	pending · api-keys	pending · api-keys	—	Post-launch run once API keys available in CI
—	Keci KGE knowledge graph embedding	pending · api-keys	pending · api-keys	—	Post-launch run once API keys available in CI
—	ComplEx KGE knowledge graph embedding	pending · api-keys	pending · api-keys	—	Post-launch run once API keys available in CI

v0.2 results · Dataset seed 0xC0FFEE · 274 tasks across 3 families · Generated 2026-06-02 · View methodology

Task Families

RefusalCorrectness

v0.1 · live

Given a schema and a field query, the system must either return the field value (ANSWER) or refuse gracefully (REFUSE) when the field doesn't exist in the schema. Tests the core schema-native property: grounding responses in explicit definitions.

Tasks: 100

Split: 50 ANSWER / 50 REFUSE

Seed: 0xC0FFEE

Metrics: accuracy, refusal_f1, confusion_matrix

TypeCoercion

v0.2 · live

Given a schema field with a declared type and a candidate value, the system must either confirm the value is type-valid (ANSWER) or refuse when the type doesn't match (REFUSE). Tests whether systems correctly identify type mismatches for schema fields.

Tasks: 100

Split: 50 ANSWER / 50 REFUSE

Seed: 0xC0FFEE

Metrics: accuracy, refusal_f1, confusion_matrix

ConstraintSatisfaction

v0.2 · live

Given a schema field with constraints (enum values, min/max bounds, format rules) and a candidate value, the system must identify whether the value satisfies the constraints (ANSWER) or violates them (REFUSE). Tests enum, range, and format constraint enforcement.

Tasks: 74

Split: 31 ANSWER / 43 REFUSE

Seed: 0xC0FFEE

Metrics: accuracy, refusal_f1, confusion_matrix

CrossSchemaAlignment

v0.3 · planned

Cross-schema field mapping and alignment tasks. Tests whether systems can correctly identify equivalent fields across heterogeneous schemas.

About

Project Ownership

ConformBench is a Schematica project. ARAMAI funds development and provides ongoing maintenance. MIT licensed — code and tasks are permissively licensed for reproducibility.

Reproducibility

All v0.2 tasks are deterministically generated from seed 0xC0FFEE. Running conformbench generate --version v0.2 --family all produces the same 274 tasks across all three families on any machine.

Paper

"ConformBench: Schema Conformance as a Benchmark for Schema-Native Architectures" targets SEMANTiCS 2026. ConformBench provides the empirical foundation.

Governance

v0.1 and v0.2 decisions rest with ARAMAI / Schematica maintainers. Community governance and open submission procedures are planned for v0.3. See GOVERNANCE.md.

Can AI systems conform toschema definitions?

Project Ownership

Reproducibility

Paper

Governance

Can AI systems conform to
schema definitions?