Can Verifiable Rewards Replace Constrained Decoding? Not Yet in This a2ui Run

Rio AI Research Lab

2026-03-20

Can Verifiable Rewards Replace Constrained Decoding? Not Yet in This a2ui Run

We tested whether verifier-shaped training and one-step self-repair could narrow the gap to reliable a2ui structured outputs without dedicated constrained decoding. The best executable system improved from a 21.0% best pure-prompt baseline to 40.0% VRS@0.90. That is a real gain, but it still does not support replacing constrained decoding when reliability truly matters.

On March 20, 2026 (UTC), we closed a focused a2ui structured-output study with a visitor-friendly version of the real question:

Can verifiable rewards and targeted fine-tuning make a small model reliable enough to skip constrained decoding?

The short answer is not yet.

The best pure prompting baseline reached 21.0% VRS@0.90. Adding one retry lifted that to 28.0%. The best pure-model SFT checkpoint reached 39.0%, and the best executable end-to-end system — that checkpoint plus one retry — reached 40.0%.

That is a meaningful improvement. But it is still below the kind of reliability most teams would want if structured outputs were expected to behave like a dependable product surface.

The result, in one glance

System	VRS@0.90	Semantic F1	What it suggests
Base 4-shot prompting	21.0%	0.500	Prompt examples helped, but joint correctness remained weak
Base 4-shot + 1 retry	28.0%	0.642	Self-repair improved weaker outputs, especially on structure
Cycle 9 split-aware SFT	39.0%	0.756	The biggest real gain came from task-specific specialization
Cycle 10 SFT + 1 retry	40.0%	0.763	The best executable system, but only slightly above SFT alone
Cycle 14 Gemini selector	34.0%	0.732	Selection did not overtake the best trained checkpoint
Oracle on the fixed candidate pool	41.0%	0.785	There was very little hidden headroom inside that pool

Benchmark summary of the main systems in the run, showing the rise from 21.0% to 40.0% VRS@0.90.

What visitors should notice first

1) The biggest gain came from SFT, not from prompt tricks

This run does support one strong positive claim:

Verifier-shaped synthetic training materially improved a small model on this narrow a2ui structured-output task.

The clearest jump was from the best pure prompting baseline at 21.0% to the best pure-model SFT checkpoint at 39.0%.

That is the central result. If someone walks away from this article remembering only one thing, it should be that the main lift came from specializing the model around a verifiable target, not from simply polishing the prompt.

2) Retry helped, but mostly before specialization

One retry mattered when the system was still relatively brittle:

21.0% → 28.0% for 4-shot prompting.

But once the model had already been specialized, the same retry policy added only a small increment:

39.0% → 40.0% from Cycle 9 SFT to Cycle 10 SFT + retry.

That makes the retry story easier to read. It was useful, but it was not the main engine of improvement.

3) The practical answer is still conservative

Even after the strongest executed setup in the run, the system finished at 40.0% VRS@0.90.

That is progress. It is not yet the kind of score that turns “structured outputs without constrained decoding” into a comfortable default engineering choice.

Why this result is still valuable

A lot of structured-output discussion becomes too optimistic as soon as the model can emit valid JSON.

But real structured outputs usually need something stricter than “the parser accepted it.” They need the payload to satisfy a schema, stay executable, preserve the critical fields, and remain semantically faithful to the original request.

That is why this run matters.

The verifier used here measured the whole chain of usability rather than parse success alone. In that stricter setting, the result becomes more informative:

prompting helped,
verifier-shaped SFT helped much more,
and the final system still stayed well below a high-confidence production band.

That is not a disappointing outcome. It is a clarifying one.

What a visitor should take away

If you remember only four things from this article, make them these:

Verifier-shaped training was useful. It created the largest improvement in the study.
Prompting alone was not enough. Even the best prompt-time system stopped at 28.0%.
Retry was a secondary gain, not the main story. It helped weaker systems more than stronger ones.
Constrained decoding still has the safer practical position. A best executed score of 40.0% is progress, not replacement.

Visitor-focused takeaway graphic summarizing what the run really supports in practice.

Practical read for teams building structured outputs

The cleanest lesson is not “verifiable rewards failed.” It is more precise than that.

They helped. They gave the project a measurable target, a way to shape training, and a real lift in end-to-end reliability.

But if a team needs structured outputs to behave predictably in production today, the safer reading is still:

use verifiers for evaluation and training,
use targeted SFT when a task family is narrow enough,
and keep constrained decoding in the toolbox when reliability really matters.

Limits of the result

This article should still be read narrowly:

the audit set was machine-audited, not human-reviewed,
the task family was only a single-page a2ui subset,
the main model study centered on one 4B-class instruct checkpoint, and
no constrained-decoding reference baseline was run inside this exact artifact set.

So the conclusion is not “constrained decoding always wins.” It is simpler:

in this run, verifier-shaped rewards improved the model materially, but not enough to replace constrained decoding.

Final read

The fairest summary is this:

Verifiable rewards gave this a2ui project a real training and evaluation signal, and that signal produced a meaningful jump in reliability. But the best executable system still reached only 40.0% VRS@0.90, which is better read as evidence of progress than as permission to retire constrained decoding.

That makes this a useful visitor-facing result. It shows where the gain was real, what carried most of it, and why the practical conclusion should remain careful.

Themes & keywords

structured-output a2ui verifiable-rewards sft benchmark constrained-decoding