AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

tl;dr

AutoRubric-T2I learns a compact set of weighted natural-language rubrics from image preference data, enabling interpretable VLM-based reward modeling without fine-tuning—using less than 0.01% of the annotated preference data that standard reward models require.

62.5%

Overall accuracy on MMRB2
(SoTA for 8B models without fine-tuning)

<0.01%

Annotated preference data
vs. standard reward models

Only 256 Pairs

Seed preference pairs
for rubric learning

The Pipeline

AutoRubric-T2I operates in two phases: seed rule generation from a small set of preference pairs, followed by iterative refinement that discovers and retains the most predictive rubrics through sparse weighting and hard-pair mining.

Phase 0 Seed Rule Generation (obtain R_seed)

1. Diversity-Aware Seed Selection

🗄️

All Training Pairs

Full preference dataset

⚖️

Composite Scoring

Proxy reward × prompt diversity

🎯

Top 256 Seed Pairs

High-margin, diverse seeds

2. T2I-Adapted CoT Rule Generation

🖼️

256 Seed Pairs

Selected preference pairs

🤖

VLM Judge

CoT prompting per pair

🔻

Deduplicate & Aggregate

Merge similar rules

📋

R_seed

Initial seed rule pool

Iterative Refinement (for r = 1, 2, …, R)

📊

Score Training Pairs

VLM scores pairs with current rules ℝ

📈

ℓ₁ Regression

Fit logistic regression, select top-N rules

🔍

Find Hard Pairs

Mine failures by difficulty & margin

🤖

Propose New Rules

VLM diagnoses hard cases, drafts rules

↻ Form new rule set: ℝ ← ℝ_ret ∪ New Rules

🏆

Best Rule Set R_best

Compact, weighted natural-language rubrics

Process node Output node Iterative step Data flow Iterative loop

Show the static pipeline figure from the paper

AutoRubric-T2I pipeline diagram — **Figure 1.** Original pipeline diagram from the paper.

Key Contributions

Sparse Rubric Learning

A framework that learns a compact, weighted set of natural-language rubrics from image preference data.

Failure-Driven Refinement

Rubric selection formulated as $\ell_1$-regularized logistic regression over VLM-scored rubric features, with iterative expansion through hard-pair mining.

Strong Downstream RL

Achieves leading preference prediction on MMRB2 and improves downstream T2I-RL on TIIF and UniGenBench++ using Flow-GRPO.

Motivation: Reward Hacking in Scalar Rewards

Standard scalar reward models compress semantic fidelity, object correctness, spatial layout, and perceptual quality into a single number. During RL optimization, the policy can exploit spurious shortcuts—such as adding human subjects or aesthetic flourishes—that inflate the scalar reward without satisfying prompt constraints. AutoRubric-T2I provides decomposed, interpretable feedback that mitigates this failure mode.

Reward hacking example — **Figure 2.** HPSv3 optimization attains a high scalar reward while violating prompt-specific constraints (e.g., inserting an unnecessary human subject). AutoRubric-T2I favors the rubric-aligned generation, with rubric-level scores revealing which dimensions fail.

Preference Benchmark Results

We evaluate on the out-of-domain MMRB2 benchmark (covering EvalMuse, OneIG-Bench, R2I-Bench, RealUnify, and WISE) and in-domain PickScore/HPSv3 test sets. AutoRubric-T2I consistently outperforms both fine-tuned scalar reward models and existing rubric-generation baselines across all VLM judge scales.

Model	EvalMuse	OneIG	R2I	RealUnify	WISE	Overall	PickScore	HPSv3
Fine-Tuned Scalar Reward Models (Qwen2.5-VL-7B)
HPSv3	54.0	60.4	68.0	68.9	56.7	59.4	67.3	74.0
UnifiedReward	56.9	62.1	56.8	67.8	56.0	59.8	68.8	65.8
Pointwise VLM Judge: Qwen3-VL-8B
AutoRule (HPSv3)	56.9	60.4	64.1	62.4	55.0	59.1	61.1	62.8
AutoRubric (HPSv3)	54.1	61.9	61.7	59.1	52.3	57.5	57.1	58.1
AutoRubric-T2I (HPSv3)	58.5	67.3	64.1	65.6	60.4	62.5	61.7	63.9
AutoRubric-T2I (PickScore)	60.8	66.2	65.6	61.3	55.9	62.4	63.2	61.5
Pointwise VLM Judge: Gemini-3-Flash
AutoRule (HPSv3)	67.7	68.4	67.2	72.0	62.2	67.6	67.2	62.6
AutoRubric (HPSv3)	61.2	62.8	66.6	74.2	58.7	63.8	65.0	62.2
AutoRubric-T2I (HPSv3)	70.2	71.3	70.8	78.7	64.9	70.8	69.0	70.0
AutoRubric-T2I (PickScore)	70.7	71.9	71.1	79.5	66.0	71.4	70.3	66.8

Table 1 (abridged). Comparison across MMRB2 out-of-domain and in-domain benchmarks. Bold indicates best within each VLM pointwise block.

Downstream T2I Reinforcement Learning

We apply AutoRubric-T2I as a reward signal for fine-tuning SD-3.5-Medium via Flow-GRPO. The learned rubrics provide more targeted feedback than scalar rewards, with particularly strong gains in reasoning, relational composition, and text generation categories.

TIIF Results

Model	Attr.	Rel.	Reason.	Text	Overall
SD-3.5-Medium	78.0	76.6	66.8	45.7	65.3
+ HPSv3	83.5	75.5	65.8	62.0	68.8
+ AutoRule (HPSv3)	85.0	72.4	70.0	55.2	69.1
+ Ours (HPSv3)	84.5	76.2	74.6	65.2	71.6

TIIF (short prompts). AutoRubric-T2I improves from 65.3% to 71.6% overall.

UniGenBench++ Results

Model	Style	Rel.	Compound	Layout	Overall
SD-3.5-Medium	91.7	70.4	62.2	74.6	64.0
+ HPSv3	82.7	69.2	60.1	73.3	62.6
+ AutoRule (HPSv3)	91.4	70.6	66.3	75.6	66.3
+ Ours (HPSv3)	90.7	71.0	66.9	78.3	66.9

UniGenBench++ (long prompts). Strongest gains in layout and compound reasoning.

RL qualitative examples — **Figure 3.** Qualitative comparison of RL outputs. Scalar rewards improve visual appeal but miss fine-grained prompt constraints; AutoRubric-T2I better preserves requested objects, relations, and scene structure.

Citation

If you find this work useful, please cite:

@article{kao2026autorubric,
  title   = {AutoRubric-T2I: Robust Rule-Based Reward Model
             for Text-to-Image Alignment},
  author  = {Kao, Kuei-Chun and Huo, Daixuan and Ban, Yuanhao
             and Hsieh, Cho-Jui},
  journal = {arXiv preprint arXiv:2605.17602},
  year    = {2026}
}