arXiv Preprint · 2026

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Kuei-Chun Kao1, Daixuan Huo1, Yuanhao Ban1,2, Cho-Jui Hsieh1,2
1University of California, Los Angeles    2Arena
tl;dr

AutoRubric-T2I learns a compact set of weighted natural-language rubrics from image preference data, enabling interpretable VLM-based reward modeling without fine-tuning—using less than 0.01% of the annotated preference data that standard reward models require.

62.5%
Overall accuracy on MMRB2
(SoTA for 8B models without fine-tuning)
<0.01%
Annotated preference data
vs. standard reward models
Only 256 Pairs
Seed preference pairs
for rubric learning

The Pipeline

AutoRubric-T2I operates in two phases: seed rule generation from a small set of preference pairs, followed by iterative refinement that discovers and retains the most predictive rubrics through sparse weighting and hard-pair mining.

Phase 0 Seed Rule Generation (obtain Rseed)
1. Diversity-Aware Seed Selection
🗄️
All Training Pairs
Full preference dataset
⚖️
Composite Scoring
Proxy reward × prompt diversity
🎯
Top 256 Seed Pairs
High-margin, diverse seeds
2. T2I-Adapted CoT Rule Generation
🖼️
256 Seed Pairs
Selected preference pairs
🤖
VLM Judge
CoT prompting per pair
🔻
Deduplicate & Aggregate
Merge similar rules
📋
Rseed
Initial seed rule pool
Iterative Refinement (for r = 1, 2, …, R)
a
📊
Score Training Pairs
VLM scores pairs with current rules ℝ
b
📈
ℓ₁ Regression
Fit logistic regression, select top-N rules
c
🔍
Find Hard Pairs
Mine failures by difficulty & margin
d
🤖
Propose New Rules
VLM diagnoses hard cases, drafts rules
Form new rule set:   ℝ ← ℝret ∪ New Rules
🏆
Best Rule Set  Rbest
Compact, weighted natural-language rubrics
Process node Output node Iterative step Data flow Iterative loop
Show the static pipeline figure from the paper
AutoRubric-T2I pipeline diagram
Figure 1. Original pipeline diagram from the paper.

Key Contributions

1

Sparse Rubric Learning

A framework that learns a compact, weighted set of natural-language rubrics from image preference data.

2

Failure-Driven Refinement

Rubric selection formulated as $\ell_1$-regularized logistic regression over VLM-scored rubric features, with iterative expansion through hard-pair mining.

3

Strong Downstream RL

Achieves leading preference prediction on MMRB2 and improves downstream T2I-RL on TIIF and UniGenBench++ using Flow-GRPO.

Motivation: Reward Hacking in Scalar Rewards

Standard scalar reward models compress semantic fidelity, object correctness, spatial layout, and perceptual quality into a single number. During RL optimization, the policy can exploit spurious shortcuts—such as adding human subjects or aesthetic flourishes—that inflate the scalar reward without satisfying prompt constraints. AutoRubric-T2I provides decomposed, interpretable feedback that mitigates this failure mode.

Reward hacking example
Figure 2. HPSv3 optimization attains a high scalar reward while violating prompt-specific constraints (e.g., inserting an unnecessary human subject). AutoRubric-T2I favors the rubric-aligned generation, with rubric-level scores revealing which dimensions fail.

Preference Benchmark Results

We evaluate on the out-of-domain MMRB2 benchmark (covering EvalMuse, OneIG-Bench, R2I-Bench, RealUnify, and WISE) and in-domain PickScore/HPSv3 test sets. AutoRubric-T2I consistently outperforms both fine-tuned scalar reward models and existing rubric-generation baselines across all VLM judge scales.

Model EvalMuse OneIG R2I RealUnify WISE Overall PickScore HPSv3
Fine-Tuned Scalar Reward Models (Qwen2.5-VL-7B)
HPSv3 54.0 60.4 68.0 68.9 56.7 59.4 67.3 74.0
UnifiedReward 56.9 62.1 56.8 67.8 56.0 59.8 68.8 65.8
Pointwise VLM Judge: Qwen3-VL-8B
AutoRule (HPSv3) 56.9 60.4 64.1 62.4 55.0 59.1 61.1 62.8
AutoRubric (HPSv3) 54.1 61.9 61.7 59.1 52.3 57.5 57.1 58.1
AutoRubric-T2I (HPSv3) 58.5 67.3 64.1 65.6 60.4 62.5 61.7 63.9
AutoRubric-T2I (PickScore) 60.8 66.2 65.6 61.3 55.9 62.4 63.2 61.5
Pointwise VLM Judge: Gemini-3-Flash
AutoRule (HPSv3) 67.7 68.4 67.2 72.0 62.2 67.6 67.2 62.6
AutoRubric (HPSv3) 61.2 62.8 66.6 74.2 58.7 63.8 65.0 62.2
AutoRubric-T2I (HPSv3) 70.2 71.3 70.8 78.7 64.9 70.8 69.0 70.0
AutoRubric-T2I (PickScore) 70.7 71.9 71.1 79.5 66.0 71.4 70.3 66.8

Table 1 (abridged). Comparison across MMRB2 out-of-domain and in-domain benchmarks. Bold indicates best within each VLM pointwise block.

Downstream T2I Reinforcement Learning

We apply AutoRubric-T2I as a reward signal for fine-tuning SD-3.5-Medium via Flow-GRPO. The learned rubrics provide more targeted feedback than scalar rewards, with particularly strong gains in reasoning, relational composition, and text generation categories.

TIIF Results

Model Attr. Rel. Reason. Text Overall
SD-3.5-Medium 78.0 76.6 66.8 45.7 65.3
+ HPSv3 83.5 75.5 65.8 62.0 68.8
+ AutoRule (HPSv3) 85.0 72.4 70.0 55.2 69.1
+ Ours (HPSv3) 84.5 76.2 74.6 65.2 71.6

TIIF (short prompts). AutoRubric-T2I improves from 65.3% to 71.6% overall.

UniGenBench++ Results

Model Style Rel. Compound Layout Overall
SD-3.5-Medium 91.7 70.4 62.2 74.6 64.0
+ HPSv3 82.7 69.2 60.1 73.3 62.6
+ AutoRule (HPSv3) 91.4 70.6 66.3 75.6 66.3
+ Ours (HPSv3) 90.7 71.0 66.9 78.3 66.9

UniGenBench++ (long prompts). Strongest gains in layout and compound reasoning.

RL qualitative examples
Figure 3. Qualitative comparison of RL outputs. Scalar rewards improve visual appeal but miss fine-grained prompt constraints; AutoRubric-T2I better preserves requested objects, relations, and scene structure.

Citation

If you find this work useful, please cite:

@article{kao2026autorubric,
  title   = {AutoRubric-T2I: Robust Rule-Based Reward Model
             for Text-to-Image Alignment},
  author  = {Kao, Kuei-Chun and Huo, Daixuan and Ban, Yuanhao
             and Hsieh, Cho-Jui},
  journal = {arXiv preprint arXiv:2605.17602},
  year    = {2026}
}