Table view
Accessibility Score by Model
A single score from 0 to 100. Higher means a model is less likely to generate accessibility barriers. Clean outputs matter most.
Average Accessibility Defects
How many barriers each model produces on average per run.
Clean Output Rate (Zero Detectable Errors)
How often a model produced code with no detectable accessibility errors.
Errors per 100 DOM elements
Error rates normalized by DOM scope to compare models with smaller vs. larger generated interfaces.
How We Measured This
We ran the same 150 distinct UI tasks under three prompt conditions:
- Unguided: accessibility is not mentioned in the task
- Little guidance: the prompt says, "Make the UI accessible"
- Expert guidance: task-specific accessibility requirements written by a human accessibility specialist
Prompts are repeated to reduce noise under default reasoning settings. Generated pages are scanned using automated WCAG-oriented checks.
This benchmark measures recurring technical risk at scale. It does not replace full manual accessibility audits with assistive technology users.
Model-by-Model Evidence
Open each model card to see detailed benchmark metrics for that model.
Download benchmark whitepaper (PDF) Download raw benchmark JSON