Benchmark Results | Web Accessibility Benchmark

Table view

Accessibility Score by Model

A single score from 0 to 100. Higher means a model is less likely to generate accessibility barriers. Clean outputs matter most.

How many barriers each model produces on average per run.

How often a model produced code with no detectable accessibility errors.

Error rates normalized by DOM scope to compare models with smaller vs. larger generated interfaces.

We ran the same 150 distinct UI tasks under three prompt conditions:

Unguided: accessibility is not mentioned in the task
Little guidance: the prompt says, "Make the UI accessible"
Expert guidance: task-specific accessibility requirements written by a human accessibility specialist

Prompts are repeated to reduce noise under default reasoning settings. Generated pages are scanned using automated WCAG-oriented checks.

This benchmark measures recurring technical risk at scale. It does not replace full manual accessibility audits with assistive technology users.

Open each model card to see detailed benchmark metrics for that model.