Public Evidence

Benchmark Results: Accessibility Performance Across Models

Compare model behavior across guidance levels and inspect where barriers are still generated by default.

Table view

Accessibility Score by Model

A single score from 0 to 100. Higher means a model is less likely to generate accessibility barriers. Clean outputs matter most.

Average Accessibility Defects

How many barriers each model produces on average per run.

Clean Output Rate (Zero Detectable Errors)

How often a model produced code with no detectable accessibility errors.

Errors per 100 DOM elements

Error rates normalized by DOM scope to compare models with smaller vs. larger generated interfaces.

How We Measured This

We ran the same 150 distinct UI tasks under three prompt conditions:

  • Unguided: accessibility is not mentioned in the task
  • Little guidance: the prompt says, "Make the UI accessible"
  • Expert guidance: task-specific accessibility requirements written by a human accessibility specialist

Prompts are repeated to reduce noise under default reasoning settings. Generated pages are scanned using automated WCAG-oriented checks.

This benchmark measures recurring technical risk at scale. It does not replace full manual accessibility audits with assistive technology users.

Model-by-Model Evidence

Open each model card to see detailed benchmark metrics for that model.