Benchmarking AI against real human work
OpenAI is quietly collecting actual deliverables from contractors' past jobsWord docs, spreadsheets, PDFsto create realistic benchmarks for how well its upcoming AI agents can handle complex, real office tasks. This isn't about synthetic datasets; it's about grounding evaluation in real professional work.
What's unusual about this project
- Contractors are being asked to upload actual work artifacts, not summaries, with personally identifiable information redacteda shift from simulated or artificial evaluation sets.
- The goal is to compare AI agent performance to human baseline outputs across a spectrum of real tasks, from analysis to creation.
Why this matters for developers and businesses
Benchmarking against real work could yield harder performance targets and clearer signals about where AI still lags human professionals. For enterprises evaluating AI agents for automation, these metrics could be decisive in procurement and deployment decisions. But it raises questions about data privacy, consent, and corporate IP boundaries in training and evaluation workflows.
