Vivold Consulting

OpenAI solicits real task files from contractors to benchmark and train next-gen AI agents

Key Insights

OpenAI has asked contractors to upload real past job deliverables to help it evaluate next-generation AI agents against human performance. Workers must anonymize sensitive information before uploading. The initiative is part of establishing benchmarks for real-world task capabilities.

Stay Updated

Get the latest insights delivered to your inbox

Benchmarking AI against real human work


OpenAI is quietly collecting actual deliverables from contractors' past jobsWord docs, spreadsheets, PDFsto create realistic benchmarks for how well its upcoming AI agents can handle complex, real office tasks. This isn't about synthetic datasets; it's about grounding evaluation in real professional work.

What's unusual about this project


- Contractors are being asked to upload actual work artifacts, not summaries, with personally identifiable information redacteda shift from simulated or artificial evaluation sets.
- The goal is to compare AI agent performance to human baseline outputs across a spectrum of real tasks, from analysis to creation.

Why this matters for developers and businesses


Benchmarking against real work could yield harder performance targets and clearer signals about where AI still lags human professionals. For enterprises evaluating AI agents for automation, these metrics could be decisive in procurement and deployment decisions. But it raises questions about data privacy, consent, and corporate IP boundaries in training and evaluation workflows.