Remote Labor Index: AI Automation of Remote Work Explained

Remote Labor Index: AI Automation of Remote Work Explained

Table of Contents

Artificial intelligence is advancing at an extraordinary pace. Every week brings new model releases, record-breaking benchmark scores, and bold claims about AI’s ability to replace human workers. Yet a critical question has remained unanswered how much of the remote labor economy can AI actually automate today? The Remote Labor Index was created specifically to answer that question with data, not speculation.

This guide covers everything you need to know about the Remote Labor Index: what it is, how it works, what the results reveal about the current state of AI automation, and what it means for employers, freelancers, and the future of work.

What Is the Remote Labor Index (RLI)?

The Remote Labor Index (RLI) is a rigorous, empirical benchmark that measures the capability of AI agents to perform real-world, economically valuable remote work. Unlike most AI benchmarks that test isolated skills writing a poem, solving a math problem, answering trivia the RLI evaluates whether AI can complete entire professional projects from start to finish, to a standard that a paying client would actually accept.

It was developed through a collaboration between the Center for AI Safety (CAIS) and Scale AI, with contributions from 47 researchers across academia and industry. The research was published on October 30, 2025, under the arXiv identifier 2510.26787.

The name reflects its purpose precisely: it indexes the degree to which remote labor work that can be done via a computer can be automated by modern AI systems.

Key Facts at a Glance

  • 240 real-world freelance projects from platforms like Upwork
  • 23 sectors covered, including design, architecture, data analysis, game development, video animation, and more
  • Over 6,000 hours of combined human labor represented
  • Total project value exceeding $140,000
  • Mean human completion time: 28.9 hours per project (median: 11.5 hours)
  • Average project cost: $632.60 (median: $200)
  • Evaluated across multiple frontier AI models and agent frameworks
  • All evaluations performed manually by trained domain experts

Why Was the Remote Labor Index Created?

The motivation for the RLI stems from a fundamental problem with existing AI benchmarks: they do not reflect economic reality. Most AI evaluation frameworks test narrow, isolated capabilities answering multiple-choice questions, generating code snippets, summarizing text. While these tests are useful for comparing models in controlled settings, they tell us very little about whether an AI agent could realistically replace a professional doing a complex, multi-step job.

Consider a benchmark that tests whether an AI can write clean Python code. Scoring well on such a benchmark does not tell you whether the AI could successfully take a client’s vague design brief, build an entire game prototype, iterate on feedback, and deliver production-ready files. The gap between benchmark performance and real-world automation capability is enormous and the RLI was created precisely to measure and track that gap.

The researchers behind the RLI identified three specific limitations of existing benchmarks that they set out to address:

  • Lack of economic grounding: Most benchmarks use synthetic or academic tasks with no real monetary value attached. RLI uses projects with documented cost and labor data from actual freelance market transactions.
  • Narrow task scope: Existing agent benchmarks typically evaluate single tasks, not end-to-end professional projects that require multi-step planning, file management, and client-ready output.
  • Limited sector diversity: Prior benchmarks focus heavily on coding and text tasks. RLI spans 23 diverse categories from 3D modeling to video production to architectural CAD reflecting the true breadth of computer-based work.

How Does the Remote Labor Index Benchmark Work?

Understanding RLI’s methodology is essential to interpreting its results. The benchmark was carefully designed to be both representative of real-world remote labor and rigorous enough to produce statistically reliable findings.

Project Selection and Sourcing

Every project in the RLI is sourced directly from actual freelance platforms, primarily Upwork. The methodology is explicitly bottom-up: researchers engaged directly with human professionals who agreed to contribute their completed project briefs and deliverables as research samples. This approach ensures that the projects are grounded in real economic transactions not hypothetical tasks invented by researchers.

Each project in the dataset contains three components:

Recruit the top 1% of data annotator Today!

Access exceptional professionals worldwide to drive your success.

  • The Brief: A text document describing the work to be done, exactly as the client provided it.
  • Input Files: All reference materials, assets, and files necessary to complete the project.
  • The Gold-Standard Deliverable: The actual completed work submitted by the human professional, which serves as the quality benchmark.

The 240 projects span an extraordinary range of complexity, with costs ranging from small jobs worth under $50 to complex engagements valued at over $10,000, and completion times ranging from a couple of hours to more than 100 hours.

The Four Primary Metrics

RLI uses four distinct metrics to evaluate AI agent performance, each capturing a different dimension of automation capability:

AI Model / AgentAutomation Rate
MetricWhat It Measures
Automation Rate% of projects where AI output matches human quality the primary metric
Elo ScoreRelative quality ranking from pairwise comparisons (human baseline = 1,000)
Dollars EarnedTotal dollar value of projects the AI successfully completes
Cost Savings (Autoflation)% reduction in project cost when using AI vs. human labor

The Automation Rate is the most important metric. It defines success as whether a “reasonable client” would accept the AI’s deliverable as equivalent to commissioned professional work. This standard is intentionally high it is not enough for an AI to produce something that looks passable; it must meet the bar that a paying client would consider satisfactory.

The Evaluation Process

Because the projects span dozens of file formats including 3D models, video files, CAD drawings, layered design documents, audio files, and complex codebases automated evaluation is not feasible. All RLI evaluations are therefore performed manually by trained human experts using a custom-built, open-source evaluation platform.

Reviewers conduct pairwise comparisons between two AI-generated deliverables, using the human professional’s deliverable as the quality reference. They assess each submission on two axes: closeness to project completion, and overall quality. A majority vote from three independent evaluators determines whether each project is considered automated.

The result is impressively reliable: the inter-annotator agreement on pass/fail automation ratings reaches 94.4% a strong signal that the evaluation process is consistent and reproducible.

Remote Labor Index Results: How Are AI Agents Performing in 2025?

The RLI results are striking and humbling for proponents of imminent AI-driven labor displacement. Despite the rapid progress AI systems have demonstrated on knowledge benchmarks and reasoning tests, current AI agents perform near the floor on real-world remote work projects.

Automation Rates by Model

Here are the benchmark results for leading frontier AI models tested on the Remote Labor Index:

AI Model / AgentAutomation Rate
AI Agent / ModelAutomation Rate (%)
Manus (best performer)2.5%
Grok 4 (xAI)2.1%
Claude Sonnet 4.5 (Anthropic)2.1%
GPT-5 (OpenAI)1.7%
ChatGPT Agent (OpenAI)1.3%
Gemini 2.5 Pro (Google)0.8%

To put these numbers in perspective: even the top-performing AI system, Manus, successfully completed only 6 out of 240 professional projects at an acceptable quality standard. That is an automation rate of 2.5% meaning that 97.5% of real remote work projects are beyond the reach of today’s most advanced AI agents.

Key Insight: The Benchmark Gap The same AI models that score impressively on standard academic and coding benchmarks sometimes surpassing human-level performance achieve near-zero rates on RLI. This reveals an important truth: benchmark performance and real economic automation capability are fundamentally different things. Passing a multiple-choice knowledge test is nothing like successfully delivering a complete architectural design package or a polished animated video.

Where AI Agents Succeed

While the overall automation rate is very low, AI agents do show meaningful performance in specific categories. Analysis of successful completions reveals that AI currently performs best in:

  • Audio generation and basic music composition tasks
  • Image generation and simple graphic design
  • Data scraping and basic data processing
  • Straightforward writing and content generation tasks

These successes share a common thread: they involve well-defined outputs with limited multi-step complexity and don’t require iterative refinement based on nuanced client feedback.

Common AI Failure Modes

A detailed error analysis of failed projects reveals four primary categories of failure:

  • Incompleteness (36%): AI agents frequently fail to deliver all required components of a project. They may produce part of what was requested but leave critical elements unfinished or missing.
  • Sub-professional quality (46%): Deliverables are submitted but fall below the quality threshold that a paying client would accept. This is the largest failure category.
  • File errors (18%): Output files are broken, corrupted, or in the wrong format making them unusable regardless of the underlying quality of the work.
  • Inconsistencies (15%): Deliverables contain internal contradictions, misaligned elements, or logical errors that would require substantial rework.

What Do RLI Results Mean for Employers and Businesses?

For business leaders and HR professionals, the Remote Labor Index offers grounded, empirical guidance at a time when AI hype can make strategic planning difficult.

The headline finding that AI automates fewer than 3% of professional remote work projects should not be read as a reason to dismiss AI adoption. Rather, it invites a more nuanced and accurate view of where AI genuinely delivers value today versus where human expertise remains indispensable.

Where Businesses Should Invest in AI Automation Now

RLI results suggest that AI provides the most reliable value as a task-level augmentation tool rather than a full project replacement. Businesses are likely to see the strongest ROI from AI in:

  • First-draft content generation that human professionals then refine
  • Data analysis and structured reporting tasks
  • Image and graphic generation for ideation and mood boards
  • Code generation for well-scoped, defined functions within a larger project
  • Summarization, transcription, and document processing workflows

What RLI Results Mean for Workforce Planning

The steady improvement in Elo scores even as absolute automation rates remain very low indicates that AI capabilities are advancing in a measurable and consistent direction. Businesses should treat this as a signal to invest now in understanding where and how AI will eventually intersect with their specific workflows, so they are prepared to adapt rather than react.

Workforce planning teams should focus on roles that involve clear project deliverables, multi-file complexity, client communication, and domain expertise all areas where human professionals will remain essential for the foreseeable future based on RLI evidence.

What Does the Remote Labor Index Mean for Remote Workers and Freelancers?

For remote workers and freelancers, the RLI delivers more reassuring news than much of the mainstream AI coverage might suggest. The data directly challenges the narrative that AI is on the verge of replacing skilled remote professionals.

A 2.5% automation rate means that 97.5% of the real, complex, paid work on platforms like Upwork currently remains beyond AI’s reach. The research shows that the types of work most vulnerable to near-term AI automation are relatively simple, well-defined tasks not the complex, multi-faceted projects that experienced freelancers command premium rates for.

Skills That Remain AI-Resistant

Based on RLI’s failure analysis, the following professional capabilities remain strongly protected from near-term automation:

  • End-to-end project management and client relationship skills
  • Domain-specific expertise requiring iterative judgment (architecture, advanced design, engineering)
  • Creative direction and holistic aesthetic decision-making
  • Multi-file, multi-format project delivery with complex interdependencies
  • Adaptation to ambiguous or evolving client briefs
  • Quality assurance and professional self-review of complex deliverables

How Freelancers Should Respond

The smartest response for skilled freelancers is not to ignore AI, but to learn to use it strategically as a productivity multiplier. Freelancers who integrate AI tools for the task-level components of their work research, first drafts, asset generation while applying their expertise to the full project can operate more efficiently and deliver higher value. This creates a competitive advantage that pure AI agents currently cannot replicate.

How Does RLI Compare to Other AI Benchmarks?

The Remote Labor Index occupies a unique position in the AI evaluation landscape. To understand what makes it distinctive, it helps to compare it with other major benchmark types:

Benchmark TypeWhat It TestsRLI Difference
Knowledge benchmarks (MMLU, GPQA)Factual recall, reasoning, Q&ARLI tests complete real-world project delivery, not knowledge retrieval
Coding benchmarks (SWE-bench, HumanEval)Code generation for defined problemsRLI includes code tasks but also 22 other diverse professional sectors
Computer use benchmarks (OSWorld)Navigating GUIs and desktop tasksRLI focuses on economically valuable outputs, not navigation capability
Agentic task benchmarks (GAIA)Multi-step task completionRLI grounds tasks in real economic transactions with real monetary value

The RLI’s defining characteristic is its economic grounding. Every project in the benchmark has a real monetary value, a real client brief, and a real human-produced gold standard. This means that RLI scores translate directly into economic statements an automation rate of 2.5% means AI can replace about $3,500 worth of the $140,000+ project bundle evaluated.

The Future of AI Automation: What the RLI Trajectory Tells Us

While current automation rates are near zero, the RLI was designed to be a longitudinal tracking tool, not a one-time snapshot. The research team specifically built Elo scoring into the methodology to detect incremental, granular improvements even while absolute automation rates remain very low.

The early Elo data confirms that models are consistently improving in quality and completeness of output, even if they have not yet crossed the threshold for client-acceptable work at meaningful rates. Newer frontier models consistently outrank older ones in Elo comparisons meaning real progress is happening below the surface of the automation rate metric.

The ‘Autoflation’ Concept

One of the most forward-looking contributions of the RLI research is the introduction of the concept of “autoflation” the percentage reduction in cost that occurs as AI successfully completes projects at a lower price than human labor. Even at today’s low automation rates, AI is already creating price pressure in the specific task categories it can handle reliably. As automation rates increase, autoflation will become one of the most economically significant metrics to track for labor market forecasting.

What Progress Would Look Like

For the RLI automation rate to rise from 2.5% to even 10%, AI agents would need to successfully complete approximately 24 diverse professional projects across categories like architectural design, complex data analysis, and game development all at a quality level a client would pay for. Given the current failure modes (incompleteness, sub-professional quality, file errors), reaching even this threshold would represent a fundamental advancement in agentic AI capability.

Frequently Asked Questions About the Remote Labor Index

What is the Remote Labor Index?

The Remote Labor Index (RLI) is an empirical benchmark developed by the Center for AI Safety and Scale AI to measure how effectively AI agents can automate real-world, economically valuable remote work. It consists of 240 professional freelance projects worth over $140,000, covering 23 sectors. AI performance is measured by ‘automation rate’ the percentage of projects completed to a standard a paying client would accept.

Who created the Remote Labor Index?

The RLI was created through a collaboration between the Center for AI Safety (CAIS) and Scale AI, with 47 researchers contributing. The lead authors include Mantas Mazeika and Alice Gatti (CAIS), along with researchers from Scale AI including Udari Madhushani Sehwag and Shivam Singhal. The research was published on October 30, 2025 (arXiv:2510.26787).

How is the automation rate calculated in RLI?

The automation rate is the percentage of projects where an AI agent’s deliverable is judged at least as good as the human professional’s gold-standard deliverable. Three independent trained evaluators conduct pairwise comparisons, and a majority vote determines the outcome. An inter-annotator agreement of 94.4% validates the consistency of these judgments.

Which AI model performs best on the Remote Labor Index?

As of the 2025 benchmark results, Manus is the top-performing AI agent on the RLI with a 2.5% automation rate. It is followed by Grok 4 and Claude Sonnet 4.5 (both at 2.1%), GPT-5 (1.7%), ChatGPT Agent (1.3%), and Gemini 2.5 Pro (0.8%).

Can AI replace remote workers based on RLI findings?

Not at a meaningful scale today. RLI findings show that the best AI agents can only complete 2.5% of professional remote work projects at client-acceptable quality. The research demonstrates that 97.5% of economically valuable remote work remains beyond current AI capability. However, AI can assist with specific task-level activities within larger projects.

What types of projects are included in the RLI?

The RLI includes 240 projects across 23 Upwork categories, including game development, product design, architectural design, data analysis, video animation, audio production, graphic design, 3D modeling, copywriting, and software development, among others. Projects range from a few hours to over 100 hours in completion time.

How does the RLI differ from other AI benchmarks?

Unlike academic benchmarks that test isolated skills (knowledge retrieval, code generation, reasoning), the RLI evaluates complete end-to-end project delivery grounded in real economic transactions. Every project has a real monetary value, an authentic client brief, and a human professional’s deliverable as the quality standard. This makes it uniquely relevant for understanding actual labor market automation.

Is the Remote Labor Index open source?

The evaluation platform developed for RLI is open-source and available on GitHub (github.com/centerforaisafety/rli_evaluation_platform). The research paper is freely available at arXiv (arXiv:2510.26787). Live leaderboard results are maintained at dashboard.safe.ai.

Conclusion: The Remote Labor Index as a Reality Check for AI Automation

The Remote Labor Index arrives at exactly the right moment when AI capabilities are genuinely impressive but public understanding of those capabilities is distorted by hype in both directions. Some believe AI is already replacing professional workers en masse; others dismiss the automation risk entirely.

RLI cuts through both extremes with data. Its conclusion is clear and empirically grounded: today’s frontier AI agents can automate less than 3% of real, economically valuable remote work. But Elo scores show steady improvement. The index provides exactly what the field has lacked a stable, economically calibrated reference point for tracking the trajectory of AI automation as it actually unfolds, not as we fear or hope it will.

For employers, policymakers, researchers, and workers alike, the Remote Labor Index is now the definitive empirical foundation for answering the question that matters most: “What can AI actually do, in the real economy, right now?” and for tracking how that answer changes over time.

Sources & References

  • Mazeika, M. et al. (2025). Remote Labor Index: Measuring AI Automation of Remote Work. arXiv:2510.26787.
  • Center for AI Safety (CAIS). remotelabor.ai
  • Scale AI Research. The Remote Labor Index: Measuring the Automation of Work. scale.com/blog/rli
  • Live Leaderboard: dashboard.safe.ai | GitHub: github.com/centerforaisafety/rli_evaluation_platform

Table of Contents

Hire top 1% global talent now

Related blogs

Introduction Machine learning and artificial intelligence are only as good as the data they are trained on. Behind every state-of-the-art

Introduction Artificial intelligence has arrived not in the speculative, science-fiction sense, but in the very real, very consequential sense that

Every time you interact with ChatGPT, Claude, Gemini, or any modern AI assistant, you’re experiencing the output of millions of

Introduction This is not a short-term prediction. It is a structural claim about where the global economy is converging and