The Data Supplier Behind the Large Models: Surge AI

I first heard of Surge AI from a podcast interview with Edwin Chen, right as they were raising their first round of funding. A few of Edwin’s points left a strong impression on me:

  • He dislikes playing the Silicon Valley hype game, preferring to do the right thing over things that merely look impressive
  • He opposes the approach of throwing labor at data labeling, instead focusing on producing the highest-quality data and improving efficiency by optimizing the process
  • He points out that public benchmarks (such as LMArena, which is based on user preferences) don’t work

I strongly agree with his views. For one, I also dislike formulaic startup paths. On top of that, I once spent a stretch of time working on AutoML, where I trained many models intensively to push performance, so I know firsthand how important understanding the data is to a model’s results. Out of interest in this company, this article surveys Surge AI’s development, team, business situation, and their core views.

Surge AI’s Development and Team

Surge AI was founded in 2020 (after the release of GPT-3). Unlike the path most Silicon Valley companies take, founder Edwin didn’t take VC money; instead, he built the company with the savings from his job at a big tech firm, and made it profitable from the very start. In the interview, Edwin said he dislikes Silicon Valley’s status game—fundraising, hype, tweeting and building cliques, and so on—and that he wanted to do the right things on his own terms, without being beholden to anyone after taking VC money.

From the beginning, Surge AI offered high-quality data labeling (unlike cheap, large-scale crowdsourcing). Later it added RLHF data and workflows, along with professional services such as quality control, complex rubric design, domain-expert labeling, fast experimentation interfaces, and red-teaming tools. These services were deeply involved in training Claude.

The company’s 2024 revenue had already surpassed 1billion,andinJuly2025itplannedtoraise1 billion, and in July 2025 it planned to raise 1 billion for the first time, at a valuation of around $30 billion.

CEO Edwin graduated from MIT and previously did data-science and machine-learning work at Twitter (search/ad quality), Google, and Facebook/Meta, giving him a deep understanding of data. The whole company’s headcount is extremely lean: the core team is fewer than 100 people, focused on building the toolchain, the platform, and evaluation systems. Counting part-time staff and consultants, it’s around 250 people. Beyond that, Surge AI also partners with other contractors to handle labor-intensive tasks.

Business and Case Studies

There isn’t much public information about Surge AI, but a few case studies can be found on their blog:

The GSM8K dataset built for OpenAI (2022)

This is a dataset of 8,500 grade-school math problems, all of them word problems, for example:

After Bobby gets his paycheck, Darren will have twice as much money as Bobby. Bobby currently has 40andwillreceivea40 and will receive a 16 paycheck. How much money does Darren have now?

Building this dataset required attention to both diversity and correctness; see the link for details.

Training and evaluating Claude (2023)

In Claude’s training and evaluation, these Surge AI features were used:

  • Quality control
  • Domain-expert labeling
  • Fast experimentation interfaces
  • Red-teaming tools

There aren’t many details, though; see the original blog post.

Building AdvancedIF for Meta Superintelligence (2025)

Today’s public benchmarks are actually disconnected from real-world scenarios; everyone focuses on metrics that are easy to evaluate (such as character counts) rather than metrics that are genuinely useful. When building this evaluation system, Surge AI followed these principles:

  • Build true evaluation targets rather than simple proxy targets (for example, character count is easy to measure, but truly evaluating writing ability is hard)
  • Use query data written by human experts rather than synthetic data
  • Make the evaluation flexible enough, instead of treating it as a simple multiple-choice question
  • Treat complex multi-turn conversations as a feature, because users are also messy in each turn of a conversation

See the original post for details.

Beyond the customer case studies above, Surge AI’s blog has plenty of interesting analyses, such as model evaluations on finance problems and coding problems. When asked to produce a slide deck on financial risk and to make financial forecasts, GPT-5, Claude, and Gemini each ran into various problems—hallucinations, formatting errors, broken formulas, and so on. In the Bash-Only test on SWE-Bench, Gemini 2.5 Pro, Claude Sonnet 4, and GPT-5 all hallucinated, but each followed a distinctly different hallucination trajectory.

Comparison of hallucination behavior across GPT-5, Claude, and Gemini on finance and coding tasks
Comparison of hallucination behavior across GPT-5, Claude, and Gemini on finance and coding tasks

There’s also a simple case: just plain PDF text extraction, yet both ChatGPT and Gemini failed at it.

Core Views

Public benchmarks have many problems

Take LMArena: you enter a question, evaluate two answers, and then pick the best one.

This actually optimizes for human attention rather than for ground truth, because ordinary people don’t fact-check—whichever answer is longer or has flashier emojis gets the vote (ChatGPT has this mechanism too). That’s actually quite unreasonable, yet it’s hugely popular in the community and treated as gospel. Customers and investors look at performance on this leaderboard, which forces developers to optimize toward an unreasonable evaluation system.

On top of that, many other benchmarks have a host of problems too. For example, 30% of the answers in “Humanity’s Last Exam” are wrong, and 36% of the answers in HellaSwag are wrong. If even the evaluation metrics have problems, then the models trained against them won’t be good either.

Original post

Today’s agents are still far from human common sense

If we frame agent capabilities into several tiers, current models can’t reach human-level common-sense reasoning:

  • Basic tool use, planning, and goal setting
  • The ability to adapt to real-world conditions (for example, if a customer makes a typo in a query, can it handle it correctly)
  • Staying grounded in the actual environment (for example, staying tightly aligned with the current conversation’s date and context)
  • Common-sense reasoning (for example, in a customer-service task, proactively looking up the membership tier based on the account information the user provides, in order to offer personalized pricing)
Agent capability tier framework: from basic tool use to human-level common-sense reasoning
Agent capability tier framework: from basic tool use to human-level common-sense reasoning

Original post