This content originally appeared on Level Up Coding - Medium and was authored by Jian Ruan
Why we’re entering AI’s second half — and why evaluation is not just an afterthought, but the foundation of real-world utility.
The Second Half of AI
Recently, I read an interesting article by OpenAI researcher Shunyu Yao titled The Second Half [1]. It makes a compelling case that the AI industry is now entering its “second half” — a shift from improving models and methods to building evaluation and benchmark.
In reinforcement learning, there are three components that drive performance: algorithm, environment, and prior knowledge. Over the past few years, we’ve seen rapid progress in the first two. Researchers have optimized architectures, tuning strategies, and training pipelines to squeeze more capability out of transformer models. At the same time, companies have fed these models massive amounts of data — common crawl, codebases, textbooks, product manuals, and more. That’s given us general-purpose LLMs that can pass bar exams, solve advanced math problems, and write well structured and accurate code.
However, Shunyu points out a critical insight: there’s a growing disconnect between benchmark performance and real-world utility value. As model intelligence improves, utility does not necessarily scale linearly. Why? Because we’re now bottlenecked not by the model’s capability, but by the environment — and in the context of AI applications, that means evaluations.
What Are We Really Evaluating For?
This is where evaluation — “evals” — becomes more than a performance check. It becomes the bridge between intelligence and usefulness.
This idea is echoed in another piece I found insightful: An LLM-as-Judge Won’t Save the Product by Eugene Yan [2]. His message is simple but critical: evals are not a one-time benchmark — they are iterative process.
In traditional software, we use test-driven development to catch regressions early, clarify requirements, and evolve our understanding of edge cases. AI systems are no different. The tests — the evals — must evolve with the system.
You don’t need to get it perfect from the start. In a talk at LangChain’s Interrupt conference, Andrew Ng shared that he often begins evals with a rough, even not very helpful, metric — just something that gives signal [3]. That first draft helps him understand where the model fails. He then improves the evaluation criteria in response to those failures. That feedback loop is the real magic.

The X-Bench Approach: Tech-Market Fit for Agents
One practical example of this shift is X-Bench, a benchmark developed by VC firm HongShan (formerly Sequoia China) [4]. X-Bench doesn’t evaluate agents on abstract academic tasks. Instead, it focuses on whether AI agents can deliver economic value in professional domains.
X-Bench introduces the concept of Technology-Market Fit. It’s not just about asking, “Is this model smart?” — it’s asking, “Where does this model actually work in the market?”
To measure this, the benchmark selects domains like:
- Recruitment: measures how well an agent can screen resumes, assess candidate fit, or draft interview questions.
- Marketing: evaluates an agent’s ability to generate targeted ad copy, customize outreach, or conduct competitive research.
These metrics go beyond general intelligence. They reflect whether an agent can perform concrete tasks within a professional context — tasks that someone would otherwise pay a human to do.
This kind of profession-aligned evaluation is a big step forward. It offers companies, investors, and builders a more actionable way to judge readiness and opportunity. It also reminds us: performance only matters when it’s tied to outcomes that matter.


Why Coding Is an Outlier
A question often comes up: why are coding agents like Cursor, Windsurf, Devin etc getting so much traction?
It’s not just that developers are building for themselves. It’s because coding has a built-in evaluation mechanism. You write code, and the compiler tells you immediately whether it runs. You write a test, and it either passes or fails. This instant, objective feedback loop makes iteration faster and evaluation easier.
Unfortunately, most domains aren’t like that. For example:
- In finance, how do you judge if an AI co-pilot gave the right advice?
- In law, how do you ensure a chatbot stayed compliant with jurisdiction-specific rules?
- In healthcare, what counts as a safe and helpful recommendation?
These industries lack a clear “compiler.” Evaluation becomes a fuzzy area, requiring subject-matter experts to weigh in. But involving humans introduces friction — it’s expensive, slow, and hard to scale.
To get around this, some teams use LLMs to evaluate LLMs. But as Eugene Yan warns, using a model as a judge doesn’t absolve you from building a good evaluation process. If your underlying evaluation criteria and process are flawed, you’ll still get garbage-in, garbage-out behavior — even if the LLM judge sounds smart.
Case Study: MedHEL and Taxonomy-Driven Evals
A great example of structured evaluation comes from Stanford’s MedHEL project, where they present a structured, task-based approach to evaluation for LLMs in healthcare [6].
Rather than using generic metrics, MedHEL:
- Identifies real clinical workflows, like “clinical decision support” and “patient communication”.
- Breaks each workflow into concrete tasks, such as “classify medical condition from notes” or “draft patient education messages.”
- Defines who the task is for (e.g., clinician), when it occurs, what data is involved, and what success looks like.
This creates a taxonomy-driven eval framework. It makes evaluation repeatable, interpretable, and grounded in professional reality.
This model could be replicated in other industries. For example, the same structure applied to finance industry:
- Fraud Detection (transaction classification, anomaly scoring)
- Credit Risk (cashflow pattern modeling, default prediction)
- Customer Support (compliance-aware response generation)
- SMB Finance Copilots (forecasting cashflow, surfacing insights from bank data)
The benefit? A clear map of what to evaluate, why it matters, and how it connects to business value.


A Mental Shift: Evals Are the Product
If you can’t evaluate your model’s behavior — on the right tasks, for the right users, with meaningful metrics — then you don’t really have a product. You just have a demo.
To build useful AI systems, we need to:
- Design use-case–specific metrics
- Validate outputs with real-world feedback
- Improve iteratively alongside the model, UI, and data
Evals are no longer just afterthought tooling. They’re the lens through which we understand whether our systems actually work for people.
Conclusion
We’re past the phase where better models alone drive better outcomes. We’re in the second half now — where success depends on whether your AI can deliver value in context.
You don’t need a perfect eval to get started. Instead
- Start simple
- Iterate fast
- Align with real-world outcomes
Because in this next chapter of AI, how you evaluate is what you build.
Reference
[1] Shunyu Yao, The Second Half, https://ysymyth.github.io/The-Second-Half/
[2] Eugene Yan, An LLM-as-Judge Won’t Save The Product — Fixing Your Process Will, https://eugeneyan.com/writing/eval-process/
[3] Andrew Ng, State of AI Agents | LangChain Interrupt, https://www.youtube.com/watch?v=4pYzYmSdSH4
[4] X-Bench Leaderboard, “Evergreen” Benchmark for AI Agents, https://xbench.org
[5] HongShan (formerly Sequoia China), X-BENCH: Tracking Agents Productivity — Scaling with Profession-Aligned Real-World Evaluations, https://xbench.org/files/xbench_profession_v2.4.pdf
[6] Stanford Center for Research on Foundation Models (CRFM), MedHEL: Holistic Evaluation of Large Language Models for Medical Tasks, https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard
Evals Are the Product: Building Useful AI Beyond the Benchmark was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Jian Ruan

Jian Ruan | Sciencx (2025-06-11T15:05:12+00:00) Evals Are the Product: Building Useful AI Beyond the Benchmark. Retrieved from https://www.scien.cx/2025/06/11/evals-are-the-product-building-useful-ai-beyond-the-benchmark/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.