AGENTS.md Beats Skills: 100% Next.js Agent Evals with an 8KB Docs Index

Vercel’s evals on Next.js 16 tasks found that agents rarely invoked skills, leaving pass rates stuck at baseline (53%). Adding explicit instructions helped (79%) but was brittle to wording. Embedding an 8KB versioned docs index in AGENTS.md achieved 100% across Build, Lint, and Test by delivering passive, always-present, retrieval-led context.

Key Points

Skills underperformed because agents often failed to invoke them (56% non-trigger rate), yielding no gain over baseline (both 53%).
Explicit instructions to use the skill improved pass rate to 79%, but outcomes were fragile to wording and task sequencing.
Embedding a compressed, version-matched Next.js docs index in AGENTS.md drove 100% pass rates across Build, Lint, and Test.
Passive, always-on context outperformed active retrieval due to no decision point, consistent availability, and no ordering issues.
A CLI (`npx @next/codemod@canary agents-md`) automates setup by downloading docs to .next-docs/ and injecting an ~8KB index, enabling retrieval-led reasoning.

Sentiment

The community is mildly positive but notably skeptical. Practitioners broadly validate that skills have reliability issues and compressed context works well in practice, but significant pushback exists around the article's framing, methodology, and scalability claims. Many commenters view the finding as obvious or overstated, and the prevailing practical consensus favors combining both approaches rather than choosing one over the other.

In Agreement

Skills frequently fail to activate — multiple practitioners report 5-56% non-invocation rates, confirming the article's core finding that agents often skip available tools
The 'no decision point' advantage is real: having documentation pointers always in context eliminates the non-deterministic step of the model choosing whether to invoke a skill
Compressed doc indexes and table-of-contents approaches in system prompts are validated by multiple practitioners as effective for improving agent performance
First-person prompt framing in AGENTS.md significantly improves adherence over imperative commands, with one commenter providing controlled test data showing the difference
Vercel's eval-driven approach is praised as valuable methodology that more engineering organizations should adopt

Opposed

The comparison is misleading or tautological — of course always-in-context information outperforms optional tool-calling; skills and AGENTS.md serve fundamentally different purposes
The methodology is weak: small sample sizes without confidence intervals, no model disclosure, and screenshots that reportedly contradict the published numbers
This approach doesn't scale — as projects grow more complex with many specialized workflows, you cannot fit everything into AGENTS.md without degrading model performance through context bloat
Skills are new and will improve as models get better trained on tool invocation through reinforcement learning; this is a temporary limitation, not a fundamental one
The article conflates skill design quality with the skill mechanism itself — poorly written skill descriptions explain the low invocation rates, not a flaw in the skills concept