Where the coding agents actually disagree

Eleven at night. I'm watching Claude Code work on my SharePoint daemon — the Python distributed job processor I've been wiring up for the past few weeks, ETag locks and worker identity logging and all the boring infra around it. I ask it to find every place a job can transition into the failed_permanent state.

It doesn't pull anything from a vector index. There is no vector index. It runs grep -r failed_permanent --include="*.py". Gets seventeen hits. Reads four of them. Runs another grep for the state transition helpers. Reads the state machine. Comes back with the answer in maybe thirty seconds.

I had to think for a second to register what hadn't happened. The cold-start "let me embed your repo" prompt. The "indexing complete (10,432 files)" toast. None of it. Claude Code just grepped my code like a slightly-too-polite junior engineer with admin access to my filesystem.

And last week's post was, in retrospect, only half the story.

That post argued the coding agents had quietly figured out industrial-strength retrieval (chunking, hybrid retrieval, reranking, knowledge graphs, the works) and that the playbook generalizes well outside code. I think that's still right. What I left out is the part where they don't agree on the spine of the thing. Three of them publicly quit RAG. Three of them doubled down on it. Both halves are shipping winning products. The split is real and the reasons are interesting, so this is the follow-up.

The breakup nobody covered

Boris Cherny, who runs Claude Code at Anthropic, posted this on X at the end of January:

Early versions of Claude Code used RAG + a local vector db, but we found pretty quickly that agentic search generally works better. It is also simpler and doesn't have the same issues around security, privacy, staleness, and reliability.¹

A million views, twenty-six hundred bookmarks. On the Latent Space podcast he was less diplomatic: agentic search outperformed RAG "by a lot," and the alternatives they tried (local vector DBs, recursive model-based indexing, the obvious moves) all lost.² The Anthropic engineer who answered on Hacker News said the gap was "surprising." Internal A/B numbers, not public.

You could read that as one team's preference. Except the same week, Nick Pash at Cline shipped a blog post titled "Why Cline Doesn't Index Your Codebase" with the line that immediately became a meme inside the coding-agent crowd: RAG is a mind virus.³ His argument, abbreviated: when you chunk code for embeddings, you're tearing apart its own logic. A function call ends up in chunk 47, its definition in chunk 892, and the model never sees both at once.

Then Sourcegraph (fifteen years of code-intelligence infrastructure, the company that invented the modern SCIP code-graph format) publicly abandoned embeddings for Cody Enterprise at GA.⁴ Their engineering blog, paraphrased: we'd been embedding code since beta, we shipped Cody Enterprise, and the embeddings layer is going to the bin. Reasons listed: didn't want to send customer code to OpenAI for embedding, vector DB maintenance was punishing past 100k repos per customer, multi-repo context never really worked. They went back to BM25 plus structural search over their existing SCIP graph and called it a win.

Three teams ran the experiment independently. Three teams said the same thing in public.

"By a lot" is not a benchmark. But three teams publishing the same negative result in the same quarter is at least a signal.

The reaction in the other camp wasn't to pivot. It was to double down. Cursor shipped a deeply engineered RAG pipeline with Merkle-tree-synced chunks, custom code-trained embeddings, and simhash-based index sharing between teammates. GitHub's Copilot Chat sits on top of Blackbird, a Rust-based code search engine that indexes 115 TB of code across 53 billion source files at GitHub-scale.⁵ Windsurf's whole product positioning was "Cascade reads from the M-Query index before it writes a line." JetBrains' Junie sits on twenty years of compiler-grade PSI tooling and wires their existing static analysis into the AI loop.

Same problem. Two architectures. Both shipping. Both winning, depending on the user.

Why code is the worst possible RAG target

It helps to remember that code, as a body of text to retrieve over, is genuinely awful for naive embeddings. Three reasons, all widely documented at this point.

The first is structural. A function and its caller can live in different files. Chunk those files, embed the chunks, and the embeddings sit nowhere near each other in vector space; they share neither vocabulary nor surface patterns. A retriever returns the chunk with def process_payment(...) and confidently misses the one fifty files away that actually calls it with the bad argument. The retrieval was technically correct. The model has half the puzzle.

The second is semantic. The word throttle and the word Semaphore mean the same thing in concurrent programming code. An embedding model trained on natural language has no idea. When a user asks "where do we throttle requests" and the codebase uses Semaphore-based limiters, RAG returns near-matches for the word throttle from comments and string literals while completely missing the actual implementation. Cline's blog post is full of these.

The third is staleness. Code changes every git push. To keep a vector index current you have to detect what changed, re-chunk it, re-embed it, reconcile with the existing index, and do all of that fast enough that the index isn't lying by the time the user asks a question. Cursor's Merkle-tree-based incremental indexing is impressive precisely because this is the hard problem they've solved. But it's a problem that doesn't exist if you skip the index entirely.

🐘

Boris Cherny got the agentic-search idea from watching Instagram engineers use grep. Not from a paper. Click-to-definition was broken in Meta's in-house editor for a long stretch — Boris later led the Dev Infra team that fixed it — and in the meantime, the engineers around him just used grep to navigate huge codebases. The thing that scaled there became the thing that scaled in Claude Code. The lineage of one of the more contested design decisions in coding agents traces back to a broken click-to-definition in someone else's editor.²

There's a fourth reason that doesn't get mentioned as often. Code is your most sensitive IP. Sending it to a third-party embedding API, even if the API drops it after embedding, leaves a vector that, per Cursor's own security docs, can sometimes be inverted back to the original code by academic-grade attacks.⁶ Cursor flags this themselves. The risk is small, and Cursor mitigates it with privacy mode and path obfuscation. The risk isn't zero, and for an enterprise security team it doesn't have to be large to be disqualifying. Anthropic apparently decided their own code was too sensitive to put through their own RAG pipeline, which, if true, is the loudest possible vote against the architecture.

The two architectures, side by side

Here's the mechanical difference. Take a query like "add rate limiting to the API endpoints" against a 50-file Flask app.

The agentic side is a loop with the model in it. Maybe nine tool calls. Thirty seconds of wall time. Each step uses the model to decide the next step, so token consumption scales with task complexity, not codebase size. Five-file repo and a fifty-thousand-file repo cost roughly the same if the answer is in five files.

The RAG side is a pipeline. One retrieval round-trip, two hundred to five hundred milliseconds, then one LLM call. The model never has to decide where to look — the index did that. Token cost is constant per query regardless of repo size. The catch is that if the top-k didn't contain the answer, the model has no fallback except to ask the user for more context, or to issue a grep tool call anyway, at which point you've paid for both architectures.

Neither side is naive about this. Cursor and Copilot both layer agentic tools on top of their indexes; their pipelines aren't "one shot of vector search and done." Claude Code and Cline have considered and rejected precomputed indexes in favor of stronger grep. The split is about the spine of the thing: what runs first, what makes the routing decision, what the architecture assumes the bottleneck is. Not the leaf operations.

Inside the agentic camp

Five tools, roughly the same approach.

Claude Code runs zero pre-indexing. The Anthropic team exposes about fifteen tools: Read, Grep (ripgrep wrapper), Glob, Bash, Task for sub-agents, WebFetch, Edit, and lets the model orchestrate them in a ReAct-style loop. Hierarchical CLAUDE.md files load from four levels (managed → user → project → directory-scoped), giving the model context without bloating tokens. The clever bit is the Task tool: sub-agents have their own context windows, and only their summary returns to the parent. So a research sub-agent can read twenty files and hand back five hundred tokens. On SWE-bench Verified the latest numbers put Claude Code at 80.8%, top of the leaderboard as of May 2026.⁷

Cline is the purest version of the philosophy and rejects RAG most loudly. Plan mode separates exploration from execution. The whole product surface fits inside VS Code as an extension. Sub-agents are explicitly read-only.

OpenAI Codex (the 2025 cloud agent and CLI, not the 2021 model) runs each task in an isolated codex-universal container, clones the repo in, and explores via ripgrep, fuzzy filename search, AST grep, and bash.⁸ AGENTS.md files do what CLAUDE.md does. GPT-5.2-Codex on SWE-bench is at 72.8%. The cold start is real, twenty to thirty seconds per task to spin up the container, but for asynchronous batch work it's a fine tradeoff.

Sourcegraph Amp is the successor to Cody Free/Pro, which Sourcegraph sunset in July 2025. Amp uses model-driven tool exploration plus a Librarian sub-agent that can hit Sourcegraph's cross-repo SCIP graph for "search and read all public code on GitHub as well as your private repositories." Architecturally it's closer to Claude Code than to old Cody.

Antigravity is the genuinely surprising case, and it gets its own subsection because the story is too weird to bury here.

🌬️

Google paid $2.4 billion for the Windsurf team without buying the company. On July 14, 2025 — the day the OpenAI exclusivity period expired — Google executed what TechCrunch called a "reverse acquihire." Half the money went to investors (~$1.2B), half went to compensation packages for the forty Windsurf engineers Google hired, including CEO Varun Mohan and co-founder Douglas Chen.⁹ The Windsurf company kept its name, IP, and roughly 250 remaining employees. Cognition AI bought the rest of it a few days later. Then Google's new ex-Windsurf team forked VS Code and shipped Antigravity, which, improbably, doesn't appear to use the M-Query RAG pipeline Windsurf was famous for. The same engineers built two opposite architectures inside the same year.

Google Antigravity is what happens when you give the M-Query team a 2M-token context window and ask them to start over. The public docs describe filesystem tools, @-mentions, Knowledge Items extracted by a Knowledge Subagent at the end of every conversation, and Skills loaded via progressive disclosure. No embedding index. No semantic graph. No M-Query.¹⁰ The bet seems to be: with Gemini 3 Pro's 1M-2M window, you don't need precomputed retrieval. Brute-force loading plus agent tool-use does the job. Whether it actually does at production scale on million-line monorepos is the open question. It's a fascinating natural experiment, because the people running it are the same people who shipped the heaviest RAG-for-code stack in the industry six months ago.

Inside the RAG camp

Three tools, three quite different architectures, all heavily engineered.

Cursor is the canonical RAG-for-code design and the one whose internals are most publicly documented. The pipeline:

Client builds a Merkle tree of SHA-256 file hashes locally. For a 50K-file repo this is about 3.2 MB.
Files are chunked via tree-sitter at function and class boundaries, not fixed character windows.
Chunks ship to Cursor's server, get embedded by a code-trained proprietary model (with OpenAI fallback), and land in Turbopuffer on GCP. Each path segment is encrypted client-side with a 6-byte nonce; the server only sees vectors and obfuscated metadata.
On query: embed the query, NN search Turbopuffer, return obfuscated paths and line ranges, client reads the actual code locally, ships only those snippets to the LLM. Plaintext never persists server-side in privacy mode.
Incremental re-indexing every ~5 minutes via Merkle diff. Only branches with changed hashes get re-embedded.
New teammate joins → upload a simhash summary of the Merkle tree → server searches existing team simhashes → if any match above threshold, the existing index is reused. Cursor reports clones of the same codebase average 92% similarity across users in an org.¹¹

Cursor's own published numbers on Cursor Context Bench: their semantic search delivers 12.5% higher accuracy answering questions, with a range of 6.5%-23.5% depending on the model. The baseline was keyword search, not modern agentic search with ripgrep + tree-sitter; that head-to-head isn't in the public record.

GitHub Copilot sits on top of Blackbird, the most impressive infrastructure piece in the AI coding stack. Written in Rust. Sharded by Git blob OID rather than by repo (deduplicates files, evenly distributes load). Custom ngram inverted indexes — not trigrams, which are too unselective on code keywords like for. The disclosed cluster runs on 5184 vCPUs, 40 TB of RAM, 1.25 PB of storage, ingests roughly 120,000 documents per second, and serves around 640 queries per second at ~100ms p99 per shard.¹² It indexes 53 billion source files across 45 million repos.

That's the lexical layer. On top sits a separate remote semantic-search index, embedded by a proprietary code-tuned transformer GitHub doesn't publish details on. The semantic index updates within seconds of a push (down from five minutes in early 2025). Copilot's coding agent (March 2026) layers tool-use on top of both. So when you ask Copilot Chat a question, a small GPT-4o-mini classifier routes the query to some mix of Blackbird text search, the semantic index, LSP symbol lookup, and brute-force workspace chunking (only if the workspace is small).

Windsurf went the heaviest on RAG for code. Cascade's retrieval pipeline runs language-specific AST parsers at project open to build a semantic graph of symbols, imports, type references, and call relationships. On top of that sits M-Query, their proprietary retrieval method layered on cosine similarity. On top of that sits real-time flow awareness: Cascade tracks edits, terminal commands, clipboard activity, and navigation history within the session as additional context signals. Plus auto-generated memories that persist across sessions.

The downside, per developer reports, is that on files exceeding 800-1000 lines, Cascade's local attention degrades and the agent starts hallucinating in the middle of long files. The architecture stuffs context aggressively rather than refactoring, so big files become a known failure mode.¹³

The numbers, with the right asterisks

SWE-bench Verified is the closest thing the field has to a fair benchmark: real GitHub issues, ground-truth fixes, sandboxed evaluation. Recent scores, from the official leaderboard and vendor announcements:

The top of the chart leans agentic. Claude Code and Codex both run no pre-index. Cursor sits in the middle, gets there with a RAG pipeline plus a strong model. Copilot and Windsurf are below, but neither of those numbers is from a head-to-head with the same backing model, so the comparison is messier than the chart makes it look.

What the chart understates: workplace adoption tells a different story. GitHub Copilot has about 29% workplace adoption against 18% each for Claude Code and Cursor.¹⁴ Copilot doesn't lead on capability and it doesn't have to. It leads on procurement, on price ($10 a month versus $20), and on the fact that the security team already approved it. SWE-bench is a measure of capability; adoption is capability mixed with friction-to-deploy, and Copilot loses the first while winning the second.

The cost math nobody wants to publish

The argument for RAG is fundamentally cost-and-latency. An embedding lookup returns in 10ms. An agentic loop pays five to thirty tool-call round-trips per query, each one a separate LLM inference. At Cursor's scale (1 billion+ accepted lines of code per day), that delta in inference cost is the whole P&L.

For an individual developer the picture inverts. One Claude Max plan user measured agentic exploration consuming about 13% of their plan's monthly inference budget. Meaningful but not crushing. For an enterprise paying per-token, the math depends on traffic patterns: low-volume technical work tilts toward agentic (cheap when idle, accurate when running); high-volume inline completion tilts toward RAG (cheap per query, fits the sub-100ms autocomplete budget).

Boris is on record that even Anthropic's own codebase was too sensitive to upload to a third-party index.¹⁵ If Anthropic won't trust the RAG security model with their own code, that's the loudest signal in the room. Not the only one — Sourcegraph independently said the same thing, citing customer concerns about sending code to OpenAI embeddings. Anthropic refusing to eat its own dogfood is hard to spin.

What this generalizes to

The same fight is about to land in every adjacent domain.

Research papers, legal documents, internal company docs, medical records: every domain where a coding agent's playbook would apply now has to make the same architectural call. Do you precompute an index over the entire corpus, accept the staleness and the security exposure, and pay the indexing cost up front? Or do you keep the corpus on disk, expose it to the agent through tool calls, and pay the latency cost at query time?

The previous post argued the layers generalize. Chunking, hybrid retrieval, reranking, knowledge graphs, RAPTOR — they do. What I missed is that the spine generalizes too. The agentic-versus-RAG fight is going to play out in legal AI, in scientific lit-review tools, in enterprise document search, in the next generation of personal-document assistants. The same arguments (staleness, security, salience-not-capacity, the agent-loop-substitutes-for-index gambit) apply unchanged.

Which side wins in those domains is probably going to depend on the same variables: how often the corpus changes (frequent → agentic, rare → RAG), how sensitive it is (sensitive → agentic, public → either), how big it is (huge → RAG still wins on cost), how much latency the user tolerates (sub-second → RAG, multi-second-OK → agentic). None of these are mysterious. The vendors are going to converge on hybrids because the variables aren't independent.

📚

Sourcegraph's SCIP-based code graph predates LLMs by years. They've been doing compiler-grade cross-repo code intelligence since roughly 2013. When Cody Enterprise dropped embeddings, they didn't have to build a replacement retrieval layer. They just pointed the LLM at the symbol graph they were already indexing for go-to-definition. The lesson buried in there: a coding agent that ships on top of fifteen years of structural indexing has a different cost curve than one that has to build the index from scratch. The best place to start an embedding pipeline is to not need one.

What I'm doing about it

Two specific changes to how I'm building the cross-document retrieval stack I described last week.

First: I'm dropping the embedding-first design for the system that sits on top of the document substrate, at least for corpora under a few thousand pages. Most user queries don't actually require semantic retrieval, they require navigation. Where is the table that disagrees with this claim? Which section of this paper defines the term? Which footnote points at which other paper? Those are structural questions. An agentic loop with grep, structure-aware tools, and access to the document tree answers them more reliably than any vector index I've gotten working. Embeddings stay in the stack, but as a tool the agent can call when the query is genuinely about concepts rather than locations. Not as the spine.

Second: I'm investing in better tools, not better indexes. The agentic-camp lesson is that a strong model with good primitives (a real Read, a real Grep, a real Glob, a way to spawn read-only sub-agents) outperforms a clever index with bad primitives. Most of the failure cases I hit last week were tool-quality problems, not retrieval-quality problems. The reranker wasn't what was broken. The thing that lets the agent ask "what's near this passage" was.

I'm shipping both changes this week and seeing what survives contact with three angry PDFs.

If the agentic path holds up, the cross-document substrate gets a lot simpler than what I was planning. If it doesn't, I'll know within a few sessions, and the indexing infrastructure I already have isn't going anywhere.

The thing I genuinely don't know yet is where the crossover happens. At what corpus size and edit frequency does precomputed indexing start beating agentic exploration on accuracy, not just on cost? Cursor and Copilot have private numbers on that. Anthropic and Cline have the opposite private numbers. Nobody's publishing the crossover.

That number is the question of the next two years.

Boris Cherny, X post, January 31, 2026. https://x.com/bcherny/status/2017824286489383315. The post hit 1.1M views and roughly 5K bookmarks. ↩
Latent Space podcast interview with Boris Cherny, May 2025. The "by a lot" framing is from Pash on X (https://x.com/pashmerepat/status/1926717705660375463) summarizing the same interview. Boris also discusses the Instagram-grep origin story on the Pragmatic Engineer podcast, March 2026. ↩ ↩²
Nick Pash, "Why Cline Doesn't Index Your Codebase," Cline blog. The "RAG is a mind virus" line is from his follow-up commentary. ↩
Sourcegraph engineering blog, "Deprecating embeddings for Cody Enterprise GA." Full quote: "Embeddings have been at the backbone of Cody's retrieval stack since we launched the product in beta, and now that Cody Enterprise is generally available, we're leaving them behind (for now)." Cody Free/Pro were discontinued July 23, 2025; the successor is Amp. ↩
GitHub engineering blog, "The technology behind GitHub's new code search." Blackbird's cluster size, ingest rate, and query rate are all from their published numbers. ↩
Cursor security documentation. The warning reads: "academic work has shown that reversing embeddings is possible in some cases." Cursor mitigates with privacy mode and client-side path obfuscation, but explicitly flags the risk. ↩
SWE-bench Verified leaderboard, May 2026. Claude Opus 4.6 + Claude Code: 80.8%. Numbers from official vendor announcements and the swebench.com leaderboard. ↩
"Reverse engineering OpenAI Codex's retrieval architecture," preprints.org/manuscript/202510.0924, October 2025. Codex's retrieval: ripgrep + fuzzy filename + AST grep + bash. No embeddings index. ↩
TechCrunch, "More details emerge on how Windsurf's VCs and founders got paid from the Google deal," August 1, 2025. The $2.4B was split evenly between investors (~$1.2B) and compensation packages for ~40 hired engineers. Cognition acquired the remaining company days later. ↩
Google Antigravity official docs and codelab. Antigravity's documented context sources: filesystem tools, @-mentions, Knowledge Items extracted by a Knowledge Subagent at end of conversation, Skills loaded via progressive disclosure, Gemini 3 Pro's 1M-2M token window. No embedding index documented. ↩
Cursor engineering blog, "Securely indexing large codebases," May 2025. The 92% cross-clone similarity number, the Turbopuffer storage detail, and the simhash team-sharing mechanism are all from Cursor's own published numbers. ↩
GitHub engineering blog, "The technology behind GitHub's new code search." Blackbird's full statistics are public. ↩
r/Codeium and developer threads documenting Windsurf's degradation on files >800-1000 lines. Widely reproduced; Windsurf has acknowledged it. ↩
2026 developer usage reports (Uvik, NxCode). Workplace adoption: Copilot 29%, Claude Code 18%, Cursor 18%, Codex 3%. Awareness: Copilot 76%, Cursor 69%, Claude Code 57%, Codex 27%. ↩
Boris Cherny on the Latent Space podcast, May 2025: Anthropic's own codebase was too sensitive to upload to a third-party RAG index. If true, this is the loudest possible signal against the RAG security model. ↩