I use AI coding tools every day. Claude Code does most of my actual work. I've spent real time with the alternatives - Gemini, Codex, a few open-source models - and I keep drifting back to Claude, not out of any particular loyalty but because the others keep failing me in the same specific way.
The loop goes like this: a new model drops, tops the benchmarks, developers try it for a few days, complain, and quietly go back to Claude. This has happened three or four times now. The pattern is consistent enough that it deserves a better explanation than "hype cycle."
What benchmarks actually measure
When a new model tops the coding benchmarks, the benchmarks aren't lying. The model really does produce better code on isolated problems - higher accuracy on HumanEval, cleaner solutions on LeetCode-style tasks, all of it. HumanEval-era benchmarks give you a single function to write, graded on whether it passes unit tests. Newer ones like SWE-bench move closer to real development - actual GitHub issues, actual repos, patches scored against real test suites.
Even SWE-bench is still a controlled run, though. Real coding work has a lot more going on at once: managing a conversation with the user, picking which files to read and which to skip, making targeted edits that don't break the surrounding code, hitting an unexpected error and deciding whether to ask for help or try something else, staying on task across twenty-plus steps without drifting into whatever looks shiny at step eleven. That kind of sustained, interactive workflow is very hard to capture in any single score.
The frame that finally made this click for me: Anthropic seems to have trained Claude heavily on the process of coding rather than just the output - the workflow itself, the sequence of small decisions a competent developer makes in a real codebase. Every major coding agent can read files, edit code, run terminal commands; Codex, Antigravity, Gemini CLI all ship these capabilities. The difference is how consistently the model behind them executes the workflow - reading the right files before changing anything, making targeted edits instead of blowing up the whole file, knowing when to act and when to stop and ask, staying on the original task instead of wandering into an unrequested refactor three files over.
All these tools can do it. Claude does it more reliably. Other models produce excellent code - sometimes, on a per-snippet basis, arguably better than Claude's. The gap doesn't show up in any individual output; it shows up across a full task. The other models loop more often, lose track of what they were doing mid-sequence, make edits that silently break surrounding context, and need more steering to stay on track. Not always, but often enough to change how much you can trust the tool to run unsupervised.
The difference isn't raw intelligence. It's process discipline. And that's harder to train for than most people realize.
Generating correct code is maybe forty percent of what an AI coding assistant needs to do well. The other sixty percent is everything around the code - editing files without corrupting what's next to them, reading the right files before making changes, completing a multi-step task without losing the thread, communicating clearly about what it did and what it found, knowing when to ask instead of assuming, and staying inside the actual scope instead of drifting into unrelated files. Every major coding agent attempts all of that. The question is how often they succeed at each one across a full task. In my daily work with Claude Code - building API endpoints, debugging production issues, refactoring components - it hits those consistently enough that I don't feel like I need to watch every step.
With other tools, I find myself intervening more. The code they generate is often just as good, but somewhere in the middle of a multi-file task something slips - a file gets partially overwritten, or the model wanders off and starts "improving" something I didn't ask about. That's the gap. It's less about raw capability and more about how often the tool stays on track without you having to course-correct.
Where things actually stand
To be fair to Google: Gemini writes excellent code. The underlying model is clearly very capable, and given a well-contained problem with a clear spec, it'll produce a good solution and sometimes a great one. The problem looks structural rather than intellectual. Google is at its core a search and general-use company, and their models are optimized across a massive surface of tasks - translation, summarization, multimodal understanding, general conversation. Agentic software development is a narrow, specific workflow that needs its own focused training: long sequences of tool calls, graceful recovery from errors mid-sequence, context held across many steps without drifting. That takes targeted reinforcement learning on exactly that scenario, not just scaling up the base model.
Anthropic published research on agent autonomy showing that software engineering accounts for nearly 50% of all agentic activity on their API. Half of their agentic usage is coding. When that's your reality, you train for it - you optimize the tool use, the file editing, the multi-step workflows - because that's what your paying users are actually doing all day. Google doesn't have that same pressure; their model serves search, translation, multimodal, general chat, and coding is one use case among dozens. Anthropic's model lives or dies by how well it codes.
My honest assessment after using these tools for real work every day:
Claude is my primary tool. Claude Code handles everything from scaffolding new features to debugging tricky production issues, and the workflow is reliable enough that I can hand it tasks I don't want to babysit. Codex has gotten a lot better at agentic work - the gap has closed more than I expected over the past few months, and while it's not as reliable as Claude yet, it's worth watching. Gemini is capable on isolated tasks and I've had it produce genuinely impressive code for well-specified problems, but as an agentic system operating independently across multi-step tasks, it still struggles. The loops, the getting stuck, the needing constant redirection - those are real, consistent failure modes I hit regularly.
I've seen people try the "plan in one model, execute in another" approach - use Gemini for architectural thinking, then switch to Claude for the actual work. In practice it adds friction without adding value. You might as well just stay in Claude for the whole thing.
None of this is permanent. The benchmark leaders will keep changing, a new model will top the leaderboard next month, some people will switch, most will drift back. Google has the resources to fix the process-discipline problem if they decide it's the priority, and OpenAI is already taking agentic workflows seriously with Codex. What Anthropic figured out - training for the workflow, not just the output - is a real insight, but it's not a secret, and other labs will have to explicitly replicate that focus to close the gap. Bigger models alone won't do it. You can have the smartest model in the world, and it won't matter if it can't edit a file without breaking the one next to it.
The benchmarks will tell you one thing. The developers who use these tools every day will tell you another. Usually you should listen to the developers.