Smarter Models Hallucinate More Tools, Vercel Got Owned Through a Third-Party OAuth Token, and a Plumber Bot Is Worth a Billion Dollars | ai-best.deals

title: "Smarter Models Hallucinate More Tools, Vercel Got Owned Through a Third-Party OAuth Token, and a Plumber Bot Is Worth a Billion Dollars" description: "From an ICLR paper that dismantles AI procurement logic to a supply chain attack that traveled through a SaaS vendor's Google Workspace tokens, today's Hacker News front page had no shortage of things that should make you reconsider some assumptions. Alex Rivera reads the threads so you don't have to." publishedAt: "2026-04-29" author: "Alex Rivera" category: "news" tags: ["hacker-news", "ai-agents", "security", "startups", "llms", "supply-chain"]

The paper out of ICLR 2026 that hit HN's front page this morning is the kind of result that makes you put your coffee down and stare at the wall for a minute. A team of researchers found that as you train a language model to reason harder — specifically, as you push it with reinforcement learning to improve performance on agentic tasks — its rate of tool hallucination goes up in direct proportion to its capability gains. Not down. Up. The smarter the agent gets at completing tasks, the more confidently it invents tool calls that don't exist.

The paper is called "The Reasoning Trap," and the name earns its keep. The core finding is not subtle: reinforcement learning targeting task performance does not simultaneously teach restraint. The model doesn't learn to be uncertain about tools it doesn't have access to. It learns to reach for more tools, because in the training distribution, reaching worked. When you deploy that model in a real environment with a real, bounded toolset, you get a very capable, very confident agent that will casually fabricate calls to endpoints that have never existed — and do it with the same fluency it uses when calling real ones.

HN commenters responded with that particular brand of unsurprise that means they saw the problem coming but are still irritated it showed up in their production logs. The most upvoted response essentially said: this is Goodhart's Law with a reward signal. You optimize for "completes tasks," you get an agent that completes tasks by whatever means the policy space allows — including inventing the means. Several engineers in the thread noted that prompt engineering and DPO help at the margins, but neither closes the gap. The paper's authors are honest about this, framing it as a "fundamental reliability-capability trade-off," which in researcher-speak means "we have demonstrated the problem exists and we do not yet have a solution."

What makes this commercially significant is how directly it contradicts the pitch logic that has been moving AI agent contracts for the past eighteen months. The standard sales narrative says: deploy our newest, most reasoning-capable model and your automation becomes more reliable. Reliability and capability improve together. You get both. The ICLR paper says you get a tradeoff you weren't told about and probably aren't measuring. If you're running agents in production where the available toolset is constrained — which is every real deployment — you should be actively testing tool hallucination as a regression metric every time you upgrade a model version. Not assuming it goes away because the benchmark numbers improved. The benchmark numbers are measuring a different thing than what you care about.

The Vercel story is a different kind of uncomfortable, because it's the kind that has been structurally inevitable for a long time. In April, Vercel confirmed a supply chain attack that started at a third-party AI productivity tool called Context.ai, traveled through an employee's Google Workspace OAuth tokens, and eventually reached Vercel's internal systems, where customer project environment variables — including, in some cases, production API keys — were exposed. A threat actor using the ShinyHunters name is claiming to sell the data on BreachForums for two million dollars. Vercel says the impact was limited to a subset of customers and non-sensitive variables. The threat actor says otherwise. We've seen this script before.

The interesting part isn't the breach itself. It's what the HN thread actually said about it. A comment that accumulated several hundred upvotes made a pointed observation: the engineering industry ritually tests job candidates on their understanding of single responsibility, separation of concerns, and minimal privilege — and then builds a business model that funnels authentication, build orchestration, secret management, runtime execution, and deployment provenance through a single CLI owned by a single company. The concentration is the risk model. It's not a bug in one vendor's implementation. It's load-bearing architecture.

This attack did not require a zero-day. It required a Lumma Stealer infection at Context.ai sometime in early 2026 — one employee opened something they shouldn't have, a Google Workspace OAuth token was harvested, that token was used to gain a foothold in Vercel's internal systems. No sophisticated exploit. Just the OAuth trust chain functioning exactly as designed, in a context where one trusted node had already been quietly compromised for weeks before anyone noticed. The supply chain in "supply chain attack" here isn't software packages. It's the human chain of SaaS tools that employees use at work and organizations have largely stopped auditing.

What you should actually do with this: rotate any environment variables that have ever lived in Vercel project settings, especially anything that decrypts to plaintext in their dashboard. Do this even if you weren't in the notified subset — limited subsets have a consistent track record of expanding as investigations continue. More importantly, pull up your Google Workspace OAuth app access page right now and look at every third-party tool that has been granted access over the past two years. The ones you installed and forgot about are exactly the ones you need to look at. Context.ai won't be the last one.

While we're on the subject of AI infrastructure as attack surface: CVE-2026-33626 in LMDeploy was disclosed and exploited within thirteen hours. Thirteen. Sysdig's honeypot caught the first exploitation attempt twelve hours and thirty-one minutes after the GitHub advisory went public. The vulnerability is an SSRF in the load_image() function — the vision-language image loader fetches arbitrary URLs without validating whether those URLs resolve to internal resources. The attacker used it to port-scan the internal network behind the model server, probe the AWS Instance Metadata Service, hit Redis and MySQL ports, and exfiltrate data out-of-band through DNS. In a single eight-minute session.

This is not a novel vulnerability class. Server-Side Request Forgery has been biting web developers since 2014. The reason it keeps showing up in AI inference tools is that these tools were built to run on a researcher's local machine and then shipped as production infrastructure without going through anything resembling the security review that an equivalent web application would receive. Nobody's threat modeling their vision-language image loader. The pattern of exploitation within hours of CVE publication has been consistent across AI tooling throughout this year, and the security community on HN was frank about why: the people who built these tools were not thinking about adversarial inputs from people who want to reach your cloud metadata service.

If you're running LMDeploy in any environment that touches real infrastructure, patch immediately. If you're running any open source LLM inference stack — vLLM, TGI, Ollama with network exposure, anything — go read its issue tracker today. Not next week.

The Stanford AI Index landed this week and IEEE Spectrum published a useful twelve-graph summary. The headline numbers are as dizzying as they always are: global AI compute capacity up thirty-fold since 2021. Agentic AI benchmarks showing the steepest performance curves of any category. OpenAI's o1 scored 8.8 percent on Humanity's Last Exam at launch; the current best models are at 38.3 percent.

The graph that HN actually argued about was ClockBench. It tests whether AI models can read analog clocks and understand calendar layouts. The best available model is sitting at roughly fifty-fifty accuracy. Fifty percent on reading a clock. After thirty times more compute than we had four years ago, and after every major lab published triumphal press releases about superhuman performance on graduate-level reasoning. Several commenters pointed out, correctly, that this is not a curiosity. Clocks and calendars are how humans sequence events in the physical world. An agent that cannot reliably read a clock is an agent that will confidently schedule your HVAC maintenance for the wrong day while simultaneously writing you a detailed explanation of why HVAC maintenance is important. The capability gaps are real and they are not located where the benchmark leaderboards suggest they are.

The China robotics number got significant attention in the thread: 295,000 industrial robots installed in 2024, versus 44,500 in Japan and 34,200 in the United States. The US leads on model releases, benchmark scores, and research publications. China leads on deployment at scale in physical environments. These are different competitions, running on different timelines, with different feedback loops about what actually works when the model has to interact with the real world.

Avoca raised $125 million at a billion-dollar valuation this week, and I mention it not to celebrate another unicorn but because it is probably the most operationally honest AI story of the week. Avoca builds AI voice agents for HVAC companies, plumbers, roofers, and electricians. The agents answer inbound calls, schedule jobs, follow up on estimates, and run dispatch. Over 800 customers. On track to book a billion dollars in jobs this year. YC-backed, with Kleiner Perkins leading the Series A and Meritech plus General Catalyst leading the Series B.

The reason this story is refreshing is that the value proposition is completely legible without a whitepaper. A plumbing company gets fifteen calls a day and cannot afford someone sitting by the phone at 9pm on a Sunday when a pipe bursts. Avoca picks up, figures out what the customer needs, gets it into the scheduling system, and sends a confirmation. That is a real problem. It has a real dollar value attached to it. There is no "AI will transform your workflows in ways that will become apparent over the next several quarters." There is "you were losing jobs because nobody picked up the phone, and now you pick up the phone."

The Fortune piece this week noted that the founders found their first customers at trade shows — which is to say, they went to where the actual customers were, in the actual industry, instead of listing on Product Hunt and waiting for the industry to discover them. This is what vertical AI deployment looks like when it's working: not horizontal platforms promising everything to everyone, but a narrow tool doing one specific job well inside an industry that platform builders never considered interesting enough to optimize for. The margins in HVAC dispatch are real. The competition from other AI vendors who thought plumbing was beneath them is nonexistent.

Briefly, and without the energy it deserves: Bluesky's twenty-four-hour outage this month, claimed by an Iran-linked group calling itself the 313 Team, prompted a characteristically weary HN discussion about the gap between how decentralized protocols are marketed and how they actually operate under attack. The protocol is open. The infrastructure most users route through is not distributed in any meaningful sense. If you ran your own Personal Data Server, you weren't affected. Most users do not run their own PDS, because running your own PDS is the kind of thing that sounds great in a whitepaper and is operationally annoying in practice. Mastodon got hit shortly after. The DDoS resistance properties of "decentralized" platforms are roughly as good as the centralization properties of their actual infrastructure, which in most cases is "hosted on someone else's cloud."

The week in aggregate: agents hallucinate more tools as they get smarter, the OAuth trust chain that connects your developer tools to your production secrets is longer and weaker than you've accounted for, AI infrastructure inherits a decade of unresolved web security debt, and the most grounded AI deployment story this week is a company answering phones for plumbers. These things coexist without contradiction if you've been watching this space for more than a year. Plan accordingly.

Alex Rivera is a former CTO and recovering hype-cycle survivor who writes about AI tools and the deals that actually matter at ai-best.deals. He reads the HN comment threads so you don't have to.

The Best AI Tools in 2026: Eight We'd Actually Pay For

An opinionated, tested guide to the AI tools worth your money in 2026 — across writing, image, coding, and productivity. Real pricing, real verdicts.

ChatGPT vs Claude vs Gemini in 2026: A Working Writer and Coder's Verdict

We use all three every day. Here's the honest head-to-head — context windows, pricing, models, multimodal, coding, web access, and which one wins per use case.

The Best Free AI Tools in 2026 (And Which 'Free' Ones Are Lying)

A working list of 12 truly-free AI tools — separated into actually-free-forever, freemium-with-credit-card, and open-source self-hostable. Avoid the bait-and-switch.

Related posts

The Best AI Tools in 2026: Eight We'd Actually Pay For

ChatGPT vs Claude vs Gemini in 2026: A Working Writer and Coder's Verdict

The Best Free AI Tools in 2026 (And Which 'Free' Ones Are Lying)