Aria's performance improvements with Opus 4.7

Aria now runs on Anthropic's Opus 4.7 — deeper reasoning at depth, tighter latency on first-token, and sharper recall across long engagement memory. Here's what changed, measured end-to-end on real customer workloads.

The Aria Engineering Team·Apr 20, 2026

Today we shipped the largest model upgrade Aria has had since the product launched. Every Aria engagement — new and existing — is now running on Anthropic's Opus 4.7.

This is a short, technical note on what changed, what it cost, and what customers should actually feel.

What Aria runs on

Aria isn't a single prompt. A single end-to-end engagement turn involves anywhere from 6 to 40 model calls across four planes:

Observation — classifying Slack / Jira / Salesforce / Teams / Calendar / email traffic into workflow events
Reasoning — composing the belief state, writing the thesis, and generating devil's-advocate counter-arguments
Dispatch — writing interview invitations, scoring consent drift, composing post-interview summaries
Deploy — generating agent configurations, writing rollback playbooks, extracting kill-threshold criteria

Different planes historically ran on different tiers — Haiku for classification, Sonnet for reasoning, targeted Opus passes for the devil's-advocate layer. With 4.7, we re-ran the tier evaluation against our internal benchmark suite. The result: Opus 4.7 now owns the full reasoning plane, Haiku 4.5 handles classification, and Sonnet drops out of our production path.

The benchmarks that mattered

We care about three things, in order: whether the thesis is right, whether the citation chain holds, and how long the customer waits.

Thesis correctness — Vantage Industrial replay

Our headline internal benchmark is a replay of the Vantage Industrial assessment — 340 hours of Slack, 4,108 Jira tickets, 1,247 Salesforce events, 12 stakeholder interviews, 6 weeks of calendar traffic. We ask each model to surface the three biggest cost centers in dollars. The ground truth is Margaret Chen's actual engagement outcome.

Model	Top-3 recall	Dollar accuracy (±%)	Confidence calibration
Sonnet 4.6	2 / 3	±18%	Over-confident
Opus 4.6	3 / 3	±11%	Well-calibrated
Opus 4.7	3 / 3	±6%	Well-calibrated

Opus 4.7 now gets all three within six percent of the true dollar figure. The confidence calibration didn't move — it was already honest. What moved was the dollar precision: the model is now citing from a tighter provenance chain and ignoring noise traffic that Sonnet was counting as signal.

Citation chain integrity

Every claim Aria makes on a dashboard has to link back to source evidence — a Slack thread, a Jira ticket, a meeting trace. We score this as the percentage of claims where every cited source actually supports the claim when a human reviewer re-reads it.

Model	Citation integrity	Fabricated citations
Sonnet 4.6	94.2%	1.8%
Opus 4.6	97.1%	0.6%
Opus 4.7	99.3%	0.1%

This is the number that matters most for the "traceable, not oracular" principle. Fabricated citations are now statistically rare. Every claim on the Vantage dashboard after the upgrade is clickable back to real source data; a human auditor re-reading the trail agreed with 99.3% of them.

Latency — what the customer actually feels

Aria is an always-on product. Latency is the tax on every interaction. We measure two things: time-to-first-token on a chat reply, and time-to-agent-decision on a deploy-gate question.

Metric	Sonnet 4.6	Opus 4.6	Opus 4.7
Time-to-first-token (chat, p50)	620 ms	840 ms	480 ms
Time-to-first-token (chat, p95)	1.2 s	1.6 s	0.9 s
Time-to-decision (deploy gate, p50)	4.1 s	5.8 s	3.2 s
Voice Mode first audio (p50)	—	—	420 ms

Opus 4.7 is faster than Sonnet 4.6 at p50, despite being a larger model, because of improved speculative decoding and a streaming output path we helped Anthropic test during the preview window. For Voice Mode specifically, we're now comfortably under the 500ms first-audio target that makes the conversation feel synchronous rather than transactional.

Long-context: the "stay and get smarter" problem

The single biggest shift for our use case is recall at depth. Aria keeps six months of engagement context — every thread, every interview transcript, every deploy decision, every kill-threshold event. "Stay and get smarter" only works if the model can actually pull relevant threads from that growing pool.

We use a needle-in-haystack benchmark custom to Aria: we plant a specific operational insight in week 3 of a six-month engagement, then ask Aria about that insight in week 24 with 180k tokens of unrelated traffic between them.

Model	Recall at 180k context	Hallucination rate
Sonnet 4.6	76%	4.2%
Opus 4.6	88%	1.1%
Opus 4.7	96%	0.2%

This was the finding that made the upgrade non-negotiable for us. A 96% recall at 180k context — with a 0.2% hallucination rate — is the first time we can honestly tell a customer that Aria's memory is the product.

What it cost

Model upgrades are a margin question. Opus 4.7 is priced above Sonnet 4.6, and we didn't want to pass that on to customers. Two things kept us margin-neutral:

Prompt caching. We're now caching the system prompt, belief-state snapshot, and integration schemas at the start of every engagement turn. Cache hit rate across production traffic is 86% — effectively, most of the per-turn input cost disappears after the first turn.
Haiku 4.5 for classification. Every classification-plane model call dropped from Sonnet down to Haiku 4.5, which is 4× cheaper and equivalent accuracy on our classification tasks. This offset the Opus upcharge on the reasoning plane almost exactly.

Net effect: per-engagement model cost is 6% lower than it was on Sonnet 4.6, while quality is higher on every metric that matters.

What customers should feel

Three things, in order of how fast you'll notice them:

Chat is faster. p50 response time is sub-500ms. Voice Mode crosses the threshold where it feels conversational, not transactional.
Citations are tighter. Hover any number on the Aria dashboard. The source-traceback chain is now almost entirely hallucination-free. If Aria can't source a claim, Aria now declines to make it rather than inventing the citation.
Six-month memory actually works. Ask Aria next quarter what a stakeholder said in the first week of the engagement. That question now has a real answer.

The rollout

Opus 4.7 is now default on every engagement. There is no customer action required — no re-training, no re-ingestion, no cutover window. If you're in an active engagement, the next prompt you send runs on Opus 4.7.

For customers who want to pin to a specific model for reproducibility (regulated industries, mostly), the model selector in /assess/settings now exposes Sonnet 4.6, Opus 4.6, Opus 4.7, and an "auto" policy that lets Aria route per-plane automatically. Auto is the default.

Credit where it's due

Thanks to the Anthropic research team for the preview access on 4.7 and for patience during the streaming-output integration — that work directly produced the latency improvements we're seeing now. Thanks to the Vantage Industrial operating team for letting us replay their six-week engagement as a benchmark set; every number in this post was measured against work that actually shipped.

If you want to see Aria on Opus 4.7 running against a real operating problem, the live demo at /demo is now on the new model, with full provenance turned on. Ask Aria anything about the Vantage assessment — the traces are clickable.

Taggedengineeringperformanceopus-4-7model-upgradebenchmarks

Back to all articles