Aria's performance improvements with Opus 4.7
Aria now runs on Anthropic's Opus 4.7 — deeper reasoning at depth, tighter latency on first-token, and sharper recall across long engagement memory. Here's what changed, measured end-to-end on real customer workloads.
Today we shipped the largest model upgrade Aria has had since the product launched. Every Aria engagement — new and existing — is now running on Anthropic's Opus 4.7.
This is a short, technical note on what changed, what it cost, and what customers should actually feel.
What Aria runs on
Aria isn't a single prompt. A single end-to-end engagement turn involves anywhere from 6 to 40 model calls across four planes:
- Observation — classifying Slack / Jira / Salesforce / Teams / Calendar / email traffic into workflow events
- Reasoning — composing the belief state, writing the thesis, and generating devil's-advocate counter-arguments
- Dispatch — writing interview invitations, scoring consent drift, composing post-interview summaries
- Deploy — generating agent configurations, writing rollback playbooks, extracting kill-threshold criteria
Different planes historically ran on different tiers — Haiku for classification, Sonnet for reasoning, targeted Opus passes for the devil's-advocate layer. With 4.7, we re-ran the tier evaluation against our internal benchmark suite. The result: Opus 4.7 now owns the full reasoning plane, Haiku 4.5 handles classification, and Sonnet drops out of our production path.
The benchmarks that mattered
We care about three things, in order: whether the thesis is right, whether the citation chain holds, and how long the customer waits.
Thesis correctness — Vantage Industrial replay
Our headline internal benchmark is a replay of the Vantage Industrial assessment — 340 hours of Slack, 4,108 Jira tickets, 1,247 Salesforce events, 12 stakeholder interviews, 6 weeks of calendar traffic. We ask each model to surface the three biggest cost centers in dollars. The ground truth is Margaret Chen's actual engagement outcome.
| Model | Top-3 recall | Dollar accuracy (±%) | Confidence calibration |
|---|---|---|---|
| Sonnet 4.6 | 2 / 3 | ±18% | Over-confident |
| Opus 4.6 | 3 / 3 | ±11% | Well-calibrated |
| Opus 4.7 | 3 / 3 | ±6% | Well-calibrated |
Opus 4.7 now gets all three within six percent of the true dollar figure. The confidence calibration didn't move — it was already honest. What moved was the dollar precision: the model is now citing from a tighter provenance chain and ignoring noise traffic that Sonnet was counting as signal.
Citation chain integrity
Every claim Aria makes on a dashboard has to link back to source evidence — a Slack thread, a Jira ticket, a meeting trace. We score this as the percentage of claims where every cited source actually supports the claim when a human reviewer re-reads it.
| Model | Citation integrity | Fabricated citations |
|---|---|---|
| Sonnet 4.6 | 94.2% | 1.8% |
| Opus 4.6 | 97.1% | 0.6% |
| Opus 4.7 | 99.3% | 0.1% |
This is the number that matters most for the "traceable, not oracular" principle. Fabricated citations are now statistically rare. Every claim on the Vantage dashboard after the upgrade is clickable back to real source data; a human auditor re-reading the trail agreed with 99.3% of them.
Latency — what the customer actually feels
Aria is an always-on product. Latency is the tax on every interaction. We measure two things: time-to-first-token on a chat reply, and time-to-agent-decision on a deploy-gate question.
| Metric | Sonnet 4.6 | Opus 4.6 | Opus 4.7 |
|---|---|---|---|
| Time-to-first-token (chat, p50) | 620 ms | 840 ms | 480 ms |
| Time-to-first-token (chat, p95) | 1.2 s | 1.6 s | 0.9 s |
| Time-to-decision (deploy gate, p50) | 4.1 s | 5.8 s | 3.2 s |
| Voice Mode first audio (p50) | — | — | 420 ms |
Opus 4.7 is faster than Sonnet 4.6 at p50, despite being a larger model, because of improved speculative decoding and a streaming output path we helped Anthropic test during the preview window. For Voice Mode specifically, we're now comfortably under the 500ms first-audio target that makes the conversation feel synchronous rather than transactional.
Long-context: the "stay and get smarter" problem
The single biggest shift for our use case is recall at depth. Aria keeps six months of engagement context — every thread, every interview transcript, every deploy decision, every kill-threshold event. "Stay and get smarter" only works if the model can actually pull relevant threads from that growing pool.
We use a needle-in-haystack benchmark custom to Aria: we plant a specific operational insight in week 3 of a six-month engagement, then ask Aria about that insight in week 24 with 180k tokens of unrelated traffic between them.
| Model | Recall at 180k context | Hallucination rate |
|---|---|---|
| Sonnet 4.6 | 76% | 4.2% |
| Opus 4.6 | 88% | 1.1% |
| Opus 4.7 | 96% | 0.2% |
This was the finding that made the upgrade non-negotiable for us. A 96% recall at 180k context — with a 0.2% hallucination rate — is the first time we can honestly tell a customer that Aria's memory is the product.
What it cost
Model upgrades are a margin question. Opus 4.7 is priced above Sonnet 4.6, and we didn't want to pass that on to customers. Two things kept us margin-neutral:
- Prompt caching. We're now caching the system prompt, belief-state snapshot, and integration schemas at the start of every engagement turn. Cache hit rate across production traffic is 86% — effectively, most of the per-turn input cost disappears after the first turn.
- Haiku 4.5 for classification. Every classification-plane model call dropped from Sonnet down to Haiku 4.5, which is 4× cheaper and equivalent accuracy on our classification tasks. This offset the Opus upcharge on the reasoning plane almost exactly.
Net effect: per-engagement model cost is 6% lower than it was on Sonnet 4.6, while quality is higher on every metric that matters.
What customers should feel
Three things, in order of how fast you'll notice them:
- Chat is faster. p50 response time is sub-500ms. Voice Mode crosses the threshold where it feels conversational, not transactional.
- Citations are tighter. Hover any number on the Aria dashboard. The source-traceback chain is now almost entirely hallucination-free. If Aria can't source a claim, Aria now declines to make it rather than inventing the citation.
- Six-month memory actually works. Ask Aria next quarter what a stakeholder said in the first week of the engagement. That question now has a real answer.
The rollout
Opus 4.7 is now default on every engagement. There is no customer action required — no re-training, no re-ingestion, no cutover window. If you're in an active engagement, the next prompt you send runs on Opus 4.7.
For customers who want to pin to a specific model for reproducibility (regulated industries, mostly), the model selector in /assess/settings now exposes Sonnet 4.6, Opus 4.6, Opus 4.7, and an "auto" policy that lets Aria route per-plane automatically. Auto is the default.
Credit where it's due
Thanks to the Anthropic research team for the preview access on 4.7 and for patience during the streaming-output integration — that work directly produced the latency improvements we're seeing now. Thanks to the Vantage Industrial operating team for letting us replay their six-week engagement as a benchmark set; every number in this post was measured against work that actually shipped.
If you want to see Aria on Opus 4.7 running against a real operating problem, the live demo at /demo is now on the new model, with full provenance turned on. Ask Aria anything about the Vantage assessment — the traces are clickable.