Model & caching.
One model, two effort profiles, two ephemeral-cached prompt blocks. This page is the operator's reference for latency, cost, and rollout discipline.
Model
Modelclaude-opus-4-7. Pinned by name in src/app/api/chat/route.ts.
SDK@anthropic-ai/sdk at ^0.80.0. Event shapes are SDK-version-sensitive; clients integrating on the server must pin compatibly.
TransportAnthropic Messages stream API. Chunks are text_delta events; final message resolved via stream.finalMessage() for stop_reason and usage.
Effort profiles
The product runs two distinct effort profiles — one for authenticated live traffic, one for the public demo. Effort controls how hard the model thinks per turn; it is the primary lever for reasoning depth.
Live — POST /api/chat
max_tokens4096. Sufficient for senior-partner-quality answers with supporting structure.
output_config.efforthigh. First-draft conclusions are stronger, input token cost is higher, latency is modestly higher.
thinking{ type: 'disabled' }. Extended thinking is off today to minimize time-to-first-byte on streamed responses.
Demo — POST /api/chat/demo
max_tokens2048. Demo answers are scoped to a fictional engagement; shorter by design.
output_config.effortmedium. Reasonable depth for a public demo with tighter cost envelope.
thinking{ type: 'disabled' }. Same rationale as live.
Extended thinking (roadmap)
The planned move is to thinking: { type: "adaptive", display: "summarized" }. Aria will decide per turn whether to engage extended thinking; summaries surface in the UI via an "Aria is considering …" affordance. The SSE upgrade described in Live chat ships in parallel — new typed events stream thinking summaries without blocking on the full reply.
Streaming lifecycle
On every request the server:
- Resolves the session and
orgId, checks the per-user rate limit, persists the user message. - Composes the 8-layer system prompt.
- Starts the model stream with
messages.stream, passing the composedsystemarray with ephemeral cache control on both blocks. - Iterates over
content_block_delta/text_deltaevents and forwards the text to the client as a chunked plain-text response. - Awaits
stream.finalMessage(), persists the assistant message with full token accounting, bumpsConversation.lastMessageAt, and closes the response.
Prompt caching
Aria passes two text blocks to messages.stream, each marked cache_control: { type: "ephemeral" }. Anthropic's ephemeral cache holds a cached prefix for ~5 minutes of inactivity or for the duration of a sustained session.
return [
{
type: "text",
text: STATIC_LAYERS, // identity + methodology + benchmarks + guardrails + outputFormat
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: dynamicBlock, // companyContext + userContext + engagementState
cache_control: { type: "ephemeral" },
},
];Why two blocks:
- The static block never varies. It stays warm across every organization, every user, every turn.
- The dynamic block varies per org and per engagement update. It can invalidate independently — one organization's engagement-state update does not cold-start the static block for everyone else.
Cache hit rate as a first-class metric
Every assistant message persists four token counts: inputTokens, outputTokens, cacheReadTokens, cacheCreateTokens. Cache hit rate is Aria's performance metric — a warm session reads the vast majority of the system prompt from cache, which drops time-to-first-byte and cost proportionally.
-- Cache hit rate for the last 24 hours, per org
SELECT
"orgId",
SUM("cacheReadTokens")::float /
NULLIF(SUM("cacheReadTokens") + SUM("cacheCreateTokens"), 0) AS cache_hit_rate,
COUNT(*) AS assistant_messages
FROM "ChatMessage"
WHERE "role" = 'ASSISTANT'
AND "createdAt" > NOW() - INTERVAL '24 hours'
GROUP BY "orgId"
ORDER BY cache_hit_rate DESC;Schema version rollouts
Bumping PROMPT_SCHEMA_VERSION cold-starts the global cache. Every request for the first ~5 minutes after a rollout will pay the full cacheCreateTokens cost until the cache warms back up. Rollout discipline:
- Deploy during low-traffic hours when possible.
- Pre-warm by issuing a small number of synthetic requests from a monitored account immediately after deploy.
- Watch
cacheReadTokens / (cacheReadTokens + cacheCreateTokens)for the 10 minutes post-deploy. The ratio should return to steady state within a few minutes. - Every assistant message persists
promptVersion, so you can correlate cache behavior to the exact version that produced each turn.
Do not include per-user identifiers in the static block
The static block is shared across every organization. Anything that should only be visible to one org belongs in the dynamic block. The layer boundaries are in place to make this mistake hard to make — resist working around them.
Observability
X-Request-Id— echoed from client or issued. Propagates through all server logs.X-Prompt-Version— composed system-prompt version at the time of the turn.X-Conversation-Id— the conversation this turn was appended to.promptSchemaVersion— logged server-side alongsideupstreamRequestId(Anthropic's request ID) for cross-correlation.
Next: Engagement memory.