Architecture

Model & caching.

One model, two effort profiles, two ephemeral-cached prompt blocks. This page is the operator's reference for latency, cost, and rollout discipline.

Model

Model

string

claude-opus-4-7. Pinned by name in src/app/api/chat/route.ts.

SDK

pkg

@anthropic-ai/sdk at ^0.80.0. Event shapes are SDK-version-sensitive; clients integrating on the server must pin compatibly.

Transport

Anthropic Messages stream API. Chunks are text_delta events; final message resolved via stream.finalMessage() for stop_reason and usage.

Effort profiles

The product runs two distinct effort profiles — one for authenticated live traffic, one for the public demo. Effort controls how hard the model thinks per turn; it is the primary lever for reasoning depth.

Live — POST /api/chat

max_tokens

integer

4096. Sufficient for senior-partner-quality answers with supporting structure.

output_config.effort

enum

high. First-draft conclusions are stronger, input token cost is higher, latency is modestly higher.

thinking

object

{ type: 'disabled' }. Extended thinking is off today to minimize time-to-first-byte on streamed responses.

Demo — POST /api/chat/demo

max_tokens

integer

2048. Demo answers are scoped to a fictional engagement; shorter by design.

output_config.effort

enum

medium. Reasonable depth for a public demo with tighter cost envelope.

thinking

object

{ type: 'disabled' }. Same rationale as live.

Extended thinking (roadmap)

The planned move is to thinking: { type: "adaptive", display: "summarized" }. Aria will decide per turn whether to engage extended thinking; summaries surface in the UI via an "Aria is considering …" affordance. The SSE upgrade described in Live chat ships in parallel — new typed events stream thinking summaries without blocking on the full reply.

Streaming lifecycle

On every request the server:

Resolves the session and orgId, checks the per-user rate limit, persists the user message.
Composes the 8-layer system prompt.
Starts the model stream with messages.stream, passing the composed system array with ephemeral cache control on both blocks.
Iterates over content_block_delta / text_delta events and forwards the text to the client as a chunked plain-text response.
Awaits stream.finalMessage(), persists the assistant message with full token accounting, bumps Conversation.lastMessageAt, and closes the response.

Prompt caching

Aria passes two text blocks to messages.stream, each marked cache_control: { type: "ephemeral" }. Anthropic's ephemeral cache holds a cached prefix for ~5 minutes of inactivity or for the duration of a sustained session.

TypeScript

return [
  {
    type: "text",
    text: STATIC_LAYERS,       // identity + methodology + benchmarks + guardrails + outputFormat
    cache_control: { type: "ephemeral" },
  },
  {
    type: "text",
    text: dynamicBlock,        // companyContext + userContext + engagementState
    cache_control: { type: "ephemeral" },
  },
];

Why two blocks:

The static block never varies. It stays warm across every organization, every user, every turn.
The dynamic block varies per org and per engagement update. It can invalidate independently — one organization's engagement-state update does not cold-start the static block for everyone else.

Cache hit rate as a first-class metric

Every assistant message persists four token counts: inputTokens, outputTokens, cacheReadTokens, cacheCreateTokens. Cache hit rate is Aria's performance metric — a warm session reads the vast majority of the system prompt from cache, which drops time-to-first-byte and cost proportionally.

SQL

-- Cache hit rate for the last 24 hours, per org
SELECT
  "orgId",
  SUM("cacheReadTokens")::float /
    NULLIF(SUM("cacheReadTokens") + SUM("cacheCreateTokens"), 0) AS cache_hit_rate,
  COUNT(*) AS assistant_messages
FROM "ChatMessage"
WHERE "role" = 'ASSISTANT'
  AND "createdAt" > NOW() - INTERVAL '24 hours'
GROUP BY "orgId"
ORDER BY cache_hit_rate DESC;

Run directly against the ChatMessage table.

Schema version rollouts

Bumping PROMPT_SCHEMA_VERSION cold-starts the global cache. Every request for the first ~5 minutes after a rollout will pay the full cacheCreateTokens cost until the cache warms back up. Rollout discipline:

Deploy during low-traffic hours when possible.
Pre-warm by issuing a small number of synthetic requests from a monitored account immediately after deploy.
Watch cacheReadTokens / (cacheReadTokens + cacheCreateTokens) for the 10 minutes post-deploy. The ratio should return to steady state within a few minutes.
Every assistant message persists promptVersion, so you can correlate cache behavior to the exact version that produced each turn.

Do not include per-user identifiers in the static block

The static block is shared across every organization. Anything that should only be visible to one org belongs in the dynamic block. The layer boundaries are in place to make this mistake hard to make — resist working around them.

Observability

X-Request-Id — echoed from client or issued. Propagates through all server logs.
X-Prompt-Version — composed system-prompt version at the time of the turn.
X-Conversation-Id — the conversation this turn was appended to.
promptSchemaVersion — logged server-side alongside upstreamRequestId (Anthropic's request ID) for cross-correlation.

Next: Engagement memory.

Architecture

Model & caching.

One model, two effort profiles, two ephemeral-cached prompt blocks. This page is the operator's reference for latency, cost, and rollout discipline.

Model

Model

string

claude-opus-4-7. Pinned by name in src/app/api/chat/route.ts.

SDK

pkg

@anthropic-ai/sdk at ^0.80.0. Event shapes are SDK-version-sensitive; clients integrating on the server must pin compatibly.

Transport

Anthropic Messages stream API. Chunks are text_delta events; final message resolved via stream.finalMessage() for stop_reason and usage.

Effort profiles

Live — POST /api/chat

max_tokens

integer

4096. Sufficient for senior-partner-quality answers with supporting structure.

output_config.effort

enum

high. First-draft conclusions are stronger, input token cost is higher, latency is modestly higher.

thinking

object

{ type: 'disabled' }. Extended thinking is off today to minimize time-to-first-byte on streamed responses.

Demo — POST /api/chat/demo

max_tokens

integer

2048. Demo answers are scoped to a fictional engagement; shorter by design.

output_config.effort

enum

medium. Reasonable depth for a public demo with tighter cost envelope.

thinking

object

{ type: 'disabled' }. Same rationale as live.

Extended thinking (roadmap)

Streaming lifecycle

On every request the server:

Resolves the session and orgId, checks the per-user rate limit, persists the user message.
Composes the 8-layer system prompt.
Starts the model stream with messages.stream, passing the composed system array with ephemeral cache control on both blocks.
Iterates over content_block_delta / text_delta events and forwards the text to the client as a chunked plain-text response.
Awaits stream.finalMessage(), persists the assistant message with full token accounting, bumps Conversation.lastMessageAt, and closes the response.

Prompt caching

TypeScript

return [
  {
    type: "text",
    text: STATIC_LAYERS,       // identity + methodology + benchmarks + guardrails + outputFormat
    cache_control: { type: "ephemeral" },
  },
  {
    type: "text",
    text: dynamicBlock,        // companyContext + userContext + engagementState
    cache_control: { type: "ephemeral" },
  },
];

Why two blocks:

The static block never varies. It stays warm across every organization, every user, every turn.
The dynamic block varies per org and per engagement update. It can invalidate independently — one organization's engagement-state update does not cold-start the static block for everyone else.

Cache hit rate as a first-class metric

SQL

-- Cache hit rate for the last 24 hours, per org
SELECT
  "orgId",
  SUM("cacheReadTokens")::float /
    NULLIF(SUM("cacheReadTokens") + SUM("cacheCreateTokens"), 0) AS cache_hit_rate,
  COUNT(*) AS assistant_messages
FROM "ChatMessage"
WHERE "role" = 'ASSISTANT'
  AND "createdAt" > NOW() - INTERVAL '24 hours'
GROUP BY "orgId"
ORDER BY cache_hit_rate DESC;

Run directly against the ChatMessage table.

Schema version rollouts

Deploy during low-traffic hours when possible.
Pre-warm by issuing a small number of synthetic requests from a monitored account immediately after deploy.
Watch cacheReadTokens / (cacheReadTokens + cacheCreateTokens) for the 10 minutes post-deploy. The ratio should return to steady state within a few minutes.
Every assistant message persists promptVersion, so you can correlate cache behavior to the exact version that produced each turn.

Do not include per-user identifiers in the static block

Observability

X-Request-Id — echoed from client or issued. Propagates through all server logs.
X-Prompt-Version — composed system-prompt version at the time of the turn.
X-Conversation-Id — the conversation this turn was appended to.
promptSchemaVersion — logged server-side alongside upstreamRequestId (Anthropic's request ID) for cross-correlation.

Next: Engagement memory.