Skip to content
These are the docs for the beta version of Evalite. Install with pnpm add evalite@beta

Vercel AI SDK

Deep integration with Vercel’s AI SDK for automatic tracing and caching of LLM calls.

Setup

Wrap your AI SDK models with wrapAISDKModel to enable tracing and caching:

import { openai } from "@ai-sdk/openai";
import { wrapAISDKModel } from "evalite/ai-sdk";
const model = wrapAISDKModel(openai("gpt-4o-mini"));

This single wrapper provides both automatic tracing and intelligent caching of LLM responses.

Tracing

wrapAISDKModel automatically captures all LLM calls made through the AI SDK, including:

  • Full prompt/messages
  • Model responses (text and tool calls)
  • Token usage
  • Timing information

Viewing Traces

Traces appear in the Evalite UI under each test case:

  1. Navigate to an eval result
  2. Click on a specific test case
  3. View the “Traces” section to see all LLM calls
  4. Inspect input, output, and timing for each trace

Example with Tracing

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { evalite } from "evalite";
import { wrapAISDKModel } from "evalite/ai-sdk";
evalite("Test Capitals", {
data: async () => [
{
input: `What's the capital of France?`,
expected: "Paris",
},
{
input: `What's the capital of Germany?`,
expected: "Berlin",
},
],
task: async (input) => {
const result = streamText({
model: wrapAISDKModel(openai("gpt-4o-mini")),
system: `Answer the question concisely.`,
prompt: input,
});
// All calls are automatically traced
return await result.text;
},
scorers: [
{
name: "Exact Match",
scorer: ({ output, expected }) => (output === expected ? 1 : 0),
},
],
});

Manual Traces with reportTrace()

For non-AI SDK calls or custom processing steps, use reportTrace():

import { reportTrace } from "evalite/traces";
evalite("Multi-Step Analysis", {
data: [{ input: "Analyze this text" }],
task: async (input) => {
// Custom processing step
const preprocessed = preprocess(input);
reportTrace({
input: { raw: input },
output: { preprocessed },
});
// AI SDK call (automatically traced)
const result = await generateText({
model: wrapAISDKModel(openai("gpt-4")),
prompt: preprocessed,
});
return result.text;
},
});

Caching

wrapAISDKModel automatically caches LLM responses to:

  • Reduce costs - Avoid redundant API calls
  • Speed up development - Instant responses for repeated inputs
  • Improve reliability - Consistent outputs during testing

Caching works for both tasks and scorers. Cache hits are tracked separately and displayed in the UI.

How Caching Works

When enabled, Evalite:

  1. Generates a cache key from model + parameters + prompt
  2. Checks if a response exists for that key
  3. Returns cached response (0 tokens used) or executes call
  4. Stores new responses in cache (24 hour TTL)
  5. Shows cache hits in UI with saved duration

Configuration

Config file (evalite.config.ts):

import { defineConfig } from "evalite/config";
export default defineConfig({
cache: false, // Disable caching
});

CLI flag:

Terminal window
evalite --no-cache # Disable for single run
evalite watch --no-cache # Disable in watch mode

Runtime (programmatic usage):

import { runEvalite } from "evalite";
await runEvalite({
cacheEnabled: false,
mode: "run-once-and-exit",
});

Precedence: Runtime > Config > Default (true)

Cache Indicators in UI

The UI shows:

  • Cache hit icon (⚡) next to evals with cached responses
  • Count of cache hits per eval
  • Separate tracking for task vs scorer cache hits
  • Saved duration in milliseconds

Per-Model Configuration

Disable caching for specific models while keeping it enabled globally:

import { wrapAISDKModel } from "evalite/ai-sdk";
import { openai } from "@ai-sdk/openai";
// Caching disabled for this model only
const model = wrapAISDKModel(openai("gpt-4o-mini"), {
caching: false,
});

Disable tracing for specific models:

const model = wrapAISDKModel(openai("gpt-4o-mini"), {
tracing: false,
});

Complete Example

import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
import { evalite } from "evalite";
import { faithfulness } from "evalite/scorers";
import { wrapAISDKModel } from "evalite/ai-sdk";
// Wrap once, use everywhere
const model = wrapAISDKModel(openai("gpt-4o-mini"));
evalite("RAG System", {
data: async () => [
{
input: "What is Evalite?",
expected: {
groundTruth: ["Evalite is a tool for testing LLM applications."],
},
},
],
task: async (input) => {
// Both calls are traced and cached
const context = await generateText({
model,
prompt: `Retrieve context for: ${input}`,
});
const result = await generateText({
model,
prompt: `Answer using context: ${context.text}\n\nQuestion: ${input}`,
});
return result.text;
},
scorers: [
{
scorer: ({ input, output, expected }) =>
// Scorer LLM calls are also cached
faithfulness({
question: input,
answer: output,
groundTruth: expected.groundTruth,
model,
}),
},
],
});

Best Practices

  1. Wrap models once - Create wrapped models at module level, reuse across evals
  2. Keep caching enabled during development - Speeds up iteration and reduces costs
  3. Disable cache for production runs - Use --no-cache for final evaluation runs
  4. Use tracing for debugging - Inspect traces to understand multi-step LLM workflows
  5. Cache is safe for deterministic tests - Same inputs always produce same cached outputs

Caching with Trial Count

For non-deterministic evaluations, you might worry that a lucky correct answer gets cached, giving false confidence in reliability. Quality and accuracy require many samples to measure properly.

Combine cache with trialCount to solve this - each trial busts the cache and runs fresh, ensuring you get multiple samples while still benefiting from caching during development:

evalite("Non-deterministic Eval", {
data: [...],
task: async (input) => {
const model = wrapAISDKModel(openai("gpt-4"));
// ...
},
trialCount: 3, // Runs 3 times, cache busted each trial
});

See Also