pnpm add evalite@beta Vercel AI SDK
Deep integration with Vercel’s AI SDK for automatic tracing and caching of LLM calls.
Setup
Wrap your AI SDK models with wrapAISDKModel to enable tracing and caching:
import { openai } from "@ai-sdk/openai";import { wrapAISDKModel } from "evalite/ai-sdk";
const model = wrapAISDKModel(openai("gpt-4o-mini"));This single wrapper provides both automatic tracing and intelligent caching of LLM responses.
Tracing
wrapAISDKModel automatically captures all LLM calls made through the AI SDK, including:
- Full prompt/messages
- Model responses (text and tool calls)
- Token usage
- Timing information
Viewing Traces
Traces appear in the Evalite UI under each test case:
- Navigate to an eval result
- Click on a specific test case
- View the “Traces” section to see all LLM calls
- Inspect input, output, and timing for each trace
Example with Tracing
import { openai } from "@ai-sdk/openai";import { streamText } from "ai";import { evalite } from "evalite";import { wrapAISDKModel } from "evalite/ai-sdk";
evalite("Test Capitals", { data: async () => [ { input: `What's the capital of France?`, expected: "Paris", }, { input: `What's the capital of Germany?`, expected: "Berlin", }, ], task: async (input) => { const result = streamText({ model: wrapAISDKModel(openai("gpt-4o-mini")), system: `Answer the question concisely.`, prompt: input, });
// All calls are automatically traced return await result.text; }, scorers: [ { name: "Exact Match", scorer: ({ output, expected }) => (output === expected ? 1 : 0), }, ],});Manual Traces with reportTrace()
For non-AI SDK calls or custom processing steps, use reportTrace():
import { reportTrace } from "evalite/traces";
evalite("Multi-Step Analysis", { data: [{ input: "Analyze this text" }], task: async (input) => { // Custom processing step const preprocessed = preprocess(input); reportTrace({ input: { raw: input }, output: { preprocessed }, });
// AI SDK call (automatically traced) const result = await generateText({ model: wrapAISDKModel(openai("gpt-4")), prompt: preprocessed, });
return result.text; },});Caching
wrapAISDKModel automatically caches LLM responses to:
- Reduce costs - Avoid redundant API calls
- Speed up development - Instant responses for repeated inputs
- Improve reliability - Consistent outputs during testing
Caching works for both tasks and scorers. Cache hits are tracked separately and displayed in the UI.
How Caching Works
When enabled, Evalite:
- Generates a cache key from model + parameters + prompt
- Checks if a response exists for that key
- Returns cached response (0 tokens used) or executes call
- Stores new responses in cache (24 hour TTL)
- Shows cache hits in UI with saved duration
Configuration
Config file (evalite.config.ts):
import { defineConfig } from "evalite/config";
export default defineConfig({ cache: false, // Disable caching});CLI flag:
evalite --no-cache # Disable for single runevalite watch --no-cache # Disable in watch modeRuntime (programmatic usage):
import { runEvalite } from "evalite";
await runEvalite({ cacheEnabled: false, mode: "run-once-and-exit",});Precedence: Runtime > Config > Default (true)
Cache Indicators in UI
The UI shows:
- Cache hit icon (⚡) next to evals with cached responses
- Count of cache hits per eval
- Separate tracking for task vs scorer cache hits
- Saved duration in milliseconds
Per-Model Configuration
Disable caching for specific models while keeping it enabled globally:
import { wrapAISDKModel } from "evalite/ai-sdk";import { openai } from "@ai-sdk/openai";
// Caching disabled for this model onlyconst model = wrapAISDKModel(openai("gpt-4o-mini"), { caching: false,});Disable tracing for specific models:
const model = wrapAISDKModel(openai("gpt-4o-mini"), { tracing: false,});Complete Example
import { openai } from "@ai-sdk/openai";import { generateText } from "ai";import { evalite } from "evalite";import { faithfulness } from "evalite/scorers";import { wrapAISDKModel } from "evalite/ai-sdk";
// Wrap once, use everywhereconst model = wrapAISDKModel(openai("gpt-4o-mini"));
evalite("RAG System", { data: async () => [ { input: "What is Evalite?", expected: { groundTruth: ["Evalite is a tool for testing LLM applications."], }, }, ], task: async (input) => { // Both calls are traced and cached const context = await generateText({ model, prompt: `Retrieve context for: ${input}`, });
const result = await generateText({ model, prompt: `Answer using context: ${context.text}\n\nQuestion: ${input}`, });
return result.text; }, scorers: [ { scorer: ({ input, output, expected }) => // Scorer LLM calls are also cached faithfulness({ question: input, answer: output, groundTruth: expected.groundTruth, model, }), }, ],});Best Practices
- Wrap models once - Create wrapped models at module level, reuse across evals
- Keep caching enabled during development - Speeds up iteration and reduces costs
- Disable cache for production runs - Use
--no-cachefor final evaluation runs - Use tracing for debugging - Inspect traces to understand multi-step LLM workflows
- Cache is safe for deterministic tests - Same inputs always produce same cached outputs
Caching with Trial Count
For non-deterministic evaluations, you might worry that a lucky correct answer gets cached, giving false confidence in reliability. Quality and accuracy require many samples to measure properly.
Combine cache with trialCount to solve this - each trial busts the cache and runs fresh, ensuring you get multiple samples while still benefiting from caching during development:
evalite("Non-deterministic Eval", { data: [...], task: async (input) => { const model = wrapAISDKModel(openai("gpt-4")); // ... }, trialCount: 3, // Runs 3 times, cache busted each trial});See Also
wrapAISDKModel()API Reference - Full API documentationreportTrace()API Reference - Manual trace reporting- Configuration Guide - Cache configuration options
- CLI Reference - Command-line flags