These are the docs for the beta version of Evalite. Install with pnpm add evalite@beta

Vercel AI SDK

Deep integration with Vercel’s AI SDK for automatic tracing and caching of LLM calls.

Setup

Wrap your AI SDK models with wrapAISDKModel to enable tracing and caching:

import { openai } from "@ai-sdk/openai";
import { wrapAISDKModel } from "evalite/ai-sdk";

const model = wrapAISDKModel(openai("gpt-4o-mini"));

This single wrapper provides both automatic tracing and intelligent caching of LLM responses.

Tracing

wrapAISDKModel automatically captures all LLM calls made through the AI SDK, including:

Full prompt/messages
Model responses (text and tool calls)
Token usage
Timing information

Viewing Traces

Traces appear in the Evalite UI under each test case:

Navigate to an eval result
Click on a specific test case
View the “Traces” section to see all LLM calls
Inspect input, output, and timing for each trace

Example with Tracing

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { evalite } from "evalite";
import { wrapAISDKModel } from "evalite/ai-sdk";

evalite("Test Capitals", {
  data: async () => [
    {
      input: `What's the capital of France?`,
      expected: "Paris",
    },
    {
      input: `What's the capital of Germany?`,
      expected: "Berlin",
    },
  ],
  task: async (input) => {
    const result = streamText({
      model: wrapAISDKModel(openai("gpt-4o-mini")),
      system: `Answer the question concisely.`,
      prompt: input,
    });

    // All calls are automatically traced
    return await result.text;
  },
  scorers: [
    {
      name: "Exact Match",
      scorer: ({ output, expected }) => (output === expected ? 1 : 0),
    },
  ],
});

Manual Traces with `reportTrace()`

For non-AI SDK calls or custom processing steps, use reportTrace():

import { reportTrace } from "evalite/traces";

evalite("Multi-Step Analysis", {
  data: [{ input: "Analyze this text" }],
  task: async (input) => {
    // Custom processing step
    const preprocessed = preprocess(input);
    reportTrace({
      input: { raw: input },
      output: { preprocessed },
    });

    // AI SDK call (automatically traced)
    const result = await generateText({
      model: wrapAISDKModel(openai("gpt-4")),
      prompt: preprocessed,
    });

    return result.text;
  },
});

Caching

wrapAISDKModel automatically caches LLM responses to:

Reduce costs - Avoid redundant API calls
Speed up development - Instant responses for repeated inputs
Improve reliability - Consistent outputs during testing

Caching works for both tasks and scorers. Cache hits are tracked separately and displayed in the UI.

How Caching Works

When enabled, Evalite:

Generates a cache key from model + parameters + prompt
Checks if a response exists for that key
Returns cached response (0 tokens used) or executes call
Stores new responses in cache (24 hour TTL)
Shows cache hits in UI with saved duration

Configuration

Config file (evalite.config.ts):

import { defineConfig } from "evalite/config";

export default defineConfig({
  cache: false, // Disable caching
});

CLI flag:

evalite --no-cache        # Disable for single run
evalite watch --no-cache  # Disable in watch mode

Runtime (programmatic usage):

import { runEvalite } from "evalite";

await runEvalite({
  cacheEnabled: false,
  mode: "run-once-and-exit",
});

Precedence: Runtime > Config > Default (true)

Cache Indicators in UI

The UI shows:

Cache hit icon (⚡) next to evals with cached responses
Count of cache hits per eval
Separate tracking for task vs scorer cache hits
Saved duration in milliseconds

Per-Model Configuration

Disable caching for specific models while keeping it enabled globally:

import { wrapAISDKModel } from "evalite/ai-sdk";
import { openai } from "@ai-sdk/openai";

// Caching disabled for this model only
const model = wrapAISDKModel(openai("gpt-4o-mini"), {
  caching: false,
});

Disable tracing for specific models:

const model = wrapAISDKModel(openai("gpt-4o-mini"), {
  tracing: false,
});

Complete Example

import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
import { evalite } from "evalite";
import { faithfulness } from "evalite/scorers";
import { wrapAISDKModel } from "evalite/ai-sdk";

// Wrap once, use everywhere
const model = wrapAISDKModel(openai("gpt-4o-mini"));

evalite("RAG System", {
  data: async () => [
    {
      input: "What is Evalite?",
      expected: {
        groundTruth: ["Evalite is a tool for testing LLM applications."],
      },
    },
  ],
  task: async (input) => {
    // Both calls are traced and cached
    const context = await generateText({
      model,
      prompt: `Retrieve context for: ${input}`,
    });

    const result = await generateText({
      model,
      prompt: `Answer using context: ${context.text}\n\nQuestion: ${input}`,
    });

    return result.text;
  },
  scorers: [
    {
      scorer: ({ input, output, expected }) =>
        // Scorer LLM calls are also cached
        faithfulness({
          question: input,
          answer: output,
          groundTruth: expected.groundTruth,
          model,
        }),
    },
  ],
});

Best Practices

Wrap models once - Create wrapped models at module level, reuse across evals
Keep caching enabled during development - Speeds up iteration and reduces costs
Disable cache for production runs - Use --no-cache for final evaluation runs
Use tracing for debugging - Inspect traces to understand multi-step LLM workflows
Cache is safe for deterministic tests - Same inputs always produce same cached outputs

Caching with Trial Count

For non-deterministic evaluations, you might worry that a lucky correct answer gets cached, giving false confidence in reliability. Quality and accuracy require many samples to measure properly.

Combine cache with trialCount to solve this - each trial busts the cache and runs fresh, ensuring you get multiple samples while still benefiting from caching during development:

evalite("Non-deterministic Eval", {
  data: [...],
  task: async (input) => {
    const model = wrapAISDKModel(openai("gpt-4"));
    // ...
  },
  trialCount: 3, // Runs 3 times, cache busted each trial
});