These are the docs for the beta version of Evalite. Install with pnpm add evalite@beta

evalite()

The main function for defining evaluations in .eval.ts files.

Signature

evalite<TInput, TOutput, TExpected = TOutput>(
  evalName: string,
  opts: {
    data: Array<{ input: TInput; expected?: TExpected; only?: boolean }>
      | (() => Promise<Array<{ input: TInput; expected?: TExpected; only?: boolean }>>);
    task: (input: TInput) => Promise<TOutput> | TOutput;
    scorers?: Array<Scorer<TInput, TOutput, TExpected> | ScorerOpts<TInput, TOutput, TExpected>>;
    columns?: (opts: { input: TInput; output: TOutput; expected?: TExpected }) =>
      Promise<Array<{ label: string; value: unknown }>> |
      Array<{ label: string; value: unknown }>;
    trialCount?: number;
  }
): void

Parameters

`evalName`

Type: string

The name of your evaluation. This appears in the UI and test output.

evalite("Greeting Generator", {
  // ...
});

`opts.data`

Type: Array<{ input: TInput; expected?: TExpected; only?: boolean }> or () => Promise<Array<...>>

The dataset for your evaluation. Each item becomes a separate test case.

Can be an array or an async function that returns an array.

// Static array
evalite("My Eval", {
  data: [
    { input: "Hello", expected: "Hi there!" },
    { input: "Goodbye", expected: "See you later!" },
  ],
  // ...
});

// Async function
evalite("My Eval", {
  data: async () => {
    const dataset = await fetch("/api/dataset").then((r) => r.json());
    return dataset;
  },
  // ...
});

only flag: Mark specific data points to run exclusively during development:

evalite("My Eval", {
  data: [
    { input: "test1", only: true }, // Only this will run
    { input: "test2" },
    { input: "test3" },
  ],
  // ...
});

`opts.task`

Type: (input: TInput) => Promise<TOutput> | TOutput

The function to test. Receives input from data, returns output to be scored.

evalite("My Eval", {
  data: [{ input: "Hello" }],
  task: async (input) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [{ role: "user", content: input }],
    });
    return response.choices[0].message.content;
  },
  // ...
});

`opts.scorers`

Type: Array<Scorer | ScorerOpts> (optional)

Functions that evaluate the output quality. Each scorer returns a score between 0 and 1.

evalite("My Eval", {
  data: [{ input: "Hello", expected: "Hi" }],
  task: async (input) => callLLM(input),
  scorers: [
    // Inline scorer
    {
      name: "Exact Match",
      scorer: ({ output, expected }) => {
        return output === expected ? 1 : 0;
      },
    },
    // Using createScorer
    createScorer({
      name: "Length Check",
      scorer: ({ output }) => {
        return output.length > 10 ? 1 : 0;
      },
    }),
  ],
});

See createScorer() for more details.

`opts.columns`

Type: (opts: { input, output, expected }) => Promise<Array<{ label, value }>> | Array<{ label, value }> (optional)

Custom columns to display in the UI alongside input/output/expected.

evalite("My Eval", {
  data: [{ input: "Hello" }],
  task: async (input) => callLLM(input),
  columns: ({ output }) => [
    { label: "Word Count", value: output.split(" ").length },
    { label: "Has Emoji", value: /\p{Emoji}/u.test(output) },
  ],
});

`opts.trialCount`

Type: number (optional, default: 1)

Number of times to run each test case. Useful for measuring variance in non-deterministic evaluations.

evalite("My Eval", {
  data: [{ input: "Hello" }],
  task: async (input) => callLLM(input),
  trialCount: 5, // Run each data point 5 times
});

Can also be set globally in defineConfig().

Methods

`evalite.skip()`

Skip an entire evaluation.

evalite.skip("My Eval", {
  data: [{ input: "Hello" }],
  task: async (input) => callLLM(input),
});

`evalite.each()`

Run the same evaluation with different variants (e.g., comparing models or prompts).

evalite.each([
  { name: "gpt-4", input: "gpt-4" },
  { name: "gpt-3.5-turbo", input: "gpt-3.5-turbo" },
])("Model Comparison", {
  data: [{ input: "Hello" }],
  task: async (input, model) => {
    const response = await openai.chat.completions.create({
      model,
      messages: [{ role: "user", content: input }],
    });
    return response.choices[0].message.content;
  },
});

only flag: Mark specific variants to run exclusively during development:

evalite.each([
  { name: "gpt-4", input: "gpt-4" },
  { name: "gpt-3.5-turbo", input: "gpt-3.5-turbo", only: true }, // Only this variant will run
])("Model Comparison", {
  data: [{ input: "Hello" }],
  task: async (input, model) => {
    // ...
  },
});

See Comparing Different Approaches for more details.

Example

import { evalite } from "evalite";
import { exactMatch } from "evalite/scorers";

evalite("Greeting Generator", {
  data: async () => {
    return [
      { input: "Hello", expected: "Hi there!" },
      { input: "Good morning", expected: "Good morning to you!" },
      { input: "Howdy", expected: "Howdy partner!" },
    ];
  },
  task: async (input) => {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [
        {
          role: "system",
          content: "Generate a friendly greeting response.",
        },
        { role: "user", content: input },
      ],
    });
    return response.choices[0].message.content;
  },
  scorers: [
    {
      scorer: ({ output, expected }) =>
        exactMatch({ actual: output, expected }),
    },
  ],
  columns: ({ output }) => [{ label: "Length", value: output.length }],
});