pnpm add evalite@beta evalite()
The main function for defining evaluations in .eval.ts files.
Signature
evalite<TInput, TOutput, TExpected = TOutput>( evalName: string, opts: { data: Array<{ input: TInput; expected?: TExpected; only?: boolean }> | (() => Promise<Array<{ input: TInput; expected?: TExpected; only?: boolean }>>); task: (input: TInput) => Promise<TOutput> | TOutput; scorers?: Array<Scorer<TInput, TOutput, TExpected> | ScorerOpts<TInput, TOutput, TExpected>>; columns?: (opts: { input: TInput; output: TOutput; expected?: TExpected }) => Promise<Array<{ label: string; value: unknown }>> | Array<{ label: string; value: unknown }>; trialCount?: number; }): voidParameters
evalName
Type: string
The name of your evaluation. This appears in the UI and test output.
evalite("Greeting Generator", { // ...});opts.data
Type: Array<{ input: TInput; expected?: TExpected; only?: boolean }> or () => Promise<Array<...>>
The dataset for your evaluation. Each item becomes a separate test case.
Can be an array or an async function that returns an array.
// Static arrayevalite("My Eval", { data: [ { input: "Hello", expected: "Hi there!" }, { input: "Goodbye", expected: "See you later!" }, ], // ...});
// Async functionevalite("My Eval", { data: async () => { const dataset = await fetch("/api/dataset").then((r) => r.json()); return dataset; }, // ...});only flag: Mark specific data points to run exclusively during development:
evalite("My Eval", { data: [ { input: "test1", only: true }, // Only this will run { input: "test2" }, { input: "test3" }, ], // ...});opts.task
Type: (input: TInput) => Promise<TOutput> | TOutput
The function to test. Receives input from data, returns output to be scored.
evalite("My Eval", { data: [{ input: "Hello" }], task: async (input) => { const response = await openai.chat.completions.create({ model: "gpt-4", messages: [{ role: "user", content: input }], }); return response.choices[0].message.content; }, // ...});opts.scorers
Type: Array<Scorer | ScorerOpts> (optional)
Functions that evaluate the output quality. Each scorer returns a score between 0 and 1.
evalite("My Eval", { data: [{ input: "Hello", expected: "Hi" }], task: async (input) => callLLM(input), scorers: [ // Inline scorer { name: "Exact Match", scorer: ({ output, expected }) => { return output === expected ? 1 : 0; }, }, // Using createScorer createScorer({ name: "Length Check", scorer: ({ output }) => { return output.length > 10 ? 1 : 0; }, }), ],});See createScorer() for more details.
opts.columns
Type: (opts: { input, output, expected }) => Promise<Array<{ label, value }>> | Array<{ label, value }> (optional)
Custom columns to display in the UI alongside input/output/expected.
evalite("My Eval", { data: [{ input: "Hello" }], task: async (input) => callLLM(input), columns: ({ output }) => [ { label: "Word Count", value: output.split(" ").length }, { label: "Has Emoji", value: /\p{Emoji}/u.test(output) }, ],});opts.trialCount
Type: number (optional, default: 1)
Number of times to run each test case. Useful for measuring variance in non-deterministic evaluations.
evalite("My Eval", { data: [{ input: "Hello" }], task: async (input) => callLLM(input), trialCount: 5, // Run each data point 5 times});Can also be set globally in defineConfig().
Methods
evalite.skip()
Skip an entire evaluation.
evalite.skip("My Eval", { data: [{ input: "Hello" }], task: async (input) => callLLM(input),});evalite.each()
Run the same evaluation with different variants (e.g., comparing models or prompts).
evalite.each([ { name: "gpt-4", input: "gpt-4" }, { name: "gpt-3.5-turbo", input: "gpt-3.5-turbo" },])("Model Comparison", { data: [{ input: "Hello" }], task: async (input, model) => { const response = await openai.chat.completions.create({ model, messages: [{ role: "user", content: input }], }); return response.choices[0].message.content; },});only flag: Mark specific variants to run exclusively during development:
evalite.each([ { name: "gpt-4", input: "gpt-4" }, { name: "gpt-3.5-turbo", input: "gpt-3.5-turbo", only: true }, // Only this variant will run])("Model Comparison", { data: [{ input: "Hello" }], task: async (input, model) => { // ... },});See Comparing Different Approaches for more details.
Example
import { evalite } from "evalite";import { exactMatch } from "evalite/scorers";
evalite("Greeting Generator", { data: async () => { return [ { input: "Hello", expected: "Hi there!" }, { input: "Good morning", expected: "Good morning to you!" }, { input: "Howdy", expected: "Howdy partner!" }, ]; }, task: async (input) => { const response = await openai.chat.completions.create({ model: "gpt-4", messages: [ { role: "system", content: "Generate a friendly greeting response.", }, { role: "user", content: input }, ], }); return response.choices[0].message.content; }, scorers: [ { scorer: ({ output, expected }) => exactMatch({ actual: output, expected }), }, ], columns: ({ output }) => [{ label: "Length", value: output.length }],});