These are the docs for the beta version of Evalite. Install with pnpm add evalite@beta

toolCallAccuracy

Checks if your AI is calling the right functions with the right parameters.

When to use: For AI agents that need to call external functions/APIs. Essential for:

Multi-step agents that orchestrate API calls
Assistants that execute user commands via tools
Workflow automation systems
Testing function calling reliability

When NOT to use: For simple text generation tasks without function calling. Use exactMatch or answerSimilarity instead.

Example

import { openai } from "@ai-sdk/openai";
import { generateText, tool } from "ai";
import { evalite } from "evalite";
import { toolCallAccuracy } from "evalite/scorers/deterministic";
import { z } from "zod";

evalite("Tool Calls", {
  data: [
    {
      input: "What calendar events do I have today?",
      expected: [
        {
          toolName: "getCalendarEvents",
          input: {
            date: "2024-04-27",
          },
        },
      ],
    },
  ],
  task: async (input) => {
    const result = await generateText({
      model: openai("gpt-4o-mini"),
      prompt: input,
      tools: {
        getCalendarEvents: tool({
          description: "Get calendar events",
          inputSchema: z.object({
            date: z.string(),
          }),
          execute: async ({ date }) => {
            return `You have a meeting with John on ${date}`;
          },
        }),
      },
      system: `You are a helpful assistant. Today's date is 2024-04-27.`,
    });

    return result.toolCalls;
  },
  scorers: [
    {
      scorer: ({ output, expected }) => {
        return toolCallAccuracy({
          actualCalls: output,
          expectedCalls: expected,
        });
      },
    },
  ],
});

Signature

async function toolCallAccuracy(opts: {
  actualCalls: ToolCall[];
  expectedCalls: ToolCall[];
  mode?: "exact" | "flexible";
  weights?: Partial<{
    exact: number;
    nameOnly: number;
    extraPenalty: number;
    wrongPenalty: number;
  }>;
}): Promise<{
  name: string;
  description: string;
  score: number;
  metadata: {
    mode: "exact" | "flexible";
    exactMatches: number;
    nameOnlyMatches: number;
    // Mode-specific fields
  };
}>;

type ToolCall = {
  toolName: string;
  input?: unknown;
};

Parameters

actualCalls

Type: ToolCall[]

The actual tool calls made by the AI. Typically the toolCalls array from your AI SDK response.

expectedCalls

Type: ToolCall[]

Expected tool calls as an array of objects with toolName and optional input properties.

mode

Type: "exact" | "flexible" Default: "exact"

Comparison mode:

"exact": Calls must happen in exact order with exact arguments
"flexible": Calls can happen in any order, only checks correct functions were called

weights

Type: Partial<ToolCallAccuracyWeights>

Custom weights for scoring components:

exact (default: 1.0) - Full credit for correct function + correct arguments
nameOnly (default: 0.5) - Partial credit for correct function + wrong arguments
extraPenalty (default: 0.25) - Penalty for extra calls (flexible mode only)
wrongPenalty (default: 0.25) - Penalty for wrong/missing calls (exact mode only)

Scoring Modes

Exact Mode

Enforces strict ordering. Calls must appear in the same sequence as expected.

toolCallAccuracy({
  actualCalls: [
    { toolName: "getTasks" },
    { toolName: "createTask", input: { title: "Buy milk" } },
  ],
  expectedCalls: [
    { toolName: "getTasks" },
    { toolName: "createTask", input: { title: "Buy milk" } },
  ],
  mode: "exact",
});
// Score: 1.0 (perfect match)

Flexible Mode

Allows calls in any order. Useful for multi-step agents where execution order varies.

toolCallAccuracy({
  actualCalls: [
    { toolName: "createTask", input: { title: "Buy milk" } },
    { toolName: "getTasks" }, // Order swapped
  ],
  expectedCalls: [
    { toolName: "getTasks" },
    { toolName: "createTask", input: { title: "Buy milk" } },
  ],
  mode: "flexible",
});
// Score: 1.0 (still perfect - order doesn't matter)

Return Value

Returns an object with:

name: “Tool Call Accuracy”
description: “Checks if the tool calls are correct”
score: Number between 0-1
metadata: Object with match counts and mode-specific details

Exact Mode Metadata

{
  mode: "exact",
  exactMatches: number,
  nameOnlyMatches: number,
  wrongOrMissing: number,
  totals: {
    reference: number,
    output: number
  }
}

Flexible Mode Metadata

{
  mode: "flexible",
  exactMatches: number,
  nameOnlyMatches: number,
  extras: number,
  missing: number,
  totals: {
    reference: number,
    output: number
  },
  details: {
    matches: ToolCall[],
    nameOnlyMatches: ToolCall[],
    extras: ToolCall[],
    missingToolCalls: ToolCall[]
  }
}