Evaluation is the concept of running your production logic against some dataset and evaluating its results with another function.

Why do we need evaluations?

TL;DR: Evaluations help bring rigor to AI development process.

During the development process, it is natural to “vibe-check” the outputs of LLMs. It is a good way to get a sense of how well your model or prompt is performing. However, this is both not scalable and not reliable. We like to think of evaluations as almost like unit tests, where you can test your models against a dataset. The difference is that the output is not binary (pass/fail), but a score or a set of scores.

This approach allows you to control most variables and only check one thing in isolation, e.g., how does the new prompt perform in comparison to the existing one.

Key concepts

  • Executor – the function being evaluated, often your prompt or (to-be-)production logic
  • Evaluator – the function evaluating the results
  • Dataset – collection of datapoints to run executors and evaluators against
  • Datapoint – two values: data and target each represented as a free-form JSON.
    • data – data sent to the executor. Required.
    • target – data sent to the evaluator. Usually, contains the expected or target outputs of certain parts of your executor. Optional.
  • Evaluation group - a group of evaluations that evaluate one feature or one part of your application. Results of evaluations in the same group are aggregated and visualized together.

Example datapoint

{
  "data": {"topic": "flowers"},
  "target": "This is a good poem about flowers"
}

Flow overview

For every datapoint in the dataset, evaluation does the following:

  1. Pass the data as an argument to the executor.
  2. Run the executor.
  3. Executor output is stored.
  4. Output of the executor and target are passed to the evaluator function.
  5. Evaluator pipeline produces a numeric output or a json object with several numeric outputs. This is stored in the results of the evaluation.

Example

const writePoem = (data: {topic: string}) => {
    // ...
}

const containsPoem = (output: string, target: string) => {
    // ...
}

evaluate({
    data: [{
        data: { topic: 'flowers' },
        target: { poem: 'This is a good poem about flowers' }
    }],
    executor: writePoem,
    evaluators: { "contains_poem": containsPoem }
})

In this example, the write_poem function is the executor, and the contains_poem function is the evaluator. The score that is produced by the evaluator is stored in the results of the evaluation under the contains_poem key.

Example evaluation

Quickstart

Prerequisites

To get the project API key, go to the Laminar dashboard, click the project settings, and generate a project API key.

Specify the key at Laminar initialization. If not specified, Laminar will look for the key in the LMNR_PROJECT_API_KEY environment variable.

Example. Running and registering evaluations

1. Create an evaluation file.

Create a file named my-first-evaluation.ts and add the following code:

my-first-evaluation.ts
import { evaluate } from '@lmnr-ai/lmnr';

const writePoem = (data: {topic: string}) => {
    // replace this with your LLM call or custom logic
    return `This is a good poem about ${data.topic}`
}

const containsPoem = (output: string, target: string): number => {
    return output.includes(target) ? 1 : 0
}

evaluate({
    data: [
        {
            data: { topic: 'flowers' },
            target: 'This is a good poem about flowers',
        },
        {
            data: { topic: 'cats' },
            target: 'I like cats',
        },
    ],
    executor: writePoem,
    evaluators: { contains_poem: containsPoem }
})

2. Run the evaluation

You can run the evaluations both from Laminar CLI and from code. To run the evaluation from the CLI, execute the following command:

export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
npx lmnr eval my-first-evaluation.ts

If you want to run multiple files, place them in the evals directory, and name the files such that they end with .eval.{ts,js}.

├─ src/
├─ evals/
│  ├── my-first-evaluation.eval.ts
│  ├── my-second-evaluation.eval.ts
│  ├── ...

Then, run npx lmnr eval to run all the evaluation files in the evals directory.

To run the evaluation directly from code, simply call the evaluate function, i.e. import or run the file created in the previous step.

Evaluator returns either a single numeric score or a JSON object / dict, with string keys and numeric values for multiple scores

Viewing evaluation results and traces

When you run an evaluation from the CLI, Laminar will output the link to the dashboard where you can view the evaluation results.

Laminar stores every evaluation result. A run for every datapoint is represented as a trace. You can view the results and corresponding traces in the evaluations page.

Example evaluation

In this example, we can see that the score for the first datapoint is 1, and for the second one is 0. This is because our evaluator function contains_poem returns 1 if the target string is found in the output string, and 0 otherwise.

We can also see the full execution trace for each datapoint. If you actually call an LLM in the executor, or the evaluator, you will also see the LLM spans in the trace.

Here the executor called OpenAI, and there are three evaluator functions.

Grouping evaluations

To be able to track the score progression over time, or compare evaluations side-by-side, you need to group them together. This can be achieved by passing the groupId parameter to the evaluate function.

import { evaluate, LaminarDataset } from '@lmnr-ai/lmnr';

evaluate({
    data: new LaminarDataset("name_of_your_dataset"),
    executor: yourExecutorFunction,
    evaluators: {evaluatorName: yourEvaluator},
    groupId: "evals_group_1",
    config: {
        projectApiKey: process.env.LMNR_PROJECT_API_KEY,
        // ... other optional parameters
    }
});

Tracking evaluation score progression over time

You can see how the score changes over time for a given group by clicking on the group name in the evaluations page.

Example evaluation score progression over time.

Comparing evaluations side-by-side

Laminar allows you to compare evaluations side-by-side in the UI.

Example comparison of two evaluation runs.