Evaluation is the concept of running your production logic against some dataset and evaluating its results with another function.

Why do we need evaluations?

TL;DR: Evaluations help bring rigor to AI development process.

During the development process, it is natural to “vibe-check” the outputs of LLMs. It is a good way to get a sense of how well your model or prompt is performing. However, this is both not scalable and not reliable. We like to think of evaluations as almost like unit tests, where you can test your models against a dataset. The difference is that the output is not binary (pass/fail), but a score or a set of scores.

This approach allows you to control most variables and only check one thing in isolation, e.g., how does the new prompt perform in comparison to the existing one.

Key concepts

  • Executor – the function being evaluated, often your prompt or (to-be-)production logic
  • Evaluator – the function evaluating the results
  • Dataset – collection of datapoints to run executors and evaluators against
  • Datapoint – two values: data and target each represented as a free-form JSON.
    • data – data sent to the executor. Required.
    • target – data sent to the evaluator. Usually, contains the expected or target outputs of certain parts of your executor. Optional.

Example datapoint

{
  "data": {"topic": "flowers"},
  "target": "This is a good poem about flowers"
}

Flow overview

For every datapoint in the dataset, evaluation does the following:

  1. Pass the data as an argument to the executor.
  2. Run the executor.
  3. Executor output is stored.
  4. Output of the executor and target are passed to the evaluator function.
  5. Evaluator pipeline produces a numeric output or a json object with several numeric outputs. This is stored in the results of the evaluation.

Example:

def write_poem(data: dict) -> str:
    # ...

def contains_poem(output: str, target: str) -> int:
    # ...

evaluate(
    data=[{
        "data": {"topic": "flowers"},
        "target": "This is a good poem about flowers"
    }],
    executor=write_poem,
    evaluators={"contains_poem": contains_poem}
)

In this example, the write_poem function is the executor, and the contains_poem function is the evaluator. The score that is produced by the evaluator is stored in the results of the evaluation under the contains_poem key.

Example evaluation

Quickstart

Prerequisites

To get the project API key, go to the Laminar dashboard, click the project settings, and generate a project API key.

Specify the key at Laminar initialization. If not specified, Laminar will look for the key in the LMNR_PROJECT_API_KEY environment variable.

Example. Running and registering evaluations

1. Create an evaluation file.

Create a file named my-first-evaluation.py and add the following code:

my-first-evaluation.py
from lmnr import evaluate

def write_poem(data: dict) -> str:
    # replace this with your LLM call or custom logic
    return f"This is a good poem about {data['topic']}"

def contains_poem(output: str, target: str) -> int:
    return 1 if target in output else 0

evaluate(
    data=[
        {
            "data": {"topic": "flowers"},
            "target": "This is a good poem about flowers"
        },
        {
            "data": {"topic": "cats"},
            "target": "I like cats"
        },
    ],
    executor=write_poem,
    evaluators={"contains_poem": contains_poem}
)

2. Run the evaluation

You can run the evaluations both from Laminar CLI and from code. To run the evaluation from the CLI, execute the following command:

# 1. Make sure `lmnr` is installed in a virtual environment
# lmnr --help
# 2. Run the evaluation
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
lmnr eval my-first-evaluation.py

To run the evaluation directly from code, simply call the evaluate function, i.e. import or run the file created in the previous step.

Evaluator returns either a single numeric score or a JSON object / dict, with string keys and numeric values for multiple scores

Viewing evaluation results and traces

When you run an evaluation from the CLI, Laminar will output the link to the dashboard where you can view the evaluation results.

Laminar stores every evaluation result. A run for every datapoint is represented as a trace. You can view the results and corresponding traces in the evaluations page.

Example evaluation

In this example, we can see that the score for the first datapoint is 1, and for the second one is 0. This is because our evaluator function contains_poem returns 1 if the target string is found in the output string, and 0 otherwise.

We can also see the full execution trace for each datapoint. If you actually call an LLM in the executor, or the evaluator, you will also see the LLM spans in the trace.

Here the executor called OpenAI, and there are three evaluator functions.