Introduction

Evaluation is the concept of running your production logic against some dataset and evaluating its results with another function.

Why do we need evaluations?

TL;DR: Evaluations help bring rigor to AI development process.

During the development process, it is natural to “vibe-check” the outputs of LLMs. It is a good way to get a sense of how well your model or prompt is performing. However, this is both not scalable and not reliable. We like to think of evaluations as almost like unit tests, where you can test your models against a dataset. The difference is that the output is not binary (pass/fail), but a score or a set of scores.

This almost scientific approach allows you to control most variables and only check one thing in isolation, e.g., how does the new prompt perform in comparison to the existing one.

Key concepts

  • Executor – the function being evaluated, often your prompt or (to-be-)production logic
  • Evaluator - the function evaluating the results
  • Dataset - collection of datapoints to run executors and evaluators against
  • Datapoint – two fixed JSON maps: data and target each with any keys.
    • data – data sent to the executor.
    • target – data sent to the evaluator. Usually, contains the expected or target outputs of certain parts of your executor.

Flow overview

High-level overview of the evaluation flow

For every datapoint in the dataset, evaluation does the following:

  1. Pass the data as an argument to the executor.
  2. Run the executor.
  3. Executor output is stored.
  4. Output of the executor and target are passed to the evaluator function.
  5. Evaluator pipeline produces a numeric output or a json object with several numeric outputs. This is stored in the results of the evaluation.

Quickstart

Prerequisites

Install the package from PyPI.

pip install lmnr

Example. Running and registering evaluations

1. Create an evaluation file.

Create a file named my-first-evaluation.py and add the following code:

# my-first-evaluation.py
from lmnr import evaluate

def write_poem(topic: str) -> str:
    # replace this with your LLM call or custom logic
    return f"This is a good poem about {topic}"

def contains_poem(output: str, target: str) -> int:
    return 1 if target in output else 0

evaluate(
    data=[
        {
            "data": {"topic": "flowers"},
            "target": {"poem": "This is a good poem about flowers"}
        },
        {
            "data": {"topic": "cats"},
            "target": {"poem": "I like cats"}
        },
    ],
    executor=write_poem,
    evaluators={"containsPoem": contains_poem}
)

2. Run the evaluation

You can run the evaluations both from Laminar CLI and from code. To run the evaluation from the CLI, execute the following command:

# 1. Make sure `lmnr` is installed in a virtual environment
# lmnr --help
# 2. Run the evaluation
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
lmnr eval my-first-evaluation.py

To run the evaluation directly from code, simply call the evaluate function, i.e. import or run the file created in the previous step.

Concepts

  • Executor takes in a Map<string, any> (set to the value of data) parameter and returns anything
  • Evaluator takes in a Map<string, any (set to the value of target) parameter and the output of the executor
  • Evaluator returns either a single numeric score or a JSON object / dict, with string keys and numeric values for multiple scores

Viewing evaluation results and traces

When you run an evaluation from the CLI, Laminar will output the link to the dashboard where you can view the evaluation results.

Laminar stores every evaluation result. A run for every datapoint is represented as a trace. You can view the results and corresponding traces in the evaluations page.