Introduction
Evaluate the results of your LLM applications with Laminar
Introduction
Evaluation is the concept of running your production logic against some dataset and evaluating its results with another function.
Why do we need evaluations?
TL;DR: Evaluations help bring rigor to AI development process.
During the development process, it is natural to “vibe-check” the outputs of LLMs. It is a good way to get a sense of how well your model or prompt is performing. However, this is both not scalable and not reliable. We like to think of evaluations as almost like unit tests, where you can test your models against a dataset. The difference is that the output is not binary (pass/fail), but a score or a set of scores.
This almost scientific approach allows you to control most variables and only check one thing in isolation, e.g., how does the new prompt perform in comparison to the existing one.
Key concepts
- Executor – the function being evaluated, often your prompt or (to-be-)production logic
- Evaluator - the function evaluating the results
- Dataset - collection of datapoints to run executors and evaluators against
- Datapoint – two fixed JSON maps:
data
andtarget
each with any keys.data
– data sent to the executor.target
– data sent to the evaluator. Usually, contains the expected or target outputs of certain parts of your executor.
Flow overview
High-level overview of the evaluation flow
For every datapoint in the dataset, evaluation does the following:
- Pass the
data
as an argument to the executor. - Run the executor.
- Executor output is stored.
- Output of the executor and
target
are passed to the evaluator function. - Evaluator pipeline produces a numeric output or a json object with several numeric outputs. This is stored in the results of the evaluation.
Quickstart
Prerequisites
Example. Running and registering evaluations
1. Create an evaluation file.
Create a file named my-first-evaluation.py
and add the following code:
# my-first-evaluation.py
from lmnr import evaluate
def write_poem(topic: str) -> str:
# replace this with your LLM call or custom logic
return f"This is a good poem about {topic}"
def contains_poem(output: str, target: str) -> int:
return 1 if target in output else 0
evaluate(
data=[
{
"data": {"topic": "flowers"},
"target": {"poem": "This is a good poem about flowers"}
},
{
"data": {"topic": "cats"},
"target": {"poem": "I like cats"}
},
],
executor=write_poem,
evaluators={"containsPoem": contains_poem}
)
2. Run the evaluation
You can run the evaluations both from Laminar CLI and from code. To run the evaluation from the CLI, execute the following command:
# 1. Make sure `lmnr` is installed in a virtual environment
# lmnr --help
# 2. Run the evaluation
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
lmnr eval my-first-evaluation.py
To run the evaluation directly from code, simply call the evaluate
function, i.e. import or
run the file created in the previous step.
Concepts
- Executor takes in a
Map<string, any>
(set to the value ofdata
) parameter and returns anything - Evaluator takes in a
Map<string, any
(set to the value oftarget
) parameter and the output of the executor - Evaluator returns either a single numeric score or a JSON object / dict, with string keys and numeric values for multiple scores
Viewing evaluation results and traces
When you run an evaluation from the CLI, Laminar will output the link to the dashboard where you can view the evaluation results.
Laminar stores every evaluation result. A run for every datapoint is represented as a trace. You can view the results and corresponding traces in the evaluations page.