Introduction
Evaluate the results of your LLM applications with Laminar
Evaluation is the concept of running your production logic against some dataset and evaluating its results with another function.
Why do we need evaluations?
TL;DR: Evaluations help bring rigor to AI development process.
During the development process, it is natural to “vibe-check” the outputs of LLMs. It is a good way to get a sense of how well your model or prompt is performing. However, this is both not scalable and not reliable. We like to think of evaluations as almost like unit tests, where you can test your models against a dataset. The difference is that the output is not binary (pass/fail), but a score or a set of scores.
This approach allows you to control most variables and only check one thing in isolation, e.g., how does the new prompt perform in comparison to the existing one.
Key concepts
- Executor – the function being evaluated, often your prompt or (to-be-)production logic
- Evaluator – the function evaluating the results
- Dataset – collection of datapoints to run executors and evaluators against
- Datapoint – two values:
data
andtarget
each represented as a free-form JSON.data
– data sent to the executor. Required.target
– data sent to the evaluator. Usually, contains the expected or target outputs of certain parts of your executor. Optional.
- Evaluation group - a group of evaluations that evaluate one feature or one part of your application. Results of evaluations in the same group are aggregated and visualized together.
Example datapoint
Flow overview
For every datapoint in the dataset, evaluation does the following:
- Pass the
data
as an argument to the executor. - Run the executor.
- Executor output is stored.
- Output of the executor and
target
are passed to the evaluator function. - Evaluator pipeline produces a numeric output or a json object with several numeric outputs. This is stored in the results of the evaluation.
Example
In this example, the write_poem
function is the executor, and the contains_poem
function is the evaluator.
The score that is produced by the evaluator is stored in the results of the evaluation under the contains_poem
key.
Example evaluation
Quickstart
Prerequisites
To get the project API key, go to the Laminar dashboard, click the project settings, and generate a project API key.
Specify the key at Laminar
initialization. If not specified,
Laminar will look for the key in the LMNR_PROJECT_API_KEY
environment variable.
Example. Running and registering evaluations
1. Create an evaluation file.
Create a file named my-first-evaluation.ts
and add the following code:
2. Run the evaluation
You can run the evaluations both from Laminar CLI and from code. To run the evaluation from the CLI, execute the following command:
If you want to run multiple files, place them in the evals
directory,
and name the files such that they end with .eval.{ts,js}
.
Then, run npx lmnr eval
to run all the evaluation files in the evals
directory.
To run the evaluation directly from code, simply call the evaluate
function, i.e. import or
run the file created in the previous step.
Evaluator returns either a single numeric score or a JSON object / dict, with string keys and numeric values for multiple scores
Viewing evaluation results and traces
When you run an evaluation from the CLI, Laminar will output the link to the dashboard where you can view the evaluation results.
Laminar stores every evaluation result. A run for every datapoint is represented as a trace. You can view the results and corresponding traces in the evaluations page.
Example evaluation
In this example, we can see that the score for the first datapoint is 1, and for the second one is 0.
This is because our evaluator function contains_poem
returns 1 if the target string is found in the output string, and 0 otherwise.
We can also see the full execution trace for each datapoint. If you actually call an LLM in the executor, or the evaluator, you will also see the LLM spans in the trace.
Here the executor called OpenAI, and there are three evaluator functions.
Grouping evaluations
To be able to track the score progression over time, or compare evaluations side-by-side,
you need to group them together. This can be achieved by passing the groupId
parameter
to the evaluate
function.
Tracking evaluation score progression over time
You can see how the score changes over time for a given group by clicking on the group name in the evaluations page.
Example evaluation score progression over time.
Comparing evaluations side-by-side
Laminar allows you to compare evaluations side-by-side in the UI.
Example comparison of two evaluation runs.