Basic correctness evaluation
In this example our executor function calls an LLM to get the capital of a country. We then evaluate the correctness of the prediction by checking for exact match with the target capital.1
1. Define an executor function
The executor function calls OpenAI to get the capital of a country.
The prompt also asks to only name the city and nothing else. In a real scenario,
you will likely want to use structured output to get the city name only.
2
2. Define an evaluator function
The evaluator function checks for exact match and returns 1 if the executor output
matches the target, and 0 otherwise.
3
3. Define data and run the evaluation
my-eval.ts
ts-node my-eval.ts
or npx lmnr eval my-eval.ts
.LLM as a judge offline evaluation
In this example, our executor will write short summaries of news articles, and the evaluator will check if the summary is correct, and grade them from 1 to 5.1
1. Prepare your data
The trick here is that the evaluator function needs to see the original article to evaluate the summary.
That is why, we will have to duplicate the article from
data
into target
prior to running the evaluation.The data may look something like the following:2
2. Define an executor function
An executor function calls OpenAI to summarize a news article. It returns a single string, the summary.
3
3. Define an evaluator function
An evaluator function grades the summary from 1 to 5. It returns an integer.
We’ve simply asked OpenAI to respond in JSON, but you may want to use
structured output or BAML instead.We also ask the LLM to give a comment on the summary. Even though we don’t use it in the evaluation,
it may be useful for debugging or further analysis. In addition, LLMs are known to perform better
when given a chance to explain their reasoning.
4
4. Run the evaluation
my-eval.ts
ts-node my-eval.ts
or npx lmnr eval my-eval.ts
.Evaluation with no target
Sometimes you may want to run evaluations on the output of the executor without a target. This can be useful, for example, to check if the output of the executor is in the correct format or if you want to use an LLM as a judge evaluator that generally evaluates the output.This is as simple as not passing
target
to your evaluator functions.target
field. For example: