Comparing evaluations side-by-side

Laminar allows you to compare evaluations side-by-side in the UI.

Example comparison of two evaluation runs.

To do this, you need to group evaluations together. This can be achieved by passing the group_id parameter to the evaluate function.

from lmnr import evaluate, LaminarDataset
import os

evaluate(
    data=LaminarDataset("name_of_your_dataset"),
    executor=your_executor_function,
    evaluators={"evaluator_name": your_evaluator},
    project_api_key=os.environ["LMNR_PROJECT_API_KEY"],
    group_id="evals_group_1",
    # ... other optional parameters
)

Basic correctness evaluation

In this example our executor function calls an LLM to get the capital of a country. We then evaluate the correctness of the prediction by checking for exact match with the target capital.

1

1. Define an executor function

An executor function calls OpenAI to get the capital of a country. The prompt also asks to only name the city and nothing else. In a real scenario, you will likely want to use structured output to get the city name only.

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_capital(data):
    country = data["country"]
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": f"What is the capital of {country}? Just name the "
                "city and nothing else",
            },
        ],
    )
    return response.choices[0].message.content.strip()

2

2. Define an evaluator function

An evaluator function checks for exact match and returns 1 if the executor output matches the target, and 0 otherwise.

def evaluator(output, target):
    return 1 if output == target["capital"] else 0
3

3. Define data and run the evaluation

from lmnr import evaluate
import os

data = [
    {"data": {"country": "Germany"}, "target": {"capital": "Berlin"}},
    {"data": {"country": "Canada"}, "target": {"capital": "Ottawa"}},
    {"data": {"country": "Tanzania"}, "target": {"capital": "Dodoma"}},
]

evaluate(
    data=data,
    executor=get_capital,
    evaluators={'check_capital_correctness': evaluator},
    project_api_key=os.environ["LMNR_PROJECT_API_KEY"],
)

Running evaluations on a previously collected dataset

It is quite common to run evaluations on datasets that were previously collected and may contain LLM inputs, LLM outputs, and additional custom data, e.g. human labels.

The interesting bit here is that you have to define the executor function to extract the LLM output from the dataset.

Let’s assume we have a dataset with the following structure:

[
    {
        "data": {
            "country": "Germany",
            "llm_output": "Berlin",
            "llm_input": "What is the capital of Germany?",
            "human_label": "correct"
        },
        "target": {
            "capital": "Berlin"
        }
    },
    {
        "data": {
            "country": "Canada",
            "llm_output": "Ottawa",
            "llm_input": "What is the capital of Canada?",
            "human_label": "correct"
        },
        "target": {
            "capital": "Ottawa"
        }
    },
    {
        "data": {
            "country": "Kazakhstan",
            "llm_output": "Nur-Sultan",
            "llm_input": "What is the capital of Kazakhstan?",
            "human_label": "incorrect"
        },
        "target": {
            "capital": "Astana"
        }
    }
]

* It is common for LLMs of generation of approximately gpt-4 and claude-3 to name the capital of Kazakhstan as “Nur-Sultan” instead of “Astana”, because, for a few years prior to their data cut-off, the capital was indeed called Nur-Sultan.

1

1. Define an executor function

Since the dataset already contains the LLM output, we can simply extract it from the dataset instead of calling the LLM.

from openai import AsyncOpenAI

async def get_capital(data):
    return data["llm_output"]

2

2. Define an evaluator function

An evaluator function checks for exact match and returns 1 if the executor output matches the target, and 0 otherwise.

def evaluator(output, target):
    return 1 if output == target["capital"] else 0
3

3. Define data and run the evaluation

from lmnr import evaluate
import os

data = [
    # ... your dataset here
]

evaluate(
    data=data,
    executor=get_capital,
    evaluators={'check_capital_correctness': evaluator},
    project_api_key=os.environ["LMNR_PROJECT_API_KEY"],
)

LLM as a judge offline evaluation

In this example, our executor will write short summaries of news articles, and the evaluator will check if the summary is correct, and grade them from 1 to 5.

1

1. Prepare your data

The trick here is that the evaluator function needs to see the original article to evaluate the summary. That is why, we will have to duplicate the article from data into target prior to running the evaluation.

The data may look something like the following:

[
    {
        "data": {
            "article": "Laminar has released a new feature. ...",
        },
        "target": {
            "article": "Laminar has released a new feature. ...",
        }
    }
]
2

2. Define an executor function

An executor function calls OpenAI to summarize a news article. It returns a single string, the summary.

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_summary(data: dict[str, str]) -> str:
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Summarize the articles that the user sends you"
            }, {
                "role": "user",
                "content": data["article"],
            },
        ],
    )
    return response.choices[0].message.content.strip()

3

3. Define an evaluator function

An evaluator function grades the summary from 1 to 5. It returns an integer. We’ve simply asked OpenAI to respond in JSON, but you may want to use structured output or BAML instead.

We also ask the LLM to give a comment on the summary. Even though we don’t use it in the evaluation, it may be useful for debugging or further analysis. In addition, LLMs are known to perform better when given a chance to explain their reasoning.

import json
async def grade_summary(summary: str, target: dict[str, str]) -> int:
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": "Given an article and its summary, grade the " +
                "summary from 1 to 5. Answer in json format. For example: " +
                '{"grade": 3, "comment": "The summary is missing key points"}' +
                f"Article: {target['article']}. Summary: {summary}"
            },
        ],
    )
    return int(json.loads(response.choices[0].message.content.strip())["grade"])

4

4. Run the evaluation

from lmnr import evaluate
import os

data = [
    { "data": { "article": '...' }, "target": { "article": '...' } },
    { "data": { "article": '...' }, "target": { "article": '...' } },
    { "data": { "article": '...' }, "target": { "article": '...' } },
]

evaluate(
    data=data,
    executor=get_summary,
    evaluators={'grade_summary': grade_summary},
    project_api_key=os.environ["LMNR_PROJECT_API_KEY"],
)