Laminar Manual Evaluation

The Laminar Manual Evaluation SDK and API provide you with granular control over the evaluation process, allowing you to integrate Laminar directly into your existing evaluation pipeline or create flexible and complex evaluation workflows.

Manual vs. SDK Evaluation

Use manual evaluation when you need granular control over the evaluation lifecycle, custom tracing, or want to integrate evaluations with complex workflows.

The manual evaluation approach gives you fine-grained control over:

Step-by-step execution tracking
Flexible evaluation logic and scoring
Integration with existing systems and workflows

How Manual Evaluation Works

Manual evaluation follows a structured workflow with three core components:

Create Evaluation - Initialize a new evaluation
Execute and Evaluate - Run your logic and evaluate it
- Run your core logic
- Run your evaluation logic
Save and Update - Store datapoints and evaluation results

Quickstart

Let’s walk through implementing manual evaluation with tracing, breaking down each component:

Step 1: Setup and Initialization

First, initialize Laminar and create your evaluation clients:

import { Laminar, LaminarClient, observe } from "@lmnr-ai/lmnr";
import { OpenAI } from "openai";

Laminar.initialize({
    projectApiKey: 'your_project_api_key',
    instrumentModules: {
        openAI: OpenAI  // Automatically traces OpenAI calls
    }
});

const client = new LaminarClient({
    projectApiKey: 'your_project_api_key',
});

const openai = new OpenAI({ apiKey: 'your_openai_api_key' });

Step 2: Create Your Executor and Evaluation logic

Our executor function makes a call to OpenAI and is wrapped with tracing:

const executeTestCase = async (testCase) => {
    return await observe(
        { name: 'executor', spanType: 'EXECUTOR', input: testCase.data },
        async () => {
            const response = await openai.chat.completions.create({
                model: 'gpt-4o-mini',
                messages: [
                    {
                        role: 'user',
                        content: `What is the capital of ${testCase.data.country}? ` +
                          'Answer only with the capital, no other text.'
                    }
                ]
            });
            return response.choices[0].message.content || '';
        }
    );
};

Then our evaluator function accuracy is going to measure accuracy of executor.

const accuracy = async (output, target) => {
    return await observe(
        { name: 'accuracy', spanType: 'EVALUATOR', input: { output, target } },
        async () => {
            if (!target) return 0;
            return output.includes(target) ? 1 : 0;
        }
    );
};

Step 3: Create Evaluation and Datapoints

First, you need to create an evaluation session, then create datapoints with test data and update them with execution results and scores. Create Evaluation Before creating datapoints, you must initialize an evaluation session:

const evalId = await client.evals.create({
    name: "Capital of Country Manual Eval", 
    groupName: "Manual API - Capital Cities" 
});

Create/Update Datapoints The most important aspect is connecting your evaluation datapoints to the current execution trace using trace_id / traceId. This makes it possible to attach the created spans to a trace, which can then be inspected later in trace view.

You need to call Laminar.getTraceId() / Laminar.get_trace_id() inside of context of span to get id of trace.

We strongly advise to call createDatapoint / create_datapoint before calling executor/evaluator so trace can be associated as evaluation trace type.

const datapointId = await client.evals.createDatapoint({
    evalId,
    data: testCase.data,
    target: testCase.target,
    index: i,
    // Must be called within span context
    traceId: Laminar.getTraceId(),
});

await client.evals.updateDatapoint({
    evalId,
    datapointId,
    scores: { accuracy: accuracyScore },
    executorOutput: { 
        response: output, 
        model: 'gpt-4o-mini',
        country: testCase.data.country
    },
});

Complete Example

import { Laminar, LaminarClient, observe } from "@lmnr-ai/lmnr";
import { OpenAI } from "openai";

Laminar.initialize({
    projectApiKey: 'your_project_api_key',
    instrumentModules: {
    openAI: OpenAI
    }
});

const client = new LaminarClient({
    projectApiKey: 'your_project_api_key',
});
const openai = new OpenAI({ apiKey: 'your_openai_api_key' });

const executeTestCase = async (testCase) => {
    return await observe(
        { name: 'executor', spanType: 'EXECUTOR', input: testCase.data },
        async () => {
            const response = await openai.chat.completions.create({
                model: 'gpt-4o-mini',
                messages: [
                    {
                        role: 'user',
                        content: `What is the capital of ${testCase.data.country}? ` +
                          'Answer only with the capital, no other text.'
                    }
                ]
            });
            return response.choices[0].message.content || '';
        }
    );
};

const accuracy = async (output, target) => {
    return await observe(
        { name: 'accuracy', spanType: 'EVALUATOR', input: { output, target } },
        async () => {
            if (!target) return 0;
            return output.includes(target) ? 1 : 0;
        }
    );
};

async function runEvaluation() {
    try {
        const testData = [
            {
                data: { country: 'France' },
                target: 'Paris',
            },
            {
                data: { country: 'Germany' },
                target: 'Berlin',
            },
        ];

        const evalId = await client.evals.create({ 
            name: "Capital of Country Manual Eval", 
            groupName: "Manual API - Capital Cities" 
        });
        
        for (let i = 0; i < testData.length; i++) {
            await observe(
                { name: 'evaluation', spanType: 'EVALUATION', input: { testCase: testData[i] } },
                async () => {
                    const testCase = testData[i];

                    // Save datapoint first to associate trace as evaluation type
                    const datapointId = await client.evals.createDatapoint({
                        evalId,
                        data: testCase.data,
                        target: testCase.target,
                        index: i,
                        // Must be called within span context
                        traceId: Laminar.getTraceId(),
                    });

                    const output = await executeTestCase(testCase);

                    const accuracyScore = await accuracy(output, testCase.target)

                    await client.evals.updateDatapoint({
                        evalId,
                        datapointId,
                        scores: { accuracy: accuracyScore },
                        executorOutput: { 
                            response: output, 
                            model: 'gpt-4o-mini',
                            country: testCase.data.country
                        },
                    });
                }
            );
        }

        await Laminar.flush();
    } catch (error) {
        console.error("Error:", error.message);
    }
}

runEvaluation();

Evaluation Results

When you run the following example of manual evaluation, you’ll see detailed tracing and evaluation results in your Laminar dashboard:

API Reference

For detailed API specifications including request/response schemas, visit:

Create Evaluation

Initialize a new evaluation session

Save Datapoints

Add evaluation datapoints with input data and expected outputs

Update Datapoint

Update datapoint with execution results and scores

Overview

Tracing

Evaluations

Datasets

Labeling Queues

Playground

SQL Editor

Manual vs. SDK Evaluation

How Manual Evaluation Works

Quickstart

Step 1: Setup and Initialization

Step 2: Create Your Executor and Evaluation logic

Step 3: Create Evaluation and Datapoints

Complete Example

Evaluation Results

API Reference

Create Evaluation

Save Datapoints

Update Datapoint

Overview

Tracing

Evaluations

Datasets

Labeling Queues

Playground

SQL Editor

​Manual vs. SDK Evaluation

​How Manual Evaluation Works

​Quickstart

​Step 1: Setup and Initialization

​Step 2: Create Your Executor and Evaluation logic

​Step 3: Create Evaluation and Datapoints

​Complete Example

​Evaluation Results

​API Reference

Create Evaluation

Save Datapoints

Update Datapoint

Manual vs. SDK Evaluation

How Manual Evaluation Works

Quickstart

Step 1: Setup and Initialization

Step 2: Create Your Executor and Evaluation logic

Step 3: Create Evaluation and Datapoints

Complete Example

Evaluation Results

API Reference