Evaluating LLM Tool Calls with Laminar

Overview

In this guide, we’ll follow the complete journey of building and improving a Data Analysis Assistant - an AI agent that helps users analyze their data, create visualizations, and generate insights. This example showcases how Laminar’s end-to-end platform helps you build reliable tool-calling agents.

Why This Guide Matters

Tool-calling agents are powerful but complex - they need to select the right tools, use correct parameters, and handle multi-step workflows. Unlike simple text generation, evaluating these agents requires understanding their decision-making process and systematic improvement based on real user interactions.

What You’ll Learn

Step 1: Tracing agent in production
Capture real user interactions automatically to understand how the agent behaves in production. Step 2: Collecting user feedback
Collect user feedback via tagging with Laminar SDK to understand when the agent helps vs. frustrates users. Step 3: Identifying failure patterns with SQL query editor
Identify systematic issues by querying the traced interactions to find failure patterns. Step 4: Label data and create evaluation dataset
Label the problematic cases and create an evaluation dataset using Laminar’s labeling queue interface. Step 5: Running evaluations
Experiment with the agent prompt to see whether the agent is improving with the new prompt. This end-to-end approach ensures our Data Analysis Assistant continuously improves based on real user interactions rather than hypothetical test cases.

The Data Analysis Assistant

Our assistant helps users analyze data through natural language queries like:

“How did our revenue perform last quarter?”
“Show me user engagement trends over time”
“Find any anomalies in our conversion rates”

Available Tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "query_database",
            "description": "Execute SQL queries to retrieve data from the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "SQL query to execute"},
                    "database": {"type": "string", "enum": ["analytics", "sales", "users"]}
                },
                "required": ["query", "database"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "create_visualization",
            "description": "Create charts and graphs from data",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {"type": "string", "description": "Data to visualize"},
                    "chart_type": {"type": "string", "enum": ["line", "bar", "pie", "scatter"]},
                    "title": {"type": "string", "description": "Chart title"}
                },
                "required": ["data", "chart_type"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "generate_summary",
            "description": "Generate insights and summary from analysis",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {"type": "string", "description": "Data to summarize"},
                    "focus": {"type": "string", "description": "What to focus the summary on"}
                },
                "required": ["data"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "compare_periods",
            "description": "Compare metrics across different time periods",
            "parameters": {
                "type": "object",
                "properties": {
                    "metric": {"type": "string", "description": "Metric to compare"},
                    "period1": {"type": "string", "description": "First time period"},
                    "period2": {"type": "string", "description": "Second time period"}
                },
                "required": ["metric", "period1", "period2"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "detect_anomalies",
            "description": "Identify unusual patterns or outliers in data",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {"type": "string", "description": "Data to analyze for anomalies"},
                    "sensitivity": {"type": "string", "enum": ["low", "medium", "high"]}
                },
                "required": ["data"]
            }
        }
    }
]

Step 1: Production Tracing with Laminar

First, let’s set up automatic tracing for our Data Analysis Assistant in production:

production_agent.py

import os
import json
from openai import OpenAI
from lmnr import Laminar, observe
from dotenv import load_dotenv

load_dotenv()

# Initialize Laminar for automatic tracing
Laminar.initialize()

client = OpenAI()

@observe(name="analyze_data")
def analyze_data(messages: list):
    """Main agent function that accepts input messages and handles tool calls"""
    
    # Hardcode system message at the beginning
    full_messages = [
        {
            "role": "system",
            "content": """You are a data analysis assistant. Use the available tools to help users analyze their data and generate insights. Always:
1. Query the appropriate database first
2. Create visualizations when helpful
3. Provide clear summaries of findings
4. Compare time periods when relevant

IMPORTANT: Always start by understanding what data you need, then query it, then process it."""
        }
    ]
    
    # Add input messages (skip any existing system messages)
    for msg in messages:
        if msg["role"] != "system":
            full_messages.append(msg)
    
    # Loop to handle multiple rounds of tool calls
    max_iterations = 5  # Prevent infinite loops
    iteration = 0
    
    while iteration < max_iterations:
        iteration += 1
        
        # Make LLM call
        response = client.chat.completions.create(
            model="o4-mini",
            messages=full_messages,
            tools=tools,
            tool_choice="auto"
        )
        
        # Add assistant's response to messages
        assistant_message = {
            "role": "assistant",
            "content": response.choices[0].message.content
        }
        if response.choices[0].message.tool_calls:
            assistant_message["tool_calls"] = response.choices[0].message.tool_calls
        full_messages.append(assistant_message)
        
        # If no tool calls, we have our final response
        if not response.choices[0].message.tool_calls:
            return response.choices[0].message.content
        
        # Execute tool calls
        for tool_call in response.choices[0].message.tool_calls:
            tool_name = tool_call.function.name
            tool_args = json.loads(tool_call.function.arguments)
            
            # Simulate tool execution (replace with actual implementations)
            result = simulate_tool_execution(tool_name, tool_args)
            
            # Add tool result to messages
            full_messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": tool_name,
                "content": result
            })
    
    # If we've hit max iterations, return what we have
    return response.choices[0].message.content or "Analysis completed with multiple tool calls."

def simulate_tool_execution(tool_name, args):
    # this will mark this span as a tool span
    with Laminar.start_as_current_span(name=tool_name, span_type="TOOL"):
        """Simulate tool execution - replace with real implementations"""
        if tool_name == "query_database":
            return "Sample data: Revenue Q4 2024: $1.2M, Q3 2024: $1.0M"
        elif tool_name == "create_visualization":
            return "Chart created successfully"
        elif tool_name == "generate_summary":
            return "Key insight: 20% revenue growth quarter-over-quarter"
        elif tool_name == "compare_periods":
            return "Q4 vs Q3: +20% increase in revenue"
        elif tool_name == "detect_anomalies":
            return "No significant anomalies detected"
        
        return "Tool executed successfully"

# Example usage in production
if __name__ == "__main__":
    messages = [
        {
            "role": "user",
            "content": "How did our revenue perform last quarter compared to the previous quarter?"
        }
    ]
    result = analyze_data(messages)
    print(result)

How Laminar Tracing Works

This agent demonstrates several key tracing concepts:

Top-Level Span: The @observe(name="analyze_data") decorator creates a top-level span that captures the entire agent execution
Automatic LLM Tracing: Laminar automatically traces the LLM call made via OpenAI API (and other supported popular LLM frameworks and SDKs, such as Langchain, Anthropic, etc.)
Manual Tool Spans: Each tool execution is wrapped in Laminar.start_as_current_span(name=tool_name, span_type="TOOL") to create dedicated tool spans

With this setup, every interaction creates a hierarchical trace in Laminar:

📊 analyze_data (top-level span)
├── 🤖 LLM call #1 (automatic)
├── 🔧 query_database (manual tool span)  
├── 🤖 LLM call #2 (automatic)
├── 🔧 create_visualization (manual tool span)
└── 🤖 LLM call #3 (automatic)

This captures:

Which tools were called and in what sequence
Tool execution results and timing
All LLM requests/responses throughout the conversation
Performance metrics and any errors

Here’s a screenshot of the traced interactions in Laminar:

Traced interactions in Laminar

Step 2: Capturing User Feedback

Now let’s add user feedback collection using Laminar’s tagging system. The key is to save the trace ID during execution and tag the trace later when you receive user feedback.

production_agent_with_feedback.py

import os
import json
from openai import OpenAI
from lmnr import Laminar, LaminarClient, observe
from dotenv import load_dotenv

load_dotenv()

# Initialize Laminar for automatic tracing
Laminar.initialize(api_key=os.environ["LMNR_PROJECT_API_KEY"])
laminar_client = LaminarClient()

client = OpenAI()

@observe("analyze_data")
def analyze_data(user_query: str):
    
    # ... rest of the code ...

    # get trace id from the current span
    trace_id = Laminar.get_trace_id()
    
    # add trace id to the response
    return {"trace_id": trace_id, "response": # ... original response ...}

# Example of how to add tag with feedback to the trace
def add_negative_feedback(trace_id: str):
    """Tag the trace with user feedback using the trace ID"""
    
    # Tag the entire trace with user feedback
    # This applies the tag to the top-level span of the trace
    laminar_client.tags.tag(trace_id, "unhelpful")

Key Points About Tagging

Get Trace ID in span context: Call Laminar.get_trace_id() inside the @observe function or inside manually created span to capture the trace ID
Store for Later: Save the trace ID along with your session data so you can tag it when feedback arrives
Tag the Trace: Use laminar_client.tags.tag(trace_id, tag_name) to apply tags to the entire trace. For example, you can have a user feedback feature in your app that tags the trace with unhelpful when the user provides feedback.
Top-Level Span: When you tag a trace, it applies the tag to the top-level span, making it easy to filter in SQL queries

This approach allows you to collect feedback asynchronously - users can provide feedback minutes or hours after the interaction, and you can still associate it with the correct trace.

Step 3: Analyzing Problematic Cases with SQL Editor

After collecting feedback in production, use Laminar’s SQL Editor to identify patterns in unsuccessful interactions. SQL editor is available from any page in Laminar and can be accessed by clicking the “console” button in the top right corner in the navigation bar.

-- Get the LLM spans from traces tagged as "unhelpful"
SELECT 
  input,
  output
FROM spans
WHERE span_type = 'LLM' 
  AND trace_id IN (SELECT trace_id FROM spans WHERE tag = 'unhelpful')
ORDER BY start_time DESC;

SQL Editor showing query results for unhelpful interactions

Example of using Laminar's SQL Editor to query traced interactions

Then click on the export to dataset button to export the results to a dataset. Drag output field to the target column and input field to the data column. Then click on the create dataset button to create new dataset.

Exporting query results to dataset for labeling

Push dataset to labeling queue

Now navigate to the newly created dataset from the previous step. In the dataset view:

Export to labeling queue: Click the “Add all to labeling queue” button
Choose labeling queue: Either create a new labeling queue or select an existing one

This process moves your problematic cases from the dataset into a structured labeling workflow where human annotators can provide the correct expected outputs for tool calls.

Step 4: Label data and create evaluation dataset

Laminar provides a convenient split-screen labeling interface that makes it easy to quickly label your data and build evaluation datasets. In this step, we will use labeling queue from the previous step to label the data and create an evaluation dataset.

The Labeling Interface

When you open your labeling queue, you’ll see an efficient split-screen interface: Left Panel - Payload View: Shows the full JSON structure of the current item, including the user’s original query in the data field and the agent’s problematic response in the target field. Right Panel - Target Editor: This is where you provide the correct expected output. You can:

Edit the existing response to fix tool selection issues
Write completely new target responses with correct tool calls
Use proper JSON formatting with syntax highlighting

Split-screen labeling interface showing payload and target editor

Laminar's labeling interface for tool call evaluation

Labeling Workflow

For each problematic tool call case:

Review the original query in the payload view (left panel)
Analyze what went wrong with the agent’s response
Write the correct expected output in the target editor (right panel)
Select your target dataset from the dropdown. You can create a new dataset or select an existing one.
Click “Complete” to save and move to the next item

Each completed datapoint will be automatically added to the target dataset. This dataset will be used for evaluation in the next step. As you type in the target editor, the left panel updates in real-time, showing exactly what will be saved to your evaluation dataset. This immediate feedback helps ensure accuracy in your labeling. Once your dataset is created, we can reference it in evaluations using LaminarDataset("eval_dataset").

Step 5: Running Tool Call Evaluations

With our labeled dataset ready, we can now run systematic evaluations. The key insight is that our dataset contains the original conversation messages in the OpenAI format, which allows us to test different system prompts while keeping the user queries consistent.

Understanding the Dataset Structure

From the labeling process, each datapoint in our evaluation dataset has this structure:

{
  "data": {
    "input": [
      {"role": "system", "content": "You are a data analysis assistant..."},
      {"role": "user", "content": "Show me sales performance compared to last quarter"},
      {"role": "assistant", "content": "..."},
      {"role": "tool", "tool_call_id": "call_9tNpqAft21sjn8UgSojJN5b8"}
    ]
  },
  "target": {
    "output": [
        {
            "id": "call_9tNpqAft21sjn8UgSojJN5b8",
            "name": "create_visualization",
            "arguments": {
            "data": "[{\"quarter\":\"Q3 2024\",\"revenue\":1000000},{\"quarter\":\"Q4 2024\",\"revenue\":1200000}]",
            "title": "Revenue by Quarter: Q3 vs Q4 2024",
            "chart_type": "bar"
            }
        }
    ]
  }
}

This structure allows us to:

Test new system prompts by replacing the system message
Keep user queries consistent for fair comparison
Compare against expected tool calls from human labeling

Create evaluation directory structure

├── src/
├── evals/
│   ├── eval_tool_selection.py

Main evaluation file

evals/eval_tool_selection.py

from lmnr import evaluate, LaminarDataset
from openai import OpenAI
import json
import os
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()

# Same tools as production - consistency is key
tools = [
    {
        "type": "function",
        "function": {
            "name": "query_database",
            "description": "Execute SQL queries to retrieve data from the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "SQL query to execute"},
                    "database": {"type": "string", "enum": ["analytics", "sales", "users"]}
                },
                "required": ["query", "database"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "create_visualization",
            "description": "Create charts and graphs from data",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {"type": "string", "description": "Data to visualize"},
                    "chart_type": {"type": "string", "enum": ["line", "bar", "pie", "scatter"]},
                    "title": {"type": "string", "description": "Chart title"}
                },
                "required": ["data", "chart_type"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "compare_periods",
            "description": "Compare metrics across different time periods",
            "parameters": {
                "type": "object",
                "properties": {
                    "metric": {"type": "string", "description": "Metric to compare"},
                    "period1": {"type": "string", "description": "First time period"},
                    "period2": {"type": "string", "description": "Second time period"}
                },
                "required": ["metric", "period1", "period2"]
            }
        }
    }
]

def data_analysis_agent(data):
    """Executor function that tests new system prompts"""
    
    # Get the original messages from the dataset
    original_messages = data["input"]
    
    # Create new messages with improved system prompt
    # This is where you test prompt improvements!
    messages = [
        {
            "role": "system", 
            "content": """You are a data analysis assistant. Use the available tools to help users analyze their data and generate insights. Always:
1. Query the appropriate database first
2. Create visualizations when helpful  
3. Provide clear summaries of findings
4. Compare time periods when relevant

IMPORTANT: Always start by understanding what data you need, then query it, then process it."""
        }
    ]
    
    # Add all non-system messages from the original conversation
    for msg in original_messages:
        if msg["role"] != "system":
            messages.append(msg)
    
    response = client.chat.completions.create(
        model="o4-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    return response

def evaluate_tool_selection(output, target):
    """Evaluator to check if correct tools were selected"""
    
    # Extract actual tool calls from the response
    actual_tool_calls = []
    if hasattr(output.choices[0].message, 'tool_calls') and output.choices[0].message.tool_calls:
        actual_tool_calls = [
            {
                "name": call.function.name,
                "arguments": json.loads(call.function.arguments)
            }
            for call in output.choices[0].message.tool_calls
        ]
    
    # Extract expected tool calls from target - fix based on actual structure
    if isinstance(target, dict):
        expected_tool_calls = target.get("output", [])
        # Handle nested list structure: [[{tool_objects}]] -> [{tool_objects}]
        if expected_tool_calls and isinstance(expected_tool_calls[0], list):
            expected_tool_calls = expected_tool_calls[0]
    elif isinstance(target, list):
        expected_tool_calls = target
    else:
        # If target is something else (like a string), try to parse it
        try:
            if isinstance(target, str):
                parsed_target = json.loads(target)
                expected_tool_calls = parsed_target.get("output", []) if isinstance(parsed_target, dict) else parsed_target
            else:
                expected_tool_calls = []
        except:
            expected_tool_calls = []
    
    # Check if we called the expected tools
    expected_tool_names = [tool["name"] for tool in expected_tool_calls]
    actual_tool_names = [tool["name"] for tool in actual_tool_calls]

    # Simple binary check: did we call all expected tools?
    for expected_tool in expected_tool_names:
        if expected_tool not in actual_tool_names:
            return 0
    
    return 1

# Run the evaluation
evaluate(
    data=LaminarDataset("eval_dataset"),
    executor=data_analysis_agent,
    evaluators={
        "tool_selection": evaluate_tool_selection
    },
    project_api_key=os.environ["LMNR_PROJECT_API_KEY"],
    group_name="improved_system_prompt_v1"
)

Testing Different System Prompts

The power of this approach is that you can easily test different system prompts:

# Version 1: Original prompt
system_prompt_v1 = """You are a data analysis assistant. Use available tools to analyze data."""

# Version 2: More detailed instructions  
system_prompt_v2 = """You are a data analysis assistant. Always:
1. Query appropriate database first
2. Create visualizations when helpful
3. Provide clear summaries"""

# Version 3: Emphasize tool sequencing
system_prompt_v3 = """You are a data analysis assistant. CRITICAL: Always follow this sequence:
1. Understand what data is needed
2. Query the database to get that data  
3. Process/analyze the results
4. Create visualizations if helpful
5. Summarize findings clearly"""

Simply swap the content field in the system message to test different versions and see which performs better on your real production failure cases.

Running the evaluation

You can run this evaluation in two ways:

Using the CLI

# Set your project API key
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY> # skip if you set LMNR_PROJECT_API_KEY in .env file

# Run single evaluation file
lmnr eval evals/eval_tool_selection.py

# Or run all evaluations in the evals directory
lmnr eval

Running as a standalone script

python evals/eval_tool_selection.py

Viewing Results and Iteration

Evaluation results in Laminar

After running evaluations, the Laminar evaluation view shows:

Overall Scores: How your agent performs across all metrics
Individual Cases: Traces of each evaluation
Performance Trends: How scores improve over iterations

Use these insights to:

Improve Prompts: Refine system prompts based on failure patterns
Adjust Tool Selection Logic: Update tool descriptions or parameters
Add New Tools: Identify missing capabilities from user feedback
Update Training Data: Create more diverse evaluation cases

This approach ensures the Data Analysis Assistant continuously improves based on real user interactions and systematic evaluation, leading to more reliable and useful AI agents.

Learn More

To dive deeper into the concepts covered in this guide:

Tracing Documentation: Learn more about automatic instrumentation, manual span creation, and advanced tracing patterns
Evaluations Documentation: Explore advanced evaluation patterns, custom evaluators, and evaluation best practices

​Overview

​Why This Guide Matters

​What You’ll Learn

​The Data Analysis Assistant

​Available Tools

​Step 1: Production Tracing with Laminar

​How Laminar Tracing Works

​Step 2: Capturing User Feedback

​Key Points About Tagging

​Step 3: Analyzing Problematic Cases with SQL Editor

​Push dataset to labeling queue

​Step 4: Label data and create evaluation dataset

​The Labeling Interface

​Labeling Workflow

​Step 5: Running Tool Call Evaluations

​Understanding the Dataset Structure

​Create evaluation directory structure

​Main evaluation file

​Testing Different System Prompts

​Running the evaluation

​Using the CLI

​Running as a standalone script

​Viewing Results and Iteration

​Learn More