Human evaluators enable you to incorporate human judgment into your evaluation pipeline. Unlike automated evaluators that provide immediate scores, human evaluators create evaluation tasks that require manual scoring through Laminar’s evaluation interface.
Use case: evaluating LLM-as-a-judge
One of the most common use cases for human evaluators is creating reference data to evaluate and calibrate LLM-as-a-judge evaluators.
When dealing with complex tasks that can’t be scored with simple code, you often need to prompt a strong reasoning LLM to act as an evaluator. However, you can’t assume your judge prompt is optimal and is aligned with human judgment. Human evaluators provide the ground truth needed to:
- Validate LLM judge accuracy - Compare LLM scores against human expert judgment
- Optimize judge prompts - Test different prompting strategies and select the best performing ones against human judgment
When to use human evaluators
Human evaluators are particularly valuable when:
- Creating reference data for LLM judges - The primary use case for validating and improving automated LLM evaluators
- Subjective quality assessment - Evaluating creativity, tone, or style where human judgment is essential
- Domain expertise required - Evaluating specialized content that requires expert knowledge
How human evaluators work
When you include a HumanEvaluator
in your evaluation:
- Evaluation runs normally - All automated evaluators execute immediately
- Human evaluation tasks created - Each datapoint generates a task requiring manual scoring
- Deferred scoring - Human evaluators appear as pending in the evaluation results
- Manual scoring through UI - Evaluators access the evaluation interface to assign scores
- Results integration - Human scores are incorporated into the overall evaluation metrics
Basic usage
Here’s a simple example that combines automated and human evaluation.
We use code to check the story is under MAX_WORDS, and then use a human evaluator to get human assessment of the story quality.
from lmnr import evaluate, HumanEvaluator
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def generate_story(data: dict) -> str:
"""Generate a creative story based on the prompt"""
return client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": f"Write a creative short story about: {data['prompt']}. Keep it under {data['max_words']} words."
}
]
).choices[0].message.content
def check_length(output: str, target: dict) -> int:
"""Automated check the story is under MAX_WORDS"""
return 1 if len(output.split()) <= target.get('max_words', 100) else 0
evaluate(
data=[
{
"data": {
"prompt": "A robot learning to paint",
"max_words": 100
},
"target": {"max_words": 150},
"metadata": {
"title": "Robot Artist Story",
"category": "sci-fi"
}
},
{
"data": {
"prompt": "A time traveler's first day in medieval times",
"max_words": 100
},
"target": {"max_words": 150},
"metadata": {
"title": "Time Travel Adventure",
"category": "historical-fiction"
}
},
],
executor=generate_story,
evaluators={
"length_check": check_length,
"story_quality": HumanEvaluator()
},
)
After you run the evaluation, you’ll see the results in the dashboard. Notice how story_quality
scores are pending, and length_check
is completed.
When you click on a single evaluation run, you’ll see the human evaluator span. You can see the data that was sent to this human evaluator and manually score it.
After you score the human evaluator, you’ll see the scores in the dashboard.
Validating LLM-as-a-judge evaluators
Here’s a practical example of using human evaluators to create reference data for validating and improving LLM-as-a-judge evaluators:
from lmnr import evaluate, HumanEvaluator
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()
def generate_customer_response(data: dict) -> str:
"""Generate customer service responses"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a helpful customer service representative."
},
{
"role": "user",
"content": data["customer_inquiry"]
}
]
)
return response.choices[0].message.content
def llm_judge_helpfulness(output: str, target: dict) -> int:
"""LLM-as-a-judge evaluator for helpfulness (1-3 scale)"""
response = client.chat.completions.create(
model="o4-mini",
messages=[
{
"role": "system",
"content": """Rate the helpfulness of this customer service response on a scale of 1-3:
1 = Not helpful at all
2 = Moderately helpful
3 = Very helpful
Consider: Does it address the customer's concern? Is it clear and actionable?
Respond with only the number."""
},
{
"role": "user",
"content": f"Customer inquiry: {target['customer_inquiry']}\n\nResponse: {output}"
}
]
)
return int(response.choices[0].message.content.strip()) / 3
# Step 1: Create reference data with human evaluators
evaluate(
data=[
{
"data": {"customer_inquiry": "My order hasn't arrived yet, it's been 2 weeks"},
"target": {"customer_inquiry": "My order hasn't arrived yet, it's been 2 weeks"}
},
{
"data": {"customer_inquiry": "I need to return a damaged product"},
"target": {"customer_inquiry": "I need to return a damaged product"}
},
],
executor=generate_customer_response,
evaluators={
"human_helpfulness": HumanEvaluator(), # Creates reference scores
"llm_judge_helpfulness": llm_judge_helpfulness, # LLM judge to validate
},
group_name="llm_judge_calibration"
)
After collecting human reference scores, you can:
- Compare correlation - Measure how well LLM judge scores correlate with human scores
- Identify disagreements - Find cases where LLM and human scores differ significantly
- Iterate on prompts - Refine your judge prompt based on disagreement patterns
Let’s explore how to collect human evaluator data into datasets and use it to validate your LLM-as-a-judge evaluator.
Collecting human evaluator data into datasets
The SQL Editor is a powerful tool for analyzing your human evaluator results and creating datasets for training or validating LLM-as-a-judge evaluators. Here’s how to leverage it:
Finding human evaluator spans
Human evaluator spans are stored with span_type = 'HUMAN_EVALUATOR'
and can be identified by their evaluator name. Here’s a basic query to find all human evaluator spans in evaluation with certain id:
SELECT
input,
output
FROM spans
WHERE span_type = 'HUMAN_EVALUATOR'
AND evaluation_id = '<evaluation_id>' -- Replace with your evaluation id
ORDER BY start_time DESC
Creating reference datasets from human scores
After running this query, click “Export to Dataset” to:
- Create a new dataset or add to an existing one
- Map the
input
to dataset data
field
- Map the
output
to dataset target
field
Then you can use this dataset to run evaluation of your LLM-as-a-judge evaluator and use human score in target
field as an expected score for the LLM-as-a-judge evaluator.
Responses are generated using AI and may contain mistakes.