Human Evaluators
How to use human evaluators in your evaluations.
Human evaluators enable you to incorporate human judgment into your evaluation pipeline. Unlike automated evaluators that provide immediate scores, human evaluators create evaluation tasks that require manual scoring through Laminar’s evaluation interface.
Use case: evaluating LLM-as-a-judge
One of the most common use cases for human evaluators is creating reference data to evaluate and calibrate LLM-as-a-judge evaluators.
When dealing with complex tasks that can’t be scored with simple code, you often need to prompt a strong reasoning LLM to act as an evaluator. However, you can’t assume your judge prompt is optimal and is aligned with human judgment. Human evaluators provide the ground truth needed to:
- Validate LLM judge accuracy - Compare LLM scores against human expert judgment
- Optimize judge prompts - Test different prompting strategies and select the best performing ones against human judgment
When to use human evaluators
Human evaluators are particularly valuable when:
- Creating reference data for LLM judges - The primary use case for validating and improving automated LLM evaluators
- Subjective quality assessment - Evaluating creativity, tone, or style where human judgment is essential
- Domain expertise required - Evaluating specialized content that requires expert knowledge
How human evaluators work
When you include a HumanEvaluator
in your evaluation:
- Evaluation runs normally - All automated evaluators execute immediately
- Human evaluation tasks created - Each datapoint generates a task requiring manual scoring
- Deferred scoring - Human evaluators appear as pending in the evaluation results
- Manual scoring through UI - Evaluators access the evaluation interface to assign scores
- Results integration - Human scores are incorporated into the overall evaluation metrics
Basic usage
Here’s a simple example that combines automated and human evaluation. We use code to check the story is under MAX_WORDS, and then use a human evaluator to get human assessment of the story quality.
After you run the evaluation, you’ll see the results in the dashboard. Notice how story_quality
scores are pending, and length_check
is completed.
When you click on a single evaluation run, you’ll see the human evaluator span. You can see the data that was sent to this human evaluator and manually score it.
After you score the human evaluator, you’ll see the scores in the dashboard.
Validating LLM-as-a-judge evaluators
Here’s a practical example of using human evaluators to create reference data for validating and improving LLM-as-a-judge evaluators:
Analyzing judge performance
After collecting human reference scores, you can:
- Compare correlation - Measure how well LLM judge scores correlate with human scores
- Identify disagreements - Find cases where LLM and human scores differ significantly
- Iterate on prompts - Refine your judge prompt based on disagreement patterns
Let’s explore how to collect human evaluator data into datasets and use it to validate your LLM-as-a-judge evaluator.
Collecting human evaluator data into datasets
The SQL Editor is a powerful tool for analyzing your human evaluator results and creating datasets for training or validating LLM-as-a-judge evaluators. Here’s how to leverage it:
Finding human evaluator spans
Human evaluator spans are stored with span_type = 'HUMAN_EVALUATOR'
and can be identified by their evaluator name. Here’s a basic query to find all human evaluator spans in evaluation with certain id:
Creating reference datasets from human scores
After running this query, click “Export to Dataset” to:
- Create a new dataset or add to an existing one
- Map the
input
to datasetdata
field - Map the
output
to datasettarget
field
Then you can use this dataset to run evaluation of your LLM-as-a-judge evaluator and use human score in target
field as an expected score for the LLM-as-a-judge evaluator.