Evaluator metadata

An evaluator may have optional metadata field associated to it. This is intended to store information that is common across all datapoints, i.e. global to the entire evaluator. For example, for regex evaluator, the regular expression pattern is stored in its metadata. Evaluator metadata is configured when you create or update

Evaluator types

Available evaluator types are:

  • Exact
  • Regex
  • Semantic similarity
  • Pipeline-based

Exact matcher

This is the simplest evaluator that checks two strings for exact equality. It does NOT trim/strip any surrounding whitespaces. It is also case-sensitive.

Requires target data?

Yes.

Metadata

No.

Scoring:

If two strings are identical, score of 1.0. Otherwise, score of 0.0. The score is then averaged for all outputs of the pipeline.

Examples

Target dataActual datascore
"Hello!""Hello!"1.0
"Hello, world!""Hello!"0.0
"Hello!""hello!"0.0
"Hello!""Hello!\n"0.0

Regex matcher

Matches all points against the matcher’s regular expression pattern.

Requires target data?

No.

Metadata

Regex pattern

Scoring

If there is a match of the matcher’s regex in the given text, score of 1.0. Otherwise, score of 0.0. The score is then averaged for all outputs of the pipeline.

Examples

Regex (in evaluator metadata)Actual datascore
"\w+!""Hello!"1.0
"Hel+o, \w+!""Hello, world!"1.0
"\d+""Hello!"0.0

Semantic similarity evaluator

Compares two strings based on their semantic similarity in the embedding space.

Requires target data?

Yes.

Metadata

No.

Scoring

Cosine similarity between two embedding vectors for the input in the range of 0.0 - 1.0. To be more specific, it computes the cosine of the angle between two embedding vectors, adds 1 and divides the result by 2.

The score is then averaged for all outputs of the pipeline.

Examples

These are two gpt-3.5-turbo-16k generated hokkus about cheese, and one example of dissimilar strings

Target dataActual datascore
"In creamy embrace,\nMelted dreams upon my tongue,\nCheese, my mouth's delight""Milky moon unfolds,\nCascading curds, divine treat,\nCheese dreams weave their spell"0.803100049495697
"Soft as clouds, it rests,\nGouda melts on eager tongue,\nCheese dreams come alive.""Melted on bread's crust,\nSavoring bliss, taste transcends,\nCheese's golden touch."0.7639690637588501
"Roses are red""Are you gonna buy a flight ticket?"0.4335864782333374

Pipeline-based evaluator

Run another pipeline that will assess the outputs of the pipeline being evaluated. This is useful when grading for less intuitive metrics or generally using an LLM to assess an LLM.

IMPORTANT: you need to make sure that input node names in the evaluator pipeline match exactly the output node names of the pipeline being evaluated. It is fine if there are less inputs than provided outputs, but the evaluation will not run if there are less outputs than required inputs

Requires target data?

No.

Metadata

Version id of the evaluator pipeline version.

Scoring

Currently, has to be a number only, so laminar can parse it and use further for scoring purposes.

Make sure you try to correctly format the output value of the evaluator pipeline, so it doesn’t contain redundant info.

Example

The pipeline below will evaluate a generic RAG pipeline. It will take user question, retrieved search context, and the LLM response as input, and produce a score on the scale of 1 to 5 based on how relevant to the question the response is.

Example RAG Evaluator

The prompt used:

You are given the user’s question, retrieved search context and the RAG flow output. Evaluate on the scale of 1 to 5, how relevant is the RAG flow output to the user’s question. The score represents the fraction of claims in the output that are relevant to the question over total claims in the output. A high score means most retrieved content directly answers the question. A low score indicates excess or irrelevant information.

You MUST produce the answer in the following format:

```

{

“score”: a single integer between 1 and 5 inclusive,

“reasoning”: explanation to why you gave this score

}

```

User’s question: {{question}};

Retrieved context: {{search}};

RAG output: {{answer}}