Concept

Evaluation is a concept of running a given pipeline version against a given dataset and then evaluating the outputs of the pipeline.

For every datapoint in the dataset:

  1. For the pipeline being tested, inputs are filled in with the values from the dataset, corresponding to the input node names. For example, if a pipeline has inputs "colour" and "season", then evaluation run would look for keys "colour" and "season" in every datapoint, and substitute the respective pipeline inputs with the values corresponding to these keys in the datapoints when running the pipeline.
  2. The pipeline is run. All env variables must be set for this to succeed.
  3. All outputs of the pipeline are extracted and fed to the evaluator alongside all values in the datapoint.
  4. The evaluator evaluates the outputs and optionally uses the values from the datapoint (using output name) if it needs target data.

You can run the same evaluation against different pipelines and pipeline versions.

Target data

Depending on the evaluator type, it may or may not require target data to run. If the evaluator is intended to compare actual data against some baseline for every input datapoint (e.g. match formatting, compute similarity), it requires target data. If the evalutator can compare the data to a static baseline common across all datapoints (e.g. validate formatting) or can evaluate in place (e.g. pipeline-based evaluator with another ‘critic’ LLM), then it does not require target data. Learn more about whether each evaluator requires target data in Evaluators.

Example

Consider the following dataset:

colourseasonrecommendation
brownautumnbrown mid-heal shoes for a casual office look
yellowsummerlight yellow dress for a beach party look

And the pipeline that takes inputs colour and season, and produces a look recommendation with output name recommendation.

You could evaluate the similarity of the pipeline outputs to the ground truth recommendations you have above. This means that your evaluation requires target data. You will need to make sure that the ground truth data column matches the output node name of the pipeline you are testing.

In contrary, if you want to match every recommendation against a static value, for example a regex look$, your evaluation does not require target data. The evaluator will grade ALL output nodes in the pipeline.

Creating an evaluation

  1. Go to “evaluations” page in the navbar
  2. Click “New evaluation” at the top of the page
  3. Give the evaluation a name
  4. Select the dataset
  5. Select and configure an evaluator

Running an evaluation

  1. From the evaluations page, click on an evaluation
  2. In the evaluation page, click “Run evaluation”
  3. Select the pipeline and the version you would like to test
  4. IMPORTANT: If your pipeline has nodes, which require env variables, make sure that they are set in the API keys page.
  5. Click run evaluation
  6. The result will appear in the page.

Note, that evaluation will run only against the datapoints that have all required input keys to fill in the pipeline’s input nodes.

Status and results of the evaluation

Possible statuses of evaluation jobs:

  • Running
  • Success
  • Error

Where error means an error executing the pipeline version.