Skip to main content

Concept

Dataset is a collection of datapoints. It can be used for the following purposes:
  1. Data storage for use in future fine-tuning or prompt-tuning.
  2. Provide inputs and expected outputs for Evaluations.

Format

Every datapoint has two fixed JSON objects: data and target, each with arbitrary keys. target is only used in evaluations.
  • data – the actual datapoint data,
  • target – data additionally sent to the evaluator function.
  • metadata – arbitrary key-value metadata about the datapoint.
For every key inside data and target, the value can be any JSON value.

Example

This is an example of a valid datapoint.
{
    "data": {
        "color": "red",
        "size": "large",
        "messages": [
            {
                "role": "user",
                "content": "Hello, can you help me choose a T-shirt?"
            },
            {
                "role": "assistant",
                "content": "I'm afraid, we don't sell T-shirts"
            }
        ]
    },
    "target": {
        "expected_output": "Of course! What size and color are you looking for?"
    }
}

Use case: Evaluations

Datasets can be used for evaluations to specify inputs and expected outputs. You will need to make sure the dataset keys match the input and output node names of the pipelines. See more in the Evaluations page.

Editing

Datasets are editable. You can edit the datapoints by clicking on the datapoint and editing the data in JSON. Changes are saved as a new datapoint version.

Versioning

Each datapoint has a unique id and a created_at timestamp. Every time you edit a datapoint, under the hood, a new datapoint version is created with the same id, but the created_at timestamp is updated. The version stack is push-only. That is, when you revert to a previous version, a copy of that version is created and added as a current version. Example:
  • Initial version (v1):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-01T00:00:00.000Z",
  "data": { "key": "initial value" }
}
  • Version 2 (v2):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-05T00:00:05.000Z",
  "data": { "key": "value at v2" }
}
  • Version 3 (v3):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-10T00:00:10.000Z",
  "data": { "key": "value at v3" }
}
After this, you want to update to version 1 (initial version). This will create a new version (v4) with the same id, but the created_at timestamp is updated.
  • Version 4 (v4):
{
  "id": "019a3122-ca78-7d75-91a7-a860526895b2",
  "created_at": "2025-01-15T00:00:15.000Z",
  "data": { "key": "initial value" }
}

Datapoint id

When you push a new datapoint to a dataset, a UUIDv7 is generated for it. This allows to sort datapoints by their creation order and preserve the order of insertion.