Configuration¶

This page describes how to configure the data used to assess the model.

Options¶

Name	Allowed	Default	Description
`name`	`string`	`"imagenet_samples"`	Name shown in outputs and tracking metadata.
`description`	`string, null`	`null`	Optional human-readable dataset description.
`source`	`string, null`	`null`	Path to a local data directory or file, a URL, or a named sample set such as `"imagenet_samples"`. Sample names download on first use into `~/.cache/raitap/<name>/`; the same name is also accepted by `labels.source` when the sample bundles ground-truth labels.
`forward_batch_size`	`int, null`	`null`	Optional batch size for the initial model forward pass used to produce predictions for metrics, report sample ranking, and `auto_pred` explainer targets. If omitted, RAITAP uses its pipeline default of `32`. This does not control explainer attribution batching; use `transparency.<explainer>.raitap.batch_size` for that.
`preprocessing`	`null, "model-bundled", path string`	`null`	Data-side preprocessing applied in the loader, before the batch reaches the model. Per-image for image sources; per-batch on the stacked `(N, F)` tensor for tabular sources. Typical contents: Resize + CenterCrop for images; feature scaling or encoding for tabular. `null` leaves the loader untouched; `"model-bundled"` pulls Resize + CenterCrop from the model's bundled torchvision preset (image models only); a `.py` path loads a factory decorated with `@raitap_preprocessing_factory`. The `.py` path requires consent — see `--allow-preprocessing-exec` and Preprocessing.
`model_input_transformation`	`null, "model-bundled", path string`	`null`	Transformation applied at the model boundary, on every forward pass. Typical contents: Normalize. For Torch backends it stays inside autograd so attribution and adversarial budgets see the user-facing input space; for ONNX backends it runs on the tensor call path before the ONNX session. `"model-bundled"` pulls Normalize from the model's bundled torchvision preset (Torch backends only — ONNX has no torchvision lineage to derive from); a `.py` path loads a factory decorated with `@raitap_model_input_transformation_factory` and works for both Torch and ONNX backends. When both this and `preprocessing` are `null` and inputs are images, a loud warning fires — silence it with `--acknowledge-preprocessing-off`. See Preprocessing.
`labels.source`	`string, null`	`null`	Optional path to a labels file (CSV, TSV, or Parquet), URL, or named sample set. When set to a sample name (e.g. `"imagenet_samples"`), raitap resolves to the labels CSV bundled with that sample. Sample sets without bundled labels raise an error. The reserved value `"directory"` derives classification labels from each sample's top-level class subdirectory (torchvision `ImageFolder` style; no labels file) — see Using your own data or built-in samples. In that mode `id_column` and `id_strategy` do not apply.
`labels.id_column`	`string, null`	`null`	Optional sample-ID column used to align labels with filenames, for example `"image"`.
`labels.column`	`string, null`	`null`	Optional class-label column. If omitted, one-hot numeric columns are reduced with `argmax`.
`labels.encoding`	`"index", "one_hot", "argmax", null`	`null`	Optional label parsing strategy.
`labels.id_strategy`	`"auto", "relative_path", "stem"`	`"auto"`	How label-file ids are matched against discovered sample files. `"auto"` (default) inspects the id column and switches to `"relative_path"` if any value contains `/` or `\`, otherwise falls back to `"stem"`. `"relative_path"` keeps directory components and supports nested ImageFolder layouts (e.g. `NORMAL/IM-0001.jpeg`) — required when filename stems collide across class subdirs. `"stem"` matches by basename only (flat-dir layouts).
`input_metadata`	`dict, null`	`null`	Input modality + layout hints, normally auto-inferred from `data.source` for image and tabular sources. Set this explicitly when the auto-inference cannot resolve the layout (e.g. raw tensors, custom loaders, or ONNX/Torch models whose declared input shape differs from the on-disk sample shape). Keys: `kind` (one of `image`, `tabular`, `text`, `time_series`), `layout` (`NCHW`, `(B,F)`, `(B,T,C)`, `TOKENS`), `feature_names`, and `shape`.
`input_metadata.shape`	`list[int], null`	`null`	The non-batch per-sample layout expected by the model (batch dim is implicit and resolved at runtime). Beyond informing output-space inference, `shape` also controls the model-input reshape performed at the backend boundary: per-sample inputs whose `numel` matches `shape` are reshaped to `(N, *shape)` before being passed to the model. For ONNX models the backend auto-derives the expected shape from `session.get_inputs()[0].shape` — concrete dims are respected as-is and symbolic / unknown dims (e.g. `"batch"`) become dynamic. `input_metadata.shape` overrides the auto-derived value, useful when an ONNX graph declares the batch dim as a fixed `1` even though the model accepts arbitrary batches, or when you want to feed flatter inputs than the graph declares. For Torch models, `nn.Module` declares no shape, so setting `input_metadata.shape` is the only way to enable backend reshape — otherwise inputs are passed through unchanged. A `raitap.utils.errors.ModelInputShapeError` is raised when per-sample `numel` mismatches the expected shape, or when an ONNX graph declares two or more dynamic dims (ambiguous — supply `shape` to disambiguate).

Examples¶

YAML

data:
  name: "my-dataset"
  description: "Internal validation set"
  source: "./data/images"
  forward_batch_size: 32
  preprocessing: model-bundled
  model_input_transformation: model-bundled
  labels:
    source: "./data/labels.csv"
    id_column: "image"
    column: "label"
    encoding: "index"
    id_strategy: "auto"

Python

from raitap.data import (
    DataConfig,
    IdStrategy,
    LabelEncoding,
    LabelsConfig,
    Preprocessing,
)

data = DataConfig(
    name="my-dataset",
    description="Internal validation set",
    source="./data/images",
    forward_batch_size=32,
    preprocessing=Preprocessing.model_bundled,
    model_input_transformation=Preprocessing.model_bundled,
    labels=LabelsConfig(
        source="./data/labels.csv",
        id_column="image",
        column="label",
        encoding=LabelEncoding.index,
        id_strategy=IdStrategy.auto,
    ),
)

CLI

uv run raitap data.source="./data/images" data.preprocessing=model-bundled data.model_input_transformation=model-bundled data.labels.source="./data/labels.csv" data.labels.column=label

pip

raitap data.source="./data/images" data.preprocessing=model-bundled data.model_input_transformation=model-bundled data.labels.source="./data/labels.csv" data.labels.column=label

For tabular models whose backend expects an unusual per-sample layout (such as ACAS Xu, a Torch network whose forward takes (N, 1, 1, 5)), supply input_metadata.shape explicitly so the pipeline reshapes the flat feature vectors before the forward pass:

data:
  input_metadata:
    kind: tabular
    layout: "(B,F)"
    feature_names: [rho, theta, psi, v_own, v_int]
    shape: [1, 1, 5]   # non-batch dims; reshapes (N, 5) -> (N, 1, 1, 5)

For nested ImageFolder-style layouts (e.g. data/test/<class>/<file>.jpg) see Using your own data or built-in samples.