Preprocessing¶
This page explains how RAITAP preprocessing works. By default, absolutely no preprocessing is applied; pretrained image models that expect ImageNet normalization, or tabular models that expect z-scored features, will produce wrong outputs.
The 2 following config keys are available:
Knob |
Where |
Typical contents |
|---|---|---|
|
loader, before batch reaches model |
Resize + CenterCrop (images), feature scaling (tabular) |
|
model boundary, every forward pass. ONLY FOR TORCH MODELS. |
Normalize, learnable input layers |
preprocessingruns in the loader so mixed-size image folders can stack at all, and so the work is outside autograd.model_input_transformationinside autograd so Captum/SHAP attribution and PGD/FGSM attacks operate on the same input space you do.
Warning
Not supported for object detection. Detection models take native per-image inputs and resize/normalise internally (e.g. torchvision GeneralizedRCNNTransform), so neither knob applies: a data-side resize/crop/pad would corrupt the ground-truth box coordinates (labels are not transformed with the pixels), and a model-side transform would double-process the input. Setting data.preprocessing or data.model_input_transformation for a detection model raises an error — leave both unset.
The following values are allowed for both keys:
null(default): no preprocessing"model-bundled": use the preprocessing bundled inside the model file (e.g.torchvisionmodels)path to a
.pyfile: load a user factory decorated with the matching RAITAP decorator (see custom examples below). Requires consent — see--allow-preprocessing-exec.
Examples¶
Torchvision image model, bundled both sides¶
data:
source: ./data/images
preprocessing: model-bundled
model_input_transformation: model-bundled
model:
source: resnet50
from raitap.data import DataConfig, Preprocessing
from raitap.models import ModelConfig
data = DataConfig(
source="./data/images",
preprocessing=Preprocessing.model_bundled,
model_input_transformation=Preprocessing.model_bundled,
)
model = ModelConfig(source="resnet50")
Handles mixed-size folders: Resize + CenterCrop run per-image as the loader
stacks the batch; Normalize runs on every forward pass. Works whenever
model.source (or model.arch) names a built-in torchvision model.
model-bundled is Torch-only on both sides — it derives from torchvision
weights lineage, which ONNX exports don't carry. For ONNX, set
model_input_transformation to a .py path (custom-file model-side
transformation is wired through the ONNX backend's tensor call path).
Data-side preprocessing via .py works for ONNX too.
Tabular model, custom feature scaling¶
data:
source: ./data/features.csv
preprocessing: ./scale.py
input_metadata:
kind: tabular
from raitap.data import DataConfig
data = DataConfig(
source="./data/features.csv",
preprocessing="./scale.py",
input_metadata={"kind": "tabular"},
)
# scale.py
import torch
from torch import nn
from raitap.data import raitap_preprocessing_factory
class ZScore(nn.Module):
def __init__(self, mean: list[float], std: list[float]) -> None:
super().__init__()
self.register_buffer("mean", torch.tensor(mean))
self.register_buffer("std", torch.tensor(std))
def forward(self, x): # x: (N, F)
return (x - self.mean) / self.std
@raitap_preprocessing_factory
def standardise() -> nn.Module:
return ZScore(mean=[0.0, 1.0, 2.0], std=[0.5, 0.5, 0.5])
The module is invoked once on the stacked (N, F) batch — rows are uniform
so per-sample iteration would just waste work. Leave the model-side knob
unset if your network handles its own input layer.
Bundled Resize/Crop + custom Normalize¶
Fine-tuned a torchvision arch on non-ImageNet data. Keep the geometry, swap the stats:
data:
source: ./data/images
preprocessing: model-bundled
model_input_transformation: ./my_normalize.py
model:
source: resnet50
from raitap.data import DataConfig, Preprocessing
data = DataConfig(
source="./data/images",
preprocessing=Preprocessing.model_bundled,
model_input_transformation="./my_normalize.py",
)
# my_normalize.py
from torch import nn
from torchvision.transforms import v2
from raitap.data import raitap_model_input_transformation_factory
@raitap_model_input_transformation_factory
def normalize_for_my_domain() -> nn.Module:
return v2.Normalize(mean=[0.5, 0.5, 0.5], std=[0.2, 0.2, 0.2])
Both sides custom, single file¶
data:
source: ./data/images
preprocessing: ./preprocessing.py
model_input_transformation: ./preprocessing.py
from raitap.data import DataConfig
data = DataConfig(
source="./data/images",
preprocessing="./preprocessing.py",
model_input_transformation="./preprocessing.py",
)
# preprocessing.py
from torch import nn
from torchvision.transforms import v2
from raitap.data import (
raitap_model_input_transformation_factory,
raitap_preprocessing_factory,
)
@raitap_preprocessing_factory
def resize_and_crop() -> nn.Module:
return nn.Sequential(
v2.Resize(232, antialias=True),
v2.CenterCrop(224),
)
@raitap_model_input_transformation_factory
def normalize() -> nn.Module:
return v2.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
)
Same file pointed at both knobs is imported and hashed once. Run with the
consent flag (--allow-preprocessing-exec):
uv run raitap --config-name assessment -yp
raitap --config-name assessment -yp
Already preprocessed upstream¶
Your dataloader emits normalized tensors? Leave both knobs unset and
acknowledge at invocation (--acknowledge-preprocessing-off):
uv run raitap --config-name assessment --acknowledge-preprocessing-off
# Or if installed as a console script:
raitap --config-name assessment --acknowledge-preprocessing-off
from raitap import run
run(config, acknowledge_preprocessing_off=True)
Non-image kinds (tabular, text, time_series declared via
input_metadata.kind) auto-suppress the warning.
Custom-file rules¶
A decorated factory must:
carry
@raitap_preprocessing_factory(data side) or@raitap_model_input_transformation_factory(model side),take no required arguments,
return an
nn.Module.
One factory per side per file. Pointing a knob at a file with no matching decorator raises before the model is built. Two matching factories raise with their names.
Two Protocol types ship for static analysis:
from raitap.data import DataPreprocessingFactory, ModelInputTransformationFactory
_data_check: DataPreprocessingFactory = resize_and_crop
_model_check: ModelInputTransformationFactory = normalize
RAITAP records each file's path and SHA-256 in tracking metadata so changes between runs surface in your history.
When does data-side run per-image?¶
Image sources (.jpg, .png, …) load per-image — the data-side module is
lifted to a single-image (C, H, W) → Tensor callable and applied as each
file is read. That's the only way a directory of varied-size images can be
stacked into one batch.
Tabular sources (.csv, .tsv, .parquet) load as (N, F) in one shot —
the data-side module runs once on the full batch.
If you set data.preprocessing: null on an image source, every file must
already be the same height and width.