Using your own data or built-in samples¶
Your own data¶
RAITAP can load your own data by pointing data.source to a local file or
directory.
Supported inputs include:
A directory of images such as
.jpg,.png,.bmp, or.webp(walked recursively — nestedImageFolder-style layouts are supported)A single image file
A CSV, TSV, or Parquet file
A directory containing CSV, TSV, or Parquet files
Example (flat directory):
data:
source: "./data/images" # a directory of images
labels:
source: "./data/labels.csv"
id_column: "image"
column: "label"
from raitap.data import DataConfig, LabelsConfig
data = DataConfig(
source="./data/images", # a directory of images
labels=LabelsConfig(
source="./data/labels.csv",
id_column="image",
column="label",
),
)
with labels.csv rows like:
image,label
IM-0001.jpeg,0
IM-0002.jpeg,1
The ids are bare filenames (no path separators), so auto matches by
stem. See How labels match to samples below.
Example (nested ImageFolder layout — data/test/<class>/<file>.jpg):
data/test/
├── NORMAL/IM-0001.jpeg
├── NORMAL/IM-0002.jpeg
└── PNEUMONIA/IM-0001.jpeg # colliding stem with NORMAL/
data:
source: "./data/test"
labels:
source: "./data/labels.csv"
id_column: "image"
column: "label"
# id_strategy: "auto" # default — relative paths auto-detected
from raitap.data import DataConfig, IdStrategy, LabelsConfig
data = DataConfig(
source="./data/test",
labels=LabelsConfig(
source="./data/labels.csv",
id_column="image",
column="label",
# id_strategy=IdStrategy.auto, # default — relative paths auto-detected
),
)
with labels.csv rows like:
image,label
NORMAL/IM-0001.jpeg,0
NORMAL/IM-0002.jpeg,0
PNEUMONIA/IM-0001.jpeg,1
The default labels.id_strategy: "auto" detects the path separators and
matches by relative path (extension is stripped during comparison, so
NORMAL/IM-0001.jpeg and NORMAL/IM-0001 both work). Sample order is
sorted by relative posix path. See Configuration for the full
id_strategy reference.
How labels match to samples¶
Labels come from the data.labels file, not from folder names. RAITAP does
not infer the class from a parent directory the way torchvision ImageFolder
does. Each labels row is tied to a sample two ways:
By id — when
labels.id_columnis set, its value is matched against the discovered sample files.By order — when no
id_columnis set, labels align to samples by sorted file order (row 1 to the first file, and so on).
For id matching, id_strategy controls how both sides are normalised before
the lookup. The same rule is applied to the sample paths (from disk) and the
label ids (from the file):
|
strips |
|
use when |
|---|---|---|---|
|
extension only |
|
label ids carry the directory (manifest) |
|
directory + ext |
|
flat dir, label ids are bare filenames |
|
picks per id |
depends on the id column |
leave it; detects which form fits |
auto switches to relative_path as soon as any label id contains / or
\, otherwise stem. The manifest form (relative_path/auto) is the
conventional one and the safe default. stem collapses ids that share a
filename across subdirs (NORMAL/IM-0001.jpeg and PNEUMONIA/IM-0001.jpeg
both become IM-0001), which raises a duplicate-id error.
Labels from directory structure¶
If your images are already organised into one folder per class (the
torchvision ImageFolder convention), set labels.source: "directory" to use
the folder names as labels. No labels file needed.
data/train/
├── NORMAL/IM-0001.jpeg # label: NORMAL
├── NORMAL/IM-0002.jpeg
└── PNEUMONIA/IM-0001.jpeg # label: PNEUMONIA
data:
source: "./data/train"
labels:
source: "directory"
from raitap.data import DIRECTORY_LABELS_SOURCE, DataConfig, LabelsConfig
data = DataConfig(
source="./data/train",
labels=LabelsConfig(source=DIRECTORY_LABELS_SOURCE), # == "directory"
)
The class is each sample's top-level subdirectory; nesting within a class
folder is fine. Class ids are assigned alphabetically (NORMAL to 0,
PNEUMONIA to 1). id_column and id_strategy do not apply. If images sit
directly under the source with no class subdirs, RAITAP warns and falls back to
predictions as metric targets.
If you want to evaluate metrics against ground-truth labels, configure the
optional data.labels block as described in Configuration.
Built-in samples¶
RAITAP also includes a few built-in sample datasets that can be referenced by
name through data.source.
Available sample names (registered in src/raitap/data/samples.py) are:
imagenet_samples— 4 ImageNet test images (tench, shih-tzu, golden retriever, tiger cat), with bundledlabels.csv.isic2018— small ISIC 2018 dermoscopy subset (no labels).malaria— small malaria thin-blood-smear subset (no labels).acas_xu_n1_1— ACAS Xu N1_1 tabular sample (no labels).UdacitySelfDriving— small Udacity self-driving subset (no labels).
Example:
data:
source: "imagenet_samples"
from raitap.data import DataConfig
data = DataConfig(source="imagenet_samples")
Built-in samples are useful for quickly testing the pipeline without preparing your own dataset first. RAITAP downloads them to a local cache automatically when needed.