Visualisers

Each visualiser renders one figure per call. All of them are declared per-assessor in YAML:

robustness:
  marabou:
    _target_: "MarabouAssessor"
    visualisers:
      - _target_: "OutputBoundsCohortVisualiser"
      - _target_: "OutputBoundsMarginHeatmapVisualiser"
from raitap.robustness import marabou, output_bounds_cohort, output_bounds_margin_heatmap

robustness = {
    "marabou": marabou(
        visualisers=[output_bounds_cohort(), output_bounds_margin_heatmap()],
    ),
}

Visualisers declare which AssessmentKind they support; the factory rejects mismatches at YAML parse time.

Empirical attack

ImagePairVisualiser

Renders the first few samples as three side-by-side panels — clean input, perturbed input, and a signed perturbation heatmap — so you can eyeball whether the attack produced a visually plausible counter-example or just noise. Use it as the default first-pass sanity check for any image-modality attack run.

How to read it. Each row is one sample; the three columns left→right are the clean input, the perturbed (adversarial) input, and their signed difference. In the difference panel red/blue marks the per-pixel perturbation direction on the dominant channel and white means untouched. Structure that traces the object means a directed attack; uniform speckle means an undirected one.

Kwarg

Default

Meaning

max_samples

4

Maximum number of rows (samples) to render.

cmap

"RdBu_r"

Diverging colormap used for the signed perturbation panel.

diff_scale

None

Fixed symmetric vmin/vmax for the perturbation panel. None means auto-fit per-figure.

Supports AssessmentKind.EMPIRICAL_ATTACK. Rejects non-image results (InputSpec.kind != IMAGE).

ImagePairVisualiser preview

PerturbationHeatmapVisualiser

Per-sample diverging heatmap of the perturbation only. Useful when you do not need the clean / perturbed comparison and want a denser grid focused on the attack's spatial signature.

How to read it. One panel per sample; colour is the perturbation reduced to a signed scalar per pixel (red/blue = + / − direction, white = untouched) and brightness is magnitude. It is the signed-difference column of the image-pair view on its own. The default reduction (signed_dominant) keeps the signed value of the channel with the largest absolute deviation, so red/blue track the attack's direction instead of cancelling to zero on opposing channels.

Kwarg

Allowed

Default

Meaning

max_samples

4

Maximum number of samples to render.

cmap

"seismic"

Diverging colormap for the perturbation.

aggregate_channels

signed_dominant | mean | mean_abs | max_abs

"signed_dominant"

Per-pixel channel reduction applied before colouring.

Supports AssessmentKind.EMPIRICAL_ATTACK. Rejects non-image results.

PerturbationHeatmapVisualiser preview

Formal verification

VerdictSummaryVisualiser

Two-panel summary of a verifier batch: a bar chart of per-verdict counts plus a histogram of runtime_per_sample. Use it as the first thing you look at after a verifier run to see how it performed before drilling into bound widths.

How to read it. Left panel — bar height is the number of samples per verdict (VERIFIED / FALSIFIED / UNKNOWN / ERROR); a tall VERIFIED bar means a robust batch. Right panel — x is per-sample verifier runtime in seconds, y is sample count; a long right tail flags samples the verifier struggled on.

Kwarg

Default

Meaning

runtime_bins

20

Histogram bin count for the runtime panel.

Supports AssessmentKind.FORMAL_VERIFICATION.

VerdictSummaryVisualiser preview

OutputBoundsCohortVisualiser

One boxplot per output class summarising the certified upper[i, k] - lower[i, k] widths across the verified batch. Reach for it when you want a single figure that says "class k's certified region is tight for most samples but has a long tail at logit 3".

How to read it. The x-axis is the output class k; each box summarises the certified interval width (upper lower) over the batch for that class. Lower, tighter boxes mean more confident bounds; a long upper whisker or tail means a few samples have loose bounds on that class. Width is uncertainty, not correctness — a wide box does not mean the class is wrong, only that the verifier could not pin its logit tightly.

Kwarg

Default

Meaning

whis

1.5

Matplotlib whisker length (multiple of IQR).

show_outliers

True

Whether to render flier points beyond the whiskers.

Supports AssessmentKind.FORMAL_VERIFICATION. Renders a placeholder figure when result.output_bounds is None or every row is NaN.

OutputBoundsCohortVisualiser preview

OutputBoundsPinnedVisualiser

One sub-plot per pinned (or first-finite) sample, showing the certified [lower_k, upper_k] interval for each output class with the target class highlighted. Use it to examine specific samples by index — e.g. "what does the bound for sample 17 look like?"

How to read it. The x-axis (certified value) is the range each class's logit can take under any perturbation inside the budget; each bar spans [lower_k, upper_k]. The target class is red, competitors blue. A sample is VERIFIED exactly when the red bar lies entirely to the right of every blue bar — the target's certified lower bound exceeds every competitor's certified upper bound, so no input in the budget can change the top class. Any overlap means the bound cannot rule out a competitor overtaking the target → UNKNOWN (bound propagation is sound but incomplete, so it never reports FALSIFIED).

Kwarg

Default

Meaning

max_samples

4

Maximum number of samples when sample_indices is not provided.

max_classes

20

Maximum classes drawn per sub-plot. Above it, shows the target plus the classes with the largest certified upper bounds (the closest competitors), so many-class models like ImageNet stay legible instead of collapsing into a wall of rows.

target_color

"#d62728"

Bar colour for the target class.

bar_color

"#1f77b4"

Bar colour for non-target classes.

sample_indices

None

Optional explicit list of row indices to pin.

Supports AssessmentKind.FORMAL_VERIFICATION. Falls back to a placeholder when bounds are absent.

OutputBoundsPinnedVisualiser preview

OutputBoundsWidthHeatmapVisualiser

A samples-by-classes heatmap whose cell value is the certified width upper - lower. Pick this over the cohort boxplot when batch size is small enough that per-sample visibility is more useful than per-class aggregate stats.

How to read it. Rows are samples, columns are classes; cell colour is the certified width (upper lower) — brighter per the colormap means a wider, looser bound. Grey cells are NaN rows (FALSIFIED / UNKNOWN / ERROR) with no bound, so the grey pattern doubles as a coverage map of which samples the verifier actually certified.

Kwarg

Default

Meaning

cmap

"viridis"

Sequential colormap for widths.

max_samples

None

Truncate to the first N rows. None renders every row.

figsize

None

Manual override; None picks an auto size from sample / class counts.

Supports AssessmentKind.FORMAL_VERIFICATION.

OutputBoundsWidthHeatmapVisualiser preview

OutputBoundsMarginHeatmapVisualiser

A samples-by-classes heatmap of the per-class certified margin against the target class. Use it for "is this batch robustly classified, or merely verified-with-room-to-flip?"

How to read it. Rows are samples, columns are classes; each cell is margin[i, k] = lower[i, target_i] - upper[i, k]. Blue (positive) means the target class is provably above class k everywhere in the certified region; red (negative) means class k could overtake the target — a flip risk. An all-blue row is comfortably robust; any red cell is where robustness is not certified. The target's own column is masked grey.

Kwarg

Default

Meaning

cmap

"RdBu"

Diverging colormap.

max_samples

None

Truncate to the first N rows. None renders every row.

figsize

None

Manual override; None picks an auto size.

Supports AssessmentKind.FORMAL_VERIFICATION. Falls back to a placeholder when result.targets is missing or shaped wrong.

OutputBoundsMarginHeatmapVisualiser preview

Statistical sampling

CorruptionAccuracyVisualiser

Clean vs corrupted accuracy bars with a CI whisker on the corrupted bar. Use it to see at a glance whether a corruption degrades accuracy and how tight the estimate is.

How to read it. Two bars — clean accuracy vs accuracy under the corruption; the whisker on the corrupted bar is its confidence interval. A big drop means the corruption hurts; a wide whisker means few samples, so treat the estimate cautiously. Annotated with the corruption name, severity, and sample count (N).

Supports AssessmentKind.STATISTICAL_SAMPLING. No image-modality requirement. Renders as an assessor-level figure (one chart for the whole assessment) rather than per-sample.

CorruptionAccuracyVisualiser preview