Visualization & Evaluation in Classification#

Discriminative Power#

The purpose of confidence scores is to discriminate between In-Districution (InD) and Out-of-Distribution (OoD) samples. To compare different scores in their efficiency relative to this task, we introduce the concept of Discriminative Power, which denotes a metric satisfying the following conditions.

  1. Be invariant under any (even nonlinear) rescaling of the score. This implies that it only depends on all the possible \((FP, TP)\) tuples when thresholding the score with every possible threshold.

  2. Only depend on the Pareto front of all the possible \((FP, TP)\) tuples.

  3. Satisfy basic monoticity principles (e.g. if an InD sample’s score is increased, all else being equal, then the discriminative power cannot decrease).

In light of the above caracterization, we provide a ROC utility class which allows simple and formalized definition of Disctiminating Power classes.

ROC(labels, scores)

ROC utility for Discriminative Power and visualization.

Implemented Discriminative Power metrics

We also provide implementation for a few standard such metrics, all derived from the base BaseDiscriminativePower class.

BaseDiscriminativePower

Base class for discriminative power metrics.

AUC

AUC for ROC, potentially partial — in which case normalized.

MCC

Maximum Matthews Correlation Coefficient.

TNR

True Negative Rate.

TPR

True Positive Rate.

Visualization & Evaluation#

Running experiments

In order to evaluate scoring algorithms, it is necessary to gather various datasets, specifically one InD and possibly several OoD datasets, in order to fit score instances and collect their output. This may be done with the following functions.

compute_confidence(scores_fit, *, ind, oods)

Compute the prediction confidence scores for given InD and OoDs.

fit_scores(scores_and_layers, net, ...[, ...])

Fit confidence scores given net and calibration data.

Visualization

We provide two visualization possibilities: histograms and ROC curves.

histogram_oods(conf_ind, conf_oods, *[, ...])

For a given score, plot histograms over all OoD sets.

roc_scores(confs_ind, confs_ood, *[, ...])

For a given OoD set, plot ROCs over all scores.

summary_plot(confs_ind, confs_oods, *[, ...])

Plot and show histograms for each score, ROCs for each OoD set.

Evaluation

Given experimental results, the following help quantify the Discriminative Power.

compute_metrics(confs_ind, confs_oods, metrics)

From precomputed confidence scores, evaluate scores.

summary_table(evals, *[, scores_and_layers, ...])

Print scores evaluation results summary in rich table.

topk_evals(evals[, k, baseline])

Identify best performing scores in at least one scenario.

Visualization & Evaluation

You can do both at the same time with the following.

summary(confs_ind, confs_oods, *[, ...])

Print evaluation table, plot and show histograms and ROCs.