Gram#

class scio.scores.Gram(*, act_norm=None, mode='raw', max_gram_order=8, cut_off=0.1, separate_diagonal=False, calib_labels='pred')[source]#

Bases: BaseScoreClassif

Gram for classification.

Parameters:

max_gram_order (int) – Max order of Gram matrices to use. In [SO20], authors use 10. We set default to 8 and do not recommend going higher. The improvement is marginal if not none, while increasing the required computational ressources and potential for overflows.
cut_off (float) – Following OODEEL’s approach, instead of using (min, max) for deviation computation, we use quantiles (cut_off, 1-cut_off). We also evaluate the expected deviation on the same calibration data. Thus cut_off should not be 0. Defaults to 0.1 and does not seem critical from a small ablation study.
separate_diagonal (bool) – Whether to separate diagonal correlations from the off-diagonal sum. See [SO20, appendix C]. Defaults to False.
calib_labels (LabelKindLike) – See LabelKind. Defaults to "pred".
mode – See BaseScoreClassif.
act_norm – See BaseScore.

Notes

In [SO20, eq. (2)], the authors define \(Gp := (F^p\times{}^{\mathrm t}F^p)^{1/p}\), which is ill-defined for negative correlations. We fix this here using absolute values, but this is never discussed in the article.

The above definition may generate many inf or nan. The sane way to treat them is not at all straightforward. Indeed, one could either use nanmax() to ignore them or consider the max to be nan if one sample is nan. The second case deactivates many correlations while the former feels inconsistent (nan correlations would be treated as \(0\) deviation). Treating them as infinite deviation is degenerate as it would render total deviation infinite for too many samples. We decide to ignore nan correlations here. Sadly, it means that if a test sample only has nan correlations because of overflow, it will have a total deviation of \(0\), making it a “normal” sample… I found no sound way to fix this overflow issue except using lower gram matrix orders (not \(10\) as in article) and also taking (q, 1-q) quantiles instead of (min, max) at calibration, like in OODEEL. The quantile approach may allow to avoid tails of the distribution containing nan, but if feels very optimistic.

The rescaling [SO20, eq. (5)] is done using the expected value on some data. If there are a few extreme samples in the training data used to compute mins and maxs, no validation sample will have a nonzero deviation, making the expected deviation \(0\) and raising issue. I think OODEEl’s approach to use a quantile instead of the min and max solves this problem in a sound way. As such, We propose to use (10, 90) percentiles of the calibration correlations to compute the deviation and expected deviation.

As mentioned in [SO20], the method scales very poorly with the number of channels. The authors propose to use a row-wise max instead of keeping all pairwise correlations to make it linear in the number of channels. It feels a bit arbitrary and monkeypatchy but we use this as it quickly becomes necessary, like in OODEEL.

References

[SO20] (1,2,3,4,5,6,7,8)

Chandramouli Shama Sastry and Sageev Oore. Detecting Out-of-Distribution examples with Gram matrices. In International Conference on Machine Learning, volume 119, 8491–8501. 2020. URL: http://proceedings.mlr.press/v119/sastry20a/sastry20a.pdf.

Hint

Below this point, the documentation is meant for development purposes only. Manual use of any listed member is highly discouraged. For usage, see Inferring with Confidence.

Useful methods defined here

`deviation`(low_high, stats)	Deviation from [SO20], patched as in OODEEL.
`layer_stats`(activations)	Concatenated upper Gram coefficients of all orders.

static deviation(low_high, stats)[source]#

Deviation from [SO20], patched as in OODEEL.

Infinite or nan elements are treated as 0. This monkey patch is mentioned in module’s doc.

Parameters:

low_high (Tensor) – Shape (2, stat_dim), first axis encodes (low, high).
stats (Tensor) – A batch of stats to test of shape (n_samples, stat_dim).

Returns:

out (Tensor) – Deviations. Shape (n_samples,).

layer_stats(activations)[source]#

Concatenated upper Gram coefficients of all orders.

Operates on the given layer’s batched activations.

Parameters:: activations (Tensor) – Shape (n_samples, *sample_activations_shape). Per-sample activations must be \(1\)D, \(2\)D or \(3\)D tensors, the first two cases being treated as implicitly single-channel.
Returns:: out (Tensor) – Shape (n_samples, max_gram_order * n_channels ** 2 / 2).
Raises:: ValueError – If not 2 <= activations.ndim <= 4.