Gram#
- class scio.scores.Gram(*, act_norm=None, mode='raw', max_gram_order=8, cut_off=0.1, separate_diagonal=False, calib_labels='pred')[source]#
Bases:
BaseScoreClassifGram for classification.
- Parameters:
max_gram_order (
int) – Max order of Gram matrices to use. In [SO20], authors use10. We set default to8and do not recommend going higher. The improvement is marginal if not none, while increasing the required computational ressources and potential for overflows.cut_off (
float) – Following OODEEL’s approach, instead of using(min, max)for deviation computation, we use quantiles(cut_off, 1-cut_off). We also evaluate the expected deviation on the same calibration data. Thuscut_offshould not be0. Defaults to0.1and does not seem critical from a small ablation study.separate_diagonal (
bool) – Whether to separate diagonal correlations from the off-diagonal sum. See [SO20, appendix C]. Defaults toFalse.calib_labels (
LabelKindLike) – SeeLabelKind. Defaults to"pred".mode – See
BaseScoreClassif.act_norm – See
BaseScore.
Notes
In [SO20, eq. (2)], the authors define \(Gp := (F^p\times{}^{\mathrm t}F^p)^{1/p}\), which is ill-defined for negative correlations. We fix this here using absolute values, but this is never discussed in the article.
The above definition may generate many
infornan. The sane way to treat them is not at all straightforward. Indeed, one could either usenanmax()to ignore them or consider the max to benanif one sample isnan. The second case deactivates many correlations while the former feels inconsistent (nancorrelations would be treated as \(0\) deviation). Treating them as infinite deviation is degenerate as it would render total deviation infinite for too many samples. We decide to ignorenancorrelations here. Sadly, it means that if a test sample only hasnancorrelations because of overflow, it will have a total deviation of \(0\), making it a “normal” sample… I found no sound way to fix this overflow issue except using lower gram matrix orders (not \(10\) as in article) and also taking(q, 1-q)quantiles instead of(min, max)at calibration, like in OODEEL. The quantile approach may allow to avoid tails of the distribution containingnan, but if feels very optimistic.The rescaling [SO20, eq. (5)] is done using the expected value on some data. If there are a few extreme samples in the training data used to compute mins and maxs, no validation sample will have a nonzero deviation, making the expected deviation \(0\) and raising issue. I think OODEEl’s approach to use a quantile instead of the min and max solves this problem in a sound way. As such, We propose to use
(10, 90)percentiles of the calibration correlations to compute the deviation and expected deviation.As mentioned in [SO20], the method scales very poorly with the number of channels. The authors propose to use a row-wise max instead of keeping all pairwise correlations to make it linear in the number of channels. It feels a bit arbitrary and monkeypatchy but we use this as it quickly becomes necessary, like in OODEEL.
References
Hint
Below this point, the documentation is meant for development purposes only. Manual use of any listed member is highly discouraged. For usage, see Inferring with Confidence.
Useful methods defined here
deviation(low_high, stats)Deviation from [SO20], patched as in OODEEL.
layer_stats(activations)Concatenated upper Gram coefficients of all orders.
- static deviation(low_high, stats)[source]#
Deviation from [SO20], patched as in OODEEL.
Infinite or
nanelements are treated as0. This monkey patch is mentioned in module’s doc.- Parameters:
low_high (
Tensor) – Shape(2, stat_dim), first axis encodes(low, high).stats (
Tensor) – A batch of stats to test of shape(n_samples, stat_dim).
- Returns:
out (
Tensor) – Deviations. Shape(n_samples,).
- layer_stats(activations)[source]#
Concatenated upper Gram coefficients of all orders.
Operates on the given layer’s batched activations.
- Parameters:
activations (
Tensor) – Shape(n_samples, *sample_activations_shape). Per-sample activations must be \(1\)D, \(2\)D or \(3\)D tensors, the first two cases being treated as implicitly single-channel.- Returns:
out (
Tensor) – Shape(n_samples, max_gram_order * n_channels ** 2 / 2).- Raises:
ValueError – If not
2 <= activations.ndim <= 4.