Metrics

A collection of metrics that aim to help discern model performance in Targeted Syntactic Evaluation. Many of these are available through the GrewTSEvaluator and will automatically be calculated when called after providing a Minimal-Pair Dataset. Alternatively, you may use them freely through this module.

grewtse.evaluators.compute_surprisal(p, is_log=False)[source]
Computes -log2(p), otherwise known as ‘surprisal’.
Surprisal in the context of a language model helps us understand how strongly the model expects a particular word or token, thus helping us discern how confident a model is in choosing grammatical over ungrammatical forms.
Return type:

float

Returns:

surprisal value

grewtse.evaluators.compute_average_surprisal(probs)[source]
Applies the surprisal function across all probabilities in a Pandas Series object and returns the mean.
Parameters:

probs (Series) – a Pandas Series of probabilities.

Return type:

float

Returns:

the mean of all surprisal values.

grewtse.evaluators.compute_average_surprisal_difference(correct_form_probs, wrong_form_probs)[source]
Subtracts the average model surprisal for all grammatical words from all ungrammatical words.
In general, it is better if the surprisal is low for grammatical words and high for ungrammatical ones, except for some weird experiments where you want that to be the case.

This difference is set up such that a higher value is thus better (i.e. average surprisal is higher for ungrammatical items) and a lower value is worse.

Parameters:
  • correct_form_probs (Series) – Pandas Series of probabilities for each correct / grammatical form.

  • wrong_form_probs (Series) – Pandas Series of probabilities for each incorrect / ungrammatical form.

Return type:

float

Returns:

A float corresponding to the model’s average certainty in the grammatical form. Higher is better.

grewtse.evaluators.compute_normalised_surprisal_difference(correct_form_probs, wrong_form_probs)[source]
Similar to the above function but with a further normalisation step.
Parameters:
  • correct_form_probs (Series) – Pandas Series of probabilities for each correct / grammatical form.

  • wrong_form_probs (Series) – Pandas Series of probabilities for each incorrect / ungrammatical form.

Return type:

float

Returns:

A float corresponding to the model’s normalised average certainty in the grammatical form. Higher is better.

grewtse.evaluators.compute_accuracy(grammatical_form_probs, ungrammatical_form_probs)[source]

Calculate accuracy: proportion of correct predictions. Assumes the model should always predict grammatical form (label = 1).

Return type:

float

grewtse.evaluators.compute_entropy(probs, k=None)[source]

Compute entropy of a probability distribution.

Higher entropy indicates more uncertainty (flatter distribution). Lower entropy indicates more certainty (peaked distribution).

Parameters:
  • probs – Array-like of probabilities (can be numpy array, torch tensor, or pandas Series)

  • k – Optional number of top probabilities to consider. If provided, only the top-k probabilities are used and renormalized.

Returns:

Raw entropy (in nats if using natural log)

grewtse.evaluators.compute_entropy_based_certainty(probs, k=None)[source]
H_norm = H / H_max, where H_max = log(n)
Return as (1 - normalised) so higher is more certain
Parameters:
  • probs (Series) – Array-like of probabilities (can be numpy array, torch tensor, or pandas Series)

  • k (int | None) – Optional number of top probabilities to consider. If provided, only the top-k probabilities are used and renormalized.

Returns:

Raw entropy (in nats if using natural log)