Evaluation

A class for carrying out automatic evaluations of models available on the Hugging Face platform with generated Minimal-Pair Datasets. Note that you must install the additional eval dependencies to use these tools.

The primary means of evaluating models is accuracy i.e. the proportion of tests where the model is “correct”. In Targeted Syntactic Evaluation, a model is typically deemed correct when P(Grammatical Item) > P(Ungrammatical Item). How these probabilities are calculated however can lead to differing results. This package allows you to choose between token- or sentence-level; the former takes the joint probability of just the tokens in the target word, while the latter takes the joint probability of all tokens in the sentence.

class grewtse.evaluators.GrewTSEvaluator[source]

Bases: object

An evaluation class designed specifically for rapid syntactic evaluation of models available on the Hugging Face platform.

evaluate_model(mp_dataset, model_repo, task_type, evaluation_type='token-level', evaluation_cols=['form_grammatical', 'form_ungrammatical'], entropy_topk=100, save_to=None, device='cpu', row_limit=None)[source]

Function for carrying out Targeted Syntactic Evaluation for either encoder or decoder models.

Parameters:
  • mp_dataset (DataFrame) – A DataFrame containing the Minimal-Pair Dataset generated by Grew-TSE.

  • model_repo (str) – the Hugging Face model repository link.

  • task_type (Literal['mlm', 'ntp']) – choose either ‘mlm’ (masked language modelling) or ‘ntp’ (next-token prediction).

  • evaluation_type (Literal['token-level', 'sentence-level']) – choose how to calculate accuracy, ‘token-level’ or ‘sentence-level’

  • evaluation_cols (list[str]) – a list of strings indicating the columns containing the target words / tokens. Defaults to form_grammatical and form_ungrammatical, as used by GrewTSEPipe.

  • entropy_topk (int) – how many probabilities are taken into account for calculating the model uncertainty.

  • device (str) – the device that you want to use e.g. cpu, cuda. Defaults to cpu.

  • save_to (Optional[str]) – a path to save the resulting CSV file to.

  • row_limit (Optional[int]) – place a limit on the number of samples / rows evaluated in the Minimal-Pair Dataset.

Return type:

DataFrame

Returns:

A DataFrame containing the evaluation results for each sample.

evaluate_from_filepath(mp_dataset_filepath, model_repo, task_type, evaluation_type='token-level', evaluation_cols=['form_grammatical', 'form_ungrammatical'], entropy_topk=100, save_to=None, device='cpu', row_limit=None)[source]

Carries out model evaluation using a Minimal-Pair Dataset generated by Grew-TSE that is provided as a filepath.

Parameters:
  • mp_dataset_filepath (str) – the filepath pointing to the Minimal-Pair Dataset. Should be a .csv format.

  • model_repo (str) – the Hugging Face model repository link.

  • task_type (Literal['mlm', 'ntp']) – choose either ‘mlm’ (masked language modelling) or ‘ntp’ (next-token prediction).

  • entropy_topk (int) – how many probabilities are taken into account for calculating the model uncertainty.

  • save_to (Optional[str]) – a path to save the resulting CSV file to.

  • device (str) – the device that you want to use e.g. cpu, gpu.

  • row_limit (Optional[int]) – place a limit on the number of samples / rows evaluated in the Minimal-Pair Dataset.

Return type:

DataFrame

Returns:

A DataFrame containing the evaluation results for each sample.

load_evaluation_results(evaluation_dataset_filepath)[source]

Load in a set of evaluation results for additional metrics e.g. overall accuracy, average surprisal. :type evaluation_dataset_filepath: str :param evaluation_dataset_filepath: the filepath pointing to a set of previously-generated evaluation results.

Return type:

None

get_avg_surprisal_difference(grammatical_column='p_form_grammatical', ungrammatical_column='p_form_ungrammatical')[source]

Get the normalised average surprisal difference (ASD). A higher score indicates that the model, on average, tends towards being more confident in the grammatical word over the ungrammatical one. However, this is not quite fully accurate and as with any average ASD scores may suffer from outliers skewing the result.

Return type:

float

Returns:

the value of the normalised ASD.

get_norm_avg_surprisal_difference(grammatical_column='p_form_grammatical', ungrammatical_column='p_form_ungrammatical')[source]

Get the normalised average surprisal difference (ASD). A higher score indicates that the model, on average, tends towards being more confident in the grammatical word over the ungrammatical one. However, this is not quite fully accurate and as with any average ASD scores may suffer from outliers skewing the result.

This normalised version simply calculates (Average Grammatical Surprisal - Average Ungrammatical Surprisal) / Average Grammatical Surprisal

Parameters:
  • grammatical_column (str) – a string representing the name of the grammatical column.

  • ungrammatical_column (str) – a string representing the name of the ungrammatical column.

Return type:

float

Returns:

the value of the normalised ASD.

get_avg_certainty()[source]

Average certainty of the model computed over all probability distributions.

Return type:

float

get_accuracy(grammatical_column='p_form_grammatical', ungrammatical_column='p_form_ungrammatical')[source]

Get the proportion of the time that the model predicts the grammatical form over the ungrammatical form. A value of -1 indicates something went wrong.

Parameters:
  • grammatical_column (str) – a string representing the name of the grammatical column.

  • ungrammatical_column (Union[str, List[str]]) – a string representing the name of the ungrammatical column.

Return type:

float

Returns:

a float between 0 and 1, where 1 is 100% accuracy.