Pipeline¶

class grewtse.GrewTSEPipe[source]¶

Main pipeline controller for generating prompt- or masked-based minimal-pair datasets derived from UD treebanks.

This class acts as a high-level interface for the Grew-TSE workflow:

Parse treebanks to build lexical item datasets.
Generate masked or prompt-based datasets using GREW.
Create minimal pairs for syntactic evaluation.

parse_treebank(filepaths, reset=False)[source]¶

Parse one or more treebanks and create a lexical item set. A lexical item set is a dataset of words and their features.

Parameters:

filepaths (str | list[str]) – Path or list of paths to treebank files.
reset (bool) – If True, clears existing lexical_items before parsing.

Return type:

DataFrame

load_lexicon(filepath, treebank_paths)[source]¶

Load a previously generated lexicon (typically returned from the parse_treebank function) from disk and attach it to the pipeline.

This method is used when you want to resume processing using an existing LI_set that was generated earlier and saved as a CSV. It loads the LI_set, validates the required columns, sets the appropriate index, and updates the pipeline and parser with the loaded data.

Parameters:

filepath (str) – Path to the CSV file containing the LI_set to load. The file must contain the columns "sentence_id" and "token_id".
treebank_paths (list[str]) – A list of paths to the treebanks associated with the LI_set. These paths are stored so the pipeline can later reference the corresponding treebanks when generating or analyzing data.

Raises:

FileNotFoundError – If the CSV file cannot be found at the given filepath.
ValueError – If the required index columns ("sentence_id", "token_id") are missing.

Return type:

None

Example

>>> pipe = GrewTSEPipe()
>>> pipe.load_lexicon("output/li_set.csv", ["treebank1.conllu", "treebank2.conllu"])

generate_masked_dataset(query, target_node, mask_token='[MASK]')[source]¶

Once a treebank has been parsed, if testing models on the task of masked language modelling (MLM) e.g. for encoder models, then you can generate a masked dataset with default token [MASK] by providing a GREW query that isolates a particular construction and a target node that identifies the element in that construction that you want to test.

Parameters:

query (str) – the GREW query that specifies a construction. Test them over at https://universal.grew.fr/
target_node (str) – the particular variable that you defined in your GREW query representing the target word

Return type:

DataFrame

Returns:

a DataFrame consisting of the sentence ID in the given treebank, the index of the token to be masked in the set of tokens, the list of all tokens, the matched token itself, the original text, and lastly the masked text.

generate_prompt_dataset(query, target_node)[source]¶

Once a treebank has been parsed, if testing models on the task of next-token prediction (NTP) e.g. for decoder models, then you can use this function to generate a prompt dataset by providing a GREW query that isolates a particular construction and a target node that identifies the element in that construction that you want to test.

Parameters:

query (str) – the GREW query that specifies a construction. Test them over at https://universal.grew.fr/
target_node (str) – the particular variable that you defined in your GREW query representing the target word

Return type:

DataFrame

Returns:

a DataFrame consisting of the sentence ID in the given treebank, the index of the target token, the list of all tokens, the matched token itself, the original text, and the created prompt.

generate_minimal_pair_dataset(morph_features, ood_pairs=None, has_leading_whitespace=True)[source]¶

After generating a masked or prompt dataset, that same dataset with minimal pairs can be created using this function by specifying the feature that you would like to change. You can also specify whether you want additional ‘OOD’ pairs to be created, as well as whether there should be a leading whitespace at the start of each minimal pair item.

NOTE: morph_features and upos_features expects lowercase keys, values remain as in the treebank.

Parameters:

morph_features (dict) – the morphological features from the UD treebank that you want to adjust for the second element of the minimal pair e.g. { ‘case’: ‘Dat’ } may convert the original target item e.g. German ‘Hunde’ (dog.PLUR.NOM / dog.PLUR.ACC) to the dative case e.g. ‘Hunden’ (dog.PLUR.DAT) to form the minimal pair (Hunde, Hunden). The exact keys and values will depend on the treebank that you’re working with.
ood_pairs (int | None) – a boolean argument that specifies whether you want alternative (likely semantically implausible) minimal pairs to be provided for each example. These may help in evaluating generalisation performance.
has_leading_whitespace (bool) – a boolean argument that specifies whether an additional whitespace is included at the beginning of each element in the minimal pair e.g. (’ is’, ‘ are’)

Return type:

DataFrame

Returns:

a DataFrame containing the masked sentences or prompts as well as the minimal pairs

get_morphological_features()[source]¶

Get a list of all available morphological features in a given treebank. Similarly, you can go to the treebank’s respective webpage to find this information. A treebank must first be parsed in order to use this function.

Return type:: list
Returns:: a list of strings with each morphological feature in the treebank.