Tutorial: Georgian Intransitive Nom→Erg Minimal-Pair Tests ==================================================================== This tutorial walks through the process of generating minimal-pair syntactic tests for evaluating how well Language Models perform on intransitive constructions. Specifically, it covers the conversion of a Nominative subject to Ergative, producing grammatical/ungrammatical sentence pairs by swapping that single case feature. The actual evaluation process will be shown in a follow-up tutorial. 1. Setup and Configuration -------------------------- Before running any queries, you need to point Grew-TSE at your treebank files and define where outputs will be written. The Georgian Language Corpus (GLC) ships as three standard Universal Dependencies splits: .. code-block:: python from grewtse.pipeline import GrewTSEPipe import pandas as pd import os TREEBANKS_KARTULI = [ "./treebanks/ka_glc-ud-train.conllu", "./treebanks/ka_glc-ud-dev.conllu", "./treebanks/ka_glc-ud-test.conllu", ] OUTPUT_DIR = "./output" LEXICON_FILE = "./output/georgian-lexicon.csv" All three splits are passed to Grew-TSE together so that minimal pairs can be drawn from the largest possible pool of attested sentences. 2. Creating a Task Configuration --------------------------------- The test is driven by a single configuration dictionary. The helper function ``create_config`` assembles this dictionary from the four things that change between tasks: the Grew query pattern, the target word to modify, the case, and a human-readable prefix for the output files. .. code-block:: python def create_config( query: str, target: str, convert_case_to: str, task_prefix: str ) -> dict: task_name = f"{task_prefix}-{target}-to-{convert_case_to}" results_dir = f"{OUTPUT_DIR}/{task_prefix}" if not os.path.isdir(results_dir): os.makedirs(results_dir, exist_ok=True) return { "treebanks": TREEBANKS_KARTULI, "grew_query": query, "target": target, "apply_leading_space": False, # typically False for MLM, True for NTP "output_dataset": f"{results_dir}/{task_name}.csv", "alternative_morph_features": {"case": convert_case_to}, "save_lexicon_to": f"{OUTPUT_DIR}/{LEXICON_FILE}", "task_name": task_name, } Key fields in the returned dictionary: - **``grew_query``** — the Grew pattern (see Section 3) that identifies grammatical target sentences in the treebank. - **``target``** — the variable name defined inside the query pattern whose case will be swapped. - **``alternative_morph_features``** — specifies the new case value to substitute in. Grew-TSE will use the lexicon to find a surface form that carries this case for the same lemma. - **``apply_leading_space``** — set to ``False`` when testing masked language models (MLM) and ``True`` when testing next-token prediction (NTP) models, because the latter expect a preceding whitespace token. 3. Writing the Grew Query Pattern ---------------------------------- The core of the test is a Grew query that picks out sentences matching a particular syntactic construction. Grew queries use a ``pattern { … }`` block to assert the existence of nodes and arcs, and an optional ``without { … }`` block to exclude structures you do not want. You can play around with Grew queries `here `_ The following pattern matches any sentence containing a verb with a single Nominative subject and *no* direct object: .. code-block:: python intransitive_query = """ pattern { V [upos="VERB"]; SUBJ [Case="Nom"]; V -[nsubj]-> SUBJ; } without { V [upos="VERB"]; V -[nsubj]-> SUBJ; V -[obj]-> OBJ; } """ The ``without`` block is essential here: without it, the pattern would also match transitive sentences (which happen to have a Nominative subject), contaminating the intransitive test set. The single configuration for this tutorial is then: .. code-block:: python config_in_to_erg = create_config( query=intransitive_query, target="SUBJ", convert_case_to="Erg", task_prefix="ka-intransitive", ) 4. Running the Pipeline ------------------------ ``run_config`` takes a single configuration dictionary and executes the three main steps of the Grew-TSE pipeline: lexicon construction (or loading), masked-sentence generation, and minimal-pair generation. .. code-block:: python def run_config(config: dict): grewtse = GrewTSEPipe() # Step 1: build or load the lexicon if not os.path.isfile(config["save_lexicon_to"]): lexicon = grewtse.parse_treebank(config["treebanks"]) lexicon.to_csv(config["save_lexicon_to"]) else: grewtse.load_lexicon(config["save_lexicon_to"], config["treebanks"]) # Step 2: find and mask target nodes masked_df = grewtse.generate_masked_dataset( config["grew_query"], config["target"] ) # Step 3: generate the minimal pairs mp_dataset = grewtse.generate_minimal_pair_dataset( config["alternative_morph_features"], has_leading_whitespace=config["apply_leading_space"], ) mp_dataset.to_csv(config["output_dataset"]) task = config["task_name"] structures_masked = masked_df.shape[0] mps_found = mp_dataset.shape[0] return task, structures_masked, mps_found **Step 1 — Lexicon.** The lexicon maps every lemma in the treebank to its attested surface forms and their morphological features. It is written to disk after the first run; subsequent runs load it from the CSV to avoid redundant parsing. **Step 2 — Masking.** Grew-TSE runs the query against the treebank, collects every sentence that matches, and masks the target node (the one named by ``target``) so that its surface form can be replaced later. **Step 3 — Minimal pairs.** For each masked sentence, Grew-TSE looks up the target lemma in the lexicon and finds a surface form that carries the case specified in ``alternative_morph_features``. If one exists, a grammatical/ungrammatical pair is emitted; if not, the sentence is silently dropped. 5. Running the Task -------------------- The ``main`` function creates the config and runs it, collecting a summary of how many structures were found and how many minimal pairs were successfully generated. .. code-block:: python def main(): if not os.path.isdir(OUTPUT_DIR): os.makedirs(OUTPUT_DIR, exist_ok=True) results = {"task_name": [], "structures_masked": [], "minimal_pairs_found": []} print("Parsing...") task_name, structures_masked, minimal_pairs_found = run_config(config_in_to_erg) results["task_name"].append(task_name) results["structures_masked"].append(structures_masked) results["minimal_pairs_found"].append(minimal_pairs_found) print(f"Completed parsing {task_name}.") results = pd.DataFrame(results) results.to_csv(f"{OUTPUT_DIR}/meta.csv") if __name__ == "__main__": main() The resulting ``meta.csv`` gives you a quick overview of the yield of the task — useful for checking whether the lexicon contains the Ergative surface forms needed to produce pairs.