Tutorial: Georgian Intransitive Nom→Erg Minimal-Pair Tests

This tutorial walks through the process of generating minimal-pair syntactic tests for evaluating how well Language Models perform on intransitive constructions. Specifically, it covers the conversion of a Nominative subject to Ergative, producing grammatical/ungrammatical sentence pairs by swapping that single case feature. The actual evaluation process will be shown in a follow-up tutorial.

1. Setup and Configuration

Before running any queries, you need to point Grew-TSE at your treebank files and define where outputs will be written. The Georgian Language Corpus (GLC) ships as three standard Universal Dependencies splits:

from grewtse.pipeline import GrewTSEPipe
import pandas as pd
import os

TREEBANKS_KARTULI = [
    "./treebanks/ka_glc-ud-train.conllu",
    "./treebanks/ka_glc-ud-dev.conllu",
    "./treebanks/ka_glc-ud-test.conllu",
]
OUTPUT_DIR = "./output"
LEXICON_FILE = "./output/georgian-lexicon.csv"

All three splits are passed to Grew-TSE together so that minimal pairs can be drawn from the largest possible pool of attested sentences.

2. Creating a Task Configuration

The test is driven by a single configuration dictionary. The helper function create_config assembles this dictionary from the four things that change between tasks: the Grew query pattern, the target word to modify, the case, and a human-readable prefix for the output files.

def create_config(
    query: str, target: str, convert_case_to: str, task_prefix: str
) -> dict:
    task_name = f"{task_prefix}-{target}-to-{convert_case_to}"
    results_dir = f"{OUTPUT_DIR}/{task_prefix}"

    if not os.path.isdir(results_dir):
        os.makedirs(results_dir, exist_ok=True)

    return {
        "treebanks": TREEBANKS_KARTULI,
        "grew_query": query,
        "target": target,
        "apply_leading_space": False,  # typically False for MLM, True for NTP
        "output_dataset": f"{results_dir}/{task_name}.csv",
        "alternative_morph_features": {"case": convert_case_to},
        "save_lexicon_to": f"{OUTPUT_DIR}/{LEXICON_FILE}",
        "task_name": task_name,
    }

Key fields in the returned dictionary:

  • ``grew_query`` — the Grew pattern (see Section 3) that identifies grammatical target sentences in the treebank.

  • ``target`` — the variable name defined inside the query pattern whose case will be swapped.

  • ``alternative_morph_features`` — specifies the new case value to substitute in. Grew-TSE will use the lexicon to find a surface form that carries this case for the same lemma.

  • ``apply_leading_space`` — set to False when testing masked language models (MLM) and True when testing next-token prediction (NTP) models, because the latter expect a preceding whitespace token.

3. Writing the Grew Query Pattern

The core of the test is a Grew query that picks out sentences matching a particular syntactic construction. Grew queries use a pattern { } block to assert the existence of nodes and arcs, and an optional without { } block to exclude structures you do not want. You can play around with Grew queries here

The following pattern matches any sentence containing a verb with a single Nominative subject and no direct object:

intransitive_query = """
    pattern {
      V [upos="VERB"];
      SUBJ [Case="Nom"];
      V -[nsubj]-> SUBJ;
    }

    without {
      V [upos="VERB"];
      V -[nsubj]-> SUBJ;
      V -[obj]-> OBJ;
    }
"""

The without block is essential here: without it, the pattern would also match transitive sentences (which happen to have a Nominative subject), contaminating the intransitive test set.

The single configuration for this tutorial is then:

config_in_to_erg = create_config(
    query=intransitive_query,
    target="SUBJ",
    convert_case_to="Erg",
    task_prefix="ka-intransitive",
)

4. Running the Pipeline

run_config takes a single configuration dictionary and executes the three main steps of the Grew-TSE pipeline: lexicon construction (or loading), masked-sentence generation, and minimal-pair generation.

def run_config(config: dict):
    grewtse = GrewTSEPipe()

    # Step 1: build or load the lexicon
    if not os.path.isfile(config["save_lexicon_to"]):
        lexicon = grewtse.parse_treebank(config["treebanks"])
        lexicon.to_csv(config["save_lexicon_to"])
    else:
        grewtse.load_lexicon(config["save_lexicon_to"], config["treebanks"])

    # Step 2: find and mask target nodes
    masked_df = grewtse.generate_masked_dataset(
        config["grew_query"], config["target"]
    )

    # Step 3: generate the minimal pairs
    mp_dataset = grewtse.generate_minimal_pair_dataset(
        config["alternative_morph_features"],
        has_leading_whitespace=config["apply_leading_space"],
    )
    mp_dataset.to_csv(config["output_dataset"])

    task = config["task_name"]
    structures_masked = masked_df.shape[0]
    mps_found = mp_dataset.shape[0]
    return task, structures_masked, mps_found

Step 1 — Lexicon. The lexicon maps every lemma in the treebank to its attested surface forms and their morphological features. It is written to disk after the first run; subsequent runs load it from the CSV to avoid redundant parsing.

Step 2 — Masking. Grew-TSE runs the query against the treebank, collects every sentence that matches, and masks the target node (the one named by target) so that its surface form can be replaced later.

Step 3 — Minimal pairs. For each masked sentence, Grew-TSE looks up the target lemma in the lexicon and finds a surface form that carries the case specified in alternative_morph_features. If one exists, a grammatical/ungrammatical pair is emitted; if not, the sentence is silently dropped.

5. Running the Task

The main function creates the config and runs it, collecting a summary of how many structures were found and how many minimal pairs were successfully generated.

def main():
    if not os.path.isdir(OUTPUT_DIR):
        os.makedirs(OUTPUT_DIR, exist_ok=True)

    results = {"task_name": [], "structures_masked": [], "minimal_pairs_found": []}

    print("Parsing...")
    task_name, structures_masked, minimal_pairs_found = run_config(config_in_to_erg)

    results["task_name"].append(task_name)
    results["structures_masked"].append(structures_masked)
    results["minimal_pairs_found"].append(minimal_pairs_found)
    print(f"Completed parsing {task_name}.")

    results = pd.DataFrame(results)
    results.to_csv(f"{OUTPUT_DIR}/meta.csv")


if __name__ == "__main__":
    main()

The resulting meta.csv gives you a quick overview of the yield of the task — useful for checking whether the lexicon contains the Ergative surface forms needed to produce pairs.