Tutorial: Georgian Intransitive Nom→Erg Minimal-Pair Tests¶
This tutorial walks through the process of generating minimal-pair syntactic tests for evaluating how well Language Models perform on intransitive constructions. Specifically, it covers the conversion of a Nominative subject to Ergative, producing grammatical/ungrammatical sentence pairs by swapping that single case feature. The actual evaluation process will be shown in a follow-up tutorial.
1. Setup and Configuration¶
Before running any queries, you need to point Grew-TSE at your treebank files and define where outputs will be written. The Georgian Language Corpus (GLC) ships as three standard Universal Dependencies splits:
from grewtse.pipeline import GrewTSEPipe
import pandas as pd
import os
TREEBANKS_KARTULI = [
"./treebanks/ka_glc-ud-train.conllu",
"./treebanks/ka_glc-ud-dev.conllu",
"./treebanks/ka_glc-ud-test.conllu",
]
OUTPUT_DIR = "./output"
LEXICON_FILE = "./output/georgian-lexicon.csv"
All three splits are passed to Grew-TSE together so that minimal pairs can be drawn from the largest possible pool of attested sentences.
2. Creating a Task Configuration¶
The test is driven by a single configuration dictionary. The helper function
create_config assembles this dictionary from the four things that change between
tasks: the Grew query pattern, the target word to modify, the case, and a
human-readable prefix for the output files.
def create_config(
query: str, target: str, convert_case_to: str, task_prefix: str
) -> dict:
task_name = f"{task_prefix}-{target}-to-{convert_case_to}"
results_dir = f"{OUTPUT_DIR}/{task_prefix}"
if not os.path.isdir(results_dir):
os.makedirs(results_dir, exist_ok=True)
return {
"treebanks": TREEBANKS_KARTULI,
"grew_query": query,
"target": target,
"apply_leading_space": False, # typically False for MLM, True for NTP
"output_dataset": f"{results_dir}/{task_name}.csv",
"alternative_morph_features": {"case": convert_case_to},
"save_lexicon_to": f"{OUTPUT_DIR}/{LEXICON_FILE}",
"task_name": task_name,
}
Key fields in the returned dictionary:
``grew_query`` — the Grew pattern (see Section 3) that identifies grammatical target sentences in the treebank.
``target`` — the variable name defined inside the query pattern whose case will be swapped.
``alternative_morph_features`` — specifies the new case value to substitute in. Grew-TSE will use the lexicon to find a surface form that carries this case for the same lemma.
``apply_leading_space`` — set to
Falsewhen testing masked language models (MLM) andTruewhen testing next-token prediction (NTP) models, because the latter expect a preceding whitespace token.
3. Writing the Grew Query Pattern¶
The core of the test is a Grew query that picks out sentences matching a particular
syntactic construction. Grew queries use a pattern { … } block to assert the
existence of nodes and arcs, and an optional without { … } block to exclude
structures you do not want. You can play around with Grew queries here
The following pattern matches any sentence containing a verb with a single Nominative subject and no direct object:
intransitive_query = """
pattern {
V [upos="VERB"];
SUBJ [Case="Nom"];
V -[nsubj]-> SUBJ;
}
without {
V [upos="VERB"];
V -[nsubj]-> SUBJ;
V -[obj]-> OBJ;
}
"""
The without block is essential here: without it, the pattern would also match
transitive sentences (which happen to have a Nominative subject), contaminating the
intransitive test set.
The single configuration for this tutorial is then:
config_in_to_erg = create_config(
query=intransitive_query,
target="SUBJ",
convert_case_to="Erg",
task_prefix="ka-intransitive",
)
4. Running the Pipeline¶
run_config takes a single configuration dictionary and executes the three main steps
of the Grew-TSE pipeline: lexicon construction (or loading), masked-sentence generation,
and minimal-pair generation.
def run_config(config: dict):
grewtse = GrewTSEPipe()
# Step 1: build or load the lexicon
if not os.path.isfile(config["save_lexicon_to"]):
lexicon = grewtse.parse_treebank(config["treebanks"])
lexicon.to_csv(config["save_lexicon_to"])
else:
grewtse.load_lexicon(config["save_lexicon_to"], config["treebanks"])
# Step 2: find and mask target nodes
masked_df = grewtse.generate_masked_dataset(
config["grew_query"], config["target"]
)
# Step 3: generate the minimal pairs
mp_dataset = grewtse.generate_minimal_pair_dataset(
config["alternative_morph_features"],
has_leading_whitespace=config["apply_leading_space"],
)
mp_dataset.to_csv(config["output_dataset"])
task = config["task_name"]
structures_masked = masked_df.shape[0]
mps_found = mp_dataset.shape[0]
return task, structures_masked, mps_found
Step 1 — Lexicon. The lexicon maps every lemma in the treebank to its attested surface forms and their morphological features. It is written to disk after the first run; subsequent runs load it from the CSV to avoid redundant parsing.
Step 2 — Masking. Grew-TSE runs the query against the treebank, collects every
sentence that matches, and masks the target node (the one named by
target) so that its surface form can be replaced later.
Step 3 — Minimal pairs. For each masked sentence, Grew-TSE looks up the target
lemma in the lexicon and finds a surface form that carries the case specified in
alternative_morph_features. If one exists, a grammatical/ungrammatical pair is
emitted; if not, the sentence is silently dropped.
5. Running the Task¶
The main function creates the config and runs it, collecting a summary of how many
structures were found and how many minimal pairs were successfully generated.
def main():
if not os.path.isdir(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR, exist_ok=True)
results = {"task_name": [], "structures_masked": [], "minimal_pairs_found": []}
print("Parsing...")
task_name, structures_masked, minimal_pairs_found = run_config(config_in_to_erg)
results["task_name"].append(task_name)
results["structures_masked"].append(structures_masked)
results["minimal_pairs_found"].append(minimal_pairs_found)
print(f"Completed parsing {task_name}.")
results = pd.DataFrame(results)
results.to_csv(f"{OUTPUT_DIR}/meta.csv")
if __name__ == "__main__":
main()
The resulting meta.csv gives you a quick overview of the yield of the task —
useful for checking whether the lexicon contains the Ergative surface forms needed
to produce pairs.