Preprocessing¶

class grewtse.preprocessing.ConlluParser[source]¶

Bases: object

A class designed to parse .conllu files for Grew-TSE, that is, the standard format for UD treebanks.

build_lexicon(filepaths)[source]¶

Create a DataFrame that contains the set of all words with their features as generated from a UD treebank. This is essential for the subsequent generation of minimal pairs. This was not designed to handle treebanks that assign differing names to features, so please ensure multiple treebank files are all from the same treebank or treebank schema.

Parameters:: filepaths (list[str] | str) – a list of strings corresponding to the UD treebank files e.g. [“german_treebank_part_A.conllu”, “german_treebank_part_B.conllu”].
Return type:: DataFrame
Returns:: a DataFrame with all words and their features.

to_syntactic_feature(sentence_id, token_id, token, alt_morph_constraints, alt_universal_constraints)[source]¶

The most important function for the finding of minimal pairs. Converts a given lexical item taken from a UD treebank sentence to another lexical item of the same lemma but with the specified differing feature(s).

Parameters:

sentence_id (str) – the ID in the treebank of the sentence.
token_id (str) – the token index in the list of tokens corresponding to the isolated target word.
token (str) – the token string itself that is the isolated target word.
alt_morph_constraints (dict) – the alternative morphological feature(s) for the target word.
alt_universal_constraints (dict) – the alternative UPOS feature(s) for the target word.

Return type:

str | None

Returns:

a string representing the converted target word.