ybracke / transnormer-data

Data preparation for the transnormer project (https://github.com/ybracke/transnormer)
0 stars 0 forks source link

Implement `LanguageToolModifier` #18

Closed ybracke closed 7 months ago

ybracke commented 8 months ago

Implement a Modifier class that applies the LanguageTool (Python library) to the raw version of a sample.

In essence, the LanguageTool is called with a fixed set of rules:

>>> with language_tool_python.LanguageTool(language='de-DE') as tool:
        # Use only certain rules
        tool.enabled_rules = {'OLD_SPELLING', 'ZUVIEL'}
        tool.enabled_rules_only = True 
        text = 'Zuviel Abfluß.'
        text_corr = tool.correct(text)
        print(text_corr)

"Zu viel Abfluss"

The tool.enabled_rules should be passed to the constructor of LanguageToolModifier.

(The list of rules must be curated manually and independently of this implementation. I am working on my own list locally, see ~/code/langtool-correct/README.md and ~/data/1_Gold/custom_LT).

Updating the raw version of the sample must be followed by updating the tok version and consequently the alignment. Sketch for the modify_sample function:

    def modify_sample(self, sample: Dict) -> Dict:
        # Update raw via LanguageTool
        raw_old = sample[self.raw]
        raw_new = self.tool.correct(raw_old)
        sample[self.raw] = tokens_new
        any_changes = (raw_new != raw_old)
        if any_changes:
            self.update_tok_from_raw(
                sample, key_raw=self.raw, key_tok=self.tok, key_ws=self.ws
            )
            # Update spans
            # TODO: Which one of the two? + add arguments
            self.update_spans_and_ws_from_tok_and_raw(
                sample,
                key_tokens=self.tok,
                key_raw=self.raw,
                key_ws=self.ws,
                key_spans=self.spans,
            )
            # OR?? self.update_token_spans()
            # Update alignment
            self.update_alignment(sample, key_tokens_src=self.tok_src, key_tokens_trg=self.tok, key_alignment=self.alignment)
        return sample

Also, see the notes on this pad.

ybracke commented 8 months ago