wysiib / linter-languagetool

Integration of Languagetool into the Atom text editor.
MIT License
17 stars 5 forks source link

[WIP] Linter spell api #23

Open hesstobi opened 6 years ago

hesstobi commented 6 years ago

The linter does not work well with markup languages like LaTeX. The Linter-Spell Package defines therefor a grammar api which sets a Boolean value for the grammar scopes if the should be checked by the linter. This api can easily add to this package. With this linter knows for each grammar which provides grammar api which scopes are should checked.

There are now two different ways how to handle this information.

  1. The document could be scanned for all scopes which should not be checked. These ranges can then be removed from the document. The cleaned document is then check by the LanguageTool-Server. After that, the range of the issues have to be translated to the original ranges in the document.
  2. The document could be check by the LanguageTool-Server as it is. After that, the range of each issue if the range has a scope which should be ignored.

The first way is quite hard to implement. Because, there is no api to get all the scopes of the document in Atom. Thus, you have to go through the whole document. Nevertheless, this is the cleaner way, because the LanguageTool-Sever gets somehow a version of the text how it also displayed in the final document. The second way is easy to implement and done in this PR. But there are some problems for certain patterns (see in the test files). I guess there are even more. That's way i marked this as WIP and ask you for testing this also on different documents. But in general this PR will fix #7.

The LT API accept JSON as the data parameter that describes markup. For example:

{"annotation":[
  {"text": "A "},
  {"markup": "<b>"},
  {"text": "test"},
  {"markup": "</b>"}
]}

This array can be build by utilise the Atom's public API:

editorGrammar = editor.getGrammar()
editorGrammar.tokenizeLines(editor.getText())

Unfortanly this does not work for tree sitter grammars.

There is also a breaking change for package. The grammerScopes option is not anymore used. The check scopes are now defined by the api. The default scopes which are integrated in the packer are: 'source.asciidoc, source.gfm, text.git-commit, text.plain, text.plain.null-grammar' for other scopes a linter-spell package has to be installed and activated, e.g. Linter-Spell-Latex.

Additional, the Linter-Spell api also provides a method to get the language which is set in the file by a language specific pattern. I add option if the linter should obey this information. This PR replaces therefor #12.

The linter-spell api also provides user defined dictionaries. But here I am not sure if this should implemented within this linter. Hopefully there will be an LT API for that in the future.

TODO:

wysiib commented 6 years ago

Very nice WIP. Simply attaching ourselves to the Linter-Spell-API is probably the best way to proceed. I wonder if Linter-Spell implements the first or the second way of checking documents? I assume they should have run into the same issues.

wysiib commented 6 years ago

Is the first way really that much harder to implement? Of course, we would need to traverse the whole document once, while replacing text and collecting individual offsets (i.e. a map stating that from position x onward, you need to add y to the position. Basically a list of ranges mapped to offsets)?

With the second approach, nGrams in LanguageTool cannot be used properly anymore.

hesstobi commented 6 years ago

Yes you a right, the first way is not so complicated. linter-spell implements somehow the first way. It divides the document in a set of ranges which should be checked and checks them individual.

Considering the following latex examples, I would suggest combined approach of removing unchecked scopes and dividing the document in individual ranges.

  1. simple latex commands within text

    Das ist besser als \zB vor drei Jahren.
    In Abbildung \ref{label} ist ein interessantes Ergebnis dargestellt.
  2. lists

    In den Fällen
    \begin{itemize}
    \item Fall 1 und
    \item Fall 2
    \end{itemize}
    gibt es eine Lösung.
  3. latex environments with included text to check Ein Satz vor der Umgebung.

    \begin{figure}
    \centering
    \input{bild}
    \caption{Eine zu überprüfende Bildunterschrift}\label{label}
    \end{figure}
    Ein Satz nach der Umgebung.

Removing the unchecked scopes in example 1 and 2 would result in checks with no errors. Whereas in example 3 it is necessary to dived the ranges.

The check could therefore be carried out as follows:

  1. Divide the text in a set of checked ranges, including the level of the embedded grammar.
  2. Combining all ranges with equal embedded grammar level within ranges of the previous level
  3. Checking each combined range
  4. Correct the error ranges

Do you agree to the approach? Do you have comments? The following example is still WIP but should be sufficient to illustrate the main concept of the approach.

Example (WIP)

Considering this markdown example.

The linter should handle this markdown example correct.
``` latex
Ein Satz vor der Umgebung.
 \begin{figure}
   \centering
   \input{bild}
   \caption{Eine zu überprüfende Bildunterschrift \cite{cite}}\label{label}
 \end{figure}
Ein Satz nach der Umgebung (siehe \ref{label}).
\begin{figure}
    \caption{Noch eine Bildunterschirft}
\end{figure}

Which is not an easy task.


### Result of step 1
``` json
[
    {
        "text": "The linter should handle this markdown example correct.\n",
        "level": 0,
        "range": [[0, 0], [0, 56]]
    },
    {
        "text": "Ein Satz vor der Umgebung.\n ",
        "level": 1,
        "range": [[2, 0], [3, 0]]
    },
    {
        "text": "Eine zu überprüfende Bildunterschrift ",
        "level": 2,
        "range": [[6, 12], [6, 49]]
    },
    {
        "text": "\nEin Satz nach der Umgebung (siehe ",
        "level": 1,
        "range": [[7, 13], [8, 33]]
    },
    {
        "text": ").\n",
        "level": 1,
        "range": [[8, 45], [8, 48]]
    },
    {
        "text": "Noch eine Bildunterschirft",
        "level": 2,
        "range": [[10, 14], [10, 38]]
    },
    {
        "text": "Which is not an easy task.",
        "level": 0,
        "range": [[13, 0], [13, 26]]
    }
]

Result of step 2

[
    {
        "text": "The linter should handle this markdown example correct.\nWhich is not an easy task." 
    },
    {
        "text": "Ein Satz vor der Umgebung.\n \nEin Satz nach der Umgebung (siehe ).\n"
    },
    {
        "text": "Eine zu überprüfende Bildunterschrift"
    },
    {
        "text": "Noch eine Bildunterschirft"
    }

]

Offsets to get the correct range of the linter results should be included within this array.

hesstobi commented 5 years ago

With the help hint on this comment change the implementation to use the data parameter of the LT API. This has the advantage resulting matches have the right offsets. There are still thinks todo (see description).