wikimedia / sentencex

A sentence segmentation library with wide language support optimized for speed and utility.
https://wikimedia.github.io/sentencex/
MIT License
51 stars 6 forks source link

Don't split on ellipses #11

Open waldyrious opened 11 months ago

waldyrious commented 11 months ago

I believe it's not safe to always split sentences on ellipses. For example, the following sentence (initially mentioned at https://github.com/DavidAnson/markdownlint/pull/719#issuecomment-1447501641):

Pausing... for... thought... should not [trigger splitting].

...currently splits as

'Pausing...', 'for...', 'thought...', 'should not [trigger splitting]'

but should remain as a single sentence.

/cc @DavidAnson @aepfli

aepfli commented 11 months ago

Hey, I extracted this whole rule into a different repo - https://github.com/aepfli/markdownlint-rule-max-one-sentence-per-line

and within the test, it clearly states that theoretically, you could add ... to the ignore_words - we do not have any kind of fancy logic in there to detect sentences. We commonly only detect certain chars (or combinations) and split on those. That is also the reason why I renamed my liniting rule to max-one-sentence-per-line. This way, I removed the need to combine split sentences, etc. - https://github.com/aepfli/markdownlint-rule-max-one-sentence-per-line/blob/main/test/sentences-per-line.md - is my updated version of the test