Open Soreine opened 4 years ago
- Signature identification
- Various formats for headers
- On Fri, Nov 19th…
- On 10/9/2018
- Headers that wrap across lines
- From:, To:, Date: style headers
- Reply chains indicated by > or multiple >>>
- Some lines look like signatures but aren’t
- Corrupted email headers
- Common for plain text emails to split reply headers
- Multi-language support if required
- Header formats change over time
Due to this, we suggest not coding your own signature parsing algorithm. It is non-trivial.
Biased source: SigParser, a paid service for email parsing
I have found https://github.com/mailgun/talon (in Python) which is interesting for its quotation detection for Text and HTML, and its basic text signature detection (forget about the signature detection with machine learning). They also have a lot of real-world fixtures, which is invaluable.
There is a JS port of it, made by people from Front, which I believe are great engineers. https://github.com/quentez/talonjs/ The repo is not documented, but it is recent and maintained.
There is also another port https://github.com/lever/planer which is older and seems less complete.
Both planer and talonjs requires a DOM implementation to work (xmldom or jsdom for example). talonjs also uses cheerio to cleanup the input document a bit.
For information, below is the algorithm used by Talon for HTML messages
# Extract actual message from provided html message body
# using tags and plain text algorithm.
#
# Cut out the 'blockquote', 'gmail_quote' tags.
# Cut out Microsoft (Outlook, Windows mail) quotations.
#
# Then use plain text algorithm to cut out splitter or
# leftover quotation.
# This works by adding checkpoint text to all html tags,
# then converting html to text,
# then extracting quotations from text,
# then checking deleted checkpoints,
# then deleting necessary tags.
Things we could take from Mailspring:
Things we could take from TalonJS
On date, somebody wrote:
lines
We should improve the existing logic to detect the replied messages. We can use blockquotes as indicators, or common strings like
"On Friday, 27 November 2015, Your Tempo <contact@yourtempo.co> wrote"
.Here are some useful regexes for such messages in several languages