yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.85k stars 1.05k forks source link

[Platform] Email alerts formatting and missing details #6889

Open chirag-yb opened 3 years ago

chirag-yb commented 3 years ago

There are three different alerts in the platform and all of them with different formatting and there is no consistency.


**Subject** - Yugabyte Platform Alert - <[USERNAME]
**Body** - 
Create Backup failed for aws-test.

Task Info: ybdemo_keyspace:cassandrakeyvalue

**Subject** - Yugabyte Platform Alert - <[USERNAME]
**Body** -
Common failure for universe 'sp-test':
Health check script got error: Traceback (most recent call last):
 File "bin/cluster_health.py", line 19, in <module>
   error is here
NameError: name 'error' is not defined code (1) [ 636 ms ].

Subject - ERROR - <[USERNAME]> tls-test-5

Body -

Universe name:tls-test-5 2020-12-01 08:54:37

Universe version:2.5.1.0-b84



Requirements
- All alerts should have the same formatting (body & subject)
- Follow the same guidelines 

1. Include platform IP and hostname details 
2. Include a link to the task page (wherever necessary e.g. - failed backup)  
jitendra-12113 commented 3 years ago

Please review: https://github.com/yugabyte/yugabyte-db/pull/7580

streddy-yb commented 3 years ago

@ymahajan - Can you help elaborate the requirements for this issue? thanks

ymahajan commented 3 years ago

Here is the format we should use -

Subject - ${severity} Alert: ${source} ${condition} For eg - Critical Alert: Universe puppy-food disk space above 80%

Body [HTML] {platform_hostname/IP} {source}/ YB version Include a link to the task page (wherever necessary e.g. - failed backup)

jitendra-12113 commented 3 years ago

@ymahajan So the above format should be applicable to all kind of alert right?

  1. Backup failure alert ( Include backup task links)
  2. Health check Alert
  3. Alert report with various checks for different nodes

cc: @SergeyPotachev, @kkg-yb

SergeyPotachev commented 3 years ago

@jitendra-12113 As we are working on alerting mechanic improvements, we need to postpone this issue. For now we don't have enough information to fill such template (severity, current metric value etc). cc @streddy-yb @kkg-yb

SergeyPotachev commented 3 years ago

cc @ymahajan