zncdatadev / trino-operator

Operator for Trino, the distributed SQL query engine for big data
Apache License 2.0
3 stars 5 forks source link

[Feature]: Logging architecture and aggregation architecture #93

Closed tutunannan closed 1 month ago

tutunannan commented 1 month ago

Summary 💡

As a user of the platform, I have poor visibility into what is happening in different components of the platform and how these components influence each other, especially operators and the product clusters they manage.

Logs are used to make this information available. Currently, every pod logs to stdout with a certain default log format. For log configuration we support setting a custom log configuration file in some products.

However, logs are not persisted, meaning that a crashed pod is difficult to investigate. They are also not aggregated, so that investigating issues involving multiple pods is difficult. Log configuration is difficult and sometimes not even possible.

Examples 🌈

Motivation 🔦

Aggregate and persist - The logging on the platform was designed to aggregate logs from all parts of the platform to make it easy to correlate events from different parts. For this, logs should share the same structure, and should be viewable in a central location. Logs should also be persisted in a central location, so if a component crashes, the logs are still there to identify the reason.

Easy to read on the fly - At the same time, logs should still be accessible in an easy to read format on the containers, to allow for easy on the fly inspection of each part of the platform. The logging configuration also supports setting different thresholds for the logs readable on the container and the aggregated logs. This way you can get a detailed view of the operations of a component while viewing it on the container, but aggregate logs at a coarser granularity when aggregating across the whole platform.

Consistent configuration - Finally, logging should be always configured the same way, no matter which product and which underlying technology is used to produce the logs. Logging for each product is configured in the ProductCluster resource. It is still supported to supply custom logging configuration files, these are then product specific.