y-scope / clp

Compressed Log Processor (CLP) is a free log management tool capable of compressing logs and searching the compressed logs without decompression.
https://yscope.com
Apache License 2.0
813 stars 68 forks source link

IR stream reader API redesign #539

Open LinZhihao-723 opened 2 weeks ago

LinZhihao-723 commented 2 weeks ago

Request

As the IR format has evolved, an IR stream (ignoring the preamble and end-of-stream byte) is no longer a sequence of serialized unstructured log events. In addition to log events, we’ve introduced other concepts that may change the stream's state but without producing a log event. For clarity, we’ll refer to these “concepts,” including log events, as IR units. For example, to support loggers that change time zones, we’ve added an IR unit that indicates a UTC offset change. These new IR units may appear in between log-event IR units. Moreover, the order of these IR units is unpredictable, so we cannot say, for instance, that they will appear after every three log events. The IR units we have in the latest IR format are:

Note that although our current IR streams can be stateful, that statefulness was always updated with each log event. For instance, the four-byte-integer encoding IR stream stores the timestamp in each log event as a timestamp delta; thus, an IR stream reader needs to keep track of the absolute timestamp of the last log event so that it can calculate the absolute timestamp of the next log event as last_log_event_abs_timestamp + next_log_event_timestamp_delta. This stream state is updated after deserializing each log event. However, as mentioned before, a UTC-offset-change IR unit may be updated in between any number of log events, and it then affects any log events deserialized afterward. (Although we discuss reading/deserializing IR streams above, the process is similar for writing/serializing.)

The current IR stream reader APIs make it easy to read log events, but have several limitations for IR formats that include additional IR units like UTC offset changes. Currently, when the caller calls the API to read a log event, the reader will read all IR units up to and including the next log event. For instance, if there are one or more UTC offset changes before the next log event, each would be read—updating the reader’s state—and then the log event would be read and returned. The limitations of this design are as follows:

Thus, we propose redesigning the reader’s APIs to solve these issues.

Possible implementation

To read IR streams, we propose a class structure that consists of a deserializer class and optional user-defined IR unit handlers. Intuitively, the deserializer will be responsible for deserializing IR units from the stream. Users of the deserializer can pass in IR unit handlers for the IR units they are interested in. When the deserializer deserializes one of these IR units, it will call the relevant IR unit handler, allowing the user to perform any additional handling for the IR unit.