y-scope / clp

Compressed Log Processor (CLP) is a free log management tool capable of compressing logs and searching the compressed logs without decompression.
https://yscope.com
Apache License 2.0
871 stars 70 forks source link

feat: Record log-order at compression time for clp-s. #584

Open gibber9809 opened 1 week ago

gibber9809 commented 1 week ago

Description

This PR adds support for recording log order at compression time and leveraging that information at decompression time to achieve log-order decompression. The --ordered flag now performs log-order decompression and timestamp-ordered decompression is no longer supported.

Technically recording log order could be made optional, but per side-discussion it will save trouble down the line to force recording log order now.

Log order is recorded in a new "Internal" subtree in the MPT. Specifically, we add a new "Internal" node type and create a subtree off of the root node which can contain fields internal to the clp-s implementation. These fields are ignored during decompression (i.e. they do not get marshalled), and search (they can not be resolved to). The log event index is recorded in an integer field called "log_event_idx" in this subtree.

This PR also cleans up some code surrounding its changes. In particular the code to insert nodes into the MPT has been improved, and many instances of std::string const& have been replaced with std::string_view const throughout the codebase.

Validation performed

Summary by CodeRabbit

Release Notes

These changes collectively enhance the application's functionality, performance, and user experience.

coderabbitai[bot] commented 1 week ago

Walkthrough

The pull request introduces various enhancements across multiple files, focusing on improving error handling, internal state management, and performance optimizations. Key changes include the addition of new member variables to track log event indices and message sizes, updates to method signatures for improved efficiency, and refinements in error messages for command-line argument parsing. Additionally, several methods have been modified to enhance functionality related to schema reading and writing, while maintaining existing logic and control flow.

Changes

File Path Change Summary
components/core/src/clp_s/ArchiveReader.cpp Added member variable m_log_event_idx_column_id. Updated initialize_schema_reader to mark log event index columns. Modified close to reset m_log_event_idx_column_id.
components/core/src/clp_s/ArchiveReader.hpp Added member variable int32_t m_log_event_idx_column_id{-1};. Added method bool has_log_order();.
components/core/src/clp_s/ArchiveWriter.cpp Introduced member variables m_next_log_event_id and m_encoded_message_size. Updated close to reset m_next_log_event_id. Adjusted append_message logic to increment m_next_log_event_id.
components/core/src/clp_s/ArchiveWriter.hpp Updated add_node method signature to accept std::string_view. Added method get_next_log_event_id. Added member variable m_next_log_event_id.
components/core/src/clp_s/CommandLineArguments.cpp Enhanced error messages for ordered and ordered-chunk-size. Added new command-line option --disable-log-order.
components/core/src/clp_s/JsonConstructor.cpp Updated store and construct_in_order methods to use indices instead of timestamps.
components/core/src/clp_s/JsonConstructor.hpp Modified comment in construct_in_order method.
components/core/src/clp_s/JsonParser.cpp Added method add_metadata_field. Enhanced parse method to include log event index. Improved error handling.
components/core/src/clp_s/JsonParser.hpp Added method add_metadata_field with std::string_view parameter.
components/core/src/clp_s/JsonSerializer.hpp Updated method signatures to use std::string_view instead of std::string.
components/core/src/clp_s/ReaderUtils.cpp Modified scope of key variable in read_schema_tree.
components/core/src/clp_s/SchemaReader.cpp Added method get_next_log_event_idx. Updated get_next_message_with_timestamp to get_next_message_with_metadata.
components/core/src/clp_s/SchemaReader.hpp Updated method signatures and added new methods related to log event indexing.
components/core/src/clp_s/SchemaTree.cpp Updated add_node method signature to use std::string_view. Added method get_metadata_field_id.
components/core/src/clp_s/SchemaTree.hpp Added new enum value Metadata. Updated method signatures and added new methods for internal nodes.
components/core/src/clp_s/archive_constants.hpp Introduced new constants in clp_s::constants and added new namespace results_cache::search.
components/core/src/clp_s/search/ColumnDescriptor.hpp Updated DescriptorToken constructor to accept std::string_view.
components/core/src/clp_s/search/Output.cpp Added populate_internal_columns method. Updated filter and write methods to include log event index.
components/core/src/clp_s/search/Output.hpp Added member variable m_metadata_columns and method populate_internal_columns.
components/core/src/clp_s/search/OutputHandler.cpp Updated write method in NetworkOutputHandler and ResultsCacheOutputHandler to include log_event_idx.
components/core/src/clp_s/search/OutputHandler.hpp Updated write method signatures across multiple handlers to include log_event_idx.
components/core/src/clp_s/search/Projection.cpp Replaced tree->get_root_node_id() with tree->get_object_subtree_node_id() in resolve_column.
components/core/src/clp_s/search/SchemaMatch.cpp Updated populate_column_mapping method to use get_object_subtree_node_id().
components/core/src/clp_s/search/SearchUtils.cpp Added case for NodeType::Metadata in node_to_literal_type. Modified control flow in double_as_int.
components/core/src/clp_s/CommandLineArguments.hpp Added method get_record_log_order() and member variable m_disable_log_order.
components/core/src/clp_s/clp-s.cpp Added record_log_order in JsonParserOption initialized using command_line_arguments.get_record_log_order().

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
🪧 Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit , please review it.` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (Invoked using PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. ### Other keywords and placeholders - Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. - Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description. - Add `@coderabbitai` anywhere in the PR title to generate the title automatically. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit. - Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
kirkrodrigues commented 6 days ago

with only a few outliers

Can you report data for the outliers as well? (Or a table with all results?). We tend to come back to PRs more and more nowadays, so they serve as good docs.

gibber9809 commented 6 days ago

with only a few outliers

Can you report data for the outliers as well? (Or a table with all results?). We tend to come back to PRs more and more nowadays, so they serve as good docs.

Sure. Ran it on all public datasets, and actually it seems to be a little worse on public datasets than the general case.

dataset new compression ratio old compression ratio decrease (%)
cockroach 21.002 21.559 2.65%
mongod 142.058 238.050 67.57%
elasticsearch 125.275 158.742 26.71%
spark-event-logs 56.297 58.119 3.24%
postgresql 36.528 40.576 11.08%
haiqi96 commented 5 days ago

As the PR currently does, I would prefer to let clp-s jsonl's metadata us orig_file_id instead of archive_id. so it is easier to use a common query logic for both json and ir stream file. (so no change is required)

If we decided to rename it from orig_file_id to archive_id, I would prefer to let IR and JsonL both use "target_id", or "source_id".

gibber9809 commented 12 hours ago

Nice work! One thing I'm curious is whystd::string_view offers better performance thanconst std::string & since we don't use string literals.

A lot of the changes to string_view aren't immediately a performance improvement, but they open us up to using string_view in more places in the codebase going forward instead of creating copies in std::string. E.g. there are many places in the current implementation of JsonParser where we turn a string_view into an std::string just so we can pass it somewhere else as an std::string const&. For the places I changed around marshalling I mostly started using std::string_view because as a result of other changes some code paths were passing std::string_view to some function and others were passing std::string to the same function, so accepting std::string_view as argument allows those functions to handle both (whereas only accepting std::string const& would force an unnecessary copy ont the std::string_view path).

For SchemaNode and add_node specifically I touch on it in my other comment.

wraymo commented 9 hours ago

If we disable log order, do you think we should throw errors for ordered decompression?

gibber9809 commented 8 hours ago

If we disable log order, do you think we should throw errors for ordered decompression?

Hmm, maybe just log a warning. I'm probably going to use this flag in a script to migrate older archive versions to newer ones and making this error out and fail would make me have to write a lot more error handling.