Open gibber9809 opened 1 week ago
The pull request introduces various enhancements across multiple files, focusing on improving error handling, internal state management, and performance optimizations. Key changes include the addition of new member variables to track log event indices and message sizes, updates to method signatures for improved efficiency, and refinements in error messages for command-line argument parsing. Additionally, several methods have been modified to enhance functionality related to schema reading and writing, while maintaining existing logic and control flow.
File Path | Change Summary |
---|---|
components/core/src/clp_s/ArchiveReader.cpp |
Added member variable m_log_event_idx_column_id . Updated initialize_schema_reader to mark log event index columns. Modified close to reset m_log_event_idx_column_id . |
components/core/src/clp_s/ArchiveReader.hpp |
Added member variable int32_t m_log_event_idx_column_id{-1}; . Added method bool has_log_order(); . |
components/core/src/clp_s/ArchiveWriter.cpp |
Introduced member variables m_next_log_event_id and m_encoded_message_size . Updated close to reset m_next_log_event_id . Adjusted append_message logic to increment m_next_log_event_id . |
components/core/src/clp_s/ArchiveWriter.hpp |
Updated add_node method signature to accept std::string_view . Added method get_next_log_event_id . Added member variable m_next_log_event_id . |
components/core/src/clp_s/CommandLineArguments.cpp |
Enhanced error messages for ordered and ordered-chunk-size . Added new command-line option --disable-log-order . |
components/core/src/clp_s/JsonConstructor.cpp |
Updated store and construct_in_order methods to use indices instead of timestamps. |
components/core/src/clp_s/JsonConstructor.hpp |
Modified comment in construct_in_order method. |
components/core/src/clp_s/JsonParser.cpp |
Added method add_metadata_field . Enhanced parse method to include log event index. Improved error handling. |
components/core/src/clp_s/JsonParser.hpp |
Added method add_metadata_field with std::string_view parameter. |
components/core/src/clp_s/JsonSerializer.hpp |
Updated method signatures to use std::string_view instead of std::string . |
components/core/src/clp_s/ReaderUtils.cpp |
Modified scope of key variable in read_schema_tree . |
components/core/src/clp_s/SchemaReader.cpp |
Added method get_next_log_event_idx . Updated get_next_message_with_timestamp to get_next_message_with_metadata . |
components/core/src/clp_s/SchemaReader.hpp |
Updated method signatures and added new methods related to log event indexing. |
components/core/src/clp_s/SchemaTree.cpp |
Updated add_node method signature to use std::string_view . Added method get_metadata_field_id . |
components/core/src/clp_s/SchemaTree.hpp |
Added new enum value Metadata . Updated method signatures and added new methods for internal nodes. |
components/core/src/clp_s/archive_constants.hpp |
Introduced new constants in clp_s::constants and added new namespace results_cache::search . |
components/core/src/clp_s/search/ColumnDescriptor.hpp |
Updated DescriptorToken constructor to accept std::string_view . |
components/core/src/clp_s/search/Output.cpp |
Added populate_internal_columns method. Updated filter and write methods to include log event index. |
components/core/src/clp_s/search/Output.hpp |
Added member variable m_metadata_columns and method populate_internal_columns . |
components/core/src/clp_s/search/OutputHandler.cpp |
Updated write method in NetworkOutputHandler and ResultsCacheOutputHandler to include log_event_idx . |
components/core/src/clp_s/search/OutputHandler.hpp |
Updated write method signatures across multiple handlers to include log_event_idx . |
components/core/src/clp_s/search/Projection.cpp |
Replaced tree->get_root_node_id() with tree->get_object_subtree_node_id() in resolve_column . |
components/core/src/clp_s/search/SchemaMatch.cpp |
Updated populate_column_mapping method to use get_object_subtree_node_id() . |
components/core/src/clp_s/search/SearchUtils.cpp |
Added case for NodeType::Metadata in node_to_literal_type . Modified control flow in double_as_int . |
components/core/src/clp_s/CommandLineArguments.hpp |
Added method get_record_log_order() and member variable m_disable_log_order . |
components/core/src/clp_s/clp-s.cpp |
Added record_log_order in JsonParserOption initialized using command_line_arguments.get_record_log_order() . |
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
with only a few outliers
Can you report data for the outliers as well? (Or a table with all results?). We tend to come back to PRs more and more nowadays, so they serve as good docs.
with only a few outliers
Can you report data for the outliers as well? (Or a table with all results?). We tend to come back to PRs more and more nowadays, so they serve as good docs.
Sure. Ran it on all public datasets, and actually it seems to be a little worse on public datasets than the general case.
dataset | new compression ratio | old compression ratio | decrease (%) |
---|---|---|---|
cockroach | 21.002 | 21.559 | 2.65% |
mongod | 142.058 | 238.050 | 67.57% |
elasticsearch | 125.275 | 158.742 | 26.71% |
spark-event-logs | 56.297 | 58.119 | 3.24% |
postgresql | 36.528 | 40.576 | 11.08% |
As the PR currently does, I would prefer to let clp-s jsonl's metadata us orig_file_id instead of archive_id. so it is easier to use a common query logic for both json and ir stream file. (so no change is required)
If we decided to rename it from orig_file_id to archive_id, I would prefer to let IR and JsonL both use "target_id", or "source_id".
Nice work! One thing I'm curious is why
std::string_view
offers better performance thanconst std::string &
since we don't use string literals.
A lot of the changes to string_view aren't immediately a performance improvement, but they open us up to using string_view in more places in the codebase going forward instead of creating copies in std::string. E.g. there are many places in the current implementation of JsonParser where we turn a string_view into an std::string just so we can pass it somewhere else as an std::string const&. For the places I changed around marshalling I mostly started using std::string_view because as a result of other changes some code paths were passing std::string_view to some function and others were passing std::string to the same function, so accepting std::string_view as argument allows those functions to handle both (whereas only accepting std::string const& would force an unnecessary copy ont the std::string_view path).
For SchemaNode
and add_node
specifically I touch on it in my other comment.
If we disable log order, do you think we should throw errors for ordered decompression?
If we disable log order, do you think we should throw errors for ordered decompression?
Hmm, maybe just log a warning. I'm probably going to use this flag in a script to migrate older archive versions to newer ones and making this error out and fail would make me have to write a lot more error handling.
Description
This PR adds support for recording log order at compression time and leveraging that information at decompression time to achieve log-order decompression. The
--ordered
flag now performs log-order decompression and timestamp-ordered decompression is no longer supported.Technically recording log order could be made optional, but per side-discussion it will save trouble down the line to force recording log order now.
Log order is recorded in a new "Internal" subtree in the MPT. Specifically, we add a new "Internal" node type and create a subtree off of the root node which can contain fields internal to the clp-s implementation. These fields are ignored during decompression (i.e. they do not get marshalled), and search (they can not be resolved to). The log event index is recorded in an integer field called "log_event_idx" in this subtree.
This PR also cleans up some code surrounding its changes. In particular the code to insert nodes into the MPT has been improved, and many instances of
std::string const&
have been replaced withstd::string_view const
throughout the codebase.Validation performed
Summary by CodeRabbit
Release Notes
New Features
ArchiveReader
andArchiveWriter
.SchemaReader
.JsonParser
andSchemaTree
for better data structure handling.Output
class to improve filtering capabilities.Improvements
Output
class to better manage internal columns.std::string_view
for method parameters.Bug Fixes
These changes collectively enhance the application's functionality, performance, and user experience.