yorkie-team / yorkie

Yorkie is a document store for collaborative applications.
https://yorkie.dev
Apache License 2.0
771 stars 143 forks source link

Enhance GetDocuments API by adding bulk retrieval #931

Closed kokodak closed 1 month ago

kokodak commented 1 month ago

What this PR does / why we need it:

This PR implements a bulk retrieval operation for the GetDocuments API to enhance performance.

The specific tasks accomplished include:

While the query to retrieve DocInfos has been reduced from N times to once when calling the GetDocuments API, there still remains an issue where packs.BuildDocumentForServerSeq() is called N times.

However, this logic seems to be related to CRDT or logical clock functionalities, which I do not fully understand yet, so I could not work on it. Therefore, I did not remove the TODO comment regarding the N+1 issue.

Which issue(s) this PR fixes:

Fixes #921

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Additional documentation:

Checklist:

Summary by CodeRabbit

coderabbitai[bot] commented 1 month ago

Walkthrough

The new method FindDocInfosByKeys was introduced to the DB struct in database.go, enabling the retrieval of multiple documents based on given keys. A corresponding test function, RunFindDocInfosByKeysTest, was also added to verify the functionality. These enhancements aim to improve the performance of the GetDocuments API by facilitating efficient bulk data queries.

Changes

File Change Summary
server/backend/database/memory/database.go Added FindDocInfosByKeys method to retrieve documents based on given keys.
server/backend/database/testcases/testcases.go Added RunFindDocInfosByKeysTest to test the FindDocInfosByKeys method by creating documents with specified keys and verifying the retrieval process.
server/documents/documents.go Revised GetDocumentSummary and GetDocumentSummaries to use the new FindDocInfosByKeys method for improved bulk retrieval efficiency.
server/rpc/admin_server.go Updated GetDocuments function to include a new parameter for the include_snapshot flag to enhance data retrieval options.
api/yorkie/v1/admin.proto Added include_snapshot field to GetDocumentsRequest for optional snapshot inclusion in responses.

Assessment against linked issues

Objective Addressed Explanation
Implement DB Query for GetDocuments API to improve performance (#921) βœ…

In the realm of keys and docs they spin,
Where queries dance and tests begin,
Performance soared, the code refined,
In FindDocInfosByKeys, success we find.
πŸ‡πŸš€


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
sejongk commented 1 month ago

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

kokodak commented 1 month ago

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@sejongk I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

sejongk commented 1 month ago

@sejongk I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

Sure. If you have any suggestions about this, please let me know.

hackerwins commented 1 month ago

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@sejongk I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@kokodak @sejongk

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

sejongk commented 1 month ago

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or? @sejongk I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@kokodak @sejongk

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

Thanks for your suggestion. I believe this suggested method is somewhat related to https://github.com/yorkie-team/yorkie/pull/597.

kokodak commented 1 month ago

I have reviewed all the comments provided.

Currently, I have completed the implementation of bulk query methods for DB.FindClosestSnapshotInfo() and DB.FindChangesBetweenServerSeqs(), which are used in BuildDocumentForServerSeq().

However, I am facing some issues and need help with the following:

  1. Although the bulk query operations are implemented, I am having difficulty writing test cases. Creating good test scenarios is challenging. Could I get some help with this?

  2. I generally understand the context of @hackerwins comment, but I am a bit unclear about the exact meaning of "snapshot" since the term is used in several places in the code. If the request value for include snapshot is false, does it mean that DB.FindClosestSnapshotInfo() should be called with includeSnapshot set to false, or does it mean that packs.BuildDocumentForServerSeq() should not be executed at all? (I am inclined to believe it's the latter.)

    • 2-a. If the latter is correct, should we still keep the bulk query code mentioned in point 1, to handle cases where include snapshot is true?
  3. I also agree that passing only the minimal information needed to render the screen is a good idea. However, if it turns out that snapshots will never be used in the GetDocuments API, we might consider configuring the code to exclude snapshots without adding an option to the API request. What are your thoughts on this? Should we still include the option in the request for flexibility?

    • 3-a. If we decide to include the option, we will need to coordinate with the front-end regarding the changes in the API structure. How should we approach this discussion?
kokodak commented 1 month ago

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

Based on the discussions with @hackerwins and @sejongk regarding the comment ideas above, we have decided to implement the option to include or exclude snapshots in the API request.

As a result, the GetDocuments API request specification has changed, which can be reviewed in this commit.

Consequently, by adding the include_snapshot field with a value of false in the CodePair code, we can expect performance improvements in the GetDocuments API.