yorkie-team / yorkie

Yorkie is a document store for collaborative applications.
https://yorkie.dev
Apache License 2.0
786 stars 146 forks source link

Optimize `crdt.Array` Data Structure for Improved Performance #980

Open hackerwins opened 2 months ago

hackerwins commented 2 months ago

Description:

The current implementation of crdt.Array is based on RGATreeList, which uses both SplayTree and LLRBTree for internal element retrieval. While this structure works, there's potential for optimization, especially considering the random access nature of crdt.Array operations.

The current implementation of crdt.Array relies on a data structure called RGATreeList, which incorporates both SplayTree and LLRBTree for managing internal elements. While this approach is functional, there's room for improvement, particularly given the random access nature of crdt.Array operations. The SplayTree is utilized to locate nodes based on their indices during Document.Update calls, while the LLRBTree is employed to find nodes according to their Element.CreatedAt values in various operations.

However, the use of SplayTree may not be ideal for crdt.Array. Initially designed to efficiently handle text editing workloads, the SplayTree's characteristic of moving recently accessed nodes to the root could potentially create unnecessary overhead in the context of crdt.Array's random access patterns. This suggests that a different data structure might be more appropriate for optimizing performance.

To address this, we propose investigating and implementing an alternative data structure that is better suited to the access patterns of crdt.Array. The goal of this change is to enhance the performance of local editing operations, potentially leading to improved efficiency and responsiveness in the system overall.

Considerations for Implementation

  1. Benchmark current vs. proposed structures in various scenarios
  2. Assess memory usage of the new structure
  3. Consider scalability for future requirements(Array.Move, ...)

Why:

Implementing a more fitting data structure for crdt.Array will likely improve its performance during local editing, especially given the random access nature of the operations. This optimization will lead to smoother user experiences and better resource management within the application.

hackerwins commented 2 months ago

Related to https://github.com/yorkie-team/yorkie/pull/989

binary-ho commented 2 months ago

Can i try this?

m4ushold commented 2 months ago

Intuitively, using BBSTs (Balanced Binary Search Trees) such as Treap, AVL, or LLRB trees, which have an upper bound of $\log{n}$ for time complexity, seems preferable. However, since crdt.Array does not require weight-based access, data structures like Y-fast tries or Van Emde Boas trees could be potential alternatives. Nevertheless, given that searches are based on relative positions between nodes rather than key-based access, it’s uncertain whether these structures can effectively support this. Further research and discussion seem necessary. If anyone has a better idea, please let me know :)

m4ushold commented 2 months ago

I have reviewed the Y-fast Trie and the Van Emde Boas Tree. Both of these data structures are designed for integer-based keys, which may not be suitable unless there is a way to efficiently update the key values over a range. Therefore, it seems more practical to experiment with several balanced binary search trees (BBSTs) instead.