Improve performance of Splay Tree in crdt.Text data structure to prevent skewness

raararaara commented 1 month ago

What would you like to be added:

Currently, Splay trees are used in the crdt.Text data structure. While Splay trees are efficient when performing operations sequentially in a contiguous range, they suffer from the drawback that sequentially inserted elements are linearly arranged on the tree, leading to it becoming a skew tree. In a tree with N vertices, performing M operations guarantees performance in O((N+M) log N) time. However, if the tree becomes skewed, each operation may take O(N) time initially before returning to faster performance. This issue may lead to performance degradation in the crdt.Text data structure due to the skewness of the Splay tree. It is essential to explore methods to prevent skewness that could affect the performance and efficiency of the Text data structure.

Why is this needed: To maintain and enhance the performance and efficiency of the crdt.Text data structure in the face of potential skewness issues in the Splay trees.

raararaara commented 1 month ago

Regarding the above, I would like to share what @justiceHui previously thought and experimented with.

How to prevent the structure of the splay tree from becoming a skew tree

To maintain the structure of the splay tree and prevent it from becoming a skew tree, it would be a good idea to use a method of randomly selecting about 10 vertices and splaying them every time an element is accessed about 100 times. Implementing this function is not difficult if you utilize existing implementations. Specifically, you can randomly generate an integer smaller than the size of the subtree you want to balance and splay it. When you want to preserve the previous root, use splay(x, p) := x, similar to collecting sections into one subtree. You can use it by defining an operation that raises it as a child of p. According to Dynamic optimality conjecture, since the splay tree is also a kind of binary search tree, it is assumed that the total number of operations (time complexity) required when manipulating the tree multiple times is not greater than that of all other binary search trees.

He shared the results of testing a scenario in which five operations {insert_after, insert_before, order_of_node, erase, erase_range} are randomly performed and a scenario in which only insert_after is performed using code written in C++.

The source code is from here, and the data used in the experiment is from here.

As a result of randomly selecting and splaying one vertex every 500 operations, the number of rotations (sum.rot) performed in 260,000 operations have increased by 7%, and the maximum number of rotations performed in individual operations (mx.rot) have decreased by 66%.

m4ushold commented 1 month ago

I tried the above code with a random seed, but the performance worsened on average.

I experimented with incorporating the idea from this paper into @justiceHui's approach by adding a probabilistic splay of the parent vertex. In my opinion, the paper's deterministic method was not very effective. I used Scheme 2 from the paper because it yielded the best results and was easy to implement.

I compared @justiceHui's method, the paper's method, and a hybrid of the two. When I tried the above code with the random seed, the performance worsened on average. So, I ran the experiments multiple times with different seeds and averaged the outcomes.

The source code is available here, and the data used in the experiment is the same as mentioned above.

The results showed that randomly selecting and splaying one vertex every 600 operations, combined with splaying the parent vertex with a 1/2 probability, led to a 6% increase in the total number of rotations (sum.rot) over 260,000 operations. However, the maximum number of rotations performed in individual operations (mx.rot) decreased by 60%.

m4ushold commented 1 month ago

I discovered a bug in my code and, after fixing it and re-running the experiments, found that method3 did not significantly outperform method1. In fact, sometimes it performed worse. In my opinion, method1(@justiceHui 's method) is the simplest and best approach. Can I implement this?

justiceHui commented 1 month ago

Through testing, we have confirmed that adding one splay every 500 splays is a good approach for 260,000 operations.

However, since we have not experimented with fewer or more operations, it would be beneficial to conduct a few more experiments before starting the implementation. For example, we can test various functions, such as adding one additional splay for every $\log n$, $\log^2 n$, or $\sqrt n$ splays, where $n$ is the total number of splay operations.

If one function proves to be significantly better, we can try adjusting the parameters dynamically based on the cumulative number of splay operations. For instance, if we decide to add one splay operation every $\sqrt{n}$ splay operations, we can increase the value of RANDOM_WAIT by 1 each time the cumulative number of splay operations exceeds ${(\texttt{RANDOM WAIT}+1)}^2$.

m4ushold commented 1 month ago

I conducted tests following your suggestions by generating data of various sizes using slicing or concatenating edit-trace data.

The experiment results indicated that for sizes less than $10^6$, the function proportional to $\sqrt{n}$ was optimal. However, for sizes exceeding this threshold, the max_height and max_rotation of all methods converged to those of the basic splay tree. Only the sum_rotation increased, suggesting that using this method might not be necessary.

I suspect that having only one experimental dataset and not being able to create it properly might have resulted in inaccurate conclusions. To obtain more accurate results, do you know of any ways to generate larger datasets or any existing datasets that can be used for this purpose?

m4ushold commented 1 month ago

Experiment with a New Splay Method

In this experiment, we tested a new approach by splaying the leaf node that is furthest from the root, instead of a random node. The rationale behind this approach is that splaying the deepest node would naturally reduce skewness more effectively compared to selecting a random node. Although finding the deepest node takes longer than selecting a random one, the time complexity remains amortized $O$($\log{n}$), thus we anticipated no significant performance degradation. Let's call this method max_height_splay.

Following the format of previous experiments, we conducted the max_height_splay operation periodically, where the number of operations is denoted by $n$. Specifically, we performed the max_height_splay operation once for each $f(n) \in \set{ \sqrt{n}, \log n, \log^2 n, \sqrt[3]{n} }$.

The experimental results showed significant performance improvements for max rotation and max height, with nodes less than $10^5$ and $f(n)=\log^2{n}$ showing improvement in performance which the maximum height was improved by 90%. The experiment code and results is available here.

However, for node sizes greater than $10^7$, the max height only improved by about 3%. As the number of operations increases, the tree is balanced due to the characteristics of the splay tree, so it seems that the performance improvement is less.

Questions

Lack of Diverse Data
I only experimented with edit-trace data and couldn't test under various conditions. Does anyone know how to obtain diverse datasets for further testing?
Loss of Splay Tree Advantages
In a text editing environment, which often behaves like a stack, reducing skewness might negate the advantages of a splay tree. (I'm not sure) This could mean we trade off between faster performance in most cases with occasional slowdowns or more consistent performance that is slightly slower than the former.

m4ushold commented 1 month ago

If you don't have any other ideas, I'd like to implement and bench it, is it okay?

raararaara commented 1 month ago

@m4ushold

Thank you for sharing your experiment. I have read through the case for the max_height_splay method that you shared and have thought about the questions you raised.

About `max_height_splay`

In general input situations (where inserts and erases can occur at any location), the method you proposed seems valid, as demonstrated by your experiment.

However, I have considered the following scenario: if typing only occurs at the end of the document, the strategy you introduced might not be significantly effective. Nonetheless, it still appears to be suitable for most situations.

Lack of Diverse Data

Although I cannot provide "edit-trace" data similar to what you used previously, I propose an alternative idea.

You can observe the shape and maximum depth of the splay tree during actual editing using the "CodeMirror" example from the yorkie-js-sdk. Additionally, with some modifications to the source code, you should be able to obtain the number of operations resulting from edits. You can follow the guide in the CONTRIBUTING document of the js-sdk to run CodeMirror.

The source code related to CodeMirror can be found in index.html and devtool/text.js. For the data structure part, you can refer to text and rgaTreeSplit.

In my case, I defined additional splay operations in the rgaTreeSplit.edit() part to see how the tree structure changes. Please refer to it as an example only.

public edit(
  range: RGATreeSplitPosRange,
  editedAt: TimeTicket,
  value?: T,
  maxCreatedAtMapByActor?: Map<string, TimeTicket>,
): [
  RGATreeSplitPos,
  Map<string, TimeTicket>,
  Array<GCPair>,
  Array<ValueChange<T>>,
] {
  // 01. split nodes with from and to
  const [toLeft, toRight] = this.findNodeWithSplit(range[1], editedAt);
  const [fromLeft, fromRight] = this.findNodeWithSplit(range[0], editedAt);

  // 02. delete between from and to
  const nodesToDelete = this.findBetween(fromRight, toRight);
  const [changes, maxCreatedAtMap, removedNodes] = this.deleteNodes(
    nodesToDelete,
    editedAt,
    maxCreatedAtMapByActor,
  );

  const caretID = toRight ? toRight.getID() : toLeft.getID();
  let caretPos = RGATreeSplitPos.of(caretID, 0);

  // 03. insert a new node
  if (value) {
    const idx = this.posToIndex(fromLeft.createPosRange()[1], true);

    const inserted = this.insertAfter(
      fromLeft,
      RGATreeSplitNode.create(RGATreeSplitNodeID.of(editedAt, 0), value),
    );

    if (changes.length && changes[changes.length - 1].from === idx) {
      changes[changes.length - 1].value = value;
    } else {
      changes.push({
        actor: editedAt.getActorID(),
        from: idx,
        to: idx,
        value,
      });
    }

    caretPos = RGATreeSplitPos.of(
      inserted.getID(),
      inserted.getContentLength(),
    );
  }

  // 💻 Add random splay logic according to lamport timestamp
  if (editedAt.getLamport().toNumber() % 20 == 0) {
    console.log('adjust occurs');
    for (let i = 0; i < 3; i++) {
      this.findAndSplay(Math.floor(Math.random() * this.length) + 1);
    }
  }

  // 04. add removed node
  const pairs: Array<GCPair> = [];
  for (const [, removedNode] of removedNodes) {
    pairs.push({ parent: this, child: removedNode });
  }

  return [caretPos, maxCreatedAtMap, pairs, changes];
}

Loss of Splay Tree Advantages

I look forward to hearing ideas from others.

Lastly, I'm excited to see the progress you're making with your implementation and bench!

yorkie-team / yorkie