[YSQL] Co-partitioned tables

patriknw commented 2 years ago

Jira Link: DB-915

Description

We are interested in the "Co-partitioning data" feature, which is described in https://github.com/yugabyte/yugabyte-db/blob/master/architecture/design/ysql-tablegroups.md

Colocating the index and the parent table with a common partiton key so that they would be handled by the same tablet could have the advantages:

writes to table and index are handled by same tablet (more local operations)
"join" with the parent table would be local when retrieving columns that are not in the index

I talk about secondary index here, but similar is probably applicable for two tables.

The problem

A secondary index has a big performance cost. I know, a secondary index becomes a distributed transaction, but it would be great to reduce that cost for cases where there is a "common" partition key between the index and its parent table.

Testing with yb-sample-app on 9 node (n2-standard-16, 2 local SSD each) cluster on GCP, from 5 clients with 100 write threads each:

SqlInsert: 110k writes/s (total)
SqlSecondaryIndex: 30k writes/s (total)

Our use case

We have a journal for event sourced entities. Currently, the table and the index looks like this:

CREATE TABLE IF NOT EXISTS event_journal(
  slice INT NOT NULL,
  entity_type VARCHAR(255) NOT NULL,
  persistence_id VARCHAR(255) NOT NULL,
  seq_nr BIGINT NOT NULL,
  db_timestamp timestamp with time zone NOT NULL,

  event_payload BYTEA NOT NULL,

  -- more columns

  PRIMARY KEY(persistence_id HASH, seq_nr ASC)
);

CREATE INDEX IF NOT EXISTS event_journal_slice_idx ON event_journal(slice ASC, entity_type ASC, db_timestamp ASC)
  SPLIT AT VALUES ((127), (255), (383), (511), (639), (767), (895));

The purpose of the index is to retrieve all events in time order without the full primary key. We partition those event processors with the slice column. We have a fixed number of slices, 1024. Each event processor handles a range of slices, for example 0 to 31. Those ranges may change over time be able to adjust the number of event processors.

The query to read the events looks like:

SELECT slice, persistence_id, seq_nr, db_timestamp, event_payload
  FROM event_journal
  WHERE
    entity_type = ?
    AND slice BETWEEN ? AND ?
    AND db_timestamp >= ?
  ORDER BY db_timestamp, seq_nr
  LIMIT ?

To share a "common" partition key we could change the primary key and index to:

PRIMARY KEY((slice HASH, entity_type HASH), persistence_id HASH, seq_nr ASC)

Then we can't use range queries, but it should be possible to replace that with

SELECT ... WHERE slice IN (0, 1, 2, 3, ...)

ddorian commented 2 years ago

Colocating the index and the parent table with a common partiton key so that they would be handled by the same tablet could have the advantages:

I don't believe this is being worked on (co-partition).

What is seq_nr column ?

How is db_timestamp generated ?

patriknw commented 2 years ago

What is seq_nr column ?

For the event sourced entities we store the events with a strict monotonic increasing sequence number for each entity id. That sequence number is managed by the application layer. Each entity is single writer.

The persistence_id is unique among all entities. An entity has a unique persistence_id and for each event it stores it increments the seq_nr. Append only (only inserts).

How is db_timestamp generated ?

See https://github.com/yugabyte/yugabyte-db/issues/10996

how is slice generated?

Slice is a hash modulo 1024 of the persistence_id.

In addition to the query mentioned above there is also one to select events for a given persistence_id.

WHERE persistence_id = ? AND seq_nr >= ? ORDER BY seq_nr

We could include entity_type and slice in that query also if needed, since those are derived from the persistence_id.

patriknw commented 2 years ago

As another experiment to understand if colocating index and parent table would be more efficient I modified the yb-sample-app to use more than one table. Selecting table based on the threadIdx.

Created 24 tablegroups, tables and indexes, like:

CREATE TABLEGROUP group0 owner postgres;
CREATE TABLE IF NOT EXISTS sqlsecondaryindex_0 (k text PRIMARY KEY, v text) TABLEGROUP group0;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_0_index ON sqlsecondaryindex_0(v) TABLEGROUP group0;

CREATE TABLEGROUP group1 owner postgres;
CREATE TABLE IF NOT EXISTS sqlsecondaryindex_1 (k text PRIMARY KEY, v text) TABLEGROUP group1;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_1_index ON sqlsecondaryindex_1(v) TABLEGROUP group1;

CREATE TABLEGROUP group2 owner postgres;
CREATE TABLE IF NOT EXISTS sqlsecondaryindex_2 (k text PRIMARY KEY, v text) TABLEGROUP group2;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_2_index ON sqlsecondaryindex_2(v) TABLEGROUP group2;

...

SqlSecondaryIndex would then produce 39k writes/s (total). Yugabyte cpu around 93%.

Comparing that without tablegroup results in 32k writes/s (total). Yugabyte cpu around 95%.

CREATE TABLE IF NOT EXISTS sqlsecondaryindex (k text PRIMARY KEY, v text) SPLIT INTO 24 TABLETS;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_index ON sqlsecondaryindex(v) SPLIT INTO 24 TABLETS;

(also tried that with 12 tablets each, with similar result)

My conclusion from this is that the colocation improves the efficiency, but not as much as I would have hoped for. Maybe tablegroups can be further optimized by not handling it as a full distributed transaction when in the same tablet?

ddorian commented 2 years ago

Just to be sure, you were directing writes to different nodes, correct? And you were measuring CPU of different nodes? Assuming they all went ~93% on average

patriknw commented 2 years ago

you were directing writes to different nodes, correct?

I had all 9 tservers in the --nodes of the yb-sample-app

And you were measuring CPU of different nodes?

Yes, looked at a few of them, and the tablet servers view in the console showed that Write ops/sec were on all 9 nodes.

yugabyte / yugabyte-db