Open patriknw opened 2 years ago
Colocating the index and the parent table with a common partiton key so that they would be handled by the same tablet could have the advantages:
I don't believe this is being worked on (co-partition).
What is seq_nr
column ?
How is db_timestamp
generated ?
What is seq_nr column ?
For the event sourced entities we store the events with a strict monotonic increasing sequence number for each entity id. That sequence number is managed by the application layer. Each entity is single writer.
The persistence_id is unique among all entities. An entity has a unique persistence_id and for each event it stores it increments the seq_nr. Append only (only inserts).
How is db_timestamp generated ?
See https://github.com/yugabyte/yugabyte-db/issues/10996
how is slice generated?
Slice is a hash modulo 1024 of the persistence_id.
In addition to the query mentioned above there is also one to select events for a given persistence_id.
WHERE persistence_id = ? AND seq_nr >= ? ORDER BY seq_nr
We could include entity_type
and slice
in that query also if needed, since those are derived from the persistence_id.
As another experiment to understand if colocating index and parent table would be more efficient I modified the yb-sample-app to use more than one table. Selecting table based on the threadIdx.
Created 24 tablegroups, tables and indexes, like:
CREATE TABLEGROUP group0 owner postgres;
CREATE TABLE IF NOT EXISTS sqlsecondaryindex_0 (k text PRIMARY KEY, v text) TABLEGROUP group0;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_0_index ON sqlsecondaryindex_0(v) TABLEGROUP group0;
CREATE TABLEGROUP group1 owner postgres;
CREATE TABLE IF NOT EXISTS sqlsecondaryindex_1 (k text PRIMARY KEY, v text) TABLEGROUP group1;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_1_index ON sqlsecondaryindex_1(v) TABLEGROUP group1;
CREATE TABLEGROUP group2 owner postgres;
CREATE TABLE IF NOT EXISTS sqlsecondaryindex_2 (k text PRIMARY KEY, v text) TABLEGROUP group2;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_2_index ON sqlsecondaryindex_2(v) TABLEGROUP group2;
...
SqlSecondaryIndex would then produce 39k writes/s (total). Yugabyte cpu around 93%.
Comparing that without tablegroup results in 32k writes/s (total). Yugabyte cpu around 95%.
CREATE TABLE IF NOT EXISTS sqlsecondaryindex (k text PRIMARY KEY, v text) SPLIT INTO 24 TABLETS;
CREATE INDEX IF NOT EXISTS sqlsecondaryindex_index ON sqlsecondaryindex(v) SPLIT INTO 24 TABLETS;
(also tried that with 12 tablets each, with similar result)
My conclusion from this is that the colocation improves the efficiency, but not as much as I would have hoped for. Maybe tablegroups can be further optimized by not handling it as a full distributed transaction when in the same tablet?
Just to be sure, you were directing writes to different nodes, correct? And you were measuring CPU of different nodes? Assuming they all went ~93% on average
you were directing writes to different nodes, correct?
I had all 9 tservers in the --nodes
of the yb-sample-app
And you were measuring CPU of different nodes?
Yes, looked at a few of them, and the tablet servers view in the console showed that Write ops/sec were on all 9 nodes.
Jira Link: DB-915
Description
We are interested in the "Co-partitioning data" feature, which is described in https://github.com/yugabyte/yugabyte-db/blob/master/architecture/design/ysql-tablegroups.md
Colocating the index and the parent table with a common partiton key so that they would be handled by the same tablet could have the advantages:
I talk about secondary index here, but similar is probably applicable for two tables.
The problem
A secondary index has a big performance cost. I know, a secondary index becomes a distributed transaction, but it would be great to reduce that cost for cases where there is a "common" partition key between the index and its parent table.
Testing with yb-sample-app on 9 node (n2-standard-16, 2 local SSD each) cluster on GCP, from 5 clients with 100 write threads each:
Our use case
We have a journal for event sourced entities. Currently, the table and the index looks like this:
The purpose of the index is to retrieve all events in time order without the full primary key. We partition those event processors with the slice column. We have a fixed number of slices, 1024. Each event processor handles a range of slices, for example 0 to 31. Those ranges may change over time be able to adjust the number of event processors.
The query to read the events looks like:
To share a "common" partition key we could change the primary key and index to:
Then we can't use range queries, but it should be possible to replace that with