Open eugene-vodyanko opened 11 months ago
Table spec:
create table MESSAGES
(
id NUMBER not null,
creation_date DATE not null,
state NUMBER(1),
state_time TIMESTAMP(6),
...
, CONSTRAINT MESSAGES_PK PRIMARY KEY (sender, creation_date, id)
)
ORGANIZATION INDEX;
Query with ORDER BY DESC
select * from
(
select /*+ FIRST_ROWS(1)*/ id, creation_date, sender, state, length(content) from messages
where sender = 10
and creation_date between to_Date('2020-11-27', 'yyyy-mm-dd') and to_Date('2022-11-27', 'yyyy-mm-dd')
order by creation_date desc
)
where rownum <= 10
Plan have no sorting
------------------------------------------------------
| Id | Operation | Name
------------------------------------------------------
| 0 | SELECT STATEMENT |
| * 1 | COUNT STOPKEY |
| 2 | VIEW |
| * 3 | INDEX RANGE SCAN DESCENDING | MESSAGES_PK
Result ~ 80 ms. Env. Solaris/SPARC, HDD.
Disk space usage:
In general, Postgres metrics are similar to Oracles when using Index Only Scan queries, however:
Calculations with plans and so on staff omitted.
As expected, Cassandra shows itself quite favorably on the same load profile (~30-40 ms). Moreover, both for direct and reverse sorting order.
CREATE TABLE IF NOT EXISTS isdb.messages (
id int,
creation_date timestamp,
...
"_source" text,
"_ver" int,
PRIMARY KEY ((sender_id), creation_date , id, "_source")
)
WITH CLUSTERING ORDER BY (creation_date DESC, id DESC, "_source" ASC)
;
[2024-02-17 00:03:04] Connected
> select id, creation_date, sender_id, state_id from isdb.messages
where sender_id = 10
and creation_date >= '54874-05-09 04:06:40.000' and creation_date <= '54874-05-11 16:23:20.000'
order by creation_date desc, id desc
limit 10
[2024-02-17 00:03:04] 10 rows retrieved starting from 1 in 26 ms (execution: 5 ms, fetching: 21 ms)
> select id, creation_date, sender_id, state_id from isdb.messages
where sender_id = 10
and creation_date >= '54874-05-09 04:06:40.000' and creation_date <= '54874-05-11 16:23:20.000'
order by creation_date asc, id asc
limit 10
[2024-02-17 00:06:29] 10 rows retrieved starting from 1 in 27 ms (execution: 4 ms, fetching: 23 ms)
Cassandra also provides efficient compression up to 4 times in relation to the "raw" data.
P.S. The tests were performed on the same data set, but when uploading data to Cassandra, the timestamp columns were loaded incorrectly due to timestamp conversion problems. This should not be confusing, it does not affect performance.
Hi there.
First of all, thank you very much for YDB. I take off my hat.
Essence of the problem
In my test, I have the table with messages (20M rows) in YDB:
I want to request the history of recent messages for a specific sender (or senders):
The request takes an average of
850 milliseconds
to complete:The fragment from
Stat
tab in UI:The query plan contains an explicit sorting step:
If we change the sorting order from DESC to ASC:
We will get a much more interesting result (~25ms):
And the plan without a sorting step:
However, if we want to get data on the list of senders sorted by date, we will again get lower performance than we could have expected:
Expected behavior
Efficient sampling without physical sorting and without reading unnecessary data.
INDEX RANGE SCAN DESCENDING
scan mode for Index Organized Tables.In addition, in this DBs you can explicitly set the order in the index (PK), of the form (key_1 ASC, key_2 DESC). If I understood the YDB documentation correctly, this is not supported for either PK or secondary indexes.
Maybe, you have recommendations or instructions?
Other DBMS behavior: ClickHouse
I understand that this is an antipattern for ClickHouse... but.
ORDER BY ASC (~10ms)
ORDER BY DESC (~10ms)
And a more correct comparison with the keyword FINAL (33ms vs. 850ms):
Test Environment
P.S. Later, if you are interested, I can give similar calculations for Oracle and (maybe) Cassandra/ScyllaDB. P.P.S As workaround, I tried using a numeric column approach instead of a date that stores an "inverted" representation in the style of
toUnixTimestamp('2050-01-01 00:00:00') - toUnixTimestamp(creation_date)
. However, in this case, expressions become more complicated and filtering becomes less optimalDateTime::FromSeconds(CAST(2524582800 - delta as Uint32)) between Date("2020-11-27") and Date("2022-11-27")
. It just looks terrible. I am sure that my usage scenario is very common and there is a way to do it effectively.