Closed slava-vishnyakov closed 3 years ago
@m-iancu :
Is the default current_timestamp
here:
https://gist.github.com/slava-vishnyakov/cfe543ffc50a751d99723ab5e622878c#file-create_data-py-L29
potentially related to @m-iancu 's comment below here on timestamp to timestamptz conversion interplay?
https://github.com/yugabyte/yugabyte-db/issues/5536#issuecomment-684197935
@kmuthukk Thank you for looking into this. For the sake of the experiment, I removed all timestamptz fields (created_at, updated_at), this did not affect performance in any way (about 5% better).
create table users (
id uuid primary key default uuid_generate_v1(),
is_admin boolean,
name text
-- ,
-- created_at timestamptz default current_timestamp
);
create table subscriptions (
id uuid primary key default uuid_generate_v1(),
user_id uuid not null references users(id),
channel_id uuid not null references channels(id),
unread_count int not null default 0 check ( unread_count >= 0 )
-- ,
-- created_at timestamptz,
-- updated_at timestamptz
);
@slava-vishnyakov I'll reproduce it as you provided the whole case (🙏🏼) What I see for the moment is that you are using orator, which uses the psycopg2 driver, which doesn't prepare the statements, so each insert has to be parsed and compiled by the query planner. This adds unnecessary work. And is already not good in PostgreSQL when you run repeated statements where only the values change. But it is worse in YugabyteDB because the dictionary lookups involve some RPC from tserver to master, where the catalog is stored. Does this simulates real application behavior? I've documented the best way to ingest data in any postgresql compatible DB, in order to explain my changes in a PR for a benchmarking tool which observed the same difference between PostgreSQL and distributed databases: https://www.kaggle.com/franckpachot/maibornwolff-database-performance-comparison
@FranckPachot Thank you for your insights!
I tried switching to prepared statements, but it didn't really change that much.
The worker code with prepared statements is in this gist.
Just in case, I also tried removing default uuid_generate_v1()
from all of the primary keys and generating UUIDs in Python code manually, but that didn't change things much either.
Thank you!
Ok. calls to 'prepare' are not as efficient as when the client prepares it but cannot do better with psycopg2. Another thing is that inserts in postgres are very fast at the beginning but require more work to be done later and may slow down when checkpoint occurs, and vacuum. So looking at performance in fist minutes may be misleading. But, before, can you confirm that what you want to achieve here is simulate is many short transactions that you cannot batch. And you update the same table. Another thing, you said about 1G RAM used within the 16GB. Where do you see that? It should you more, as a percentage of the available RAM and this can be set with --default_memory_limit_to_ram_ratio flags. For example when started with yugabyted, it sets 0.6 for yb-tserver and 0.4 for yb-master
Well, I'm trying to get a feel of how YugaByte stacks against PostgreSQL to possibly replace it. However, looking at 300:1 difference without any scaling, it doesn't look good. If YugaByte shows at least 2:1 slowdown, or maybe even 4:1, but it scales with tservers, that might make sense, but if it's 300:1, it doesn't make any sense.
Yes, we COULD batch this, but I mean, subscriptions usually don't happen in batches :) Hello, dear YouTube users, please queue and subscribe to at least 1000 channels at once :) (No, we're not developing YouTube, just a joke)
Re: memory usage
I'm seeing this in htop
:
tserver uses 0.7% memory, master uses 0.6% memory, I don't use --default_memory_limit_to_ram_ratio
flag, the call is:
/root/yugabyte-2.9.0.0/bin/yb-master \
--master_addresses 10.1.0.2:7100 \
--rpc_bind_addresses=10.1.0.2:7100 \
--fs_data_dirs=/root/yugabyte_data/rf1/master1 \
--webserver_interface=0.0.0.0 \
--webserver_port=7000 \
--replication_factor=1
/root/yugabyte-2.9.0.0/bin/yb-tserver \
--tserver_master_addrs 10.1.0.2:7100 \
--rpc_bind_addresses 10.1.0.2:9100 \
--start_pgsql_proxy \
--pgsql_proxy_bind_address 10.1.0.2:5433 \
--fs_data_dirs=/root/yugabyte_data/rf1/tablet1 \
--webserver_port=9001 \
--redis_proxy_bind_address=10.1.0.2:6379 \
--redis_proxy_webserver_port=11000 \
--cql_proxy_bind_address=10.1.0.2:9042 \
--cql_proxy_webserver_port=12000
The worst part for me is that neither adding masters, nor adding tservers, adds any performance (and CPU is barely used, whereas Postgresql maxes out CPU). This feels like some locking issue bottleneck, like everything is waiting for some shared resource. But, of course, it's just a guess.
Ok, I've done detailed logging and basically, all of workers eventually get stuck at increment_counter and then only one works. I color-coded three workers for better visibility:
db.update("prepare increment_count as update kv set v=v + $1 where k = 'count' returning v")
....
print_color('BEFORE execute increment_count')
total_subs = db.select("execute increment_count (%s)", [len(channels)])[0][0]
print_color('AFTER execute increment_count')
So, it seems that for whatever reason the call to update gets stuck. Sometimes it happens faster, sometimes a bit later:
But eventually, all, except one processes get locked up.
Ok, so running either update kv set v=v+%s where k = 'count' returning v;
raw or as a prepared statement YugabyteDB definitely locks all processes except one, which seems like a bug to me (the same does not occur in Postgres).
I moved this count to Postgres (so I have one connection to YugabyteDB and one to Postgres specifically for that count). Not a great solution, but ok for testing purposes.
I'm finally able to get CPUs to about 80% load with peak inserts at about 1700/s (which is still 23 times slower than Postgres at 40k/sec) using 10 processes (as opposed to 100 on Postgres, but there is no increase in speed beyond 10 processes for YugabyteDB)
Adding one more tserver, ~50 processes (no increase beyond 25, really), I get about 2700 req/s (after waiting until yugabyte stops shuffling data around):
If I switch to connecting to random node instead of connecting the same, I get about the same results ~ 2500-2700 inserts/s, with occasional drops to 1900:
After adding one more tserver (rf=1, masters=1, tservers=3), with connections going to all three, 100 processes/connections, just like Postgres.
Master/tserver1 is very busy (top one), tserver2 is more relaxed, tserver3 is quite relaxed (bottom one). Still ~2800 inserts/sec, no improvement at all compared to tservers=2.
Just for the sake of the experiment, I've created rf=3 masters=3 tablets=3 cluster on the same machines (removing any previous data dirs), I've got about 1700 inserts per second using 100 connections... and a "database is shutting down" error?!
tserver1 segfaulted after about 22,000 inserts to subscriptions:
I restarted it, waited for it to stop shuffling, afterwards I've got a sustained rate of ~2000 inserts/second, but at least no more segfaults.
So,
1) at best I was able to do 2800 inserts per second using 2 servers, which is still 14:1 to Postgres (i.e. vs 40,000 inserts/s using 1 Postgres server) or 28:1 considering that I use 2 servers;
2) Adding tservers beyond 2 servers with rf=1 didn't help at all;
3) I've got locked processes doing update kv set v=v+%s where k = 'count' returning v;
(either as prepared statement or not);
4) Random Segfault with rf=3 on tserver1 (never happened afterwards);
5) Prepared statements didn't affect performance in any measurable way;
6) Connecting to either one or a random server did not affter performance in any measurable way (both with rf1/m1/t1 and rf3/m3/t3);
Anything else I could try here to get closer to Postgres result?
in Postgres this rate may slow down when checkpoint occurs, and will definitely slow down when you will VACUUM (this is not automatic <PG13 for insert-only table). And VACUUM is necessary before hanging on xid wrap-around. I'm not saying that YugabyteDB is faster (monoliths are always faster when there's no need to scale-out), but benchmarks on PostgreSQL should run for many hours to get the true picture.
how do you add the nodes? There's a default number of tablets per node so it depends if you add nodes before or after the create table (you can also define the number of splits in CREATE TABLE).
You can check the number of tablets in http://yb-master:7000/tables to be sure there are enough to be balanced on all tservers
I suggest to have one counter per server:
# init:
insert into kv (k, v) select to_char(generate_series,'9999'),0 from generate_series(0,1000);
# update:
total_subs = db.select("update kv set v=v+%s where k = to_char(%s,'9999') returning v;", [len(channels),i])[0][0]
# count:
total_subs = db.select("select sum(v) from kv")[0][0]
if the goal is to monitor the number of rows inserted this is also provided by:
select calls,rows,min_time,max_time,mean_time,stddev_time,query from pg_stat_statements where query like 'insert%';
need logs to troubleshoot this. Note that it is much easier to use https://docs.yugabyte.com/latest/reference/configuration/yugabyted/ to start the cluster with right defaults
prepared statements should improve this, but unfortunately not possible with psycopg2 driver. Calling prepare
and execute
is not as efficient because the execute call still have to be parsed. There are many other postgresql drivers for Python but not sure which are supported by orator. Do you need an ORM? Here you run native SQL statements.
First we need to be sure that what is done there (row-by-row commits) is what you expect. I understand subscription cannot be inserted in batch. But here each insert into subscriptions
happens in one separate transaction (because default is autocommit for orator - which is probably never a good idea for a SQL application, this default is to avoid long running transaction when there are user interactions). The update kv
is in another transaction. Sounds weird to me as it can be inconsistent if one worker fails.
Anyway, thank you for all your tests, we have everything to reproduce and I'll look at that next week with colleagues. Will look where the time is spent for this kind of workload. Please, confirm that you want each insert committed rather than one transaction covering the whole for channel_id in channels
Thank you!
The problem was that I've got fully stuck workers even with 1 master 1 tablet server. Yes, it's done for counting only (since people advised me not to do count(*) for that), but it doesn't seem like it retries (or is retry taking more than a few minutes? because I saw workers blocked indefinitely).
Please, confirm that you want each insert committed rather than one transaction covering the whole for channel_id in channels
Yes, the use use is kind of similar to YouTube where a user subscribes to a channel. They do it one-by-one, obviously.
I've got locked processes doing update kv set v=v+%s where k = 'count' returning v; (either as prepared statement or not);
Can you explain the use case here ? Are you counting the number of subscribers ? If yes, it should be better to do incremental aggregation (blog post example).
prepared statements should improve this, but unfortunately not possible with psycopg2 driver
I think in psycopg3 it's available.
(or is retry taking more than a few minutes?
Probably not, you can check the state of the connection with:
select
round(extract(epoch from clock_timestamp()-backend_start)) "backend(s)"
,round(extract(epoch from clock_timestamp()-xact_start)) "xact"
,round(extract(epoch from clock_timestamp()-query_start)) "query"
,round(extract(epoch from clock_timestamp()-state_change)) "state"
,state,wait_event_type,wait_event,query
from pg_stat_activity where datname='test'
order by state_change;
to see if they are active or idle, and for how long
@slava-vishnyakov In order to be sure what we reproduce is the same, can you generate a perf flamegraph:
# record perf samples during 2 minutes (run on the yugabyte server while the program is running)
sudo perf record --call-graph fp -F99 -e cpu-cycles -a sleep 60
# get Brendan Gregg flamegraph tool
git clone https://github.com/brendangregg/FlameGraph.git
# get my color map for YugabyteDB functions
wget -c https://raw.githubusercontent.com/FranckPachot/ybio/main/util_flamegraph/palette.map
# generate flamegraph on the recorded perf.data
sudo perf script -i perf.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl --colors green --width=1200 --hash --cp | tee perf.svg
# this generates perf.svg that can be opened on any browser
@ddorian I mean YugaByteDB positions itself as a drop-in replacement for Postgres, but I can't do something as trivial as update kv set v=v+1 where k = 'count' returning v;
? I need one number, not per day data, etc... The use case here is to track how many inserts have all processes done to be able to directly compare it to Postgres performance.
@FranckPachot Which version should I perf? The one that locks up in update
, or the one that works?
Which version should I perf
The one that works, please, to see what can be improved
I can't do something as trivial as update kv set v=v+1 where k = 'count' returning v;?
This will be a point of contention in any consistent database. Scaling with multiple processes writing to multiple shards, and then finally all writing on the same row. All databases will serialized the updates to the same row. Either with pessimistic locking (wait that concurrent transactions commit) or optimistic (retry until no concurrent transaction). It is just worse with optimistic locking because the probability of conflict is high there. We will implement pessimistic locking for some use cases where it makes sense, but incrementing a single counter will always limit the scalability. See what I proposed above: have multiple counters, like one per worker, and sum it when you want the "one number".
If you want to get an idea of Yugabyte vs. PostgreSQL performance on row-by-row insert transactions, here is an example:
PostgreSQL:
postgres=# create table demo(i int primary key, t timestamp default clock_timestamp());
CREATE TABLE
postgres=# \timing on
Timing is on.
postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 5654.371 ms (00:05.654)
postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;
rows/s | count | duration
--------------------+-------+-----------------
1773.1782322512395 | 10000 | 00:00:05.639591
YugabyteDB
yugabyte=# create table demo(i int primary key, t timestamp default clock_timestamp());
CREATE TABLE
yugabyte=# \timing on
Timing is on.
yugabyte=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 46716.501 ms (00:46.717)
yugabyte=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;
rows/s | count | duration
------------------+-------+-----------------
214.575311256509 | 10000 | 00:00:46.603684
So, per-session, a 1:8 ratio is what I have for the simplest insert (no foreign key, no indexes). This gives an idea. However, this is per session and should scale if there is no single point of serialization (like updating the same row). Then, is this performance ratio a problem? We need to take into account the level of protection of the data and service.
PostgreSQL commit is just a fsync() of WAL on local disk. All other writes are in memory. This is very fast on modern storage. However, in case of full storage failure you lose transactions (Recovery Point Objective > 0) and it can take hours to restore (Recover Time Objective). Even with a standby database, RTO is in minutes and RPO is still not zero. In addition to that, the rate of inserts cannot be sustained without checkpoints and vacuum.
On the other hand, YugabyteDB is designed for RPO=0 (no data loss failover) and RTO in few seconds (the time to select new tablet leaders). Even when you run on one node with RF=1, the distributed design is still there: commits are not just local fsync() but remote procedure calls, and write to disk.
The target for your workload can be set by the following test on empty tables:
truncate subscriptions; truncate channels; truncate users;
insert into channels(name) select n::text from generate_series(1,100) n;
insert into users(name) select n::text from generate_series(1,100) n;
do $$ declare
c cursor for select * from channels;
u cursor for select * from users;
begin
for c in c loop
for u in u loop
insert into subscriptions(user_id,channel_id,created_at) values(u.id,c.id,clock_timestamp());
commit;
end loop;
end loop;
end; $$;
select count(*)/extract(epoch from max(created_at)-min(created_at)) "rows/s",count(*),max(created_at)-min(created_at) "duration" from subscriptions;
This is the most efficient way to do your single-row insert transactions. See if that is ok for you (for a Highly Available database) and then let's see if you are far from that with your code. If you don't need scale-out, resilience, and HA, then PostgreSQL can be 10x faster for sure. Another possibility is to concentrate multiple subscriptions at application level, to commit them in batch. For example buffer 1000 subscription (only when rate is high so that the added latency is low), insert them as multi-value insert, and return successful completion to all users when all are committed. For sure this is more code but can reach another level of throughput.
@FranckPachot here's the perf results (3 masters, 3 tablets): perf.svg.zip
GitHub doesn't allow SVGs, so I had to compress it.
have multiple counters, like one per worker, and sum it when you want the "one number".
Thank you for the explanation. I already removed the counting part from Yugabyte, that's ok, I understand the limitation better now.
I did approach this problem to figure out whether Yugabyte is kind of like a "free lunch" - replace Postgres with it and you can scale it indefinitely (kind of like when you do sharding/partitioning), but now that I understand it better, I absolutely agree that HA/fast failover/data protection are important. It's just that reading the Yugabyte docs I got somewhat wrong idea that it's basically an easy replacement for Postgres performance-wise and easily scales with more servers. Ok, it seems like Yugabyte is more suitable for HA case.
However, it's still strange for me that adding more tablet servers (I do need more tablet servers for scaling, right?) didn't improve my performance at all after 2 servers. That's considering that I even moved the update counter to Postgres.
It seems like this: you get ~10x slowdown by using Yugabyte vs Postgres, but you should be able to gain that perf back by adding tablet(+master?) servers... However it's not the case for me. So right now it seems like Yugabyte is resilient, but slow with no way to improve it beyond classic approaches like manual sharding. But that can be done with Postgres too (Postgres slaves provide data reliability and fast failover with a bit of manual work though) and you can also do sharding, but at the same time you get way more performance from PG.
I'm really hoping there's some obvious improvement that can be done here, because it feels like not being able to scale by adding tablets wasn't the plan for Yugabyte.
Thank you for your time!
Thanks for the flamegraph, this confirms that there no CPU time besides what we can expect. This should scale with more session. I don't understand why you seem to be limited in scalability. Are all your sessions active when it runs? This is visible with:
select
round(extract(epoch from clock_timestamp()-backend_start)) "backend(s)"
,round(extract(epoch from clock_timestamp()-xact_start)) "xact"
,round(extract(epoch from clock_timestamp()-query_start)) "query"
,round(extract(epoch from clock_timestamp()-state_change)) "state"
,state,wait_event_type,wait_event,query
from pg_stat_activity
order by state_change;
Yes, I see 100 sessions doing inserts in "active".
Occasionally, I see 2-3 in "idle" state, but they become active soon enough.
ok, so they are all active. great. Maybe 100 sessions is too much for 16 vCPU. One difference between PostgreSQL and YugabyteDB is that YB is multi-threaded. One client session will use more than one vCPU. I'll run the same in a 16 vCPU VM and tell you what I see.
On a 16 vCPU 30GB RAM (AWS c5.4xlarge):
--placement_cloud=aws
--placement_region=eu-west-2
--placement_zone=eu-west-2c
--max_log_size=256
--replication_factor=1
--use_client_to_server_encryption=true
--callhome_collection_level=medium
--enable_ysql=true
--use_cassandra_authentication=false
--ysql_enable_auth=false
--undefok=enable_ysql
--allow_insecure_connections=false
--txn_table_wait_min_ts_count=1
--use_node_to_node_encryption=true
and --default_memory_limit_to_ram_ratio=0.10
for the yb-master and --default_memory_limit_to_ram_ratio=0.85
for yb-tserver
So it seems the optimal response time / throughput is about 1 worker per vCPU. Adding more just saturates CPU
Thank you for the benchmark! My biggest problem right now though isn't throughput per se, but rather what can I do to scale it beyond 2 servers? Are you able to get past 2200/3300 inserts/seconds by adding something? Tablet servers? Master servers? Changing replication factor?
I've run the same on 3 c5.4xlarge in 3 different Availability Zones RF=3, with 25 workers connected to each and still have 3200 per node. The subscriptions table has 24 tablets with leaders balanced across all 3 nodes and followers in the others:
I've provisioned it from our platform, but settings are what I've put above
So the scaling is ~ 1 worker per vCPU and on each node. And this workload scales out linearly with more nodes. I still have the update on a single row but doesn't seem to be a problem when CPU is not saturated. Most of the work is insert and commit.
Oh, sorry, I realize that you display the total throughput. So no, I was too optimist here. Still 3200 tx/second. But now with RF3 replication cross AZ
Added 6 more nodes and ran again 25 workers on each node:
This scales from 3200 to 6000 tx/second by doubling the number of nodes
The throughput lowered (and latency increased) shortly when I added the new nodes (because tablets are re-balanced)
So in summary, on this example: yes you need more resources to achieve the same throughput than PostgreSQL. But you have data protection where the application is still available with no data loss even if one datacenter burns.
And the performance is more predictible (no checkpoint, no vacuum)
Now that all nodes are added, I've re-created the whole test. The table has now 48 tablets. I've doubled the workers - the CPU is saturated, average latency is 10ms because there's some cpu wait (runqueue is higher than available threads). The rate is very regular at 7400 tx/second: Note that the "Ops/sec" in the screenshot is the same as you see in the tablet server web console. They are operations between the YSQL API and the DocDB storage. Here each insert into subscriptions, with primary and foreign key checking, and commit, is 2 read ops and 3 write ops
Thank you for your test! I think I understand most of what I need to understand here now.
Thank you!
Hi, ummm, may I know the conclusion? because when I googling for "Yugabyte performance" this pops-up on the top search result
I'm sorry, this was 2 years ago and I don't remember all the details. In the end, since performance and not reliability was the concern, we went with regular Postgres, which was more than 10x faster. Again, this was our conclusion in our use case, it doesn't mean that it's slower for everyone in every scenario.
@codenoid in two years a lot of optimizations have been implemented to reach PostgreSQL parity (thanks to such detailed issues). If you encounter performance problems, can you open another issue and tag me?
I am in the same boat with @codenoid. All benchmarks in the web are old and they all show that yb is way much slower than pg.
@FranckPachot assuming your statement is true, having such benchmarks and blog posts would help a lot for promoting YugabyteDb. There are lots of end-users like me who would get drawn into yugabytedb with such resources.
The test with YugabyteDB 2.9.0 performs about 300 times slower than Postgres :( both with rf=1 and 1 master/tablet and rf=3 and 3 masters/tablets and with rf=1, one master and 3 tablet servers.
I have a simple test schema: basically a user with UUID primary can subscribe any number of channels also with UUID primary key.
The script
create_data.py
creates Python processes, each with own connection, and creates a user, a channels and then proceeds to subscribe the user to a random number of previously created channels (randing from 1 to a few thousands), channel_ids are kept in memory for performance.(I attached all the test python files in this gist The schema is created here. )
I run either
python create_data.py YUGABYTE_URL 100
(run on Yugabyte, with 100 parallel connections using 100 Python processes) orpython create_data.py POSTGRES_URL 100
(same, on Postgres). (Connection URLs are kept in .env file and loaded with dotenv)PostgreSQL achieves about 40,000 subscriptions per second, while YugaByte can't get any higher than ~150 subscriptions/second.
Disabling indexes does not help. Adding tablets does not help.
Restarting is done with
python create_data.py YUGABYTE_URL --drop
, then re-init withpython create_data.py YUGABYTE_URL --init
.The worst part is that tablet servers barely show any load:
No other processes are running. I run YugaByte 2.9.0 as shown in either rf1-master.sh, rf1-tablet.sh (for replication factor 1) or using rf3-master{1,2,3}.sh rf3-tablet{1,2,3}.sh (obviously each one on their own server).
The servers have 16 vCPUs and 30G memory (about 1G used). All servers have SSDs. Ubuntu 20.04.
The load (create_data.py) is created from the 4th machine.
All are connected using private IPs and the UI correctly shows 3 masters, 3 tablets (or 1 master, 1 tablet/3 tablets in rf1 case). .WARNING logs don't show anything interesting.
With Python when I go from 1 connection to 100 connections, I get a proportional increase up until about 100 processes, where I get a sustained rate of ~40,000 subscriptions per second...
Postgres:
YugaByte 2.9.0:
With YugaByte I get about 110 subscriptions/sec with 1 connection, about 150/sec with 3 connections and then that's it, doesn't matter if I have 100 connections, and even if I connect to random nodes. It does not matter if the connection stays open or opened for each "batch" of ten users.
No transactions are performed.
The same situation is with rf=1 tablet_servers=3, there's really no benefit in running 2 more servers. The worst part is that they all are basically idle:
The weird part is that inserting 10 random users takes about the same time on Postgres and YugaByteDB: 1-5 seconds on YugaByteDB, 0.4-2.5 seconds on Postgres... But the concurrency on Postgres is much higher.
Is this expected? Am I missing something here?
Thank you!