[YSQL] Import speed varying between 1 - 70 MB/s

crabhi commented 2 years ago

I'm evaluating Yugabyte and I have some issues with import of the data from Postgres. I get very different import speed every time. I'm not bothered by the speed itself at the moment but rather that I don't understand why it changes so wildly.

Steps to reproduce

Create a new self-hosted cluster (6x AWS t3a.large, 100 GB disk each, 3x yb-master, 6x yb-tserver)
Create a table. This is the table I experimented with (name redacted). It has about 19 GB of data.

CREATE TABLE public.xyz (
  id SERIAL PRIMARY KEY,
  text text NOT NULL,
  text_hash character varying(40) NOT NULL
);

Run import. Previously, I experimented with a variant of pg_dump --data-only | ssh $REMOTE ysqlsh. The current script uses basically psql -c "\copy (SELECT ... ORDER BY id) TO STDOUT" | ssh $REMOTE ysqlsh -c "\copy $TABLE FROM STDIN"

What I checked

The instances are not low on any CPU/IO/Network credits. Even if they were, the perf hit is usually 1:4 or similar, not 1:70.
Sending ordered or unordered rows doesn't seem to make any difference.
The source server is able to provide data at 160 MB/s into /dev/null, so it's not the bottleneck. I checked this also at the very time the slow transfer was in progress.
I tried yb-admin compact_table ysql.our_db our_table, then yb-admin flush_table ysql.our_db our_table on a single master server. yb-admin compact_table fails with Timed out waiting for FlushTables on a full table but finishes very quickly on a truncated table. yb-admin flush_table finishes quickly in both cases. No difference in throughput.

What I observed

Sometimes, the import runs at 70 MB/s, sometimes at 1 MB/s.
When the command starts slow, it goes slow the whole time. If it starts fast, it stays fast.
Import finishes successfully.
It seems I can reliably reproduce the slowness with running TRUNCATE on the table and starting the import. Dropping the table and recreating doesn't resolve it.
Recreating the cluster from empty data dirs leads to fast imports, but not 100 %. :thinking:
The instances don't seem to be bottlenecked during the slow import (see iostat.txt - output of parallel-ssh ... --inline iostat -y 5 1 nvme1n1 | sed 's/admin@ec2.*/admin@REDACTED/' ). I see quite low TPS or disk througput. CPU is not at max either. Memory is not maxed either.

I'm stuck

I don't know where to look for possible bottleneck and how to debug further. I can't reliably go between the "slow import" and "fast import" states either. :-)

crabhi commented 2 years ago

Just for illustration - this is how the difference looks like during the import fast slow .

crabhi commented 2 years ago

In the logs, I found repeated occurences of:

W1217 14:56:11.327088 1605522 long_operation_tracker.cc:113] UpdateReplica running for 1.000s in thread 1621627:
    @     0x7f912d40711f  (unknown)
    @     0x7f912dd84b39  __lll_lock_wait
    @     0x7f912dd7f6e2  __GI___pthread_mutex_lock
    @     0x7f9136f03e18  rocksdb::port::Mutex::Lock()
    @     0x7f9136f78d4b  rocksdb::InstrumentedMutex::Lock()
    @     0x7f9136e60cef  rocksdb::DBImpl::WriteImpl()
    @     0x7f9136e62963  rocksdb::DBImpl::Write()
    @     0x7f913c75c470  yb::tablet::Tablet::WriteToRocksDB()
    @     0x7f913c761eb4  yb::tablet::Tablet::ApplyKeyValueRowOperations()
    @     0x7f913c7623fe  yb::tablet::Tablet::ApplyOperation()
    @     0x7f913c7627a3  yb::tablet::Tablet::ApplyRowOperations()
    @     0x7f913c7fd9b8  yb::tablet::WriteOperation::DoReplicated()
    @     0x7f913c7ee07b  yb::tablet::Operation::Replicated()
    @     0x7f913c7f3960  yb::tablet::OperationDriver::ApplyTask()
    @     0x7f913c7f40ae  yb::tablet::OperationDriver::ReplicationFinished()
    @     0x7f913c3fc377  yb::consensus::ReplicaState::NotifyReplicationFinishedUnlocked()

(yb-tserver.WARNING)

And

W1217 14:17:54.329200 1675414 async_rpc_tasks.cc:260] 178b4f6bda244cafbbae88fecf4a12df Flush Tablets RPC (task=0x2dfbd50, state=kRunning): TS 178b4f6bda244cafbbae88fecf4a12df: Flush Tablets RPC failed for tablet : Timed out (yb/rpc/outbound_call.cc:485): FlushTablets RPC (request call id 14962) to 127.0.1.1:9100 timed out after 30.000s

(yb-master.WARNING)

But I'm not quite sure what to look for. :thinking: I'm OK with sharing the logs if you'd like to see them. There is nothing too sensitive in the database but I don't feel comfortable posting them publicly.

ddorian commented 2 years ago

(6x AWS t3a.large, 100 GB disk each, 3x yb-master, 6x yb-tserver)

In this case, it's better to have 3 t3a.xlarge instead of 6 t3a.large. You want to scale vertically (at least) until the recommended hardware size https://docs.yugabyte.com/latest/deploy/checklist/#hardware-requirements

t3a.large

I suggest not using burstable vps for hosting a database.

The current script uses basically psql -c "\copy (SELECT ... ORDER BY id) TO STDOUT" | ssh $REMOTE ysqlsh -c "\copy $TABLE FROM STDIN"

Try to separate the data-generation from data-insertion. Meaning, generate static files like .csv, and then only benchmark the "loading/inserting" part.

I don't know where to look for possible bottleneck and how to debug further. I can't reliably go between the "slow import" and "fast import" states either. :-)

Don't use burstable cpu-s.
How big are the files? Use smaller files.
Is this a normal operation that you need to do in production? Meaning, in OLTP scenarios you usually have many connections writing few rows (which we optimize) compared to few connections doing very large transactions (which we don't optimize on and is a little heavy because of synchronous-replication + shared design (the insert will end up making a distributed transaction across all nodes with indexes too)).
How to replicate? Can you use any of our benchmarking tools: https://docs.yugabyte.com/latest/benchmark/
Use something around our recommended hardware: https://docs.yugabyte.com/latest/deploy/checklist/#hardware-requirements

crabhi commented 2 years ago

Thank you for your response. I'll try with different hardware. What surprised me were the performance differences which I couldn't explain, not slow performance per se. That makes me worry we could hit unexpected slowdowns down the road.

ddorian commented 2 years ago

I see quite low TPS or disk througput. CPU is not at max either. Memory is not maxed either.

You probably need to increase concurrency and lower the size of the files.

But note that you won't be able to reproduce with burstable cpus.

What surprised me were the performance differences which I couldn't explain, not slow performance per see.

You need to remove as many undefined variables as possible. The easiest way to do that is by following any of the benchmarks suites that we use (assuming any of them can be similar to your requirements).

crabhi commented 2 years ago

Hello in the new year. :-)

I tried again with 3x c5a.2xlarge. Using a single client, I observe data load throughput of around 2 - 8 MiB/s (4 000 rows/s). That's somewhat better than the previous case, but still not great. The CPU is at 30 % utilization and the network usage between the instances is less than 100 Mbit/s (c5a.2xlarge should be capable of 2.5 Gbps sustained, 10 Gbps burst).

Interestingly, if I break the data and import the chunks in parallel (I tried 4x and 8x), I see almost no improvement in the speed. However, the workload will saturate CPUs of all instances by yb-tserver processes.

I see around 400 - 450 IOPS (with 3000 IOPS EBS disks). Average queue size is less than one.

ddorian commented 2 years ago

Can you paste output of \d+ public.xyz just to make sure on the sharding? Also, how many tablets are for that table? Can you paste a screenshot of http://yb-master-ip:7000/tablet-servers?

ddorian commented 2 years ago

Try to separate the data-generation from data-insertion. Meaning, generate static files like .csv, and then only benchmark the "loading/inserting" part.

Did you do this?

Interestingly, if I break the data and import the chunks in parallel (I tried 4x and 8x), I see almost no improvement in the speed. However, the workload will saturate CPUs of all instances by yb-tserver processes.

Are you sending the requests all into 1 server or spreading on the 3 servers?

crabhi commented 2 years ago

yugabyte=# \d+ public.xyz
                                                                    Table "public.xyz"
  Column   |         Type          | Collation | Nullable |                        Default                        | Storage  | Stats target | Description 
-----------+-----------------------+-----------+----------+-------------------------------------------------------+----------+--------------+-------------
 id        | integer               |           | not null | nextval('xyz_id_seq'::regclass)                       | plain    |              | 
 text      | text                  |           | not null |                                                       | extended |              | 
 text_hash | character varying(40) |           | not null |                                                       | extended |              | 
Indexes:
    "xyz_pkey" PRIMARY KEY, lsm (id HASH)
Referenced by:
    TABLE "some_other" CONSTRAINT "xpos_caption_fk_id_bf82c7ed_fk_sentione_" FOREIGN KEY (caption_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "some_other" CONSTRAINT "xpos_context_fk_id_8cf9123f_fk_sentione_" FOREIGN KEY (context_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "some_other" CONSTRAINT "xpos_context_title_fk_id_04941e12_fk_sentione_" FOREIGN KEY (context_title_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "some_other" CONSTRAINT "xpos_description_fk_id_0f41143d_fk_sentione_" FOREIGN KEY (description_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "some_other" CONSTRAINT "xpos_meta_title_fk_id_73b2dbd1_fk_sentione_" FOREIGN KEY (meta_title_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "some_other" CONSTRAINT "xpos_parent_text_fk_id_26e85a17_fk_sentione_" FOREIGN KEY (parent_text_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "some_other" CONSTRAINT "xpos_title_fk_id_97d55e0d_fk_sentione_" FOREIGN KEY (title_fk_id) REFERENCES xyz(id) DEFERRABLE INITIALLY DEFERRED

Try to separate the data-generation from data-insertion. Meaning, generate static files like .csv, and then only benchmark the "loading/inserting" part.

Did you do this?

No, but I'm fairly sure this is not the bottleneck. At the same time as I was exporting the data to Yugabyte, I tried the same export to /dev/null and it went at 70 MB/s. I'll try it anyway in a minute.

Are you sending the requests all into 1 server or spreading on the 3 servers?

Just one server, but the CPU is saturated at all of them. Only one has those postgres: ... COPY processes, but they consume only smaller part of CPU. I'll try to distribute.

crabhi commented 2 years ago

This is the screenshot you asked for. There are more tables but only two of them have some data for now.

crabhi commented 2 years ago

By the way, thank you for looking into this @ddorian. I appreciate it very much. :smiley:

ddorian commented 2 years ago

What about using https://docs.yugabyte.com/latest/api/ysql/the-sql-language/statements/cmd_copy/#import-a-large-table-using-smaller-transactions ?

crabhi commented 2 years ago

Interesting. We're back to the behaviour I observed initially - the varying speed. However, the reduced transaction size seems to help somewhat.

$ pv -c < export_xyz_0000000001.txt | ssh admin@3.71.31.219 '/srv/yugabyte-2.8.0.0/bin/ysqlsh yugabyte -c "\copy xyz FROM STDIN WITH (ROWS_PER_TRANSACTION 1000)"'
1.61GiB 0:00:20 [80.8MiB/s] [=============================================================================================>] 100%            
ERROR:  duplicate key value violates unique constraint "xyz_pkey"

# TRUNCATE from another session

$ pv -c < export_xyz_0000000001.txt | ssh admin@3.71.31.219 '/srv/yugabyte-2.8.0.0/bin/ysqlsh yugabyte -c "\copy xyz FROM STDIN WITH (ROWS_PER_TRANSACTION 1000)"'
 654MiB 0:01:13 [10.2MiB/s] [====================================>                                                          ] 39% ETA 0:01:50

Now, it occured to me that maybe Yugabyte, when it encounters an error, still consumes the rest of the input data and only then reports the error. :bulb: That would explain the wildly jumping speeds between tries. And especially the slower import after TRUNCATE - into a clean table. If that's true, then every time I observed fast import, then it would fail eventually and only because I was doing several experiments without letting them to finish, I didn't see the error. :thinking:

ddorian commented 2 years ago

Now, it occured to me that maybe Yugabyte, when it encounters an error, still consumes the rest of the input data and only then reports the error.

Don't think so. You can easily verify this by importing a big file and making a duplicate of the 2nd line so it triggers a duplicate key on the unique index.

crabhi commented 2 years ago

Yes, I think this happens when I start the import without truncating the table first. The duplicate occurs early. In fact, this is the way I can reproduce the issue.

Run the import, observe 10 MB/s import speed.
Cancel it any time after the first transaction is committed.
Run the import again, observe 80 MB/s, observe ERROR: duplicate key value but only when the whole file is sent (mine has 1.6 GB).

ddorian commented 2 years ago

Run the import again, observe 80 MB/s, observe ERROR: duplicate key value but only when the whole file is sent (mine has 1.6 GB).

Maybe ssh is transferring the whole file and then it starts getting imported? You can easily verify this by using a local file on the server.

crabhi commented 2 years ago

I tried with the file on Yugabyte server (yb-m0) and indeed, I can observe this behaviour. I also moved pv out in case it was buffering something.

admin@yb-m0:~$ time < export_xyz.txt ysqlsh yugabyte -c "\copy xyz FROM STDIN WITH (ROWS_PER_TRANSACTION 1000)"
ERROR:  duplicate key value violates unique constraint "xyz_pkey"

real    0m2.883s
user    0m1.102s
sys     0m1.232s

admin@yb-m0:~$ cat export_xyz.txt export_xyz.txt > export-double.txt
admin@yb-m0:~$ time < export-double.txt ysqlsh yugabyte -c "\copy xyz FROM STDIN WITH (ROWS_PER_TRANSACTION 1000)"
ERROR:  duplicate key value violates unique constraint "xyz_pkey"

real    0m6.621s
user    0m2.305s
sys     0m2.670s

I can see the file of double the size takes about double the time to report the error. The files are sorted by id and the smallest id already present in the database is 1.

crabhi commented 2 years ago

I see what is doing it. It's ysqlsh (precisely its \copy command)! PostgreSQL's COPY reports the error immediately.

admin@yb-m0:~$ time ysqlsh yugabyte -c "COPY xyz FROM '/home/admin/export_xyz.txt' WITH (ROWS_PER_TRANSACTION 1000)"
ERROR:  duplicate key value violates unique constraint "xyz_pkey"

real    0m0.183s
user    0m0.000s
sys     0m0.007s

Testing psql, it behaves the same, so it's a behaviour you inherited. I think this issue can be closed now, thank you.

yugabyte / yugabyte-db