[New Feature] Faster Bulk-Data Loading in YugabyteDB - Githubissues

yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.

https://www.yugabyte.com

Other

8.96k stars 1.07k forks source link

[New Feature] Faster Bulk-Data Loading in YugabyteDB #11765

Open ymahajan opened 2 years ago

ymahajan commented 2 years ago

Jira Link: DB-4641

Description

Master issue to track improvements to make it easier and faster to get large amounts of data into YugabyteDB.

Phase 1

Status	Feature	GitHub Issue	Comments
✅	Faster non transactional writes during bulk load	#7809	Allowing faster writes on copy command by using session variable "yb_force_non_transactional_writes".
✅	Disable transactional writes during bulk data loading for indexes	#11266	Add yb_disable_transactional_writes session to improve the latency performance of bulk data loading for index tables such as when COPY command is used which goes into the insert write path (not delete or update).
✅	Implement Async Flush for COPY command	#11628	Currently, we synchronously wait for a flush response every time we flush. We want to make this asynchronous to reduce the time spent waiting and improve the performance of COPY.
✅	Speed up YSQL inserts by skipping lookup of keys being inserted	#11269	During bulk load (for example inserts by Copy command), skip lookup of the key being inserted, to speed up the inserts. This is similar to the upsert mode that is supported for YCQL.
✅	Optimize memory allocation/deallocation in bulk insert/copy using Protobuf's arena	#11720	Currently when running bulk insert / copy command, in the PostgreSQL backend for, about 15 percent of CPU time is spent on memory allocation / deallocation.
✅	Perf improvement by eliminating serialization to the WAL format.	#11409	When writing data to the RocksDb layer, there are additional steps of serializing to the WAL format which is unnecessary and leads to wasted work.
✅	Tuning parameters for faster copy performance	#12293	Tuning parameters for faster copy performance
✅	Pack columns in DocDB storage format for better performance	#3520	Packing columns into a single RocksDB entry per row instead of one per column (as we do currently) improves YSQL performance
⬜️	Parallelize copy command	#11453	Distribute copy operation internally using multiple workers

Phase 2

Status	Feature	GitHub Issue	Comments
⬜️	Streaming ingest to YugabyteDB without using JDBC		Inserting around 1 billion records through the streaming interface every day. It will be inefficient to transfer this huge volume of records over the JDBC interface. It could be implementing Spark RDD write interface.

ddorian commented 1 year ago

Streaming ingest to YugabyteDB without using JDBC

Maybe https://www.postgresql.org/about/news/apache-arrow-flight-sql-adapter-for-postgresql-010-2716/ could be of use.