yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.65k stars 1.04k forks source link

[YSQL] Insert with function scan requires extra flushes with plpgsql function over sql function #19520

Open timothy-e opened 8 months ago

timothy-e commented 8 months ago

Jira Link: DB-8313

Description

Schema:

CREATE TABLE truths (b BOOL);
CREATE OR REPLACE FUNCTION ret_true() RETURNS boolean AS $$
    BEGIN
        RETURN true;
    END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION ret_true_sql(varchar) RETURNS boolean
    AS 'select true;'
    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;

Using the plpgsql function, we have 20k writes and 20k flushes.

EXPLAIN (ANALYZE, DIST) INSERT INTO truths SELECT ret_true() FROM generate_series(1, 20000);
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Insert on truths  (cost=0.00..270.00 rows=1000 width=1) (actual time=2418.653..2418.653 rows=0 loops=1)
   ->  Function Scan on generate_series  (cost=0.00..260.00 rows=1000 width=1) (actual time=0.724..2318.286 rows=20000 loops=1)
 Planning Time: 0.038 ms
 Execution Time: 2418.838 ms
 Storage Read Requests: 0
 Storage Write Requests: 20000.000
 Catalog Read Requests: 3
 Catalog Read Execution Time: 2.803 ms
 Catalog Write Requests: 0.000
 Storage Flush Requests: 19999
 Storage Flush Execution Time: 2214.296 ms
 Storage Execution Time: 2217.099 ms
 Peak Memory Usage: 1287 kB
(13 rows)

Using the SQL function, we have 20k write requests and 4 flushes:

EXPLAIN (ANALYZE, DIST) INSERT INTO truths
 SELECT ret_true_sql('a') FROM generate_series(1, 20000);
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Insert on truths  (cost=0.00..10.00 rows=1000 width=1) (actual time=78.687..78.687 rows=0 loops=1)
   ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=1) (actual time=1.342..2.405 rows=20000 loops=1)
 Planning Time: 0.182 ms
 Execution Time: 78.804 ms
 Storage Read Requests: 0
 Storage Write Requests: 20000.000
 Catalog Read Requests: 5
 Catalog Read Execution Time: 6.311 ms
 Catalog Write Requests: 0.000
 Storage Flush Requests: 4
 Storage Flush Execution Time: 0.009 ms
 Storage Execution Time: 6.320 ms
 Peak Memory Usage: 1281 kB
(13 rows)

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

kmuthukk commented 8 months ago

This is not about PL/PGSQL or SQL.. but whether the function is "pure" or not.

If I mark the PL/PGSQL function IMMUTABLE, similar to the SQL function:

CREATE OR REPLACE FUNCTION ret_true() RETURNS boolean AS $$
       BEGIN
        RETURN true;
        END;
$$ LANGUAGE plpgsql IMMUTABLE;
                    ^^^^^^^^^

then we get the more optimal behavior of a single flush.

kmuthukk commented 8 months ago

Even with IMMUTABLE, this test case shows high number of flushes:


DROP TABLE IF EXISTS load_temp;

create table load_temp (value text);
insert into load_temp select 'abcdef'
 from generate_series(1,1000) a;

CREATE OR REPLACE FUNCTION is_date(input_string VARCHAR, format_string VARCHAR)
RETURNS BOOLEAN AS $$
BEGIN
    perform TO_DATE(input_string, format_string);
    RETURN TRUE;
EXCEPTION WHEN OTHERS THEN
    RETURN FALSE;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

DROP TABLE IF EXISTS test_load;
EXPLAIN (ANALYZE, DIST) CREATE  TABLE test_load AS
    SELECT
        public.is_date(value,'YYYYMMDD') col1
    FROM   load_temp;

shows:

 Storage Flush Requests: 1001

but if we change:

public.is_date(value,'YYYYMMDD') col1

to say:

public.is_date('a','YYYYMMDD') col1

then the number of flushes drop to:

 Storage Flush Requests: 2