yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.06k stars 1.09k forks source link

Using PL/pgSQL proc to insert five rows is slower than doing the same inserts by executing prepared statements as top level calls #2557

Open bllewell opened 5 years ago

bllewell commented 5 years ago

Jira Link: DB-2535

Overall Summary as of 13-Jan-2020

9-Oct-2019: issue created

I wrote a Python program to implement an experiment to illustrate the expected performance benefit of encapsulating the several round trips in a single stored procedure call using a transaction that inserts several rows into a single table "t(k int. v varchar)". The table design reflected typical practice by using an auto-generated (surrogate) primary key, "k", and some typical constraints on both that column and the payload column "v". The results in YugabyteDB came out contrary to expectation. Briefly, the stored proc approach (hereinafter "proc") was slower than the top-level SQL approach (hereinafter "SQL") until the transaction had 20 statements. With fewer stmts per txn, "proc" was slower than "SQL". For example, with 2 stmts per txn, "SQL" was almost twice as fast as "proc". (Running the same test using PostgreSQL showed that "SQL" was never slower than "proc".)

19-Oct-2019: Retest following initial analysis

The initial analysis showed that the fact that YugabyteDB doesn't implement so-called "negative caching" might produce a performance drag sufficient to explain what my tests had shown. The workaround is trivial: simply use an explicit typecast for every actual value bound to a SQL statement at "execute" time to cast to the datatype of the corresponding table column. I implemented it as the only change to my Python program. Now, within the limits of my measurement precision, the stored proc approach was never slower than the top-level SQL approach. However, the benefit, even with 20 stmts per txn was a tiny 1.2x.

3-Dec-2019: Retest with simplified tescase diegn

Informal discussion (kmuthukk, bllewell) brought out the fact that using a series ("generated always as identity") to populate the primary key column was a bad choice in a distributed database—and that the intended-to-be typical ordinary constraints that I'd used served only to introduce noise into a performance experiment. Therefore a new test was proposed with all constraints except the unavoidable primary key constraint removed, and a new distributed-SQL-friendly method to populate the PK—use "k uuid" and populate it either:

— 1. client-side using Python's "uuid.uuid4()" method

— 2. sever-side by defining "k uuid" with the "default gen_random_uuid()" clause.

Of course, this implies two different insert statements:

-- Testcase 1
insert into t(k, v) values($1::uuid, $2::varchar)

and

-- Testcase 2
insert into t(v) values($1::varchar)

Notice the proper use of the explicit typecasts.

Notice, too, that the difference between "Testcase 1" and "Testcase 2" conflates two orthogonal degrees of freedom: (a) client-side vs server side generation of the PK value; and (2) the number of columns in the "insert" list.

I made the results easier to comprehend by computing the speed ratios ("proc" to "SQL") where, e.g. a ratio of 1.27 means that the "proc" approach is 1.27x faster than the "SQL" approach.

Find the tables below by searching for:

“plpgsql_vs_sql_1.py”* (client-side generated uuid)

and:

“plpgsql_vs_sql_2.py” (server-side generated uuid)

For YugabyteDB, the ratios for "Testcase 1" show a new shock: for example, for 2 stmts per txn, the speed of "proc" is 0.7x the speed of "SQL". But, for "Testcase 2", we see a similar pattern to what my 10-Oct-2019 account describes. Notice the dramatic contrast for the results for vanilla PostgreSQL: there, "proc" is always noticeably faster than "SQL".

13-Jan-2020: Proposed next step

I propose a new testcase that uses a fixed method for PK generation (we must use the client-side method) and that then runs the test to measure the proc to SQL speed ratios with an ever increasing number of "payload" columns (say from zero through nine) like this:

insert into t(k) values($1::uuid)

then

insert into t(k, v1) values($1::uuid, $2::varchar)

then

insert into t(k, v1, v2) values($1::uuid, $2::varchar, $3:varchar)

through

insert into t(k, v1, v2, v3, v4, v5, v6, v7, v8, v9)
values(
  $1::uuid,
  $2::varchar,
  $3:varchar,
  $4:varchar,
  $5:varchar,
  $6:varchar,
  $7:varchar,
  $8:varchar,
  $9:varchar,
  $10:varchar)

Background

Ever since stored procedures came to relational databases (late 1980s and early 1990s) one of the most touted benefits has been the performance improvement brought to multi-statement change-making transactions. The obvious contributing effect is round trip elimination. (Here, "round trip" covers both the basic client-server transmission time and the code paths in both the client and the server for encoding/decoding the request to/from the wire protocol format and for network socket management.) A less obvious effect, when the server is handling many concurrent processes, is that each network return from server to client means voluntarily descheduling your process from the server CPU—which, in turn, means that you lose instruction cache warmth and so on.) See, for example, THIS VIDEO.

Simple test

I did this test using YB Version 2.0.0 downloaded from the Internet. Running on my macOS Mojave Version 10.14.6. Using a single node RF=1 cluster created with bare "yb-ctl create".

This test uses ysqlsh and a "toy" table. See the attached issue_2557.zip. Unzip it and run 0.sql at the ysqlsh prompt.

The table is created thus:

create table t(
  k int
  generated always as identity
  constraint t_k_chk check(k > 0)
  constraint t_pk primary key,

  v varchar(100)
  constraint t_v_nn not null);

Wall-clock times are measured with the\timing on metacommand to compare the speed of the top-level SQL approach with that of the PL/pgSQL approach.

Top-level SQL approach

First, the statement is prepared (without timing this):

prepare stmt(varchar) as insert into t(v) values($1)

Then, this transaction is executed:

start transaction;
  execute stmt('one');
  execute stmt('two');
  execute stmt('three');
  execute stmt('four');
  execute stmt('five');
commit;

(Notice that PostgreSQL (unlike, say, Oracle Database) has no bulk-binding features. This is true both for top-level SQL from client-drivers and for the SQL issued out of PG/plSQL procedures.)

And then another two five-insert txns, with different binds, are executed and timed.

PL/pgSQL approach

The procedure is created thus:

create procedure insert_rows(vs in varchar[])
  language plpgsql
as $$
declare
  this_v varchar(30) not null := '?';
begin
  rollback;
  foreach this_v in array vs loop
    insert into t(v) values(this_v);
  end loop;
  commit;
end
$$;

And it is invoked thus:

call insert_rows(
  array[
    'one',
    'two',
    'three',
    'four',
    'five'
  ]
);

And then it's invoked another two times with different actual arguments.

(Notice that PostgreSQL through at least Version 11.2 doesn't support the "prepare-execute" approach for the "call" statement. However, creating the header:

procedure insert_rows(a in varchar[])
  language plpgsql
as

is functionally equivalent to "preparing" because what follows between the $$ quotes is the identical literal text that you'd use following the keyword DO to execute it as an anonymous block. And the procedure's formal parameters, and how they are referenced in the procedure's defining block statement, are exactly equivalent to the bind-to-placeholders mechanism that "prepare-execute" supports.)

Typical results

The times are in millisec. For the top-level SQL test, I'm listing the sums (calculated by hand) over the individual times for all of the seven statements).

top-level SQL...  9.28  |  5.48  |  5.51
PG/plSQL........ 19.50  |  6.61  |  6.93

The fact that the time for the first attempt is greater than for the subsequent attempts is easily explained for the top-level SQL test because we can see the times for each individual statement. The very first "execute" takes about 5x longer than all the subsequent ones, presumably because the real SQL compilation and plan calculation happen just for the first execution.

We don't have access to such granular timings for the PL/pgSQL test. But the 11.2 PostgreSQL documentation at Section 43.11. PL/pgSQL Under the Hood — look for "43.11.2. Plan Caching" in the page — gives a reasonable explanation. The procedure is compiled on first use, and the compiled form is then re-used for the rest of the session's duration. And static SQL statements are effectively prepared and planned on first use and, again, are then re-used for the rest of the session's duration.

Something is clearly wrong with the implementation of PG/plSQL's various mechanisms in YugabyteDB's adoption of PostgreSQL's query layer code. After warm-up, the PG/plSQL approach is about 1.2x slower than the top-level SQL approach for a five-insert txn. For reference, here are the typical corresponding figures when the test is run using psql and a vanilla PostgreSQL 11.2 database:

top-level SQL...  1.46  |  0.78  |  0.64
PG/plSQL........  1.10  |  0.29  |  0.20

Here, after warm-up, the PG/plSQL approach is about 3x faster than the top-level SQL approach.

Exhaustive test implemented using Python with the psycopg2 module for PostgreSQL

The attached issue_2557.zip also includes single_table.py and a collection of the results it produced in all_results.txt. The program is a mechanical harness for the ysqlsh tests that I described above. It uses the same table t and the same procedure insert_rows(vs in varchar[]).

It allows you to specify the number of "insert" statements per transaction. It defines a batch as a specifiable number of inserts and therefore calculates the number of transactions per batch that will achieve that number. Each batch is timed. It allows you to specify the number of batches to do. And it calculates the average time per batch and the standard deviation. Increasing the number of batches gives more reliable average time and standard deviation values. When a run is complete, it inserts these values, together with the values that parameterized the program run, into a results table.

It also adds a variant on the stored procedure approach by implementing the identical code as a DO block. The DO block is slower than the stored procedure, presumably because it has to be compiled anew for every invocation. (You can't use the "prepare-execute" approach with a DO block.) For that reason, I didn't include any results from the DO block approach in all_results.txt

The program has seven command-line arguments with the following meanings.

  --mode {normal,create_results_table}

Choosing _create_resultstable drops and re-creates the results table and does nothing else. With this choice, only the db argument matters.

  --db {yb,pg,cloud})

Choosing yb uses YugabyteDB on port 5433 on localhost. Choosing pg uses PostgreSQL on port 5432 on localhost. And choosing cloud uses YugabyteDB on port 5433 on <some IP address>. The IP address is set as a text literal in the Python source. Simply edit it to use the YugabyteDB you want to. In my tests, I used a three-node cluster with RF=3 installed in the same availability zone (AWS US West 2a — Oregon). And I ran my client Python program in a developer VM in the same availability zone.

  --nr_stmts_pr_txn {1,2,5,10,20,50,100}

This is the number of "insert" statements per transaction. The available choices are all factors of the choices for the number of rows to be inserted per batch.

  --method {sql,do_block,proc}

This determines which approach to use: top-level SQLs; a stored procedure; or an anonymous block.

  --v_unq {n,y}

This determines whether to create a secondary index as a unique index on the column t.v.

  --nr_rows_per_batch {500,1000,10000}

This is the number of rows to be inserted per batch.

  --nr_batches

This is the number of batches to do in one program run.

The attached issue_2557.zip also includes run_single_table_cloud.sh. This runs single_table.py for the cloud database for each of the allowed values for nr_stmts_pr_txn and for each of the top-level SQL and stored procedure approaches. It takes the default values for v_unq (don't create a secondary index), nr_rows_per_batch (10,000), and nr_batches (10). Here are the results (copied from all_results.txt). They are produced by running show_results.sql — also included in issue_2557.zip.

 Stmts per txn | Method | Time per batch |  StDev   
---------------+--------+----------------+----------
             1 | sql    |   33.59        |    0.010
             1 | proc   |   62.21        |    0.102
             2 | sql    |   25.67        |    0.021
             2 | proc   |   41.43        |    0.063
             5 | sql    |   21.66        |    0.041
             5 | proc   |   26.81        |    0.016
            10 | sql    |   20.16        |    0.000
            10 | proc   |   22.04        |    0.013
            20 | sql    |   19.28        |    0.028
            20 | proc   |   19.39        |    0.031
            50 | sql    |   18.36        |    0.007
            50 | proc   |   17.61        |    0.035
           100 | sql    |   17.77        |    0.076
           100 | proc   |   16.79        |    0.013

Here are the corresponding figures for the identical test done running both the Python program and YugabyteDB (single node, RF=1) on my MacBook (see above for details):

 Stmts per txn | Method | Time per batch |  StDev   
---------------+--------+----------------+----------
             1 | sql    |   14.57        |    0.122
             1 | proc   |   30.34        |    0.013
             2 | sql    |   11.75        |    0.011
             2 | proc   |   19.21        |    0.099
             5 | sql    |   10.07        |    0.021
             5 | proc   |   12.53        |    0.062
            10 | sql    |    9.57        |    0.067
            10 | proc   |    9.92        |    0.013
            20 | sql    |    8.56        |    0.036
            20 | proc   |    9.15        |    0.009
            50 | sql    |    8.54        |    0.132
            50 | proc   |    8.20        |    0.054
           100 | sql    |    8.31        |    0.052
           100 | proc   |    7.92        |    0.029

The pattern is the same for the local test and the AWS Cloud test. But the local times are all about one half of the Cloud times.

As a sanity check, I ran a subset of the tests running the Python program on my MacBook (in Sunnyvale CA) and connecting to YugabyteDB in the AWS cloud (in Oregon). Of course, it's a completely unrealistic setup, but it makes the client-server round trip times dominate the measurements—so we do see the expected benefit of stored procedure encapsulation here.

 Stmts per txn | Method | Time per batch |  StDev   
---------------+--------+----------------+----------
             2 | sql    |  152.22        |    1.504
             2 | proc   |   42.37        |    0.030
           100 | sql    |   78.06        |    0.535
           100 | proc   |    2.44        |    0.035

And finally, for completeness, here are the results from running the Python program on my MacBook against a PostgreSQL database also on my MacBook:

 Stmts per txn | Method | Time per batch |  StDev   
---------------+--------+----------------+----------
             1 | sql    |    2.98        |    0.011
             1 | proc   |    2.04        |    0.008
             2 | sql    |    1.95        |    0.001
             2 | proc   |    1.14        |    0.010
             5 | sql    |    1.30        |    0.008
             5 | proc   |    0.54        |    0.002
            10 | sql    |    1.03        |    0.002
            10 | proc   |    0.34        |    0.002
            20 | sql    |    0.90        |    0.064
            20 | proc   |    0.27        |    0.002
            50 | sql    |    0.97        |    0.043
            50 | proc   |    0.19        |    0.002
           100 | sql    |    0.78        |    0.004
           100 | proc   |    0.17        |    0.001

Here we see that the stored procedure approach is always faster than the top-level SQL approach—exactly as expected.

Summary

We see that using YugabyteDB, with sensible placements of client and database, the stored procedure approach is noticeably slower than the top-level SQL approach for small numbers of statements per transaction. Only when we reach 20 statements per transaction. do we start to break even.

bllewell commented 5 years ago

issue_2557.zip

kmuthukk commented 5 years ago

Really nice test case @bllewell !

When running "the individual SQL statements" from client (and not using plpgsql):

% python ~/notes/plpgsql_perf.py --db=yb --nr_stmts_pr_txn=5 --nr_batches=10  --method=sql

Noticing this in top. See how yb-master is only 0.7% and not in the critical path (as expected).

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9087 centos    20   0 1208500 125532  31464 S  97.3  1.7   3:37.59 yb-tserver
14039 centos    20   0 1107276  36880  14772 S  48.2  0.5   0:11.21 postgres
...
 9084 centos    20   0  537180  68164  24624 S   0.7  0.9   0:20.05 yb-master

But when running the plpgsql version that bundles the statements in one call to the server, the performance is slower.

% python ~/notes/plpgsql_perf.py --db=yb --nr_stmts_pr_txn=5 --nr_batches=10  --method=proc

Noticing this in top. See how yb-master is now taking about 12% CPU. So it appears that something isn't getting cached as one would expect.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9087 centos    20   0 1230768 131848  35128 S  75.0  1.8   6:10.03 yb-tserver
14422 centos    20   0 1109524  37992  15732 S  50.0  0.5   0:09.92 postgres
 9084 centos    20   0  537180  68772  24624 S  12.5  1.0   0:23.25 yb-master

Profiling the yb-master thread (using something like sudo perf top -p 9084 -ag) shows this:

+   92.51%     0.18%  libyrpc.so                      [.] yb::rpc::(anonymous namespace)::Worker::Execute
+   90.42%     0.02%  libtserver.so                   [.] yb::tserver::TabletServiceImpl::Read
+   84.61%     0.22%  libyb_docdb.so                  [.] yb::docdb::QLRocksDBStorage::GetIterator
+   82.80%     0.05%  libyb_docdb.so                  [.] yb::docdb::PgsqlReadOperation::Execute
+   81.82%     0.12%  libyb_docdb.so                  [.] yb::docdb::DocRowwiseIterator::DoInit<yb::docdb::DocPgsqlScanSpec>
+   80.72%     0.01%  libyrpc.so                      [.] yb::rpc::ServicePoolImpl::Handle
+   77.41%     0.10%  librocksdb.so                   [.] rocksdb::MergingIterator::Seek
+   75.17%     0.04%  librocksdb.so                   [.] rocksdb::BlockBasedTable::NewDataBlockIterator
+   74.78%     0.33%  librocksdb.so                   [.] rocksdb::ReadBlockContents
+   71.79%     0.12%  librocksdb.so                   [.] rocksdb::DBIter::Seek
+   62.58%     0.02%  libyb_docdb.so                  [.] yb::docdb::DocRowwiseIterator::Init
+   59.35%     0.05%  libyb_docdb.so                  [.] yb::docdb::PerformRocksDBSeek
+   54.82%     0.02%  libtserver.so                   [.] yb::tserver::TabletServiceImpl::CompleteRead
+   53.49%     0.04%  libyb_docdb.so                  [.] yb::docdb::BoundedRocksDbIterator::Seek
+   53.21%    52.68%  libsnappy.so.1.3.0              [.] snappy::RawUncompress
+   53.12%     0.22%  librocksdb.so                   [.] rocksdb::UncompressBlockContents
+   52.23%     0.05%  librocksdb.so                   [.] rocksdb::(anonymous namespace)::TwoLevelIterator::Seek
+   52.07%     0.16%  libsnappy.so.1.3.0              [.] snappy::RawUncompress
+   34.21%     0.05%  librocksdb.so                   [.] rocksdb::MultiLevelIterator::Seek
+   25.14%     0.03%  libtserver.so                   [.] yb::tserver::TabletServiceImpl::DoRead
+   15.60%     0.00%  libyb_util.so                   [.] yb::Thread::SuperviseThread
+   15.60%     0.00%  libpthread-2.23.so              [.] start_thread
+   14.40%     0.00%  libyrpc.so                      [.] yb::rpc::InboundCall::InboundCallTask::Run
+   14.26%     0.00%  libtserver_service_proto.so     [.] yb::tserver::TabletServerServiceIf::Handle
+   13.24%     0.00%  libtablet.so                    [.] yb::tablet::AbstractTablet::HandlePgsqlReadRequest
+   13.24%     0.00%  libtablet.so                    [.] yb::tablet::Tablet::HandlePgsqlReadRequest
+   11.95%    11.95%  librocksdb.so                   [.] rocksdb::crc32c::ExtendImpl<&rocksdb::crc32c::Fast_CRC32>
+   11.46%     0.04%  libyb_docdb.so                  [.] yb::docdb::IntentAwareIterator::Seek
+   10.69%     0.00%  librocksdb.so                   [.] rocksdb::block_based_table::ReadBlockFromFile
+    6.87%     0.35%  librocksdb.so                   [.] rocksdb::RandomAccessFileReader::Read
kmuthukk commented 5 years ago

In particular, the RPC that the yb-master service seems to be getting is:

ip-10-9-123-137.us-west-2.compute.internal:~
22:56 $ curl -s http://127.0.0.1:7000/rpcz | more
{
    "inbound_connections": [
       ....
       ....
        {
            "remote_ip": "127.0.0.1:45651",
            "state": "OPEN",
            "processed_call_count": 35648,
            "calls_in_flight": [
                {
                    "header": {
                        "call_id": 180001,
                        "remote_method": {
                            "service_name": "yb.master.MasterService",
                            "method_name": "GetTableLocations"
                        },
                        "timeout_millis": 2500
                    },
                    "micros_elapsed": 1111
                }
            ]
        },
kmuthukk commented 5 years ago

Profiling the postgres worker process when running the test in proc (plpgsql) mode shows these stacks:

-   53.40%     0.00%  postgres  postgres                     [.] parse_analyze                                                                                                                                                                                                   ▒
     parse_analyze                                                                                                                                                                                                                                                               ▒
     transformTopLevelStmt                                                                                                                                                                                                                                                       ▒
   - transformStmt                                                                                                                                                                                                                                                               ▒
      - 52.98% ParseFuncOrColumn                                                                                                                                                                                                                                                 ▒
         - 52.77% func_get_detail                                                                                                                                                                                                                                                ▒
            - 52.70% LookupTypeName                                                                                                                                                                                                                                              ▒
                 TypenameGetTypid                                                                                                                                                                                                                                                ▒
                 GetSysCacheOid                                                                                                                                                                                                                                                  ▒
               + SearchCatCache

and

  - SearchCatCache                                                                                                                                                                                                                                                              ▒
      - 52.42% SearchCatCacheMiss                                                                                                                                                                                                                                                ▒
         - 47.30% ybc_systable_getnext                                                                                                                                                                                                                                           ▒
            - 46.95% ybc_getnext_heaptuple                                                                                                                                                                                                                                       ▒
               - 41.12% YBCPgExecSelect                                                                                                                                                                                                                                          ▒
                  - yb::pggate::PgApiImpl::ExecSelect                                                                                                                                                                                                                            ▒
                     - yb::pggate::PgSelect::Exec                                                                                                                                                                                                                                ▒
                        - 39.65% yb::pggate::PgDocOp::Execute                                                                                                                                                                                                                    ▒
                           - 38.95% yb::pggate::PgDocReadOp::SendRequestUnlocked                                                                                                                                                                                                 ▒
                              - 31.30% yb::pggate::PgSession::PgApplyAsync                                                                                                                                                                                                       ▒
                                 - 31.02% yb::client::YBSession::Apply                                                                                                                                                                                                           ▒
                                    - 29.96% yb::client::internal::Batcher::Add                                                                                                                                                                                                  ▒
                                       - 29.33% yb::client::internal::MetaCache::LookupTabletByKey                                                                                                                                                                               ▒
                                          - 27.86% yb::rpc::StartRpc<yb::client::internal::LookupByKeyRpc, yb::client::internal::MetaCache*, yb::client::YBTable const*&, std::string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long, std::ratio▒
                                             - 27.16% yb::client::internal::LookupRpc::SendRpc                                                                                                                                                                                   ▒
                                                - 25.96% yb::client::internal::LookupByKeyRpc::DoSendRpc                                                                                                                                                                         ▒
                                                   - 25.33% yb::master::MasterServiceProxy::GetTableLocationsAsync                                                                                                                                                               ▒
                                                      - yb::rpc::Proxy::AsyncRequest                                                                                                                                                                                             ▒
                                                         - 24.70% yb::rpc::Proxy::DoAsyncRequest                                                                                                                                                                                 ▒
                                                            - 20.42% yb::rpc::Proxy::QueueCall                                                                                                                                                                                   ▒
                                                               - 19.86% yb::rpc::Messenger::QueueOutboundCall                                                                                                                                                                    ▒
                                                                  - 19.65% yb::rpc::Reactor::QueueOutboundCall                                                                                                                                                                   ▒
                                                                     + 19.37% yb::rpc::Reactor::ScheduleReactorTask                                                                                                                                                              ▒
                                                              2.67% yb::rpc::OutboundCall::SetRequestParam                                                                                                                                                                       ▒
                                                              0.91% yb::rpc::OutboundCall::OutboundCall
m-iancu commented 5 years ago

High-level, one of the issues is that (function call) parsing relies on negative caching in the case the function arguments types don't match exactly.

Specifically, the (function call) parser works as follows:

  1. see if a function with exact name and arg types exists
  2. else see if this is a type coercion (e.g. integer ('123'))
  3. else, search for the best function candidate including (implicit) type casts.

Assuming case 1 doesn't match, case 2 will search for a type with that name which (if such a type does not exist) will end up querying the master each time. Only then in will get to step 3 and search for the appropriate (implicit) type casts.

As current workaround, one can explicitly cast to the right type in the function call. For instance:

-- use implicit argument type (will end up requiring type casting and therefore do a master request).
yugabyte=# call insert_rows( array['one'] );
Time: 6.099 ms
-- explicitly cast to the type declared in the function definition (should not need master request).
yugabyte=# call insert_rows( array['one']::varchar[] );
CALL
Time: 3.070 ms

cc @kmuthukk @bllewell

kmuthukk commented 5 years ago

Thx a lot @m-iancu for the quick insight! Will give it a try.

kmuthukk commented 5 years ago

@m-iancu -- that was it:

Changed:

        sess.execute("call insert_rows(%s)", (vs,))

to:

        sess.execute("call insert_rows(%s::varchar[])", (vs,))

The times now dropped from about 20 to 15 for this program.

$ python ~/notes/plpgsql_perf.py --db=yb --nr_stmts_pr_txn=5 --nr_batches=10  --method=proc
Database         | Stmts pr txn | Method    | v_Unq? | Rows per batch | Batches
Local YugabyteDB |            5 | Procedure | No     | 10000          | 10

   batch#      time (seconds)
        1               15.21
        2               15.15
        3               15.52
...

and, yb-master is no longer in the critical path (as should be the case).

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9087 centos    20   0 1286304 234080  51704 S 105.3  3.2  18:25.64 yb-tserver
27531 centos    20   0 1109524  40052  15732 R  47.8  0.6   0:05.45 postgres
27530 centos    20   0  123288  11816   4400 S   1.7  0.2   0:00.18 python
 9084 centos    20   0  539228  70520  24792 S   0.3  1.0   2:05.19 yb-master
bllewell commented 5 years ago

I implemented the explicit type cast—exactly as proposed—in my own copy of the code kit that I attached when I opened this issue. This is the new sess.execute() invocation:

sess.execute("call insert_rows(%s::varchar[])", (vs,))

In other words, I made exactly and only this one tiny spelling change. Then I re-ran these two tests:

On my MacBook to a local RF1, single node YB cluster (now version 2.0.1)

See above for the pre-workaround results for this test env (search for "Here are the corresponding figures for the identical test done running both the Python program and YugabyteDB (single node, RF=1) on my MacBook")

run_single_table_yb.sh

 Stmts per txn | Method | Time per batch |  StDev   
---------------+--------+----------------+----------
             1 | sql    |   16.43        |    0.174
             1 | proc   |   16.49        |    0.459
             2 | sql    |   12.45        |    0.026
             2 | proc   |   11.45        |    0.078
             5 | sql    |   10.33        |    0.053
             5 | proc   |    9.30        |    0.006
            10 | sql    |   10.19        |    0.013
            10 | proc   |    8.57        |    0.028
            20 | sql    |    9.96        |    0.072
            20 | proc   |    8.32        |    0.073
            50 | sql    |    9.94        |    0.220
            50 | proc   |    8.04        |    0.033
           100 | sql    |   10.00        |    0.984
           100 | proc   |    7.79        |    0.002

The times for the "1 stmt per txn" result pair" are identical within the stdev. Notice that, for this case, "method=sql" is 3 round ysqlsh/postgres-server round trips ("start txn", "insert", and "commit") and "method=proc" is just one, as it always is. In all other cases, "method=proc" is faster than "method=sql". However, the outcome is probably better stated thus:

Encapsulating the database behind a "hard shell" stored proc API, to bring correctness (via age-old modular program design reasoning) and security ('cos the client cannot access the tables directly) incurs no performance penalty.

It seems strange that the "method=sql" times are systematically slower for this run than those for the run reported above because the code is unchanged. (The differences are greater than the stdev values.) Is this significant?

Using a 3-node RF=3 YB cluster on AWS (now version 2.0.2)

Specifically: AWS "us-west-2 / us-west-2a", "idem-2b", "idem-2c" — (i.e. Oregon). (Log on to Yugaware there and look for the cluster called "bllewell".) My Python program ran on "Rao's development machine" in the same data center.

See above for the pre-workaround results for this test env (search for "_The attached issue_2557.zip also includes run_single_tablecloud.sh.")

run_single_table_cloud.sh

 Stmts per txn | Method | Time per batch |  StDev   
---------------+--------+----------------+----------
             1 | sql    |   59.76        |    0.114
             1 | proc   |   69.31        |    0.017
             2 | sql    |   57.41        |    0.322
             2 | proc   |   56.03        |    0.121
             5 | sql    |   43.51        |    0.106
             5 | proc   |   42.04        |    0.023
            10 | sql    |   34.17        |    0.074
            10 | proc   |   45.04        |    0.049
            20 | sql    |   32.63        |    0.015
            20 | proc   |   43.22        |    0.068
            50 | sql    |   36.86        |    0.046
            50 | proc   |   44.47        |    0.098
           100 | sql    |   42.73        |    0.050
           100 | proc   |   43.87        |    0.010

How can the "method-sql" times be more than ~2x longer than their counterparts in the earlier corresponding run? This seems to cast doubt on the value of the measurements. Notice that, in this test, the "method=proc" times are sometimes noticeably longer than their "method=sql" partners. Look esp. at the result pairs for "stmts per txn" is 20 and 50.

bllewell commented 5 years ago

Adding a new testcase

Summary: when the client generates the PK, the speed ratio, proc-to-sql, shows that encapsulating several SQL statements in a single procedure call actually slows down the transaction. The ratio is 0.6 for a one-statement txn. It increases slowly with the number of statements per txn and reaches about 1.0 for a ten-statement txn. This is contrary to all expectation.

Detail

The original testcase used “generated always as identity” to populate the “k int” primary key column—which uses a sequence under the covers. But sequences perform poorly in a distributed SQL database. So the new testcase uses “k uuid” as the primary key. Two different methods to populate it are compared: (1) getting the value client-side using Python’s uuid.uuid4() method; and (2) getting the value server side using default gen_random_uuid() for the primary key column in the “create table” statement. (This requires that the PostgreSQL extension “pgcrypto” is installed.) To make the code more transparent, and especially to allow the use of a GUI diff too, I used two very similar Python files: _“plpgsql_vs_sql1.py” for the client-side PK approach and _“plpgsql_vs_sql2.py” for the server-side PK approach. Both were derived from _“singletable.py” (described above). The other parameterizations (in addition to choice of database and _“mode {normal, create_resultstable}”) were retained:

  --nr_stmts_pr_txn {1, 2, 5, 10, 20, 50, 100}
  --method {sql, proc, do_block}
  --nr_rows_per_batch {5, 500, 1000, 10000} ... fixed at 10000
  --nr_batches NR_BATCHES ... fixed at 10

This time, the results were recorded in a “results” table with these columns:

run_timestamp         timestamp
database              varchar(20)
pk_method             varchar(20)
nr_stmts_pr_txn       integer
method                varchar(20)
nr_rows_per_batch     integer
nr_batches            integer
avg_seconds_per_batch numeric
stdev                 numeric

Here are the results for YB version 2.0.6 using a RF=1 single node cluster on my MacBook (macOS Mojave Version 10.14.6) for _“plpgsql_vs_sql1.py” (client-side generated uuid):

  Method  | Stmts per txn | Time per batch |  StDev
----------+---------------+----------------+---------
 sql      |             1 |    8.40        |    0.095
 sql      |             2 |    6.00        |    0.046
 sql      |             5 |    4.52        |    0.008
 sql      |            10 |    4.06        |    0.048
 sql      |            20 |    3.80        |    0.001
 sql      |            50 |    3.60        |    0.001
 sql      |           100 |    3.54        |    0.007

 proc     |             1 |   14.05        |    0.017
 proc     |             2 |    8.78        |    0.082
 proc     |             5 |    5.11        |    0.000
 proc     |            10 |    4.01        |    0.006
 proc     |            20 |    3.34        |    0.011
 proc     |            50 |    2.98        |    0.004
 proc     |           100 |    2.80        |    0.008

(I’m not showing the “doblock” results here because, for well-known reasons, this approach will always be slower than the “proc” approach.) Notice that the time per batch for a one-statement txn for “method=proc” is about 2x slower than for “method=sql”_. This is crazy.

Here are the results for _“plpgsql_vs_sql2.py” (server-side generated uuid):

  Method  | Stmts per txn | Time per batch |  StDev
----------+---------------+----------------+----------
 sql      |             1 |    8.00        |    0.127
 sql      |             2 |    5.81        |    0.009
 sql      |             5 |    4.53        |    0.056
 sql      |            10 |    3.91        |    0.002
 sql      |            20 |    3.57        |    0.015
 sql      |            50 |    3.40        |    0.011
 sql      |           100 |    3.43        |    0.005

 proc     |             1 |    7.42        |    0.058
 proc     |             2 |    5.19        |    0.008
 proc     |             5 |    4.18        |    0.082
 proc     |            10 |    3.22        |    0.044
 proc     |            20 |    2.90        |    0.001
 proc     |            50 |    2.67        |    0.005
 proc     |           100 |    2.67        |    0.004

Now the time per batch for a one-statement txn for “method=proc” is a little faster than for “method=sql”.

I wrote a little program to compute the speed ratios, proc-to-sql, for each of the two approaches. (It uses a table function with a SQL body to implement a parameterized view.) It’s attached as _“showresults.sql”, together with the results and the two Python programs, in _“issue_2557_part2.zip”. Here’s what it shows:

For _“plpgsql_vs_sql1.py” (client-side generated uuid):

 Stmts per txn | Speed Ratio 
---------------+-------------
             1 |    0.60
             2 |    0.68
             5 |    0.89
            10 |    1.01
            20 |    1.14
            50 |    1.21
           100 |    1.26

For _“plpgsql_vs_sql2.py” (server-side generated uuid):

 Stmts per txn | Speed Ratio 
---------------+-------------
             1 |    1.08
             2 |    1.12
             5 |    1.09
            10 |    1.21
            20 |    1.23
            50 |    1.27
           100 |    1.28

At least, here, the ratios are always greater than one.

Here, as a sanity check, are the same ratios using vanilla PostgreSQL:

For _“plpgsql_vs_sql1.py” (client-side generated uuid):

 Stmts per txn | Speed Ratio 
---------------+-------------
             1 |    1.68
             2 |    1.65
             5 |    2.09
            10 |    2.60
            20 |    2.69
            50 |    3.11
           100 |    3.63

For _“plpgsql_vs_sql2.py” (server-side generated uuid):

 Stmts per txn | Speed Ratio 
---------------+-------------
             1 |    1.53
             2 |    1.82
             5 |    2.53
            10 |    3.14
            20 |    4.17
            50 |    4.74
           100 |    5.18

We don’t see the crazy effect here.

Finally, for completeness, here are the same ratios using a YB Version 2.0.6 RF=3, 3-node cluster in AWS:

For _“plpgsql_vs_sql1.py” (client-side generated uuid):

 Stmts per txn | Speed Ratio 
---------------+-------------
             1 |    0.67
             2 |    0.77
             5 |    0.92
            10 |    1.00
            20 |    1.08
            50 |    1.15
           100 |    1.15

For _“plpgsql_vs_sql2.py” (server-side generated uuid):

 Stmts per txn | Speed Ratio 
---------------+-------------
             1 |    1.07
             2 |    1.10
             5 |    1.13
            10 |    1.14
            20 |    1.16
            50 |    1.15
           100 |    1.21

These have the same general character as the results recorded on my MacBook.

Final note

During one test run using AWS I got a wrong results error. It didn’t reproduce when I re-ran that test. See the file _“yb__stmts_pr_txn_005__method_do_block_bad_unqviolation.txt” on the _“output/awsv2” directory. It shows this:

ERROR:  duplicate key value violates unique constraint "t_pk"

I suppose that it could reflect that fact that the call to gen_random_uuid() returned a previously-seen value. But the probability that this might happen is supposed to be vanishingly small. And my program called gen_random_uuid only 10,000 times per batch.

bllewell commented 5 years ago

issue_2557_part_2.zip

x46085 commented 4 years ago

Hello! I just wanted to ask if this issue was actively being worked on for a release this year. We are in the middle of selecting our platform's primary database and because we use a lot of stored procedures, this issue has us reconsidering re-implementing on a different database. Thanks!

kmuthukk commented 4 years ago

hi @x46085 -- the workaround suggested by @m-iancu -- to explicitly type cast the actual arguments (see for example https://github.com/YugaByte/yugabyte-db/issues/2557#issuecomment-544036209) should address the issue largely. Does that suffice for now? Or that's the one you would like to see us handle transparently (by implementing the negative caching)?

x46085 commented 4 years ago

Thanks for the quick response. We will give it a try and report back!