TPCC from "remote" node causes less than expected data to be loaded

tylarb commented 4 years ago

When I run a TPCC load from a node that is undersized and in a different zone as the cluster. I get data less than expected to be loaded. Given 100 warehouses, I expect some number of orders = 30k times number of warehouses, new_orders = 90k, order_lines=30k.

These numbers, given random inserts, will have some variability, in the realm of 1-2%, decreasing as load/threads goes up.

Data representing this will be uploaded shortly.

Here is the cluster setup and tpcc load module: cluster:

US-West 3 node i3.8xlarge, 32core, ~240Gb ram

Node which tpcc is running from:

US-east c5.xlarge 4vcpus, 8GB RAM

Issue happens with small numbers of threads. Tested and failing with 4, 8, 48 threads

100 warehouses

Command for example:

time ./tpccbenchmark --create=true --load=true --nodes=[node1, node2, node3]  --warehouses=100   --loaderthreads 8

The tpcc benchmark success but with the above errors.

tylarb commented 4 years ago

Here is a healthy run, with counts as expected:

select count(*) as warehouses from warehouse;
 warehouses
------------
        100
(1 row)
select count ( distinct  d_w_id ) as warehouses, count(*) as districts from district;
 warehouses | districts
------------+-----------
        100 |      1000
(1 row)
select count ( distinct  c_w_id ) as warehouses, count(*) as customers from customer;
 warehouses | customers
------------+-----------
        100 |   3000000
(1 row)
;
select count ( distinct  s_w_id ) as warehouses, count(*) as stocks from stock;
 warehouses | stocks
------------+---------
        100 | 4920832
(1 row)
select count ( distinct  h_w_id ) as warehouses, count(*) as history from history;
 warehouses | history
------------+---------
        100 | 3000000
(1 row)
select count ( distinct  o_w_id ) as warehouses, count(*) as orders from oorder;
 warehouses | orders
------------+---------
        100 | 3000000
(1 row)
select count ( distinct  no_w_id ) as warehouses, count(*) as new_orders from new_order;
 warehouses | new_orders
------------+------------
        100 |     900000
(1 row)

tylarb commented 4 years ago

Here's an unhealty connection, from the node in different region:

{$some different smaller node}$ time ./tpccbenchmark --create=true --load=true --nodes=172.151.59.4,172.151.49.178,172.151.55.80 --warehouses=100 --loaderthreads 48

yugabyte=# select count(*) as warehouses from warehouse;
 warehouses 
------------
        100
(1 row)

yugabyte=# select count ( distinct  d_w_id ) as warehouses, count(*) as districts from district;
 warehouses | districts 
------------+-----------
        100 |      1000
(1 row)

yugabyte=# select count ( distinct  c_w_id ) as warehouses, count(*) as customers from customer;

 warehouses | customers 
------------+-----------
        100 |   3000000
(1 row)

yugabyte=# 
yugabyte=# select count ( distinct  s_w_id ) as warehouses, count(*) as stocks from stock;

 warehouses | stocks  
------------+---------
        100 | 7281152
(1 row)

yugabyte=# select count ( distinct  h_w_id ) as warehouses, count(*) as history from history;

 warehouses | history 
------------+---------
        100 | 3000000
(1 row)

yugabyte=# 
yugabyte=# select count ( distinct  o_w_id ) as warehouses, count(*) as orders from oorder;

 warehouses | orders  
------------+---------
        100 | 2978220
(1 row)

yugabyte=# 
yugabyte=# select count ( distinct  no_w_id ) as warehouses, count(*) as new_orders from new_order;

 warehouses | new_orders 
------------+------------
        100 |     892920
(1 row)

tylarb commented 4 years ago

Here's the same data, this time from a tpcc with 8 threads (other values the same). The performance was abysmal - over 2 hours to completion, compared to about 17 min running locally, and plenty of errors. So I am expecting them to be related.

yugabyte=# select count(*) as warehouses from warehouse;
 warehouses 
------------
        100
(1 row)
yugabyte=# select count ( distinct  c_w_id ) as warehouses, count(*) as customers from customer;
 warehouses | customers 
------------+-----------
        100 |   3000000
(1 row)

yugabyte=# select count ( distinct  s_w_id ) as warehouses, count(*) as stocks from stock;
 warehouses |  stocks  
------------+----------
        100 | 10000000
(1 row)

yugabyte=#  select count ( distinct  h_w_id ) as warehouses, count(*) as history from history;
 warehouses | history 
------------+---------
        100 | 3000000
(1 row)

yugabyte=# select count ( distinct  o_w_id ) as warehouses, count(*) as orders from oorder;
 warehouses | orders  
------------+---------
        100 | 2860457
(1 row)

yugabyte=# select count ( distinct  no_w_id ) as warehouses, count(*) as new_orders from new_order;
 warehouses | new_orders 
------------+------------
        100 |     856106
(1 row)

This time, we're well under-valued for orders and new_orders compared to local - 5% under.

tylarb commented 4 years ago

CC issue https://github.com/yugabyte/tpcc/issues/46

yugabyte / tpcc

TPCC from "remote" node causes less than expected data to be loaded #47