yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.79k stars 1.05k forks source link

[DocDB] Copy command fails with serialization error. #18210

Open zlareb1-yb opened 1 year ago

zlareb1-yb commented 1 year ago

Jira Link: DB-7230

Description

testysqltspcmvandbr is failing at COPY_FROM with below error:

INFO:root:COPY colocated_employees_c FROM '/tmp/colocated_employees.csv'
2023-07-11 07:54:08,722 database_operations.py:1148 INFO testysqltspcmvandbr-aws-rf3 COPY colocated_employees_c FROM '/tmp/colocated_employees.csv'
INFO:root:Closing <connection object at 0x7fe7a6404890; dsn: 'user=yugabyte dbname=colocated_db host=10.9.121.12 port=5433 options='-c statement_timeout=300000'', closed: 0> connection.
2023-07-11 07:54:15,890 database_operations.py:43 INFO testysqltspcmvandbr-aws-rf3 Closing <connection object at 0x7fe7a6404890; dsn: 'user=yugabyte dbname=colocated_db host=10.9.121.12 port=5433 options='-c statement_timeout=300000'', closed: 0> connection.
INFO:root:Connection closed to <connection object at 0x7fe7a6404890; dsn: 'user=yugabyte dbname=colocated_db host=10.9.121.12 port=5433 options='-c statement_timeout=300000'', closed: 1>
2023-07-11 07:54:15,891 database_operations.py:48 INFO testysqltspcmvandbr-aws-rf3 Connection closed to <connection object at 0x7fe7a6404890; dsn: 'user=yugabyte dbname=colocated_db host=10.9.121.12 port=5433 options='-c statement_timeout=300000'', closed: 1>
ERROR:root:ITEST FAILED testysqltspcmvandbr-aws-rf3 : SerializationFailure('Operation failed. Try again: [Operation failed. Try again (yb/tablet/tablet.cc:1312): Transaction metadata missing: 13217c64-5a0d-47db-a796-9fe424b60416, looks like it was just aborted (pgsql error 40001)]\n')
2023-07-11 07:54:15,891 test_base.py:179 ERROR testysqltspcmvandbr-aws-rf3 ITEST FAILED testysqltspcmvandbr-aws-rf3 : SerializationFailure('Operation failed. Try again: [Operation failed. Try again (yb/tablet/tablet.cc:1312): Transaction metadata missing: 13217c64-5a0d-47db-a796-9fe424b60416, looks like it was just aborted (pgsql error 40001)]\n')
INFO:root:Traceback (most recent call last):
  File "/var/lib/jenkins/code/internal-services/itest/src/test_base.py", line 175, in execute_steps
    step.call()
  File "/var/lib/jenkins/code/internal-services/itest/src/test_base.py", line 54, in call
    ret = self.function()
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/tablet_splitting/test_ysql_ts_pcmv_and_br.py", line 1060, in do_custom_work
    self.test_tablet_splitting(
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/tablet_splitting/test_ysql_ts_pcmv_and_br.py", line 535, in test_tablet_splitting
    self.copy_data_from_source(session, table, data_source_path)
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/tablet_splitting/test_ysql_ts_pcmv_and_br.py", line 429, in copy_data_from_source
    ysql_copy_from(
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/system_test_utils/database_operations.py", line 1149, in ysql_copy_from
    ysql_execute_query(session, query)
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/system_test_utils/database_operations.py", line 1754, in ysql_execute_query
    raise exp
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/system_test_utils/database_operations.py", line 1741, in ysql_execute_query
    session.execute(query)
psycopg2.errors.SerializationFailure: Operation failed. Try again: [Operation failed. Try again (yb/tablet/tablet.cc:1312): Transaction metadata missing: 13217c64-5a0d-47db-a796-9fe424b60416, looks like it was just aborted (pgsql error 40001)]

Test steps:

testysqltspcmvandbr-aws-rf3: Start
    (     0.499s) User Login : Success
    (     0.141s) Refresh YB Version : Success
    (    90.379s) Setup Provider : Success
    (     0.040s) Updating Health Check Interval to 60000 sec : Success
    (   361.172s) Create universe sagr-ee139cd11d-20230711-050257 : Success
    (    39.296s) Create YSQL Colocated DB : Success
    (    11.627s) Create YSQL Non Colocated DB : Success
    (     1.425s) Create YSQL Colocated: True Tables with Sharding: RANGE on Colocated: True DB : Success
    (  1937.398s) Create Secondary Index. : Success
    (    34.392s) Create Unique Index. : Success
    (    20.365s) Create Partial Index. : Success
    (     0.390s) Validate created indexes : Success
    (    43.789s) Create Materialized View: colocated_mv : Success
    (     0.451s) Verify Materialized View is populated : Success
    (    98.925s) Insert 5000 records : Success
    (     2.072s) Verify Table is populated : Success
    (    45.284s) Refresh materialized view : Success
    (     0.243s) Verify materialized View is populated : Success
    (     1.058s) Create YSQL Colocated: False Tables with Sharding: HASH on Colocated: True DB : Success
    (  2194.501s) Create Secondary Index. : Success
    (    12.463s) Create Unique Index. : Success
    (    10.533s) Create Partial Index. : Success
    (     0.705s) Validate Tablets count increased from 3 : Success
    (     0.570s) Validate created indexes : Success
    (    43.720s) Create Materialized View: non_colocated_mv_opt_out : Success
    (     0.411s) Verify Materialized View is populated : Success
    (   105.450s) Insert 5000 records : Success
    (     0.970s) Verify Table is populated : Success
    (    45.388s) Refresh materialized view : Success
    (     0.231s) Verify materialized View is populated : Success
    (     1.286s) Create YSQL Colocated: False Tables with Sharding: HASH on Colocated: False DB : Success
    (  2021.733s) Create Secondary Index. : Success
    (    14.608s) Create Unique Index. : Success
    (     8.588s) Create Partial Index. : Success
    (     0.813s) Validate Tablets count increased from 3 : Success
    (     0.428s) Validate created indexes : Success
    (    27.022s) Create Materialized View: non_colocated_mv : Success
    (     0.496s) Verify Materialized View is populated : Success
    (   101.245s) Insert 5000 records : Success
    (     0.978s) Verify Table is populated : Success
    (    26.781s) Refresh materialized view : Success
    (     0.403s) Verify materialized View is populated : Success
    (     1.968s) Create YSQL Colocated: False Tables with Sharding: RANGE on Colocated: False DB : Success
    (  2021.456s) Create Secondary Index. : Success
    (    14.669s) Create Unique Index. : Success
    (    10.618s) Create Partial Index. : Success
    (     0.898s) Validate Tablets count increased from 3 : Success
    (     0.482s) Validate created indexes : Success
    (    28.085s) Create Materialized View: non_colocated_mv_r : Success
    (     0.480s) Verify Materialized View is populated : Success
    (   106.954s) Insert 5000 records : Success
    (     0.866s) Verify Table is populated : Success
    (    27.531s) Refresh materialized view : Success
    (     0.305s) Verify materialized View is populated : Success
    (    31.890s) Create YSQL Colocated: True Tables with Sharding: RANGE on Colocated: True DB : Success
    (     8.470s) Copy Data from /tmp/colocated_employees.csv : >>> Integration Test Failed <<< 
Operation failed. Try again: [Operation failed. Try again (yb/tablet/tablet.cc:1312): Transaction metadata missing: 7cc586a5-5d7a-4fb4-b2d4-029f5d7bac23, looks like it was just aborted (pgsql error 40001)]

    (    33.090s) Saved server log files and keys at /share/jenkins/workspace/itest-system-developer/logs/2.16.6.0_testysqltspcmvandbr-aws-rf3_20230711_075643 : Success
    (    74.979s) Destroy universe : Success
    (     0.234s) Check and stop workloads : Success
testysqltspcmvandbr-aws-rf3: End

Build - 2.16.6.0-b34

GFLAGS

MASTER_FLAGS = {
     "enable_automatic_tablet_splitting": "true",
    "tablet_split_high_phase_shard_count_per_node": 200,
    "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2 MB
    "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
    "tablet_split_low_phase_shard_count_per_node": 98,
    "enable_stream_compression": "true",
    "stream_compression_algo": "1",  # Gzip compression
}

TSERVER_FLAGS = {
    "enable_automatic_tablet_splitting": "true",
    "ysql_num_shards_per_tserver": 1,
    "enable_stream_compression": "true",
    "stream_compression_algo": "1",  # Gzip compression
}

Schema Used

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TYPE complex AS (re float8, im float8);
CREATE TYPE e_details AS ENUM ('Email', 'Sms', 'Phone');
CREATE DOMAIN postal_code AS TEXT CHECK(VALUE ~ '^\d{5}$'OR VALUE ~ '^\d{5}-\d{4}$');

Range Sharded table-

CREATE TABLE tab (
  id TEXT, 
  uuid_col uuid DEFAULT uuid_generate_v4(), 
  name text COLLATE "en_US.utf8", 
  c complex, 
  info json, 
  contact JSONB, 
  arr smallint[], 
  cash money, 
  i inet, 
  m macaddr, 
  i2 serial, 
  i3 bigserial, 
  val smallint, 
  details e_details, 
  age int, 
  collated_data text collate "POSIX", 
  date DATE, 
  n NUMERIC (3, 2), 
  r real, 
  c1 CHAR(1), 
  created_at timestamptz, 
  uuid0 uuid DEFAULT uuid_nil(), 
  uuid1 uuid DEFAULT uuid_generate_v1(), 
  p1 POINT, 
  t1 TIME, 
  ts1 TIMESTAMP, 
  i4 INTERVAL, 
  p2 path, 
  p3 polygon, 
  b box, 
  c2 circle, 
  l line, 
  l1 lseg, 
  a2 text[][], 
  zip postal_code, 
  PRIMARY KEY (id, name, uuid_col),
);

cc: @renjith-yb @kripasreenivasan @Arjun-yb

Warning: Please confirm that this issue does not contain any sensitive information

zlareb1-yb commented 1 year ago

Jenkins job for reference - https://jenkins.dev.yugabyte.com/job/itest-system-developer/7232/console

rthallamko3 commented 11 months ago

@zlareb1-yb , Can you check if this fails frequently?

zlareb1-yb commented 11 months ago

@agsh-yb can you please confirm if this test is failing frequently?

agsh-yb commented 11 months ago

@rthallamko3 , occasionally this test fails due to a different issue, which might be related to automation, can potentially be resolved by adjusting the batch size of copy transactions. However, this particular issue seems to occur quite frequently.

cc: @zlareb1-yb

kripasreenivasan commented 10 months ago

Test name: testysqltspcmvandbr-aws-rf3

agsh-yb commented 10 months ago

@rthallamko3 I think adjusting the batch size of copy transactions Worked well here! I have reduced the batch size of transactions while copy, The issue was not observed in three consistent runs. This was tried on version 2.16.9.0-b9 cc: @kripasreenivasan

rthallamko3 commented 10 months ago

Per @agsh-yb ,

Executed above test with 20k, 10k and default
With rows_per_transcations =20k & with 10k
COPY colocated_employees_c FROM '/tmp/colocated_employees.csv' WITH (ROWS_PER_TRANSACTION 20000)
psycopg2.OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
2.  With Default value:
COPY colocated_employees_c FROM '/tmp/colocated_employees.csv'
[Invalid argument (yb/tablet/preparer.cc:301): Operation replicate msg size (512364645) exceeds limit of leader side single op size (254017536)]
agsh-yb commented 10 months ago

CSV file: https://drive.google.com/file/d/1QLLVxS8A6sWE4t4FEGLWwImdzrbdTRrs/view?usp=sharing

agsh-yb commented 10 months ago
rajapriya371 commented 9 months ago

Observing this issue on version 2.19.3.0-b140,2.21.0.0-b270,2.18.5.0-b63