yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[YSQL] Failure while performing concurrent DML & DDL operations on same table #14468

Open yugabyte-ci opened 2 years ago

yugabyte-ci commented 2 years ago

Jira Link: DB-3867

Edit: Issue can be reproduced (with and without) tablegroup, so it is not specific to tablegroups.

Test for Concurrent DDL (alter) + DML operation on same table in a tablegroup.

// Concurrent DDL (alter) + DML operation on same table in a tablegroup.
TEST_F(TablegroupConcurrencyTest, YB_DISABLE_TEST_IN_TSAN(AlterAndUpdateOnSameTable)) {
  PGConn conn1 = ASSERT_RESULT(ConnectToDB(database_name));
  PGConn conn2 = ASSERT_RESULT(ConnectToDB(database_name));
  ASSERT_OK(conn1.ExecuteFormat("CREATE TABLEGROUP $0", tablegroup_name));

  for (int i = 0; i < num_iterations; ++i) {
    // Insert 50 rows in t1.
    ASSERT_OK(conn1.ExecuteFormat("CREATE TABLE t1 (i int, j int) TABLEGROUP $0", tablegroup_name));
    InsertDataIntoTable(&conn1, "t1" /* table_name */);

    std::atomic<bool> done = false;
    int counter = 0;

    // Update rows in t1 on a separate thread.
    std::thread update_thread([&conn2, &done, &counter] {
      while (!done) {
        Status s = conn2.ExecuteFormat("UPDATE t1 SET j = j * $0 WHERE i > 0", counter);
        while (!s.ok()) {
          s = conn2.ExecuteFormat("UPDATE t1 SET j = j * $0 WHERE i > 0", counter);
        }
        ++counter;
      }
    });

    // Add a column in t1 on the main thread.
    Status s = conn1.Execute("ALTER TABLE t1 ADD COLUMN k INT DEFAULT 0");
    while (!s.ok()) {
      s = conn1.Execute("ALTER TABLE t1 ADD COLUMN k INT DEFAULT 0");
    }
    done = true;
    update_thread.join();

    // Verify that t1 has 3 columns.
    string query =
        "SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 't1'";
    PGResultPtr res = ASSERT_RESULT(conn1.Fetch(query));
    ASSERT_EQ(PQntuples(res.get()), 3);

    // Verify t1 has 50 rows.
    auto curr_rows = ASSERT_RESULT(conn1.FetchValue<int64_t>("SELECT COUNT(*) FROM t1"));
    ASSERT_EQ(curr_rows, 50);

    // Reset the table for next iteration.
    ASSERT_OK(conn1.ExecuteFormat("DROP TABLE t1"));
  }
}

This test when ran for 30 times, failed once due to timeout with the following message:

FATAL:  Could not reconnect to database
[ts-1] 2022-10-13 14:14:30.328 IST [37784] HINT:  Database might have been dropped by another user
[ts-1] I1013 14:14:30.329075 1830858752 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 58)
W1013 14:14:30.346480 1844539392 libpq_utils.cc:84] SQLSTATE is not defined for result with error message: , PQresultStatus: PGRES_FATAL_ERROR
[ts-2] I1013 14:14:30.444984 1816375296 tablet_server.cc:645] Invalidating the entire cache since catalog version incremented
[ts-3] I1013 14:14:30.453585 1849044992 tablet_server.cc:645] Invalidating the entire cache since catalog version incremented
[ts-1] I1013 14:14:30.457280 1877372928 tablet_server.cc:645] Invalidating the entire cache since catalog version incremented
W1013 14:14:35.346482 1844539392 libpq_utils.cc:84] SQLSTATE is not defined for result with error message: , PQresultStatus: PGRES_FATAL_ERROR
W1013 14:14:40.346500 1844539392 libpq_utils.cc:84] SQLSTATE is not defined for result with error message: , PQresultStatus: PGRES_FATAL_ERROR
W1013 14:14:45.346519 1844539392 libpq_utils.cc:84] SQLSTATE is not defined for result with error message: , PQresultStatus: PGRES_FATAL_ERROR
siddharth2411 commented 2 years ago

This test also fails with regular/non-colocated tables.

m-iancu commented 1 year ago

Looks like the error message from that FATAL is from the YSQL metadata-cache-refresh code: https://github.com/YugaByte/yugabyte-db/blob/master/src/postgres/src/backend/utils/cache/relcache.c#L1952-L1956

WangPingGang commented 1 year ago

Is there any progress on this problem? it seems unit test PgLibPqTest.PagingReadRestart get the same problem.