yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.92k stars 1.06k forks source link

[DocDB] select during indexes drop/create is failed #15374

Closed pilshchikov closed 11 months ago

pilshchikov commented 1 year ago

Jira Link: DB-4507

Description

Steps:

  1. Start CassandraDataLoad workload (table with 15 columns and 10 indexes. columns k, v1, v2, v3...vN)
    java -jar yb-sample-apps-1.8.29.jar --workload CassandraDataLoad  --num_writes -1 --num_reads -1 --num_unique_keys 10000000000 --num_value_columns 15 --num_indexes 10 --uuid f0c812fd-9ee1-4a83-8300-c2e4b9aed81d --retry_primary_key true --create_table_name test_indexes_9a2909 --num_threads_write 4 --num_threads_read 10 --batch_size 10 --uuid_marker f37d6998-5f2a-4703-856a-4bbc8d4e809d --nodes 172.151.23.223:9042,172.151.31.105:9042,172.151.22.92:9042
  2. start dropping and creating indexes
  3. between check workload logs on failed reads/writes
  4. After some time this exception appear:
    Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.151.23.223:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/172.151.23.223:9042] Timed out waiting for server response), /172.151.22.92:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/172.151.22.92:9042] Timed out waiting for server response), /172.151.31.105:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [/172.151.31.105:9042] Timed out waiting for server response))
    at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:283)
    at com.datastax.driver.core.RequestHandler.access$1200(RequestHandler.java:61)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:375)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.retry(RequestHandler.java:557)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.processRetryDecision(RequestHandler.java:539)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:981)
    at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1635)
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
    at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
    at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:833)

All logs and steps/metrics: http://stress.dev.yugabyte.com/stress_test/d669b530-6697-472d-98f7-b0ec30b35790 Workload link: https://github.com/yugabyte/yb-stress-test/blob/0b3f9c6c1311fc4aed08b887143be75011a8db4e/tools/sample-app/src/main/java/com/yugabyte/sample/apps/CassandraDataLoad.java#L161 Jar to download: https://github.com/yugabyte/yb-stress-test/releases/download/sa_1.8.29/yb-sample-apps-1.8.29.jar

Workload is doing read by where k/v1/v2/v3...vN=[key] select column randomly, same as writes

Env: gflags master/tserver

enable_automatic_tablet_splitting=false
client_read_write_timeout_ms=1800000
yb_client_admin_operation_timeout_sec=1800

version: 2.17.1.0-b345

rthallamko3 commented 1 year ago

@pilshchikov , Do you know if this repros on master? @yusong-yan has made some fixes in master in Jan, wanted to know if those resolves the issues.

rthallamko3 commented 1 year ago

Seems like the latest failure per report is failing to drop index.

Exception in teardown

test_ycql_backfill_indexes_default Failed
Failed to execute test test_ycql_backfill_indexes_default!
Traceback (most recent call last):
  File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/runner_tests.py", line 1037, in run_tests
    test_func(class_instance)
  File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/suites/indexes/test_indexes.py", line 516, in test_ycql_backfill_indexes_default
    self.scenario(
  File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/suites/indexes/test_indexes.py", line 642, in scenario
    actions(index_name, column, table_name, known_exceptions)
  File "/home/ubuntu/workspace/stress_tests/run_test_with_universe/src/suites/indexes/test_indexes.py", line 338, in drop_and_create_index_action_ycql
    assert not drop_indx, f"Failed to drop index: {"".join(drop_indx)}"
AssertionError: Failed to drop index: <stdin>:1:NoHostAvailable: 
rthallamko3 commented 1 year ago

@pilshchikov , Can we check the resource utilization in these runs. If the tserver runs out of resources, the above failure might happen.

pilshchikov commented 1 year ago

@rthallamko3 i need to check that, will update till after tomorrow

pilshchikov commented 11 months ago

Closing because this issue is not reproducing, but new one is start to run into constantly https://github.com/yugabyte/yugabyte-db/issues/19628