ydb-platform / nbs

Network Block & File Store
Apache License 2.0
58 stars 22 forks source link

[Disk Manager] fix idempotency of DeleteImage task #2438

Open gy2411 opened 6 days ago

gy2411 commented 6 days ago

The TestImageServiceCreateImageFromImage nemesis test failed in the pr https://github.com/ydb-platform/nbs/pull/2294

The root cause of the failure are conflicst of requests of tasks with the same id:

disk_manager_stdout_9io07ym4:2024-11-01T18:48:28.881Z   DEBUG   persistence/ydb_logger.go:70    done    {"SYSLOG_IDENTIFIER": "disk-manager", "COMPONENT": "TASK_RUNNER", "REQUEST_ID":
 "TestImageServiceCreateImageFromImage_5", "SYSLOG_IDENTIFIER": "disk-manager", "COMPONENT": "TASK", "TASK_ID": "ab93fb55-4ac7-4bef-bdd1-f96179cd074c", "COMPONENT": "TASK_SCHEDULER", 
"COMPONENT": "TASK_SCHEDULER", "COMPONENT": "YDB", "YDB_LOG_NAME": "ydb.table.do", "latency": "23.805079ms", "idempotent": true, "label": "", "attempts": 1, "error": "non-retryable er
ror occurred on attempt No.1 (idempotent=true): Non retriable error, Silent=false: cannot create task with request=\n+TestImageServiceCreateImageFromImage_image1\u0012\u0006zone-a\u00
18\u0001, because task with same id=52d6ef27-045c-4f9e-9cb9-67a3041bcb1d,\n\t\t\t\t but different request=\n+TestImageServiceCreateImageFromImage_image1\u0012\u0006zone-a\u0018\u0001 
\ufffd\ufffd\ufffd\u0014 already exists

The scenario was the following:

gy2411 commented 6 days ago

Moreover, the errors of parsing protobufs happened in requests to ydb:

disk_manager_stdout_cx60ds6q:2024-11-01T19:21:25.304Z   INFO    storage/storage_ydb_impl.go:1528        finished updating task with id ab93fb55-4ac7-4bef-bdd1-f96179cd074c     {"SYSLO
G_IDENTIFIER": "disk-manager", "COMPONENT": "TASK_RUNNER", "REQUEST_ID": "TestImageServiceCreateImageFromImage_5", "SYSLOG_IDENTIFIER": "disk-manager"}
disk_manager_stdout_cx60ds6q:2024-11-01T19:21:25.311Z   WARN    persistence/ydb_logger.go:74    failed  {"SYSLOG_IDENTIFIER": "disk-manager", "COMPONENT": "TASK_RUNNER", "REQUEST_ID":
 "TestImageServiceCreateImageFromImage_5", "SYSLOG_IDENTIFIER": "disk-manager", "COMPONENT": "YDB", "YDB_LOG_NAME": "ydb.table.do", "latency": "5.862879049s", "idempotent": true, "lab
el": "", "error": "Retriable error, IgnoreRetryLimit=false: failed: query \n\t\t--!syntax_v1\n\t\tpragma TablePathPrefix = \"/Root/disk_manager/recipe/tasks\";\n\t\tdeclare $values as
 List<Struct<\n\t\tid: Utf8,\n\t\tidempotency_key: Utf8,\n\t\taccount_id: Utf8,\n\t\ttask_type: Utf8,\n\t\tregular: Bool,\n\t\tdescription: Utf8,\n\t\tcreated_at: Timestamp,\n\t\tcrea
ted_by: Utf8,\n\t\tmodified_at: Timestamp,\n\t\tgeneration_id: Uint64,\n\t\tstatus: Int64,\n\t\terror_code: Int64,\n\t\terror_message: Utf8,\n\t\terror_silent: Bool,\n\t\terror_detail
s: String,\n\t\tretriable_error_count: Uint64,\n\t\trequest: String,\n\t\tstate: String,\n\t\tmetadata: String,\n\t\tdependencies: String,\n\t\tchanged_state_at: Timestamp,\n\t\tended
_at: Timestamp,\n\t\tlast_host: Utf8,\n\t\tlast_runner: Utf8,\n\t\tzone_id: Utf8,\n\t\testimated_time: Timestamp,\n\t\tdependants: String,\n\t\tpanic_count: Uint64>>;\n\n\t\tupsert in
to tasks\n\t\tselect *\n\t\tfrom AS_TABLE($values)\n\t: connError{node_id:1,address:'dp774es3d0144t8q93pk.auto.internal:9458'}: transport/Internal (code = 13, source error = \"rpc err
or: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8\", address: \"dp774es3d0144t8q93pk.auto.internal:9458\", traceID: \"5430dda3-d03e-4a54-941
6-5645a835ff46\") at `github.com/ydb-platform/ydb-go-sdk/v3/internal/conn.(*conn).Invoke(conn.go:347)` at `github.com/ydb-platform/ydb-go-sdk/v3/internal/balancer.(*Balancer).wrapCall
(balancer.go:351)` at `github.com/ydb-platform/ydb-go-sdk/v3/internal/table.(*session).executeDataQuery(session.go:795)` at `github.com/ydb-platform/ydb-go-sdk/v3/internal/table.(*ses
sion).Execute(session.go:748)` at `github.com/ydb-platform/ydb-go-sdk/v3/internal/table.(*transaction).Execute(transaction.go:80)` at `github.com/ydb-platform/ydb-go-sdk/v3/internal/t
able.do.func1(retry.go:53)`", "retryable": true, "code": 13, "deleteSession": true, "version": "3.54.3"}

The reason of this is error_message field of task state. Here we write request in the message, but request in not a valid utf-8 string.

gy2411 commented 6 days ago

Solution proposal:

1) DeleteImage task should finish succesfully immediately if it can not find the image in the database here.

2) Write request as base64 in error messages. This will also help to print request in logs in the format that is convenient for parsing.