wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
3.87k stars 1.03k forks source link

Triton Server - support of Unified Conformer model fails #2525

Closed pjhool closed 1 month ago

pjhool commented 1 month ago

Describe the bug does not support Unified Conformer model on Triton Inference server

To Reproduce Steps to reproduce the behavior:

  1. copy unified conformer trained model , config.yaml , train.yam to onnx_model directory
  2. when staring Triton inference Server . See error

    == Triton Inference Server ==

NVIDIA Release 23.01 (build 52277748) Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0513 07:13:40.798583 120 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fada8000000' with size 1024000000 I0513 07:13:40.798851 120 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 1024000000 [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 34:7: Expected integer, got: initial_state E0513 07:13:40.801181 120 model_repository_manager.cc:1004] Poll failed for model directory 'encoder': failed to read text proto from /ws/model_repo/encoder/config.pbtxt [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 86:3: Expected integer, got: } E0513 07:13:40.801258 120 model_repository_manager.cc:1004] Poll failed for model directory 'feature_extractor': failed to read text proto from /ws/model_repo/feature_extractor/config.pbtxt E0513 07:13:40.801448 120 model_repository_manager.cc:487] Invalid argument: ensemble streaming_wenet contains models that are not available: encoder, feature_extractor I0513 07:13:40.801467 120 model_lifecycle.cc:459] loading: decoder:1 I0513 07:13:40.801486 120 model_lifecycle.cc:459] loading: wenet:1 I0513 07:13:40.802576 120 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime I0513 07:13:40.802594 120 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.11 I0513 07:13:40.802597 120 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.11 I0513 07:13:40.802599 120 onnxruntime.cc:2505] backend configuration: {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} I0513 07:13:40.810061 120 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: decoder (version 1) I0513 07:13:40.810469 120 onnxruntime.cc:666] skipping model configuration auto-complete for 'decoder': inputs and outputs already specified I0513 07:13:40.810836 120 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: decoder_0_0 (GPU device 0) 2024-05-13 07:13:40.994426930 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-05-13 07:13:40.994447420 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0513 07:13:41.575200 120 python_be.cc:1534] Forcing CPU only input tensors. I0513 07:13:43.630285 120 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: decoder_0_1 (GPU device 0) 2024-05-13 07:13:43.719222391 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-05-13 07:13:43.719241018 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. I0513 07:13:43.732326 120 python_be.cc:1858] TRITONBACKEND_ModelInstanceInitialize: wenet_0_0 (CPU device 0) I0513 07:13:43.732525 120 model_lifecycle.cc:694] successfully loaded 'decoder' version 1 Using device cpu Successfully load model ! Successfully load vocabulary ! Using rescoring: True Successfully load all parameters! Finish Init I0513 07:13:44.496564 120 python_be.cc:1858] TRITONBACKEND_ModelInstanceInitialize: wenet_0_1 (CPU device 0) Using device cpu Successfully load model ! Successfully load vocabulary ! Using rescoring: True Successfully load all parameters! Finish Init I0513 07:13:45.234932 120 model_lifecycle.cc:694] successfully loaded 'wenet' version 1 I0513 07:13:45.235027 120 server.cc:563] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0513 07:13:45.235090 120 server.cc:590] +-------------+-------------------------------------------------------------+-------------------------------------------------------------+ Backend Path Config +-------------+-------------------------------------------------------------+-------------------------------------------------------------+ onnxruntime /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntim {"cmdline":{"auto-complete-config":"true","min-compute-capa e.so bility":"6.000000","backend-directory":"/opt/tritonserver/b ackends","default-max-batch-size":"4"}}
python /opt/tritonserver/backends/python/libtriton_python.so {"cmdline":{"auto-complete-config":"true","min-compute-capa
bility":"6.000000","backend-directory":"/opt/tritonserver/b
ackends","default-max-batch-size":"4"}}

+-------------+-------------------------------------------------------------+-------------------------------------------------------------+

I0513 07:13:45.235133 120 server.cc:633] +---------+---------+--------+ | Model | Version | Status | +---------+---------+--------+ | decoder | 1 | READY | | wenet | 1 | READY | +---------+---------+--------+

I0513 07:13:45.249745 120 metrics.cc:864] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090 I0513 07:13:45.249956 120 metrics.cc:757] Collecting CPU metrics I0513 07:13:45.250076 120 tritonserver.cc:2264] +----------------------------------+------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.30.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_c | | | onfiguration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging | | model_repository_path[0] | /ws/model_repo | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 1024000000 | | cuda_memory_pool_byte_size{0} | 1024000000 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+------------------------------------------------------------------------------------------------------+

I0513 07:13:45.250096 120 server.cc:264] Waiting for in-flight requests to complete. I0513 07:13:45.250100 120 server.cc:280] Timeout 30: Found 0 model versions that have in-flight inferences I0513 07:13:45.250163 120 server.cc:295] All models are stopped, unloading models I0513 07:13:45.250170 120 server.cc:302] Timeout 30: Found 2 live models and 0 in-flight non-inference requests I0513 07:13:45.250249 120 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0513 07:13:45.256153 120 onnxruntime.cc:2640] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0513 07:13:45.264788 120 onnxruntime.cc:2586] TRITONBACKEND_ModelFinalize: delete model state I0513 07:13:45.264861 120 model_lifecycle.cc:579] successfully unloaded 'decoder' version 1 I0513 07:13:46.250263 120 server.cc:302] Timeout 29: Found 1 live models and 0 in-flight non-inference requests Cleaning up... remove wenet model I0513 07:13:47.250381 120 server.cc:302] Timeout 28: Found 1 live models and 0 in-flight non-inference requests Cleaning up... remove wenet model I0513 07:13:48.049864 120 model_lifecycle.cc:579] successfully unloaded 'wenet' version 1 I0513 07:13:48.250502 120 server.cc:302] Timeout 27: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models

Expected behavior should load all trained models

Screen Shot trion-unified conformer

Slyne commented 1 month ago

Could you attach your config.pbtxt from encoder and feature_extractor? Looks like the configuration file is not quite right.

pjhool commented 1 month ago

attatched config.pbtxt files for encoder, feature_extractor.

  1. encoder config.pbtxt config.pbtxt.txt

name: "encoder" backend: "onnxruntime" default_model_filename: "encoder.onnx"

max_batch_size: 512

sequence_batching{ max_sequence_idle_microseconds: 5000000 oldest { max_candidate_sequences: 1024 max_queue_delay_microseconds: 5000 } control_input [ ] state [ { input_name: "offset" output_name: "r_offset" data_type: TYPE_INT64 dims: [ 1 ] initial_state: { data_type: TYPE_INT64 dims: [ 1 ] zero_data: true name: "initial state" } }, { input_name: "att_cache" output_name: "r_att_cache" data_type: TYPE_FP16 dims: [ #num_layers, #num_head, #cache_size, #att_cache_output_size ] initial_state: { data_type: TYPE_FP16 dims: [ #num_layers, #num_head, #cache_size, #att_cache_output_size ] zero_data: true name: "initial state" } }, { input_name: "cnn_cache" output_name: "r_cnn_cache" data_type: TYPE_FP16 dims: [#num_layers, 256, #cnn_module_cache] initial_state: { data_type: TYPE_FP16 dims: [#num_layers, 256, #cnn_module_cache] zero_data: true name: "initial state" } }, { input_name: "cache_mask" output_name: "r_cache_mask" data_type: TYPE_FP16 dims: [1, #cache_size] initial_state: { data_type: TYPE_FP16 dims: [1, #cache_size] zero_data: true name: "initial state" } } ] } input [ { name: "chunk_xs" data_type: TYPE_FP16 dims: [#decoding_window, 80] }, { name: "chunk_lens" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [] } } ] output [ { name: "log_probs" data_type: TYPE_FP16 dims: [-1, 10] # [-1, beam_size] }, { name: "log_probs_idx" data_type: TYPE_INT64 dims: [-1, 10] # [-1, beam_size] }, { name: "chunk_out" data_type: TYPE_FP16 dims: [-1, -1] }, { name: "chunk_out_lens" data_type: TYPE_INT32 dims: [1] reshape: { shape: [] } } ] instance_group [ { count: 2 kind: KIND_GPU } ]

config_template.pbtxt.txt config_template.pbtxt.txt

config_template2.pbtxt.txt config_template2.pbtxt.txt

  1. feature_extractor config.pbtxt config.pbtxt.txt

name: "feature_extractor" backend: "python" max_batch_size: 512

parameters [ { key: "frame_length_ms", value: { string_value: "#frame_length" } }, { key: "frame_shift_ms" value: { string_value: "#frame_shift" } }, { key: "sample_rate" value: { string_value: "#sample_rate" } }, { key: "chunk_size_s", value: { string_value: "#chunk_size_in_seconds" } } ] sequence_batching{ max_sequence_idle_microseconds: 5000000 oldest { max_candidate_sequences: 512 preferred_batch_size: [ 32, 64, 128, 256] } control_input [ { name: "START", control [ { kind: CONTROL_SEQUENCE_START fp32_false_true: [0, 1] } ] }, { name: "READY" control [ { kind: CONTROL_SEQUENCE_READY fp32_false_true: [0, 1] } ] }, { name: "CORRID", control [ { kind: CONTROL_SEQUENCE_CORRID data_type: TYPE_UINT64 } ] }, { name: "END", control [ { kind: CONTROL_SEQUENCE_END fp32_false_true: [0, 1] } ] } ] } input [ { name: "wav" data_type: TYPE_FP32 dims: [-1] }, { name: "wav_lens" data_type: TYPE_INT32 dims: [1] } ] output [ { name: "speech" datatype: TYPE#DTYPE # FP32 dims: [#decoding_window, #num_mel_bins] }, { name: "speech_lengths" data_type: TYPE_INT32 dims: [1] } ] instance_group [ { count: 2 kind: KIND_GPU } ]

model.py model.py.txt

config_template.pbtxt.txt config_template.pbtxt.txt

Slyne commented 1 month ago

It looks like this step is not successfully completed.

python3 wenet/bin/export_onnx_gpu.py --config=$model_dir/train.yaml --checkpoint=$model_dir/final.pt --cmvn_file=$model_dir/global_cmvn --ctc_weight=0.5 --output_onnx_dir=$onnx_model_dir --fp16 cp $model_dir/words.txt $model_dir/train.yaml $onnx_model_dir/

Could you check this again and ensure all the fields with #xxx being filled. On Mon, May 13, 2024 at 19:05 JungHwan Park @.***> wrote:

Reopened #2525 https://github.com/wenet-e2e/wenet/issues/2525.

— Reply to this email directly, view it on GitHub https://github.com/wenet-e2e/wenet/issues/2525#event-12794933444, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP63VFRNGNTZCZCX53AKOLZCFWPZAVCNFSM6AAAAABHTZO2COVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSG44TIOJTGM2DINA . You are receiving this because you commented.Message ID: @.***>

pjhool commented 1 month ago

run the same command as above -- python3 -m wenet.bin.export_onnx_gpu --config $EXP/train.yaml --checkpoint $EXP/final_10.pt --cmvn_file=$EXP/global_cmvn --ctc_weight=0.5 --output_onnx_dir $onnx_dir --fp16

Slyne commented 1 month ago

run the same command as above -- python3 -m wenet.bin.export_onnx_gpu --config $EXP/train.yaml --checkpoint $EXP/final_10.pt --cmvn_file=$EXP/global_cmvn --ctc_weight=0.5 --output_onnx_dir $onnx_dir --fp16

Please also add --streaming if you are using streaming model.

pjhool commented 1 month ago

It works!!