I set S3 artifact store in my stack, python run.py. After 5 min, it failed with:
ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request
I checked Stackoverflow; people said the above error could be caused by token expiry.
Then I check zenml codes and the underlying s3fs codes. It seems that s3fs has a 5-minute timeout default for connections.
connect_timeout = 5
I checked my MinIO trace, the window is truely 5 min.
I am considering adding that timeout parameter to the zenml config, but dont know how and where.
Since the error occurred after a long-duration fine-tuning process, I wonder:
Maybe S3ArtifactStore should auto re-connect? Instead of throwing out the 400 error.
Therefore this bug report.
Contact Details [Optional]
No response
System Information
ZENML_LOCAL_VERSION: 0.61.0 ZENML_SERVER_VERSION: 0.61.0 ZENML_SERVER_DATABASE: sqlite ZENML_SERVER_DEPLOYMENT_TYPE: other ZENML_CONFIG_DIR: /home/raymund/.config/zenml ZENML_LOCAL_STORE_DIR: /home/raymund/.config/zenml/local_stores ZENML_SERVER_URL: http://192.168.0.100:8080 ZENML_ACTIVE_REPOSITORY_ROOT: /home/raymund/Documents/ftune-hq PYTHON_VERSION: 3.10.14 ENVIRONMENT: native SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '24.04'} ACTIVE_WORKSPACE: default ACTIVE_STACK: s3store_stack ACTIVE_USER: lab TELEMETRY_STATUS: enabled ANALYTICS_CLIENT_ID: 5efcfb28-f5b0-44d7-be5f-7e06cf8909fc ANALYTICS_USER_ID: 518c5278-9025-4d9d-87ec-a760547355c2 ANALYTICS_SERVER_ID: 37313583-62d5-4a91-8a12-209bbf468827 INTEGRATIONS: ['airflow', 'bitbucket', 'kaniko', 'pigeon', 'pytorch', 's3'] PACKAGES: {'certifi': '2024.7.4', 'fsspec': '2024.6.1', 's3fs': '2024.6.1', 'regex': '2024.5.15', 'tzdata': '2024.1', 'pytz': '2024.1', 'setuptools': '65.5.0', 'pip': '24.1.2', 'packaging': '24.1', 'attrs': '23.2.0', 'pyarrow': '16.1.0', 'rich': '13.7.1', 'nvidia-nvjitlink-cu12': '12.5.82', 'nvidia-cuda-nvrtc-cu12': '12.1.105', 'nvidia-cuda-cupti-cu12': '12.1.105', 'nvidia-nvtx-cu12': '12.1.105', 'nvidia-cuda-runtime-cu12': '12.1.105', 'nvidia-cublas-cu12': '12.1.3.1', 'nvidia-cusparse-cu12': '12.1.0.106', 'nvidia-cusolver-cu12': '11.4.5.107', 'nvidia-cufft-cu12': '11.0.2.54', 'nvidia-curand-cu12': '10.3.2.106', 'ipython': '8.26.0', 'nvidia-cudnn-cu12': '8.9.2.26', 'ipywidgets': '8.1.3', 'click': '8.1.3', 'configparser': '7.0.0', 'docker': '6.1.3', 'multidict': '6.0.5', 'pyyaml': '6.0.1', 'psutil': '6.0.0', 'traitlets': '5.14.3', 'decorator': '5.1.1', 'smmap': '5.0.1', 'tqdm': '4.66.4', 'transformers': '4.42.3', 'typing-extensions': '4.12.2', 'pexpect': '4.9.0', 'widgetsnbextension': '4.0.11', 'gitdb': '4.0.11', 'async-timeout': '4.0.3', 'bcrypt': '4.0.1', 'filelock': '3.15.4', 'aiohttp': '3.9.5', 'idna': '3.7', 'xxhash': '3.4.1', 'charset-normalizer': '3.3.2', 'networkx': '3.3', 'gitpython': '3.1.43', 'jinja2': '3.1.4', 'prompt-toolkit': '3.0.47', 'jupyterlab-widgets': '3.0.11', 'greenlet': '3.0.3', 'markdown-it-py': '3.0.0', 'requests': '2.31.0', 'nvidia-nccl-cu12': '2.20.5', 'datasets': '2.19.1', 'pydantic-core': '2.18.4', 'pygments': '2.18.0', 'aiobotocore': '2.13.1', 'python-dateutil': '2.9.0.post0', 'pydantic': '2.7.4', 'pyparsing': '2.4.7', 'asttokens': '2.4.1', 'torch': '2.3.1', 'triton': '2.3.1', 'pandas': '2.2.2', 'urllib3': '2.2.2', 'pydantic-settings': '2.2.1', 'cloudpickle': '2.2.1', 'markupsafe': '2.1.5', 'sqlalchemy': '2.0.31', 'executing': '2.0.1', 'boto3': '1.34.131', 'botocore': '1.34.131', 'numpy': '1.26.4', 'wrapt': '1.16.0', 'six': '1.16.0', 'sympy': '1.12.1', 'yarl': '1.9.4', 'distro': '1.9.0', 'alembic': '1.8.1', 'websocket-client': '1.8.0', 'passlib': '1.7.4', 'frozenlist': '1.4.1', 'argparse': '1.4.0', 'mako': '1.3.5', 'aiosignal': '1.3.1', 'mpmath': '1.3.0', 'exceptiongroup': '1.2.1', 'pymysql': '1.1.1', 'python-dotenv': '1.0.1', 'jmespath': '1.0.1', 'multiprocess': '0.70.16', 'zenml': '0.61.0', 'bitsandbytes': '0.43.1', 'sqlalchemy-utils': '0.41.2', 'accelerate': '0.32.1', 'huggingface-hub': '0.23.4', 'jedi': '0.19.1', 'httplib2': '0.19.1', 'tokenizers': '0.19.1', 'validators': '0.18.2', 'peft': '0.11.1', 'aioitertools': '0.11.0', 's3transfer': '0.10.2', 'parso': '0.8.4', 'aws-profile-manager': '0.7.3', 'ptyprocess': '0.7.0', 'annotated-types': '0.7.0', 'stack-data': '0.6.3', 'pyarrow-hotfix': '0.6', 'safetensors': '0.4.3', 'dill': '0.3.8', 'secure': '0.3.0', 'click-params': '0.3.0', 'wcwidth': '0.2.13', 'pure-eval': '0.2.2', 'comm': '0.2.2', 'matplotlib-inline': '0.1.7', 'mdurl': '0.1.2', 'sqlmodel': '0.0.18'}
CURRENT STACK
Name: s3store_stack ID: f5a870ff-2c25-4e63-aa5b-4b90acaa9c64 User: lab / 518c5278-9025-4d9d-87ec-a760547355c2 Workspace: default / ec65075d-7856-48f9-a5ea-10eb3f063bac
ORCHESTRATOR: default
Name: default ID: 5b8cf94c-0231-4860-9b7b-71d701cf866d Type: orchestrator Flavor: local Configuration: {} Workspace: default / ec65075d-7856-48f9-a5ea-10eb3f063bac
ARTIFACT_STORE: s3_store
Name: s3_store ID: bee9c341-3cc6-48dc-b764-21f63b1924ea Type: artifact_store Flavor: s3 Configuration: {'authentication_secret': 's3_secret', 'path': 's3://labyrinth', 'key': '****', 'secret': '****', 'token': '****', 'client_kwargs': {'endpoint_url': 'http://192.168.0.100:9000', 'region_name': 'taipei'}, 'config_kwargs': None, 's3_additional_kwargs': None} User: lab / 518c5278-9025-4d9d-87ec-a760547355c2 Workspace: default / ec65075d-7856-48f9-a5ea-10eb3f063bac
What happened?
I set S3 artifact store in my stack, python run.py. After 5 min, it failed with:
ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request
I checked Stackoverflow; people said the above error could be caused by token expiry. Then I check zenml codes and the underlying s3fs codes. It seems that s3fs has a 5-minute timeout default for connections.
connect_timeout = 5
I checked my MinIO trace, the window is truely 5 min.
I am considering adding that timeout parameter to the zenml config, but dont know how and where.
Since the error occurred after a long-duration fine-tuning process, I wonder: Maybe S3ArtifactStore should auto re-connect? Instead of throwing out the 400 error. Therefore this bug report.
Reproduction steps
zenml artifact-store register s3_store -f s3 --path='s3://labyrinth' --authentication_secret=s3_secret --client_kwargs='{"endpoint_url": "http://192.168.0.100:9000", "region_name": "taipei"}'
zenml stack register s3store_stack -a s3_store -o default --set
python run.py
Relevant log output
Code of Conduct