zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
3.92k stars 427 forks source link

[BUG]: "Member must have length less than or equal to 63" when creating a job in SageMaker #1517

Closed Frank995 closed 1 year ago

Frank995 commented 1 year ago

Contact Details [Optional]

pfrank995@gmail.com

System Information

ZENML_LOCAL_VERSION: 0.37.0
ZENML_SERVER_VERSION: 0.37.0
ZENML_SERVER_DATABASE: mysql
ZENML_SERVER_DEPLOYMENT_TYPE: other
ZENML_CONFIG_DIR: /home/francesco/.config/zenml
ZENML_LOCAL_STORE_DIR: /home/francesco/.config/zenml/local_stores
ZENML_SERVER_URL: https://agp2rc7652.eu-west-1.awsapprunner.com
ZENML_ACTIVE_REPOSITORY_ROOT: /home/francesco/repos/shipamax-vulcan-review/data-science/src/modelling_tools/zenml
PYTHON_VERSION: 3.10.10
ENVIRONMENT: native
SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': 
'20.04'}
ACTIVE_WORKSPACE: default
ACTIVE_STACK: francesco_stack
ACTIVE_USER: francesco
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: fac1c365-39ab-48cb-8147-4bfcb59c3bd6
ANALYTICS_USER_ID: 5135c97c-454d-4ea4-b04c-67c9b9464cef
ANALYTICS_SERVER_ID: face5c38-e106-44f8-aa3b-88ffcab8b10c
INTEGRATIONS: ['aws', 'kaniko', 'lightgbm', 'pillow', 'plotly', 'pytorch', 's3', 'scipy', 'sklearn']
PACKAGES: {'pdfminer.six': '20221105', 'regex': '2023.3.22', 'tifffile': '2023.3.21', 'certifi': '2022.12.7', 
's3fs': '2022.11.0', 'fsspec': '2022.11.0', 'pytz': '2022.7.1', 'tzdata': '2022.7', 'setuptools': '65.6.3', 
'cryptography': '38.0.4', 'pyzmq': '25.0.2', 'black': '23.1.0', 'pip': '23.0.1', 'packaging': '23.0', 'attrs': 
'22.2.0', 'contextlib2': '21.6.0', 'argon2-cffi': '21.3.0', 'argon2-cffi-bindings': '21.2.0', 
'azure-storage-blob': '12.9.0', 'rich': '12.6.0', 'pillow': '9.4.0', 'more-itertools': '9.1.0', 'phonenumbers': 
'8.13.7', 'ipython': '8.11.0', 'tenacity': '8.2.2', 'click': '8.1.3', 'python-slugify': '8.0.1', 'ipywidgets': 
'7.7.4', 'azure-servicebus': '7.6.0', 'nbconvert': '7.2.10', 'coverage': '7.2.2', 'jupyter-client': '7.2.0', 
'ipykernel': '6.22.0', 'notebook': '6.5.3', 'tornado': '6.2', 'ftfy': '6.1.1', 'multidict': '6.0.4', 'docker': 
'6.0.1', 'bleach': '6.0.0', 'plotly': '5.14.1', 'psutil': '5.9.4', 'traitlets': '5.9.0', 'nbformat': '5.8.0', 
'pyyaml': '5.4.1', 'jupyter-core': '5.3.0', 'decorator': '5.1.1', 'configobj': '5.0.8', 'mailchecker': '5.0.7', 
'dash-table': '5.0.0', 'smmap': '5.0.0', 'tqdm': '4.65.0', 'fonttools': '4.39.2', 'transformers': '4.27.2', 
'jsonschema': '4.17.3', 'importlib-metadata': '4.13.0', 'beautifulsoup4': '4.12.0', 'antlr4-python3-runtime': 
'4.9.3', 'lxml': '4.9.2', 'rsa': '4.9', 'pexpect': '4.8.0', 'pytest': '4.6.11', 'opencv-python-headless': 
'4.6.0.66', 'typing-extensions': '4.5.0', 'isort': '4.3.21', 'azure-keyvault-secrets': '4.3.0', 'tzlocal': '4.3',
'cachetools': '4.2.4', 'altair': '4.2.2', 'gitdb': '4.0.10', 'async-timeout': '4.0.2', 'bcrypt': '4.0.1', 
'singledispatch': '4.0.0', 'pytest-cov': '4.0.0', 'protobuf': '3.20.1', 'zipp': '3.15.0', 'ply': '3.11', 
'filelock': '3.10.2', 'aiohttp': '3.8.4', 'matplotlib': '3.7.1', 'h5py': '3.7.0', 'widgetsnbextension': '3.6.3', 
'anyio': '3.6.2', 'markdown': '3.4.3', 'idna': '3.4', 'nltk': '3.4', 'lightgbm': '3.3.5', 'oauthlib': '3.2.2', 
'flufl.lock': '3.2', 'pytest-mock': '3.2.0', 'azure-ai-formrecognizer': '3.2.0b3', 'gitpython': '3.1.18', 
'jinja2': '3.1.2', 'atpublic': '3.1.1', 'platformdirs': '3.1.1', 'threadpoolctl': '3.1.0', 'prompt-toolkit': 
'3.0.38', 'chardet': '3.0.4', 'intervaltree': '3.0.2', 'zc.lockfile': '3.0.post1', 'networkx': '3.0', 'watchdog':
'3.0.0', 'sagemaker': '2.117.0', 'boto': '2.49.0', 'requests': '2.28.1', 'imageio': '2.26.1', 'pycparser': 
'2.21', 'fastjsonschema': '2.16.3', 'pygments': '2.14.0', 'tensorboard': '2.12.0', 'aws-xray-sdk': '2.11.0', 
'psycopg2-binary': '2.9.5', 'dash': '2.9.3', 'python-dateutil': '2.8.1', 'portalocker': '2.7.0', 'pyjwt': 
'2.6.0', 'pyparsing': '2.4.7', 'pylint': '2.4.4', 'aiobotocore': '2.4.2', 'soupsieve': '2.4', 'sortedcontainers':
'2.4.0', 'astroid': '2.3.3', 'pygtrie': '2.3.2', 'omegaconf': '2.3.0', 'werkzeug': '2.2.3', 'flask': '2.2.3', 
'asttokens': '2.2.1', 'cloudpickle': '2.2.1', 'termcolor': '2.2.0', 'dpath': '2.1.4', 'markupsafe': '2.1.2', 
'itsdangerous': '2.1.2', 'base58': '2.1.1', 'charset-normalizer': '2.1.1', 'pycocotools': '2.0.6', 'mistune': 
'2.0.5', 'greenlet': '2.0.2', 'portion': '2.0.2', 'tomli': '2.0.1', 'dash-html-components': '2.0.0', 
'dash-core-components': '2.0.0', 'googleapis-common-protos': '1.56.4', 'grpcio': '1.51.3', 'google-auth': 
'1.35.0', 'botocore': '1.27.59', 'urllib3': '1.26.15', 'pypdf2': '1.26.0', 'boto3': '1.24.59', 'numpy': '1.24.0',
'jupyter-server': '1.23.6', 'azure-core': '1.22.1', 'pymupdf': '1.18.17', 'funcy': '1.18', 'msal': '1.17.0', 
'google-api-core': '1.17.0', 'six': '1.16.0', 'cffi': '1.15.1', 'mypy-boto3': '1.14.40.0', 'boto3-stubs': 
'1.14.40.0', 'dvc': '1.11.16', 'wrapt': '1.11.2', 'py': '1.11.0', 'torch': '1.11.0', 'backoff': '1.10.0', 
'scipy': '1.9.3', 'pydantic': '1.9.2', 'shapely': '1.8.5', 'yarl': '1.8.2', 'tensorboard-plugin-wit': '1.8.1', 
'alembic': '1.8.1', 'distro': '1.8.0', 'send2trash': '1.8.0', 'azure-identity': '1.8.0', 'ppft': '1.7.6.6', 
'passlib': '1.7.4', 'pysocks': '1.7.1', 'debugpy': '1.6.6', 'uamqp': '1.6.4', 'monotonic': '1.6', 'shtab': 
'1.5.8', 'nest-asyncio': '1.5.6', 'jsonpath-ng': '1.5.3', 'websocket-client': '1.5.1', 'blinker': '1.5', 
'pandocfilters': '1.5.0', 'sqlalchemy': '1.4.41', 'kiwisolver': '1.4.4', 'appdirs': '1.4.4', 'typed-ast': 
'1.4.3', 'lazy-object-proxy': '1.4.3', 'pydot': '1.4.2', 'dash-bootstrap-components': '1.4.1', 'atomicwrites': 
'1.4.1', 'pdf2image': '1.4.1', 'pywavelets': '1.4.1', 'analytics-python': '1.4.post1', 'absl-py': '1.4.0', 
'pyahocorasick': '1.4.0', 'unidecode': '1.3.6', 'frozenlist': '1.3.3', 'hydra-core': '1.3.2', 
'requests-oauthlib': '1.3.1', 'aiosignal': '1.3.1', 'sniffio': '1.3.0', 'pytest-datadir': '1.3.0', 
'text-unidecode': '1.3', 'mako': '1.2.4', 'tinycss2': '1.2.1', 'executing': '1.2.0', 'joblib': '1.2.0', 'xlrd': 
'1.2.0', 'azure-common': '1.1.28', 'pandas': '1.1.5', 'jupyterlab-widgets': '1.1.3', 'shortuuid': '1.0.11', 
'contourpy': '1.0.7', 'scikit-learn': '1.0.2', 'pymysql': '1.0.2', 'smdebug-rulesconfig': '1.0.1', 
'google-cloud-vision': '1.0.0', 'mypy-extensions': '1.0.0', 'requests-aws4auth': '1.0.0', 'streamlit': '0.77.0', 
'multiprocess': '0.70.14', 'wheel': '0.38.4', 'sqlalchemy-utils': '0.38.3', 'zenml': '0.37.0', 'python-benedict':
'0.30.0', 'cython': '0.29.33', 'dulwich': '0.21.3', 'python-dotenv': '0.21.0', 'pyrsistent': '0.19.3', 
'httplib2': '0.19.1', 'future': '0.18.3', 'validators': '0.18.2', 'jedi': '0.18.2', 'scikit-image': '0.18.1', 
'ruamel.yaml': '0.17.21', 'terminado': '0.17.1', 'prometheus-client': '0.16.0', 'huggingface-hub': '0.13.3', 
'tokenizers': '0.13.2', 'pluggy': '0.13.1', 'voluptuous': '0.13.1', 'xmltodict': '0.13.0', 'python-levenshtein': 
'0.12.1', 'torchvision': '0.12.0', 'toolz': '0.12.0', 'pathspec': '0.11.1', 'cycler': '0.11.0', 'torchaudio': 
'0.11.0', 'aioitertools': '0.11.0', 'toml': '0.10.2', 'imbalanced-learn': '0.10.1', 'python-terraform': '0.10.1',
'python-fsutil': '0.10.0', 'jmespath': '0.10.0', 'rtree': '0.9.5', 'commonmark': '0.9.1', 'tabulate': '0.9.0', 
'dictdiffer': '0.9.0', 'parso': '0.8.3', 'astor': '0.8.1', 'dgl': '0.8.0.post1', 'pydeck': '0.8.0', 
'pickleshare': '0.7.5', 'schema': '0.7.5', 'nbclient': '0.7.2', 'pytorch-crf': '0.7.2', 'defusedxml': '0.7.1', 
'ptyprocess': '0.7.0', 'tensorboard-data-server': '0.7.0', 'msrest': '0.6.21', 'stack-data': '0.6.2', 'isodate': 
'0.6.1', 'string-grouper': '0.6.1', 'mccabe': '0.6.1', 'multipledispatch': '0.6.0', 'grandalf': '0.6', 
's3transfer': '0.6.0', 'detectron2': '0.6', 'pyocr': '0.5.3', 'nbclassic': '0.5.3', 'nanotime': '0.5.2', 
'webencodings': '0.5.1', 'pytorch-nlp': '0.5.0', 'pyasn1': '0.4.8', 'colorama': '0.4.6', 'google-auth-oauthlib': 
'0.4.6', 'flatten-dict': '0.4.2', 'entrypoints': '0.4', 'dill': '0.3.6', 'pox': '0.3.2', 
'sparse-dot-topn-for-blocks': '0.3.1.post3', 'msal-extensions': '0.3.1', 'dash-cytoscape': '0.3.0', 
'click-params': '0.3.0', 'pathos': '0.3.0', 'pyasn1-modules': '0.2.8', 'ruamel.yaml.clib': '0.2.7', 'wcwidth': 
'0.2.6', 'jupyterlab-pygments': '0.2.2', 'notebook-shim': '0.2.2', 'pure-eval': '0.2.2', 'ipython-genutils': 
'0.2.0', 'backcall': '0.2.0', 'xmljson': '0.2.0', 'google-pasta': '0.2.0', 'iopath': '0.1.9', 'yacs': '0.1.8', 
'matplotlib-inline': '0.1.6', 'fvcore': '0.1.5.post20221221', 'wordninja': '0.1.5', 'protobuf3-to-dict': '0.1.5',
'comm': '0.1.3', 'pytz-deprecation-shim': '0.1.0.post0', 'sqlmodel': '0.0.8', 'topn': '0.0.7', 
'sqlalchemy2-stubs': '0.0.2a32', 'imblearn': '0.0'}
The attribute instance_type of class SagemakerStepOperatorConfig will be deprecated soon.
The stack francesco_stack contains components that require building Docker images. Older versions of ZenML always built these images locally, but since version 0.32.0 this behavior can be configured using the image_builder stack component. This stack will temporarily default to a local image builder that mirrors the previous behavior, but this will be removed in future versions of ZenML. Please add an image builder to this stack:
zenml image-builder register <NAME> ...
zenml stack update 1958ff2b-7f48-4d4a-944e-f51095dbeeed -i <NAME>

CURRENT STACK

Name: francesco_stack
ID: 1958ff2b-7f48-4d4a-944e-f51095dbeeed
Shared: No
User: francesco / 5135c97c-454d-4ea4-b04c-67c9b9464cef
Workspace: default / 375b424b-a9f0-4806-aaac-20c3d6932740

ORCHESTRATOR: default

Name: default
ID: 987988bd-3b0d-4a12-b62d-38a479080d9a
Type: orchestrator
Flavor: local
Configuration: {}
Shared: No
User: francesco / 5135c97c-454d-4ea4-b04c-67c9b9464cef
Workspace: default / 375b424b-a9f0-4806-aaac-20c3d6932740

ARTIFACT_STORE: francesco_store

Name: francesco_store
ID: 95497a65-abfd-4f93-a480-df52805ebc08
Type: artifact_store
Flavor: s3
Configuration: {'authentication_secret': None, 'path': 's3://zenml-store-francesco', 'key': '********', 'secret':
'********', 'token': '********', 'client_kwargs': None, 'config_kwargs': None, 's3_additional_kwargs': None}
Shared: No
User: francesco / 5135c97c-454d-4ea4-b04c-67c9b9464cef
Workspace: default / 375b424b-a9f0-4806-aaac-20c3d6932740

CONTAINER_REGISTRY: ecr_registry

Name: ecr_registry
ID: a6c5fbfc-d79b-44b0-ab2b-5da060bd2ee0
Type: container_registry
Flavor: aws
Configuration: {'authentication_secret': None, 'uri': '470832953632.dkr.ecr.eu-west-1.amazonaws.com'}
Shared: Yes
User: default / b45d3a21-bf75-4f6f-9aac-88aeb712cf75
Workspace: default / 375b424b-a9f0-4806-aaac-20c3d6932740

SECRETS_MANAGER: francesco-secret-store

Name: francesco-secret-store
ID: bf8a71a1-b33c-4a01-bc14-a4c91242629c
Type: secrets_manager
Flavor: aws
Configuration: {'scope': <SecretsManagerScope.COMPONENT: 'component'>, 'namespace': None, 'region_name': 
'eu-west-1'}
Shared: No
User: francesco / 5135c97c-454d-4ea4-b04c-67c9b9464cef
Workspace: default / 375b424b-a9f0-4806-aaac-20c3d6932740

STEP_OPERATOR: sagemaker_trainer_cpu

Name: sagemaker_trainer_cpu
ID: 9aa6203f-4a27-47c9-9cd8-3d1c2cef7082
Type: step_operator
Flavor: sagemaker
Configuration: {'instance_type': 'ml.c5.18xlarge', 'experiment_name': None, 'input_data_s3_uri': None, 
'estimator_args': {}, 'role': 'notebook-role-datascience-zenml', 'bucket': None}
Shared: Yes
User: francesco / 5135c97c-454d-4ea4-b04c-67c9b9464cef
Workspace: default / 375b424b-a9f0-4806-aaac-20c3d6932740

What happened?

When running a pipeline with a step operator in SageMaker I may get: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'generic-training-pipeline-2023-04-12-15-36-41-124330-model-trainer' at 'trainingJobName' failed to satisfy constraint: Member must have length less than or equal to 63 If the pipeline name is medium sized.

Reproduction steps

  1. Create a pipeline with only one step
  2. Give it a medium size name, such as 'generic-training-pipeline'
  3. Define a step operator in SageMaker
  4. Use the step decorator to run the step on SageMaker ...

Relevant log output

Si è verificata un'eccezione: ClientError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'generic-training-pipeline-2023-04-13-06-05-26-177259-model-trainer' at 'trainingJobName' failed to satisfy constraint: Member must have length less than or equal to 63
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/botocore/client.py", line 915, in _make_api_call
    raise error_class(parsed_response, operation_name)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/sagemaker/session.py", line 611, in submit
    self.sagemaker_client.create_training_job(**request)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/sagemaker/session.py", line 4344, in _intercept_create_request
    return create(request)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/sagemaker/session.py", line 613, in train
    self._intercept_create_request(train_request, submit, self.train.__name__)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/sagemaker/estimator.py", line 2042, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/sagemaker/estimator.py", line 1125, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 272, in wrapper
    return run_func(*args, **kwargs)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/integrations/aws/step_operators/sagemaker_step_operator.py", line 207, in launch
    estimator.fit(
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/orchestrators/step_launcher.py", line 430, in _run_step_with_step_operator
    step_operator.launch(
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/orchestrators/step_launcher.py", line 375, in _run_step
    self._run_step_with_step_operator(
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/orchestrators/step_launcher.py", line 198, in launch
    self._run_step(
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/orchestrators/base_orchestrator.py", line 186, in run_step
    launcher.launch()
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/orchestrators/local/local_orchestrator.py", line 82, in prepare_or_run_pipeline
    self.run_step(
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/orchestrators/base_orchestrator.py", line 166, in run
    result = self.prepare_or_run_pipeline(
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/stack/stack.py", line 864, in deploy_pipeline
    return self.orchestrator.run(deployment=deployment, stack=self)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/site-packages/zenml/pipelines/base_pipeline.py", line 599, in run
    stack.deploy_pipeline(deployment=deployment_model)
  File "/home/francesco/repos/shipamax-vulcan-review/data-science/src/modelling_tools/zenml/run_standard_training_pipeline.py", line 46, in <module>
    pipeline.run(config_path="standard_training_config.yaml")
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/francesco/miniconda3/envs/zenml/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'generic-training-pipeline-2023-04-13-06-05-26-177259-model-trainer' at 'trainingJobName' failed to satisfy constraint: Member must have length less than or equal to 63

Code of Conduct

strickvl commented 1 year ago

Thanks for this report. @christianversloot's #1505 was merged into develop very recently which fixes this (and the previous issue #1502). I'll be in the next release. Thanks for reporting it, though!