zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.15k stars 446 forks source link

[BUG]: The tags for the AWS Sagemaker orchestrator are passed in the wrong format #1800

Closed mnschmit closed 1 year ago

mnschmit commented 1 year ago

Contact Details [Optional]

martin.schmitt@celebrate.company

System Information

ZENML_LOCAL_VERSION: 0.44.1
ZENML_SERVER_VERSION: 0.44.1
ZENML_SERVER_DATABASE: sqlite
ZENML_SERVER_DEPLOYMENT_TYPE: other
ZENML_CONFIG_DIR: /home/martin/.config/zenml
ZENML_LOCAL_STORE_DIR: /home/martin/.config/zenml/local_stores
ZENML_ACTIVE_REPOSITORY_ROOT: /home/martin/repositories/token-classification
PYTHON_VERSION: 3.9.13
ENVIRONMENT: native
SYSTEM_INFO: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '22.04'}
ACTIVE_WORKSPACE: default
ACTIVE_STACK: default_cloud
ACTIVE_USER: celebrate-zenml
TELEMETRY_STATUS: enabled
ANALYTICS_CLIENT_ID: e395e99f-5f7c-4a71-8978-a2bb3923d78f
ANALYTICS_USER_ID: c4d9dc1e-0523-41da-86cb-9c280b6b26e9
ANALYTICS_SERVER_ID: 92b436e1-7e0f-42c3-8a9f-bc8f879a06f5
INTEGRATIONS: ['aws', 'github', 'huggingface', 'kaniko', 'pillow', 's3', 'scipy', 'wandb']
PACKAGES: {'babel': '2.12.1', 'deprecated': '1.2.14', 'gitpython': '3.1.34', 'jinja2': '3.1.2', 'mako': '1.2.4', 'markupsafe': '2.1.3', 
'pillow': '10.0.0', 'pygithub': '2.0.1rc0', 'pyjwt': '2.8.0', 'pymysql': '1.0.3', 'pynacl': '1.5.0', 'pyyaml': '6.0.1', 'pygments': 
'2.16.1', 'sqlalchemy': '1.4.41', 'sqlalchemy-utils': '0.38.3', 'send2trash': '1.8.2', 'accelerate': '0.22.0', 'aiobotocore': '2.5.2', 
'aiohttp': '3.8.5', 'aioitertools': '0.11.0', 'aiosignal': '1.3.1', 'alembic': '1.8.1', 'analytics-python': '1.4.post1', 'anyio': 
'4.0.0', 'appdirs': '1.4.4', 'argilla': '1.15.0', 'argon2-cffi': '23.1.0', 'argon2-cffi-bindings': '21.2.0', 'arrow': '1.2.3', 
'astroid': '2.15.6', 'asttokens': '2.4.0', 'async-lru': '2.0.4', 'async-timeout': '4.0.3', 'attrs': '22.2.0', 'azure-common': '1.1.28', 
'azure-core': '1.29.3', 'azure-mgmt-core': '1.4.0', 'azure-mgmt-resource': '23.1.0b2', 'backcall': '0.2.0', 'backoff': '2.2.1', 
'bcrypt': '4.0.1', 'beautifulsoup4': '4.12.2', 'black': '23.7.0', 'bleach': '6.0.0', 'boto3': '1.26.76', 'botocore': '1.29.161', 
'cachetools': '5.3.1', 'certifi': '2023.7.22', 'cffi': '1.15.1', 'charset-normalizer': '3.2.0', 'click': '8.1.3', 'click-params': 
'0.3.0', 'cloudpickle': '2.2.1', 'cmake': '3.27.4.1', 'comm': '0.1.4', 'commonmark': '0.9.1', 'contextlib2': '21.6.0', 'cryptography': 
'41.0.3', 'datasets': '2.14.5', 'debugpy': '1.6.7.post1', 'decorator': '5.1.1', 'defusedxml': '0.7.1', 'dill': '0.3.7', 'distro': 
'1.8.0', 'docker': '6.1.3', 'docker-pycreds': '0.4.0', 'evaluate': '0.4.0', 'exceptiongroup': '1.1.3', 'executing': '1.2.0', 
'fastjsonschema': '2.18.0', 'filelock': '3.12.3', 'flake8': '6.0.0', 'fqdn': '1.5.1', 'frozenlist': '1.4.0', 'fsspec': '2023.4.0', 
'gitdb': '4.0.10', 'google-auth': '2.22.0', 'google-pasta': '0.2.0', 'greenlet': '3.0.0rc1', 'h11': '0.14.0', 'httpcore': '0.16.3', 
'httplib2': '0.19.1', 'httpx': '0.23.3', 'huggingface-hub': '0.16.4', 'idna': '3.4', 'importlib-metadata': '4.13.0', 'iniconfig': 
'2.0.0', 'ipykernel': '6.25.2', 'ipython': '8.15.0', 'ipython-genutils': '0.2.0', 'ipywidgets': '7.8.0', 'isodate': '0.6.1', 
'isoduration': '20.11.0', 'isort': '5.12.0', 'jedi': '0.19.0', 'jmespath': '1.0.1', 'joblib': '1.3.2', 'json5': '0.9.14', 'jsonpointer':
'2.4', 'jsonschema': '4.19.0', 'jsonschema-specifications': '2023.7.1', 'jupyter-client': '8.3.1', 'jupyter-core': '5.3.1', 
'jupyter-events': '0.7.0', 'jupyter-lsp': '2.2.0', 'jupyter-server': '2.7.3', 'jupyter-server-terminals': '0.4.4', 'jupyterlab': 
'4.1.0a1', 'jupyterlab-pygments': '0.2.2', 'jupyterlab-server': '2.24.0', 'jupyterlab-widgets': '2.0.0b1', 'kubernetes': '28.1.0a1', 
'lazy-object-proxy': '1.9.0', 'lit': '17.0.0rc4', 'matplotlib-inline': '0.1.6', 'mccabe': '0.7.0', 'mistune': '3.0.1', 'monotonic': 
'1.6', 'mpmath': '1.3.0', 'multidict': '6.0.4', 'multiprocess': '0.70.15', 'mypy': '1.4.1', 'mypy-extensions': '1.0.0', 'nbclient': 
'0.8.0', 'nbconvert': '7.8.0', 'nbformat': '5.9.2', 'nest-asyncio': '1.5.7', 'networkx': '3.1', 'notebook': '7.0.3', 'notebook-shim': 
'0.2.3', 'numpy': '1.23.5', 'nvidia-cublas-cu11': '11.10.3.66', 'nvidia-cuda-cupti-cu11': '11.7.101', 'nvidia-cuda-nvrtc-cu11': 
'11.7.99', 'nvidia-cuda-runtime-cu11': '11.7.99', 'nvidia-cudnn-cu11': '8.5.0.96', 'nvidia-cufft-cu11': '10.9.0.58', 
'nvidia-curand-cu11': '10.2.10.91', 'nvidia-cusolver-cu11': '11.4.0.1', 'nvidia-cusparse-cu11': '11.7.4.91', 'nvidia-nccl-cu11': 
'2.14.3', 'nvidia-nvtx-cu11': '11.7.91', 'oauthlib': '3.2.2', 'overrides': '7.4.0', 'packaging': '23.1', 'pandas': '1.5.3', 
'pandocfilters': '1.5.0', 'parso': '0.8.3', 'passlib': '1.7.4', 'pathos': '0.3.1', 'pathspec': '0.11.1', 'pathtools': '0.1.2', 
'pexpect': '4.8.0', 'pickleshare': '0.7.5', 'pip': '22.0.2', 'platformdirs': '3.10.0', 'pluggy': '1.2.0', 'pox': '0.3.3', 'ppft': 
'1.7.6.7', 'prometheus-client': '0.17.1', 'prompt-toolkit': '3.0.39', 'protobuf': '3.20.3', 'protobuf3-to-dict': '0.1.5', 'psutil': 
'5.9.5', 'ptyprocess': '0.7.0', 'pure-eval': '0.2.2', 'pyarrow': '13.0.0', 'pyasn1': '0.5.0', 'pyasn1-modules': '0.3.0', 'pycodestyle': 
'2.10.0', 'pycparser': '2.21', 'pydantic': '1.10.12', 'pyflakes': '3.0.1', 'pylint': '3.0.0a6', 'pyparsing': '2.4.7', 'pytest': '7.4.0',
'python-dateutil': '2.8.2', 'python-json-logger': '2.0.7', 'python-terraform': '0.10.1', 'pytz': '2023.3.post1', 'pyzmq': '25.1.1', 
'referencing': '0.30.2', 'regex': '2023.8.8', 'requests': '2.31.0', 'requests-oauthlib': '1.3.1', 'responses': '0.18.0', 
'rfc3339-validator': '0.1.4', 'rfc3986': '1.5.0', 'rfc3986-validator': '0.1.1', 'rich': '12.6.0', 'rpds-py': '0.10.2', 'rsa': '4.9', 
's3fs': '2023.4.0', 's3transfer': '0.6.2', 'sagemaker': '2.117.0', 'schema': '0.7.5', 'scikit-learn': '1.3.0', 'scipy': '1.11.2', 
'sentencepiece': '0.1.99', 'sentry-sdk': '1.30.0', 'seqeval': '1.2.2', 'setproctitle': '1.3.2', 'setuptools': '68.2.0', 'six': '1.16.0',
'smdebug-rulesconfig': '1.0.1', 'smmap': '5.0.0', 'sniffio': '1.3.0', 'soupsieve': '2.5', 'sqlalchemy2-stubs': '0.0.2a35', 'sqlmodel': 
'0.0.8', 'stack-data': '0.6.2', 'sympy': '1.12', 'terminado': '0.17.1', 'threadpoolctl': '3.2.0', 'tinycss2': '1.2.1', 'tokenizers': 
'0.13.4rc3', 'tomli': '2.0.1', 'tomlkit': '0.11.8', 'tornado': '6.3.3', 'tqdm': '4.66.1', 'traitlets': '5.9.0', 'transformers': 
'4.29.2', 'triton': '2.0.0', 'typer': '0.7.0', 'typing-extensions': '4.7.1', 'uri-template': '1.3.0', 'urllib3': '1.26.16', 
'validators': '0.18.2', 'wandb': '0.15.10', 'wcwidth': '0.2.6', 'webcolors': '1.13', 'webencodings': '0.5.1', 'websocket-client': 
'1.6.2', 'wheel': '0.41.2', 'widgetsnbextension': '3.6.5', 'wrapt': '1.14.1', 'xxhash': '3.3.0', 'yarl': '1.9.2', 'zenml': '0.44.1', 
'zipp': '3.16.2'}

CURRENT STACK

Name: default_cloud
ID: d0040a28-a86f-47fa-85ec-e65506534639
Shared: No
User: celebrate-zenml / c4d9dc1e-0523-41da-86cb-9c280b6b26e9
Workspace: default / a7b32c3d-ec03-46f6-a382-69a0690c9dc7

ORCHESTRATOR: sagemaker-service-platform

Name: sagemaker-service-platform
ID: e6c3649c-7dfa-42d3-8b56-3aa172ada938
Type: orchestrator
Flavor: sagemaker
Configuration: {'instance_type': 'ml.t3.medium', 'processor_role': '', 'volume_size_in_gb': 30, 'max_runtime_in_seconds': 86400, 
'processor_tags': {'squad': 'servicePlatform', 'application': 'machine-learning', 'domain': 'OrderToProduction', 'environment': 'prod'},
'processor_args': {}, 'input_data_s3_mode': 'File', 'input_data_s3_uri': {}, 'output_data_s3_mode': 'EndOfJob', 'output_data_s3_uri': 
{}, 'synchronous': False, 'execution_role': 'arn:aws:iam::************:role/zenml-sagemaker-role', 'bucket': ''}
Shared: No
User: celebrate-zenml / c4d9dc1e-0523-41da-86cb-9c280b6b26e9
Workspace: default / a7b32c3d-ec03-46f6-a382-69a0690c9dc7

ARTIFACT_STORE: s3

Name: s3
ID: c89b611a-d711-46f6-8c93-6b369f75e4c1
Type: artifact_store
Flavor: s3
Configuration: {'authentication_secret': None, 'path': 's3://celebrate-zenml-artifact-store', 'key': '********', 'secret': '********', 
'token': '********', 'client_kwargs': None, 'config_kwargs': None, 's3_additional_kwargs': None}
Shared: No
User: celebrate-zenml / c4d9dc1e-0523-41da-86cb-9c280b6b26e9
Workspace: default / a7b32c3d-ec03-46f6-a382-69a0690c9dc7

CONTAINER_REGISTRY: ecr-default

Name: ecr-default
ID: d3ac2845-4fe5-4505-8b17-2c385563098b
Type: container_registry
Flavor: default
Configuration: {'authentication_secret': None, 'uri': '************.dkr.ecr.************.amazonaws.com'}
Shared: No
User: celebrate-zenml / c4d9dc1e-0523-41da-86cb-9c280b6b26e9
Workspace: default / a7b32c3d-ec03-46f6-a382-69a0690c9dc7

EXPERIMENT_TRACKER: wandb-celebrate-ai

Name: wandb-celebrate-ai
ID: 6e70d0ac-a559-4bf1-ad24-0ddbd33f945c
Type: experiment_tracker
Flavor: wandb
Configuration: {'run_name': None, 'tags': [], 'settings': {}, 'api_key': '********', 'entity': 'celebrate-ai', 'project_name': None}
Shared: No
User: celebrate-zenml / c4d9dc1e-0523-41da-86cb-9c280b6b26e9
Workspace: default / a7b32c3d-ec03-46f6-a382-69a0690c9dc7

IMAGE_BUILDER: local_image_builder

Name: local_image_builder
ID: 98652c71-f895-4357-b5f7-226b6974b3e1
Type: image_builder
Flavor: local
Configuration: {}
Shared: Yes
User: celebrate-zenml / c4d9dc1e-0523-41da-86cb-9c280b6b26e9
Workspace: default / a7b32c3d-ec03-46f6-a382-69a0690c9dc7

What happened?

I specified tags in my Sagemaker orchestrator, i.e., key-value pairs we can use in AWS to describe resources, e.g., for billing purposes. Once I did that, pipeline runs were failing with the following error:

ClientError: An error occurred (ValidationException) when calling the CreatePipeline operation: Unable to parse pipeline definition. 
Unknown Argument member 'squad'.

(squad is one of my custom tags; see config of Sagemaker orchestrator above)

I figured out that the underlying sagemaker.processing.Processor expects its tags parameter (list[dict[str,str]]) in the following format: tags=[{"Key": "Project", "Value": "MyProject"}, {"Key": "Environment", "Value": "Development"}], i.e., with a dedicated dict per tag with the fixed keys Key (uppercase!) and Value whereas the current ZenML implementation passes a list with a single dictionary containing all the tags like this {'key1': 'value1', 'key2': 'value2'}.

A quick fix for this would be to change this line to

[{"Key": key, "Value": value} for key, value in step_settings.processor_tags.items()]

That fixed it immediately for me. But I still open this issue instead of a PR because I am unsure whether the right format should already be stored before this line is executed, i.e., at some other place, and we should not only fix it at the time when the tags are submitted to create a pipeline with the Sagemaker orchestrator. I am not sure, but it is possible that a similar issue exists with the Sagemaker step operator. So it might be favorable to already store it in the ZenML server in the format that is expected by the Sagemaker API.

Reproduction steps

  1. Register a Sagemaker orchestrator
  2. Provide tags in its configuration
  3. Start a pipeline using this orchestrator

Relevant log output

No response

Code of Conduct

fa9r commented 1 year ago

@mnschmit Thanks for surfacing this issue and the detailed analysis. I'm looking into this now and will try the fix you suggested. Will keep you posted.

mnschmit commented 1 year ago

Great, thank you, @fa9r. It would be great if there was a direct fix for all AWS-related tagging in any (possibly future) components.

fa9r commented 1 year ago

Nice your proposed fix worked like a charm. I included it in #1799 and will make sure it gets merged before the next ZenML release.

IMO our current tag representation (simple dict) is much more intuitive than the representation that sagemaker expect. The tags are also not used in any other place, so I think it is fine to only adjust the format right before submitting the pipeline. The sagemaker step operator doesn't even support tags atm, so nothing to be done there. Thus, your fix was exactly what needed to be changed, thanks again for the hint :)

mnschmit commented 1 year ago

Cool, thanks for the update :+1: :slightly_smiling_face: