zenml-io / zenml

ZenML 🙏: The bridge between ML and Ops. https://zenml.io.
https://zenml.io
Apache License 2.0
4.04k stars 436 forks source link

[FEATURE] Implement `zenml clean` #34

Closed hamzamaiot closed 2 years ago

harshasridhar commented 3 years ago

Hi, I'd like to work on this issue. Please help me out with the details.

htahir1 commented 3 years ago

Thank you @harshasridhar. Thank you for the contribution, it is greatly appreciated!

Here are a few pointers:

When the user uses zenml clean the following needs to happen.

For each pipeline in the pipeline_store specified in the zenml_config, you need to delete the metadata_store and the artifact_store. Here is how:

Finally, the pipeline_store needs to be deleted.

For each concept above the docs go into some detail: https://docs.zenml.io . I hope thats good for a starting point but might require more discussions. Please feel free to join the slack to chat directly. Thanks again for your effort!

SKRohit commented 3 years ago

@htahir1 I am looking into this issue. And this is what I understood and what I have doubts about.

  1. Every BasePipeline object has metadata_store and artifact_store attributes so deleting those for each pipeline would be enough?
  2. Also, every BasePipeline object also has datasource attribute which is a BaseDatasource object and it has its own metadata_store and artifact_store should we consider them as well for deletion? In my opinion, I think it should be deleted separately since there is a possibility that artifact_store and metadata_store of datasources and pipelines could different let me know your thoughts.
  3. Also, should zenml clean also delete datasources whaich are not related to any pipeline?
htahir1 commented 3 years ago

Thanks for the well thought out comments @SKRohit. Here are my answers:

  1. Yes I believe that would be enough.
  2. Every datasource is connected atleast to a data pipeline, therefore if you delete all pipelines then you will delete all datasources in essence.
  3. See above - if you delete all pipelines than there should be no datasources left as each datasource produces a data pipeline per commit.

In general, internally we are preparing a big change in the next month that will rewrite a lot of this logic and make things easier. For now, please implement as simple as possible logic that goes through pipelines and deletes their artifact and metadata stores. Please try to decouple functions as after the refactor it might still be useful! Thanks!

htahir1 commented 2 years ago

540 is addressing this now in a simpler way

strickvl commented 2 years ago

This issue has been implemented now in #540 so I'm going to close this.