Add Support for Exports Read/Writing From S3 Buckets

1nv8rzim commented 2 months ago

Purpose

One of the things I have been working towards is a stateless deploy of yeti in order to deploy several replicas of the celery runner and api containers. The biggest blocker for this is that a shared volumes between several containers is an anti-pattern and not supported on the cluster I am deploying to.

This PR introduces the ability to replaces this shared volume with an S3 Complaint Bucket

Changes

When system.export_path is prefixed with s3:// it will attempt to use S3 as the storage mediums for exports
- Example: Setting system.export_path to s3://bucket_name would use bucket_name for storage of export task results
Celery Runner can now upload results export tasks to an s3 compliant bucket
API is able to read results of export task from an s3 compliant bucket
Credentials and configuration for s3 buckets access should be injected by environment variables

1nv8rzim commented 2 months ago

This works; I am drafting a more generalized approach here so that this same thing can be extended past s3 in the future

tomchop commented 2 months ago

This is a great feature, thanks a lot!

This works; I am drafting a more generalized approach https://github.com/1nv8rzim/yeti/pull/2 so that this same thing can be extended past s3 in the future

That was going to be my main comment on this - it would be great if the core of this was provider agnostic. Please let us know once the generalized approach is ready for review!

Another thing that came to mind is documentation for this (especially the env variable part) Would you mind adding a snippet .md file to https://github.com/yeti-platform/yeti-platform.github.io/tree/main/content/docs describing how to do set this up? I'll make sure it ends up in the right directory structure so you don't need to mind any of the hugo stuff.

Thanks again!

1nv8rzim commented 2 months ago

Another thing that came to mind is documentation for this (especially the env variable part) Would you mind adding a snippet .md file to yeti-platform/yeti-platform.github.io@main/content/docs describing how to do set this up? I'll make sure it ends up in the right directory structure so you don't need to mind any of the hugo stuff.

Would something like this work? https://github.com/yeti-platform/yeti-platform.github.io/pull/8

That was going to be my main comment on this - it would be great if the core of this was provider agnostic. Please let us know once the generalized approach is ready for review!

Will do! I have tested the I PR I linked locally, need to test in our full deploy though

tomchop commented 2 months ago

Yes, the doc PR looks great! I might swoop in and change some typos / the layout of the page but it looks good.

Let me know once both are ready to review! Thanks again :)

1nv8rzim commented 2 months ago

@tomchop Merged in the code for the Generic Storage Clients. Tested and confirmed everything works. Should be ready for a full review

1nv8rzim commented 2 months ago

#1141

Based on the order that ^^ and the current PR are merged; we should think about moving over the the new template directory to use the persistient storage clients.

1nv8rzim commented 2 months ago

There's a systemic typo: Persist**i**ent (should be Persistent). But, maybe we could just rename this to "FileStorage"?

Fixed in 5a3ea15

tomchop commented 1 month ago

Remove templates from database, move to filesystem #1141

Based on the order that ^^ and the current PR are merged; we should think about moving over the the new template directory to use the persistient storage clients.

Thinking back on this again - I think I'm gonna merge https://github.com/yeti-platform/yeti/pull/1141/files as it is, we need to think a bit more as to whether we want to be able to load templates from outside the infrastructure where Yeti lives given that templates can lead to RCE (the whole reason we're moving them out of the db into the filesystem)

1nv8rzim commented 1 month ago

Remove templates from database, move to filesystem #1141

Based on the order that ^^ and the current PR are merged; we should think about moving over the the new template directory to use the persistient storage clients.

Thinking back on this again - I think I'm gonna merge #1141 (files) as it is, we need to think a bit more as to whether we want to be able to load templates from outside the infrastructure where Yeti lives given that templates can lead to RCE (the whole reason we're moving them out of the db into the filesystem)

Right now we are going through the process internally of making the api and tasks containers to being stateless so that we can run them in replicasets. This is actually the core reason behind this contribution.

~~The move of templates from the database to filesystem breaks this statelessness. Having an option to make this change stateless is something we would be very interested in.~~

~~Likewise, what is the attack vector that specifically makes storage of templates in arangodb vulnerable to RCE?~~

As I understood the entire process of having user editable export templates allows RCE through the templates themselves which would be agnostic to wherever they are stored. This would just be an accepted risk because we would assume authenticated should not preform RCE.

~~Similarly to all of this, if we can trust the file system with storing these templates, shouldn't using a remote storage solution like s3 bucket be just as secure?~~

Didn't notice it before but the PR removes the ability to edit these templates from the UI. This would work for us since it pretty much makes the templates static and we can just add them directly in at build time and they can be handled as part of the deploy.

tomchop commented 1 month ago

Looks like some tests are failing (bad imports), can you please fix this and the ruff errors?

yeti-platform / yeti

Add Support for Exports Read/Writing From S3 Buckets #1137

Purpose

Changes