wdm0006 / DummyRDD

A pure python mock of pyspark's RDD
http://wdm0006.github.io/DummyRDD/
BSD 3-Clause "New" or "Revised" License
27 stars 13 forks source link

Need documentation on S3 use #26

Open codebynumbers opened 7 years ago

codebynumbers commented 7 years ago

It's not clear that TinyS3 is needed, also not obvious without digging through code on how to set the AWS keys. Would also be nice if it supported profiles like boto, but that seems to be a limitation of TinyS3

wdm0006 commented 7 years ago

Thanks for the input, for sure you are right about the documentation. As for boto vs tinys3, what extra would that allow that still matches how you would interact with spark? I've not used profiles in either spark or boto so I'm not sure what that would look like.

codebynumbers commented 7 years ago

The nice thing boto allows you to do, is to not have to specify your credentials in code at all. It reads them from files on disk (~/.aws/credentials). You can also set up multiple credentials in a single file in profile sections, so you can specify the profile name to use in the code. This is helpful if you have multiple accounts/iam roles and need to quickly switch between them.

When we usually use spark, we configure it with IAM roles that control which S3 files it has access to, ie we are not embedding credentials in config files. I think the biggest hurdle was the documentation, more than the profiles though.

wdm0006 commented 7 years ago

Ah I see what you mean, that does make sense. First let's tackle the documentation issue though, I'd like to just basically copy the pyspark docs for the implemented methods, because the idea is to work the same way. Separately some examples for things like pulling files from s3, or accessing rdd data directly for debugging would be helpful I think. Do you think that would have been enough in your situation?

codebynumbers commented 7 years ago

Yeah that would have been perfect.

On Nov 23, 2016 11:57 AM, "Will McGinnis" notifications@github.com wrote:

Ah I see what you mean, that does make sense. First let's tackle the documentation issue though, I'd like to just basically copy the pyspark docs for the implemented methods, because the idea is to work the same way. Separately some examples for things like pulling files from s3, or accessing rdd data directly for debugging would be helpful I think. Do you think that would have been enough in your situation?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wdm0006/DummyRDD/issues/26#issuecomment-262571952, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4IMBWl6wtsRJjSMf0u2xExd3qiIhnfks5rBHCPgaJpZM4K6wyX .