Warehousing plan - Githubissues

marten commented 9 years ago

I figured I should start writing my thoughts down somewhere.

So we have this Spark thing Ed wrote. Main issue with that is we need to get the data into it somehow, since the current querying style just loads up the entire classifications table.

I could look into splitting up that query, but what I'd propose is that we set up a sidetiq job in Panoptes to write yesterday's classifications to S3 every night (one file per day). Some days will be bigger than others, but it's a nice and easy way to split the data up. Files can be Avro, but I don't know how good the Ruby-lib for that is, and I think the current JSON-inside-a-CSV is probably fine enough.

I like this idea a little more because it also means that we keep Panoptes' database schema inside Panoptes, instead of having a Spark job that knows about it.

As for running the Spark job to export to CSV, I think should just boot up an Amazon EMR (Elastic MapReduce) cluster. This gives us easy autoscaling (as in it launches and terminates when done) compared to setting up Spark ourselves, as well as not having to maintain it. EMR can easily read from S3 and write back there.

/cc @camallen

chrissnyder commented 9 years ago

+1 to using EMR rather than building and hosting a solution ourself

adammcmaster commented 9 years ago

If we’re talking about warehousing, is it worth considering Redshift? https://aws.amazon.com/redshift/?nc2=h_l3_db https://aws.amazon.com/redshift/?nc2=h_l3_db

On 26 Oct 2015, at 17:42, Chris Snyder notifications@github.com wrote:

+1 to using EMR rather than building and hosting a solution ourself

— Reply to this email directly or view it on GitHub https://github.com/zooniverse/Panoptes/issues/1455#issuecomment-151221761.

marten commented 9 years ago

And confuse the heck out of astronomers? :+1:

It seems like it would be able to load up data from S3 the same way a Spark job would, except more persistent. I'd like to get the Spark job Ed started running first (need to adapt it to load from per-day CSVs but other than that it seems like it should deploy easily), but we should look at Redshift after that for sure.

zooniverse / panoptes

Warehousing plan #1455