yahoo / TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Apache License 2.0
3.87k stars 938 forks source link

How to run TensorFlowOnSpark on AWS EMR? #545

Open dgoldenberg-audiomack opened 3 years ago

dgoldenberg-audiomack commented 3 years ago

Could you provide an example/recipe for how to run a TensorFlow application on AWS EMR using TensorFlowOnSpark to achieve scale and distributability?

For example let's say I want to run this on AWS EMR using a cluster that EMR configures: https://www.tensorflow.org/recommenders/examples/basic_retrieval. It's a basic app from the Tensorflow Recommenders framework.

How would one run this example on TensorFlowOnSpark in EMR?

Also, looking at examples in TensorFlowOnSpark, specifically: https://github.com/yahoo/TensorFlowOnSpark/tree/master/examples/mnist/keras:

Train via InputMode.TENSORFLOW In this mode, each worker will load the entire MNIST dataset into memory (automatically downloading the dataset if needed).

Train via InputMode.SPARK In this mode, Spark will distribute the MNIST dataset (as CSV) across the workers, so each of the workers will see only a portion of the dataset per epoch. Also note that InputMode.SPARK currently only supports a single input RDD, so the validation/test data is not used.

Which mode would one want to use to achieve scalability and performance on larger datasets? I would think that the SPARK mode would be more efficient - ? But I'm confused by the subsequent comment, "currently only supports a single input RDD, so the validation/test data is not used." What does this mean? I wouldn't be able to use the validation and test datasets? But then how could I assess the accuracy of the model?

Thanks.

leewyang commented 3 years ago

Have you seen the wiki docs which explain the architecture details of TensorFlowOnSpark along with the FAQ?

Anyhow, at the most general level, TFoS basically just helps to stand up a distributed TensorFlow cluster on top of Spark executors... so to achieve scalability and performance on larger datasets, you really should understand and experiment with distributed TF first. The InputMode.SPARK is mostly a "compatibility" shim that allows you to use Spark RDDs w/ TF (with some limitations).

dilta commented 3 years ago

You can use the inference mode of mnist_pipeline.py to score the test data, and calculate the accuracy based on the results.