wwoo / tf_face

Recognising faces using Vision API, TensorFlow & Google Cloud Machine Learning
14 stars 23 forks source link

Training a model using Cloud ML to serve using TensorFlow Serving #1

Open martiankuo1 opened 7 years ago

martiankuo1 commented 7 years ago

I followed the steps describe in "tf_face" using my own set of training data and proceeded to "Training a model using Cloud ML to serve using TensorFlow Serving", I issued the following command according to the tutorial "gcloud beta ml jobs submit training my9thmljob --package-path=pubfig_export --module-name=pubfig_export.export_log --region=us-central1 --staging-bucket=gs://cloudml-1001"

but got the following error from cloud machine engine job "The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 445, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 240, in main train_queue = get_input_queue(FLAGS.train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 98, in get_input_queue train_images, train_labels = get_image_label_list(train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 74, in get_image_label_list for line in open(image_label_file, "r"): IOError: [Errno 2] No such file or directory: '/tmp/data/train.txt' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=99426417043&resource=ml_job%2Fjob_id%2Fmy9thmljob&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22my9thmljob%22"

The only thing I can understand from the message is "[Error2] No such file or directory "/tmp/data/train.txt" , I double checked that I had move the "data" directory to "temp".

Any suggestion? Your kind help will be deeply appreciated

CH

wwoo commented 7 years ago

Hello,

As a whole, the example needs to be updated since Cloud ML (or ML Engine) now has gone into GA. I have that on my todo. However - on the specific issue you're seeing:

If you're trying ML Engine online training (not local training), you'll need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped tarball of your 'data' directory (which is where train.txt should live) to GCS. The code downloads the tarball and unpacks it into /tmp, which is how ML Engine gets a copy of the dataset.

As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' - you'll need to change this to your path. If the download works, you should see something like this in your logs:

... 14:20:36.915 Recursively copying from gs://wwoo-train/pubfig/out.tar.gz to /tmp/ ...

It might be worth checking that this works with local training first. Once you've verified that works, try it with ML Engine.

ww

martiankuo1 commented 7 years ago

Thanks.

I did make it work on the local training. I will try what you recommended.

Your kind help is deeply appreciated.

Cheng-Hua Kuo CloudMile

On Wed, May 3, 2017 at 9:03 AM, wwoo notifications@github.com wrote:

Hello,

As a whole, the example needs to be updated since Cloud ML (or ML Engine) now has gone into GA. I have that on my todo. However - on the specific issue you're seeing:

If you're trying ML Engine online training (not local training), you'll need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped tarball of your 'data' directory (which is where train.txt should live) to GCS. The code downloads the tarball and unpacks it into /tmp, which is how ML Engine gets a copy of the dataset.

As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' - you'll need to change this to your path. If the download works, you should see something like this in your logs:

... 14:20:36.915 Recursively copying from gs://wwoo-train/pubfig/out.tar.gz to /tmp/ ...

It might be worth checking that this works with local training first. Once you've verified that works, try it with ML Engine.

ww

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wwoo/tf_face/issues/1#issuecomment-298801819, or mute the thread https://github.com/notifications/unsubscribe-auth/Aa3UqXgOrfzDuBrnGkIL5V3cUEoB69tcks5r19J7gaJpZM4NMFvB .