zlabjp / nghttpx-ingress-lb

nghttpx ingress controller for Kubernetes
Other
136 stars 12 forks source link

Does nghttpx ingress intercept errors? #82

Open ingridgoh opened 6 years ago

ingridgoh commented 6 years ago

Hello,

I currently have tensorflow serving deployed in a container and I've noticed that where there are any prediction errors, the actual error stack is not returned to the client when using nghttpx ingress. The following are my observations (all aspects/environment is kept constant except for the usage of an intermediate ingress):

1. Client Request --> Load Balancer --> Ingress --> Container (Tensorflow-serving) Observation: Error is obscured from the client, a generic error message is received Error Received: grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INTERNAL, details="Received RST_STREAM with error code 2")

2. Client Request --> Load Balancer --> Container (Tensorflow-serving) Observation: Detailed error stack is returned to client Error Received: grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Matrix size-incompatible: In[0]: [3592,10], In[1]: [3592,10] [[Node: MatMul = MatMul[T=DT_FLOAT, _output_shapes=[[?,10]], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_x_0_0, Variable/read)]]")

Thank you!

tatsuhiro-t commented 6 years ago

Could you provide a way how to reproduce this, for example, using https://github.com/tensorflow/serving/tree/master/tensorflow_serving/example ?

ingridgoh commented 6 years ago

The error mentioned was a failed inference query against a DNN model. However you do not need to replicate the exact error that I have received since all error thrown by TF-Serving server will result in the "Received RST_STREAM with error code 2" error if the ingress is used. You could take https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_client.py as an example and tweak the script such that a random matrix is sent in the request instead of an image (Please note that I'm a complete novice at this):

e.g.:

rand_array = np.random.rand(10, 3592)
request = predict_pb2.PredictRequest()
request.model_spec.name = MODEL_NAME
request.model_spec.signature_name = 'predict_images'
request.inputs['inputs'].CopyFrom(
    tf.contrib.util.make_tensor_proto(rand_array, dtype=tf.float32)
)

Here's a simple architectural diagram for my setup: image

tatsuhiro-t commented 6 years ago

I tried to reproduce the issue with the following patch to tensorflow_serving/example/mnist_client.py:

diff --git a/tensorflow_serving/example/mnist_client.py b/tensorflow_serving/example/mnist_client.py
index 947f7c4..93f1e91 100644
--- a/tensorflow_serving/example/mnist_client.py
+++ b/tensorflow_serving/example/mnist_client.py
@@ -146,8 +146,9 @@ def do_inference(hostport, work_dir, concurrency, num_tests):
     request.model_spec.name = 'mnist'
     request.model_spec.signature_name = 'predict_images'
     image, label = test_data_set.next_batch(1)
+    rand_array = numpy.random.rand(10, 3592)
     request.inputs['images'].CopyFrom(
-        tf.contrib.util.make_tensor_proto(image[0], shape=[1, image[0].size]))
+        tf.contrib.util.make_tensor_proto(rand_array, dtype=tf.float32))
     result_counter.throttle()
     result_future = stub.Predict.future(request, 5.0)  # 5 seconds
     result_future.add_done_callback(

But, I got the same error messages with, or without proxy:

AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Matrix size-incompatible: In[0]: [10,3592], In[1]: [784,10] [[Node: MatMul = MatMul[T=DT_FLOAT, _output_shapes=[[?,10]], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_x_0_0, Variable/read)]]")

Which version of Ingress controller are you using? It is worth to try the latest version.

ingridgoh commented 6 years ago

I was using v0.28.0. I've updated the controller to 0.31.0 but the same behaviour still occurs. The following are the exact errors I have received:

With ingress:

$ python 1_non_mlpkit_our_data.py
Traceback (most recent call last):
  File "1_non_mlpkit_our_data.py", line 94, in <module>
    print stub.Predict(request, 120)
  File "/Users/setup/virtualenv/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
    self._request_serializer, self._response_deserializer)
  File "/Users/setup/virtualenv/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INTERNAL, details="Received RST_STREAM with error code 2")

Without ingress

$ python 1_non_mlpkit_our_data.py
Traceback (most recent call last):
  File "1_non_mlpkit_our_data.py", line 94, in <module>
    print stub.Predict(request, 120)
  File "/Users/setup/virtualenv/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
    self._request_serializer, self._response_deserializer)
  File "/Users/setup/virtualenv/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Matrix size-incompatible: In[0]: [3592,10], In[1]: [3592,10]
     [[Node: MatMul = MatMul[T=DT_FLOAT, _output_shapes=[[?,10]], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_x_0_0, Variable/read)]]")