yanyan314 / Refinitiv-

0 stars 0 forks source link

Performance Tuning #2

Open yanyan314 opened 4 years ago

yanyan314 commented 4 years ago

image

Findings

**CloudWatch issue  DAA-6280 - Promote performance Closed**

Description: 
      GraphQL will send metrics to CloudWatch, but we found that CloudWatch API call is very slow and it slows down the GraphQL. AWS also has limitation for send metrics to CloudWatch, when we reach that limitation, extremely slow response time from CloudWatch will almost hang the GraphQL.  Allen.Li changed the CloudWatch API call to an asynchronous call, to keep the GraphQL for waiting the response from the CloudWatch API call.  This bug was found by Allen Li .

What has been done:
      The code has been merged (See Jira  feature/DAA-6280). After testing showed that for small query, the whole response time down from seconds to hundreds mili-seconds.  

Concerns:
      Considering the CloudWatch limitation. When the requests are keep coming in, and the metrics are flooding to CloudWatch, at some point the asynchronous call will reach thread pool or queue limitation and exceptions will be thrown, the behavior of GraphQL won't be right.  So the once for all solution should be:
     1. Send the metrics in batch, with asynchronous calls. Or send them in another process, instead of thread.
     2. When the incoming metrics could not be handled, abandon them to keep the service level. 
     3. To see if AWS provides more options instead of making API calls.

Status:  Bug fixed, but need to be improved.

reference: ExecutorService is a framework provided by the JDK which simplifies the execution of tasks in asynchronous mode. Generally speaking, ExecutorService automatically provides a pool of threads and API for assigning tasks to it. https://www.baeldung.com/java-executor-service-tutorial

**Jersey thread pool issue**
Description: 
      There is a queued thread pool in Jetty Http Server, the default pool size is 200. We did a series testing, on local machine, using local mock services as upstream OCS, run JMeter as the client on the same machine. Since all the traffics are on the same machine, the OCS service and network latency are isolated. A mock OCS service will return data immediately, we call it as "simple query",  another mock OCS service will return data after 30 seconds, we call it "complex query".  
      Problems:
          1. When use multi-threads (from 50~220), simple query response time increased and throughput dropped.  When concurrency reached to 220, GraphQL almost stopped responding, throughput down to 0.47 requests/sec. 
          2. When use 100 threads to send simple query, and combined sending complex query with (10~100) threads. Response time dropped dramatically when 70 threads with complex threads. Throughput down to 1.2 requests/sec.
      Conclusion: 
          GraphQL itself, isolated from other factors, it only can handle 220 requests at the same time, and when the concurrency is higher than 150, the total throughput will drop dramatically, response time created dozens of times.
      Fix:
         Crease the default thread pool size to 200~1000.  Do the test again, the GraphQL can handle 600 requests and keep almost the same service level. Specially for small queries,  GraphQL can handle almost 800 requests at the same time.
Status:
     Fixed.

**Service implementation OnFailure method issue  DAA-6276 - OCS client can not handle failure when connection failed Closed**
Description: 
      The method OnFailure() in class com.refinitiv.eds.graphql.cdf.service.objcontainer.ObjectContainerServiceImpl.java. Which will deal with the scenario of failing to get response data from OCS. It will try to get the attribute named REQUEST_START_TIME_ATTRIBUTE from HttpContext. But for some reason, it will get Null and the method will raise run time exception. Then the thread will be returned to thread pool with wrong state, and will cause error in the future.  This bug was found by Allen Li .
      The fix has been merged with "feature/DDA6280", by catching the error and let the thread get back to pool without exceptions.
Status:
      Fixed.
**Zombie task issue  DAA-6329 - Disconnected from ELB but ECS task are still running Development in Progress**
Description: 
      The bulk call means big query for most cases, when GraphQL needs a lot of time dealing with upstream services, bulk may disconnect from GraphQL due to time out. The corresponding worker thread will not be aware of this, and will keep doing the job, like computing data, sending queries to upstream services or just waiting something until time out or get error response from other services. This worker thread we call it "Zombie". The "Zombie" is doing nonsense work and will keep upstream services to do the nonsense work together. We need to build a mechanism to notify the "Zombies" and let them rest in peaceful.  
Status:
      Unfixed.
Timeout on ELB
Description: 
Status:
**CDF query without ObjectId**
Description:
     In some cases, the objectId is not present in CDF relationship Query. It will cause full scan in CDF and long latency for this query.  This is not a useful query. Means it is a logic bug for empty ObjectId in GraphQL.
     This issue has been fix in DAA-6382.
Status:
     Fixed.
http request without enabling gzip compression header 
Description:
     OCS supports gzip compression transportation, but in Graphql server the sending all http requests have not added header "accept-encoding: gzip". The result is that if the transported json content size is very large such as more than 100k in one separate OCS http request, the graphql can receive the response header soon but receive all content body taking more 1-2 seconds(per the test result in aws graphQL Dev endpoint).  After enabling gzip compression, the transported size is 15-20 times smaller than the orginal json size, one example, in my one test query, the biggest content size is 220k, the compressed size is only 10k, the body and header are received almost simultaneously.

      Suggestions:
          OCS server enable gzip compression transportation, but not all Elastic clusters are enabled.
         Through testing in postman, currently 3 elastic clusters only Filing enables gzip compression transportation. Research and RSA does not enable. Enabling gzip compression is simple for Elasitc: Change the ES settings http.compression: true. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-http.html

          Because current performance issue is focused on the OCS not elastic request, it is not a big issue if RSA and Research elastic cluster team decide to not enable compression transportation.

      Status:
           Will fix in  DAA-6383 - Gap between OCS calls Pending QA 
yanyan314 commented 4 years ago

Performance issues in bulk testing - Data Access APIs - Enterprise Confluence.pdf