wso2 / kubernetes-apim

Kubernetes and Helm resources for WSO2 API Manager
Apache License 2.0
115 stars 216 forks source link

[v3.1.x] Issues during pod restart process after upgrading helm resources #397

Open Prasadi-R opened 4 years ago

Prasadi-R commented 4 years ago

Description: The following errors were observed in the wso2carbon.log file when restarting API-Manager pods after a helm upgrade. This causes the API-M pods does not restart successfully.

[2020-06-03 11:58:14,952] ERROR - RegistryContext ||Unable to get instance of the registry context org.wso2.carbon.registry.core.exceptions.RegistryException: Unable to connect to Data Source at org.wso2.carbon.registry.core.config.RegistryConfigurationProcessor.populateRegistryConfig(RegistryConfigurationProcessor.java:165) ~[org.wso2.carbon.registry.core_4.6.0.jar:?] ..... Caused by: org.h2.jdbc.JdbcSQLNonTransientException: IO Exception: null [90028-199] at org.h2.message.DbException.getJdbcSQLException(DbException.java:502) ~[h2_1.4.199.wso2v1.jar:?] ..... Caused by: java.lang.IllegalStateException: Could not open file nio:/home/wso2carbon/solr/database/WSO2CARBON_DB.mv.db [1.4.199/1] at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:883) ~[h2_1.4.199.wso2v1.jar:?] ..... Caused by: java.io.IOException: No locks available at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]

There are several warning messages related to failed to mount errors, observed in the pod descriptions.

Affected Product Version: Helm Resources For WSO2 API Manager version 3.1.0.2

OS, DB, other environment details and versions:
AWS EKS, Google Cloud (GKE)

chirangaalwis commented 4 years ago

@Prasadi-R as per my observations, the following were noted (just to note, I used the WSO2 API Management deployment pattern 1 deployment to investigate this issue).

 strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate

As per the example, the maxSurge would allow one extra Pod instance (based on new changes) other than the desired to be spawned and be ready prior to the deletion of the old version of the Pod. The purpose of this strategy is to avoid downtime during the upgrade process.

In the given case, the new Publisher-DevPortal instance will attempt to use the H2 based local Carbon DB which is persisted (as per Solr indexing requirements) and is being used by the older Pod version. Ideally, the old Pod needs to be deleted in order for the relevant file to be released from usage and for the new Pod to start using it.

As per my understanding, thus the highlighted issue occurs.

chirangaalwis commented 4 years ago

After a series of internal discussions, the following options were deduced as solutions for the issue discussed.

First, it is important to state that this option forces us to use Kubernetes Deployment resources as Recreate update strategy is not an update strategy option in StatefulSet deployments.

We evaluated this option with WSO2 API Manager deployment pattern 1 for version 3.1.0, in a GKE environment. Each API Manager All in One deployment was defined using a Kubernetes Deployment resource and Recreate strategy was used for updating the existing deployment.

We made about 6 update attempts on existing API Manager Pods and every update worked successfully. Thus, we can conclude that this option works fine for the discussed scenario.

Under this option, we may be able to stick to using Kubernetes StatefulSet resources to define the deployments by sticking to the approach suggested in this article provided by @ThilinaManamgoda. Hence, using this option will ease the effort of scaling the Publisher and DevPortal deployments than option 1 although this can be considered a less occurring use case for the given profiles, as per my understanding.

Though this option is yet to be evaluated, the user may have to bear the overhead of maintaining additional, externalized databases under this approach.

Considering the fact that the user overhead of externalizing and maintaining the databases and scaling is a less occurring use case for the discussed profiles, the first option can be considered the most appropriate for the given scenario.

chirangaalwis commented 4 years ago

As per further tests, it was noticed that this error persists even during the usage of Recreate strategy. We are currently in the process of testing the option of moving the last access time to the Governance Registry. We will update this thread once we go through the tests.

chirangaalwis commented 4 years ago

The discussed issue had been reported by a number of users in numerous Kubernetes deployments, over the past few weeks.

As per the last few meetings on this matter we had we decided to use the following steps to avoid this issue.

We decided to use this resource type due to the availability of the Recreate update strategy, which is not an update strategy option in StatefulSet based deployments.

During the internal upgrade test attempts running API Manager pattern 1 deployments in AWS EKS and GKE Kubernetes services (at least 10 upgrade attempts in each service), we encountered this issue with low frequency compared to previous approaches.

But considering the intermittent occurrence of this error, we decided to https://github.com/wso2/kubernetes-apim/issues/416. Furthermore, it was agreed to conduct an internal evaluation of the product issue causing this and attempt to fix it.

Also, the following were suggested as practices which we could adopt in future releases.

Update as of 2020-09-01: Please refer to the official WSO2 container guide for recommended, tried and tested storage options.

Please feel free to share your thoughts and concerns with regards to this discussed matter.

chirangaalwis commented 4 years ago

As per https://github.com/wso2/kubernetes-apim/issues/397#issuecomment-660429579, we will moving this issue to a future milestone since, this issue has not been completely resolved yet.