[v3.1.x] Issues during pod restart process after upgrading helm resources

Prasadi-R commented 4 years ago

Description: The following errors were observed in the wso2carbon.log file when restarting API-Manager pods after a helm upgrade. This causes the API-M pods does not restart successfully.

[2020-06-03 11:58:14,952] ERROR - RegistryContext ||Unable to get instance of the registry context org.wso2.carbon.registry.core.exceptions.RegistryException: Unable to connect to Data Source at org.wso2.carbon.registry.core.config.RegistryConfigurationProcessor.populateRegistryConfig(RegistryConfigurationProcessor.java:165) ~[org.wso2.carbon.registry.core_4.6.0.jar:?] ..... Caused by: org.h2.jdbc.JdbcSQLNonTransientException: IO Exception: null [90028-199] at org.h2.message.DbException.getJdbcSQLException(DbException.java:502) ~[h2_1.4.199.wso2v1.jar:?] ..... Caused by: java.lang.IllegalStateException: Could not open file nio:/home/wso2carbon/solr/database/WSO2CARBON_DB.mv.db [1.4.199/1] at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:883) ~[h2_1.4.199.wso2v1.jar:?] ..... Caused by: java.io.IOException: No locks available at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]

There are several warning messages related to failed to mount errors, observed in the pod descriptions.

Affected Product Version: Helm Resources For WSO2 API Manager version 3.1.0.2

OS, DB, other environment details and versions:
AWS EKS, Google Cloud (GKE)

chirangaalwis commented 4 years ago

@Prasadi-R as per my observations, the following were noted (just to note, I used the WSO2 API Management deployment pattern 1 deployment to investigate this issue).

Does not occur during a fresh deployment. Occurs only during an upgrade to an existing APIM deployment.
Not specific to NFS Server Provisioner. Encountered it when using Google FileStore as a persistence solution, as well.
In WSO2 product Kubernetes resources, we have used Rolling Updating as the update strategy for an existing deployment.

 strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate

As per the example, the maxSurge would allow one extra Pod instance (based on new changes) other than the desired to be spawned and be ready prior to the deletion of the old version of the Pod. The purpose of this strategy is to avoid downtime during the upgrade process.

In the given case, the new Publisher-DevPortal instance will attempt to use the H2 based local Carbon DB which is persisted (as per Solr indexing requirements) and is being used by the older Pod version. Ideally, the old Pod needs to be deleted in order for the relevant file to be released from usage and for the new Pod to start using it.

As per my understanding, thus the highlighted issue occurs.

chirangaalwis commented 4 years ago

After a series of internal discussions, the following options were deduced as solutions for the issue discussed.

Stick to a persisted, H2 based local Carbon database. Use Recreate Kubernetes update strategy in the deployment.

First, it is important to state that this option forces us to use Kubernetes Deployment resources as Recreate update strategy is not an update strategy option in StatefulSet deployments.

We evaluated this option with WSO2 API Manager deployment pattern 1 for version 3.1.0, in a GKE environment. Each API Manager All in One deployment was defined using a Kubernetes Deployment resource and Recreate strategy was used for updating the existing deployment.

We made about 6 update attempts on existing API Manager Pods and every update worked successfully. Thus, we can conclude that this option works fine for the discussed scenario.

Externalize the local Carbon database in each of the Publisher and DevPortal instances, rather than sticking to the file based H2 database.

Under this option, we may be able to stick to using Kubernetes StatefulSet resources to define the deployments by sticking to the approach suggested in this article provided by @ThilinaManamgoda. Hence, using this option will ease the effort of scaling the Publisher and DevPortal deployments than option 1 although this can be considered a less occurring use case for the given profiles, as per my understanding.

Though this option is yet to be evaluated, the user may have to bear the overhead of maintaining additional, externalized databases under this approach.

Considering the fact that the user overhead of externalizing and maintaining the databases and scaling is a less occurring use case for the discussed profiles, the first option can be considered the most appropriate for the given scenario.

chirangaalwis commented 4 years ago

As per further tests, it was noticed that this error persists even during the usage of Recreate strategy. We are currently in the process of testing the option of moving the last access time to the Governance Registry. We will update this thread once we go through the tests.

chirangaalwis commented 4 years ago

The discussed issue had been reported by a number of users in numerous Kubernetes deployments, over the past few weeks.

As per the last few meetings on this matter we had we decided to use the following steps to avoid this issue.

Use Kubernetes Deployment resources to define Publisher and DevPortal deployments in Kubernetes, instead of Kubernetes StatefulSet resource type which is the most suitable resource type for stateful deployments.

We decided to use this resource type due to the availability of the Recreate update strategy, which is not an update strategy option in StatefulSet based deployments.

Persist the H2 based local Carbon database and the Solr indexed data directories using a persistent storage solution and use the ReadWriteMany access mode in created Kubernetes Persistent Volumes.

During the internal upgrade test attempts running API Manager pattern 1 deployments in AWS EKS and GKE Kubernetes services (at least 10 upgrade attempts in each service), we encountered this issue with low frequency compared to previous approaches.

But considering the intermittent occurrence of this error, we decided to https://github.com/wso2/kubernetes-apim/issues/416. Furthermore, it was agreed to conduct an internal evaluation of the product issue causing this and attempt to fix it.

Also, the following were suggested as practices which we could adopt in future releases.

Continue to package the NFS based storage solution in WSO2 product Helm charts which have persistence/sharing use cases, for the purpose of evaluation. In an ideal scenario, the storage solution to be used should solely be obtained as per user preference (via user input). But, since it is mandatory for some WSO2 product profile deployments with high availability to persist/share runtime artifacts, it was suggested to package the NFS server Provisioner by default, especially for evaluation purposes.
Recommend to switch to a more production ready, persistent storage solution other than NFS Server Provisioner (for example, see https://github.com/wso2/kubernetes-apim/issues/410#issuecomment-654070688 for CephFS) in the long run. The users can switch to the desired persistent storage solution by installing the relevant Kubernetes Storage Class as a prerequisite and providing it as a user input.
Document the persistent storage solutions with which WSO2 product Kubernetes deployments have been tried and tested.

Update as of 2020-09-01: Please refer to the official WSO2 container guide for recommended, tried and tested storage options.

Please feel free to share your thoughts and concerns with regards to this discussed matter.

chirangaalwis commented 4 years ago

As per https://github.com/wso2/kubernetes-apim/issues/397#issuecomment-660429579, we will moving this issue to a future milestone since, this issue has not been completely resolved yet.

wso2 / kubernetes-apim

[v3.1.x] Issues during pod restart process after upgrading helm resources #397