Open Prasadi-R opened 4 years ago
@Prasadi-R as per my observations, the following were noted (just to note, I used the WSO2 API Management deployment pattern 1 deployment to investigate this issue).
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
As per the example, the maxSurge
would allow one extra Pod instance (based on new changes) other than the desired to be spawned and be ready prior to the deletion of the old version of the Pod. The purpose of this strategy is to avoid downtime during the upgrade process.
In the given case, the new Publisher-DevPortal instance will attempt to use the H2 based local Carbon DB which is persisted (as per Solr indexing requirements) and is being used by the older Pod version. Ideally, the old Pod needs to be deleted in order for the relevant file to be released from usage and for the new Pod to start using it.
As per my understanding, thus the highlighted issue occurs.
After a series of internal discussions, the following options were deduced as solutions for the issue discussed.
First, it is important to state that this option forces us to use Kubernetes Deployment resources as Recreate
update strategy is not an update strategy option in StatefulSet deployments.
We evaluated this option with WSO2 API Manager deployment pattern 1 for version 3.1.0
, in a GKE environment. Each API Manager All in One deployment was defined using a Kubernetes Deployment resource and Recreate
strategy was used for updating the existing deployment.
We made about 6 update attempts on existing API Manager Pods and every update worked successfully. Thus, we can conclude that this option works fine for the discussed scenario.
Under this option, we may be able to stick to using Kubernetes StatefulSet resources to define the deployments by sticking to the approach suggested in this article provided by @ThilinaManamgoda. Hence, using this option will ease the effort of scaling the Publisher and DevPortal deployments than option 1 although this can be considered a less occurring use case for the given profiles, as per my understanding.
Though this option is yet to be evaluated, the user may have to bear the overhead of maintaining additional, externalized databases under this approach.
Considering the fact that the user overhead of externalizing and maintaining the databases and scaling is a less occurring use case for the discussed profiles, the first option can be considered the most appropriate for the given scenario.
As per further tests, it was noticed that this error persists even during the usage of Recreate
strategy. We are currently in the process of testing the option of moving the last access time to the Governance Registry. We will update this thread once we go through the tests.
The discussed issue had been reported by a number of users in numerous Kubernetes deployments, over the past few weeks.
As per the last few meetings on this matter we had we decided to use the following steps to avoid this issue.
We decided to use this resource type due to the availability of the Recreate update strategy, which is not an update strategy option in StatefulSet based deployments.
During the internal upgrade test attempts running API Manager pattern 1 deployments in AWS EKS and GKE Kubernetes services (at least 10 upgrade attempts in each service), we encountered this issue with low frequency compared to previous approaches.
But considering the intermittent occurrence of this error, we decided to https://github.com/wso2/kubernetes-apim/issues/416. Furthermore, it was agreed to conduct an internal evaluation of the product issue causing this and attempt to fix it.
Also, the following were suggested as practices which we could adopt in future releases.
Continue to package the NFS based storage solution in WSO2 product Helm charts which have persistence/sharing use cases, for the purpose of evaluation. In an ideal scenario, the storage solution to be used should solely be obtained as per user preference (via user input). But, since it is mandatory for some WSO2 product profile deployments with high availability to persist/share runtime artifacts, it was suggested to package the NFS server Provisioner by default, especially for evaluation purposes.
Recommend to switch to a more production ready, persistent storage solution other than NFS Server Provisioner (for example, see https://github.com/wso2/kubernetes-apim/issues/410#issuecomment-654070688 for CephFS) in the long run. The users can switch to the desired persistent storage solution by installing the relevant Kubernetes Storage Class as a prerequisite and providing it as a user input.
Document the persistent storage solutions with which WSO2 product Kubernetes deployments have been tried and tested.
Update as of 2020-09-01: Please refer to the official WSO2 container guide for recommended, tried and tested storage options.
Please feel free to share your thoughts and concerns with regards to this discussed matter.
As per https://github.com/wso2/kubernetes-apim/issues/397#issuecomment-660429579, we will moving this issue to a future milestone since, this issue has not been completely resolved yet.
Description: The following errors were observed in the wso2carbon.log file when restarting API-Manager pods after a helm upgrade. This causes the API-M pods does not restart successfully.
[2020-06-03 11:58:14,952] ERROR - RegistryContext ||Unable to get instance of the registry context org.wso2.carbon.registry.core.exceptions.RegistryException: Unable to connect to Data Source at org.wso2.carbon.registry.core.config.RegistryConfigurationProcessor.populateRegistryConfig(RegistryConfigurationProcessor.java:165) ~[org.wso2.carbon.registry.core_4.6.0.jar:?] ..... Caused by: org.h2.jdbc.JdbcSQLNonTransientException: IO Exception: null [90028-199] at org.h2.message.DbException.getJdbcSQLException(DbException.java:502) ~[h2_1.4.199.wso2v1.jar:?] ..... Caused by: java.lang.IllegalStateException: Could not open file nio:/home/wso2carbon/solr/database/WSO2CARBON_DB.mv.db [1.4.199/1] at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:883) ~[h2_1.4.199.wso2v1.jar:?] ..... Caused by: java.io.IOException: No locks available at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) ~[?:?]
There are several warning messages related to failed to mount errors, observed in the pod descriptions.
Affected Product Version: Helm Resources For WSO2 API Manager version
3.1.0.2
OS, DB, other environment details and versions:
AWS EKS, Google Cloud (GKE)