zowe / zowe-explorer-vscode

Visual Studio Code Extension for Zowe, which lets users interact with z/OS Data Sets, Unix System Services, and Jobs on a remote mainframe instance. Powered by Zowe SDKs.
Eclipse Public License 2.0
173 stars 92 forks source link

Latch issue and unable to cancel users address space #2963

Open savaresejt opened 5 months ago

savaresejt commented 5 months ago

User reported having issues opening and saving datasets. We noticed that there was an issue with latch contention with xxx and that the user had address spaces that we could not cancel even with force cancel commands in SDSF. Ultimately we had to IPL the system to deal with the latching issues.

This is the latch connection on OMVS from /D OMVS,A=ALL

To Reproduce

We were not able to reproduce the error.

Expected behavior

Even if there is an issue saving the datasets, we would expect that the address spaces being spun up could be canceled by system admins.

Screenshots

Desktop (please complete the following information):

Additional context

github-actions[bot] commented 5 months ago

Thank you for creating a bug report. We will investigate the bug and evaluate its impact on the product. If you haven't already, please ensure you have provided steps to reproduce the bug and as much context as possible.

savaresejt commented 2 months ago

We ran into this again

Describing Dataset-is-Catalogued function
  [-] should return true if a dataset exists 32.46s (32.46s|4ms)
   RuntimeException: Command Error:
   z/OSMF REST API Error:
   Rest API failure with HTTP(S) status 500
   rc:       4
   reason:   1
   category: 2
   message:  login: timeout: TsoServerConnection(USER=XXXXX, ASID=0x00ca, QID=0x25840028)
   Error Details:
   HTTP(S) error status "500" received.
   Review request details (resource, base path, credentials, payload) and ensure correctness.
   Protocol:  https
   Host:      XXX
   Port:      XXX
   Base Path:
   Resource:  /zosmf/restfiles/ds?dslevel=SYS3.P.DBA.JCL.OLD
   Request:   GET
   Headers:   [{"Accept-Encoding":"gzip"},{"X-IBM-Max-Items":"0"},{"X-CSRF-ZOSMF-HEADER":true}]
   Payload:   undefined
   at Dataset-is-Catalogued, C:\whatever
   at <ScriptBlock>, C:\whatever

Recovery Instructions D GRS,C,L

  1. Find the latch number holding up OMVS/BPXOINIT [if you're not sure which one, go to TSO OMVS or SSH and see which you're stuck behind when you attempt to connect].
  2. In SDSF, put a JT next to the address space that is locking up OMVS/BPXOINIT.
  3. FORCE U=,A=,TCB= 

OUTPUT FROM D GRS,C,L

RESPONSE=SYSELMD                                                       
 ISG343I 08.08.35 GRS STATUS 871                                       
 LATCH SET NAME:  SYS.BPX.AP00.PRTB1.PPRA.LSN                          
 CREATOR JOBNAME: OMVS      CREATOR ASID: 0010                         
   LATCH NUMBER:  1                                                    
     REQUESTOR  ASID  EXC/SHR    OWN/WAIT  WORKUNIT  TCB  ELAPSED TIME 
     USER1      0027  EXCLUSIVE  OWN       00AE41B0   Y   07:04:13.452 
     BPXOINIT   0041  EXCLUSIVE  WAIT      00AFAAE8   Y   07:04:13.450 
     USER1      0027  EXCLUSIVE  WAIT      00AE4800   Y   07:04:13.363 
     USER1      00E9  EXCLUSIVE  WAIT      00AE4800   Y   06:57:38.356 
     USER1      00AB  EXCLUSIVE  WAIT      00AE4800   Y   06:57:36.971 
     SSHD3      00F0  EXCLUSIVE  WAIT      00AD9DC8   Y   01:42:27.413 
     USER3      00F5  EXCLUSIVE  WAIT      00AFB2F8   Y   00:30:39.263 
     PORTMAP    0098  EXCLUSIVE  WAIT      00AF9040   Y   00:03:46.903 
   LATCH NUMBER:  47                                                   
     REQUESTOR  ASID  EXC/SHR    OWN/WAIT  WORKUNIT  TCB  ELAPSED TIME 
     USER1      00AA  EXCLUSIVE  OWN       00AE41B0   Y   16:42:25.036 
     USER1      00AA  EXCLUSIVE  WAIT      00AE4800   Y   16:27:14.359 
   LATCH NUMBER:  125                                                  
     REQUESTOR  ASID  EXC/SHR    OWN/WAIT  WORKUNIT  TCB  ELAPSED TIME 
     USER1      0099  EXCLUSIVE  OWN       00AE41B0   Y   15:40:55.436 
     USER1      0099  EXCLUSIVE  WAIT      00AE4800   Y   15:25:55.379 
 LATCH NUMBER:  260                                                    
   REQUESTOR  ASID  EXC/SHR    OWN/WAIT  WORKUNIT  TCB  ELAPSED TIME   
   USER3      00F5  EXCLUSIVE  OWN       00AFB2F8   Y   00:30:39.263   
   RSED3      0051  EXCLUSIVE  WAIT      00AC80A0   Y   00:30:36.665   
   RSED3      0051  EXCLUSIVE  WAIT      00AC1E88   Y   00:27:39.311   
 LATCH NUMBER:  1505                                                   
   REQUESTOR  ASID  EXC/SHR    OWN/WAIT  WORKUNIT  TCB  ELAPSED TIME   
   USER2      00D5  EXCLUSIVE  OWN       00AE41B0   Y   -over 24 hrs   
   USER2      00D5  EXCLUSIVE  WAIT      00AE4800   Y   -over 24 hrs   
 LATCH NUMBER:  1654                                                   
   REQUESTOR  ASID  EXC/SHR    OWN/WAIT  WORKUNIT  TCB  ELAPSED TIME   
   SSHD3      00F0  EXCLUSIVE  OWN       00AD9DC8   Y   01:42:27.413   
   SSHD4      0066  EXCLUSIVE  WAIT      00AFB2F8   Y   01:40:26.298   
   SSHD5      0121  EXCLUSIVE  WAIT      00AFB2F8   Y   00:17:57.800     

These commands recovered us

FORCE U=USER1,A=00AA,TCB=AE4800 
FORCE U=USER1,A=0099,TCB=AE4800
FORCE U=USER1,A=0099,TCB=AE4800
FORCE U=USER1,A=0099,TCB=AE4800
FORCE U=USER2,A=00d5,TCB=AE4800

We would still like to figure out the root cause here, this could cause issues if we move zowe up to production and we're getting system latches.

I tried the following test afterwards and was unable to recreate the issue

savaresejt commented 2 months ago

This might be a big with z/OSMF. Just spoke with @phaumer in the z/open editor issue here.