zowe / zss

Zowe System Services Server for enabling low-level microservices
Eclipse Public License 2.0
13 stars 45 forks source link

Starting new Zowe results in multiple S0C4 (U4088) abends then terminates #736

Open bobbydixon opened 1 month ago

bobbydixon commented 1 month ago

Describe the bug Client installed Zowe V2.18 to test. After completing installation steps, the started the two Zowe STCs, ZWESISTC and ZWESLSTC. ZWESLSTC issued multiple S0C4 dumps and terminated.

Steps to Reproduce

  1. Install Zowe V2.18
  2. Start the two Zowe STCs, ZWESISTC & ZWESLSTC

Expected behavior Expect the two Zowe STCs to start, initialize, and stay up until stopped.

Screenshots (if needed)

Logs

Describe your environment

Additional context This seems similar to GitHub issue https://github.com/zowe/zss/issues/600 which closed due to inactivity.

I had Howard, one of our LE experts review the dump, and he found:

As noted previously the u4088-75 is happening because the XPLINK backchain pointer at 33150830 is not in storage.   I don't see any sign of a prior freemain/release but I do see many prior program checks and RCVY systrace entries.  The customer is specifying TERMTHDACT(MSG...), which is not optimal for this.  Please ask them to change their TERMTHDACT RTO from MSG to UADUMP so we can get a u4039 dump of the first failure.

They have:

PARMLIB(CEEPRMML) OVR TERMTHDACT(MSG,CESE,00000096)

Change it to TERMTHDACT(UADUMP,CESE,00000096)

They can add it to their existing _CEE_RUNOPTS statement.

The dump shows many DSA's using addressess within the page at 33150000 but that page is not in storage. Systrace also shows many recovery retries pointing to SDWA's that are also no longer in storage. Also TERMTHDACT is set to UADUMP now.

Updates noted, I am continuing with the u4088 dump. The u4088 is getting issued out of CEEVXPAL because the calculated previous DSA location is not within the current downstack segment. In CEEVXPAL we have register 4 set to the value passed to CEEVXPAL, 33150830. We do some calculations to see if this value is greater than SMCB_DSBOS, the highest stack address, or less than the stack floor value, 3325080. It is less than the stack floor so we take the u4088-75 abend. The value in register 4 is passed to CEEVXPAL by the caller. We save the callers registers in the SMCB. Register 7 is set to x'B2780DD0, which is x418' bytes into JsonToJS1, so this code needs to be reviewed to see why it is passing this value.

 I also reviewed the prior RCVY/pgm chk calls in systrace.  They all involve CEEVHPSO, LE's HP linkage stack overflow processor.  These are telling me the stack segments are overflowing and a new segment is needed.  This happens when the stack size is too small and a new segment is needed.  I see this happening about 242 times between 14:45:27.404950835 and 14:45:48.968235664 before the u4088, which looks excessive.  The Stack RTO is set to 

OVERRIDE OVR STACK(0000012272,0000004080,ANY ,FREE, 0000012272,0000004080)

Which may be too small. I would recommend setting the RPTSTG(ON) option and then reviewing the report to get better idea of what to use for the stack setting.

 But JsonToJS1 need to be reviewed in addition to this. 

It is also probable that the stack overflow is exhausting stack storage, driving jsonToJS1 t request stack storage that is now outside of the current segment. The traceback also hints at a possible loop in jsonToJS1 +48C?

Traceback:
DSA Entry E Offset Statement Load Mod Program Unit Service Stat

1 jsonToJS1 +00000000 PATHNAM Call 2 jsonToJS1 +0000048C PATHNAM Call 3 jsonToJS1 +0000048C PATHNAM Call 4 jsonToJS1 +0000048C PATHNAM Call 5 ejsJsonToJS_internal
+00000022 PATHNAM Call 6 evaluationVisitor
+00000196
PATHNAM Call 7 visitJSON +00000124 PATHNAM Call 8 visitJSON +00000166 PATHNAM Call 9 visitJSON +00000166 PATHNAM Call 10 evaluateJsonTemplates
+00000090
PATHNAM Call 11 cfgLoadConfiguration2
+00000166 PATHNAM Call 12 cfgLoadConfiguration
+00000022
PATHNAM Call 13 CEEVHPFR +00000002 CEEPLPKA Call 14 main +000005E4 *PATHNAM Call

After discussing this issue in our Zowe meeting, Andrew suggested starting ZSS in 64-bit mode by adding 64bit: true to the zowe.yaml file as follows, then recycle the Zowe STCs:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

zss: enabled: true port: 7557 crossMemoryServerName: ZWESIS_STD tls: true agent: 64bit: true jwt: fallback: true

1000TurquoisePogs commented 1 month ago

Hi, discussing this with others but I was able to reproduce this and resolve it by varying the CEE runopts setting STACK

_CEE_RUNOPTS=STACK(24576,16384,ANYWHERE,KEEP,524288,13107) Works.

This may be covering up a bug, so it's still being investigated but it can be a workaround at the moment.

bobbydixon commented 1 month ago

Hi, thanks for the update.

Are you saying that adding "64bit: true" to zss in the zowe.yaml file is a workaround (for now), or adding "_CEE_RUNOPTS=STACK(24576,16384,ANYWHERE,KEEP,524288,13107)" to the CEE runopts?

Kind regards, Bobby

ifakhrutdinov commented 1 month ago

@bobbydixon, yes, those should be valid workarounds.

ifakhrutdinov commented 4 days ago

@bobbydixon , we've fixed a bug, which may have contributed to this issue, in our code; but we've also found an issue in the compiler and that's probably the root cause. We've opened a case with IBM. We'll let you know if there are any additional workarounds you could use.