zowe / zss

Zowe System Services Server for enabling low-level microservices
Eclipse Public License 2.0
13 stars 45 forks source link

Issues starting ZSS in Zowe V2 #600

Closed Dingmans closed 6 months ago

Dingmans commented 1 year ago

Greetings,

I've installed zowe V2.8 and most components come up fine. But ZSS terminates more or less instantly and go into a restart loop. It seems like ZSS ends before anything is written to the zssServer log as it have been empty so far.

What I can see in joblog:

ZWESVUSR INFO (zwe-internal-start-component) starting component zss ...
ZWESVUSR INFO ZWEL0004I component zss(33555312) terminated, status = 0
ZWESVUSR INFO ZWEL0005I next attempt to restart component zss in 1 seconds

What I can see on the syslog:

IEF450I ZWE2SZ STEP1 - ABEND=S000 U4088 REASON=00000075 980
        TIME=11.13.06

IEA995I SYMPTOM DUMP OUTPUT 978
  USER COMPLETION CODE=4088 REASON CODE=00000075
 TIME=11.13.04  SEQ=27066  CPU=0000  ASID=0108
 PSW AT TIME OF ERROR  078D1400   8915F896  ILC 2  INTC 0D
   NO ACTIVE MODULE FOUND
   NAME=UNKNOWN
   DATA AT PSW  0915F890 - 00181610  0A0DA7F4  001C1811
   AR/GR 0: 00000000/84000000   1: 00000000/84000FF8
         2: 00000000/00000075   3: 00000000/890E5000
         4: 01FF000A/1D162870   5: 00000000/1C783FD4
         6: 00000000/1C90FA90   7: 00000000/0915F85A
         8: 00000000/80000000   9: 00000000/1D1A3FD0
         A: 00000000/00000001   B: 00000000/8915F7B8
         C: 00000000/1C9113B8   D: 00000000/1D10F358
         E: 00000000/892D520C   F: 00000000/00000075
 END OF SYMPTOM DUMP
BPXP018I THREAD 1E45080000000000, IN PROCESS 67109370, ENDED 979
WITHOUT BEING UNDUBBED WITH COMPLETION CODE 84000FF8
, AND REASON CODE 00000075.

+CEE3798I ATTEMPTING TO TAKE A DUMP FOR ABEND U4088 TO DATA SET:
XEAT.XSDDNEMP.D132.T1113043.P40001FA
IGD101I SMS ALLOCATED TO DDNAME (SYS00002) 973
        DSN (XEAT.XSDDNEMP.D132.T1113043.P40001FA        )
        STORCLAS (SCIBSTD) MGMTCLAS (HSMMIGL2) DATACLAS (SQ00T2)
        VOL SER NOS= `TM1T13`

I'm not sure what U4088 REASON=00000075 but it seems storage/pointer related.

X'75' (117)
During a stack overflow on the Down stack, the stack pointer (register 4) was not within the current Down stack segment.

I will provide the CEE dump, as well as our zowe.yaml and a joblog. For the CEE dump I used binary FTP to fetch it from z, I will provide DCB information at bottom if you need to pre-allocate a target dataset.

joblog-ceedmp-yaml.zip

Any help would be appreciated.

CEEDUMP DCB:

Organization  . . . : PS
Record format . . . : FBS
Record length . . . : 4160
Block size  . . . . : 24960
1st extent tracks . : 2002
Secondary tracks  . : 2402

Many thanks, David

JoeNemo commented 1 year ago

Is this a small test system? Does this system have much other USS/OMVS workload on it? Sorry that there isn't a lot of specifics to 'guess' on here. But maybe there is a memory resource limit issue? Can we get a dump of environment variables?

Dingmans commented 1 year ago

Hello!

Yep, no problem - it is a small system for sure, though there should be room to adjust if there is a resource shortage of some kind.

The workload under USS is certainly increasing, though the OMVS system wide limits are still holding up I'd say.

image

If anything our LE heap/stack values might be set a bit low (I believe lower than default values at least).

I've attached our environment vars. envvar.txt

Many thanks!

ifakhrutdinov commented 1 year ago

Hello @Dingmans , could you set zowe.launcher.shareAs to no in zowe.yaml? See https://docs.zowe.org/stable/appendix/zowe-yaml-configuration/#launcher-and-launch-scripts

You'd need to add the following under zowe:

  launcher:
    shareAs: no

I suspect that the heap is running out and this option will force the launcher to start every component in a separate address space making more room for the heap of zss. If this helps, the HEAP LE options should be adjusted, so that all the components start properly in a single address space.

Dingmans commented 1 year ago

Hello @ifakhrutdinov!

I tried as you suggested, and observed the same issue with ZSS as described initially. Our LE HEAP is defaulting on 64-bit, and is increased somewhat for 31-bit.

HEAP64=((1M,1M,KEEP,32K,32K,KEEP,4K,4K,FREE),OVR) HEAP=(64K,32K,ANYWHERE,FREE,8K,4K)

Br, David

Dingmans commented 1 year ago

In the STC joblog I see a few debug statements, for example:

ZWESVUSR DEBUG (zwe-internal-start-prepare,configure_components) export _CEE_RUNOPTS="XPLINK(ON),HEAPPOOLS(ON)"

I think it is wierd that we get HEAPPOOLS(ON) as I read somewhere it should be HEAPPOOLS(OFF), which is what is specified in the proc from zowe samplib, and I haven't changed that.

I can't find anywhere where I have specified HEAPPOOLS(ON) in my own env profile, the system wide profile for all users under USS or in any configuration file for Zowe as far as I can find/remember. In CEEPRM HEAPPOOL defaults to OFF.

I don't know if HEAPPOOLS is related to this issue, but the problem is any attempt from my side to add stack/heap adjustments to _CEE_RUNOPTS via STDENV DD is not picked up it seems.

Any idea where those export statements are coming from?

//David

ifakhrutdinov commented 1 year ago

@Dingmans could you attach the latest log?

Dingmans commented 1 year ago

Here it is: ZWE2SV2.txt

If you notice the JVMJ9VM015W messages, we probably triggered that by adding MEMLIM parameter to the PROC in this particular startup - to see if it would pick that up at least, we set it too 500M which was too small. Otherwise it looks much the same any other initialization.

ifakhrutdinov commented 1 year ago

@Dingmans , thanks, it still has _BPX_SHAREAS=YES. Could you also link your config file?

Dingmans commented 1 year ago

Yes, sorry - I disabled shareas=no after the initial test as it didn't help with zss, and it seemed to bring some noticeable overhead.

Here is a another joblog with shareAs=no ZWE2SV3.txt

ifakhrutdinov commented 1 year ago

In the V2 log I can see multiple out-of-memory exceptions in Java, and they're absent in the V3 log. So some memory related issues are definitely present.

I don't see why zss terminated in the V3 log. Could you attach it here too? I'm wondering if it's the same ABEND this time. I think it should be /SYSTEM/var/zowe/lab/logs/zssServer-2023-06-14-10-06.log.

Dingmans commented 1 year ago

I might be causing some confusion here. The java issues was probably triggered by me adding a MEMLIM parameter to the procedure, which is what you see in the V2 log. I removed that MEMLIM parameter since it was just causing more issues so I think that is why you are seeing a difference in regards to JAVA.

(Why I added that was at suggestion from colleague but I never had any faith in it as a solution so to speak, shouldn't have sent that specific log).

ZWE2SZ abends more or less the same second it is started, so I suspect that it never reaches a point where it can write anything.

A colleague of mine who can read dumps checked the CEE dump produced by these abends and he points to the stack being the issue. So I think it can be a good step to to try to increase the stack.

Zowe documentation also recommends a HEAP64 size of at least (4M,4M.... and I think we currently default to (1M,1M.. But perhaps this is what you hope to circumvent with the "shareAs:" no parameter.

Which is why I need to find what is overriding any changes made to _CEE_RUNOPTS that is specified in ZWESLSTC on the STDENV DD.

I think it should be fine to specify a new stack & heap parameters there?

ifakhrutdinov commented 1 year ago

I don't think the values from STDENV will be inherited by the components started by the launcher process. STDENV will be used by the launcher only.

You can try adding custom runtime options to the zssServer.sh script directly and see if that affects it.

Regarding HEAP64, as far as I know zss is still a 31bit application, so it should affect it.

I'm going to have a closer look at the CEE dump...

ifakhrutdinov commented 1 year ago

Ok, looks like zss reaches some code:

Registers and PSW:
GPR0..... 00000000_84000000  GPR1..... 00000000_84000FF8  GPR2..... 00000000_00000075  GPR3..... 00000000_890E5000
GPR4..... 00000000_1D162870  GPR5..... 00000000_1C783FD4  GPR6..... 00000000_1C90FA90  GPR7..... 00000000_0915F85A
GPR8..... 00000000_80000000  GPR9..... 00000000_1D1A3FD0  GPR10.... 00000000_00000001  GPR11.... 00000000_8915F7B8
GPR12.... 00000000_1C9113B8  GPR13.... 00000000_1D10F358  GPR14.... 00000000_892D520C  GPR15.... 00000000_00000075
PSW..... 078D1400 8915F896

Traceback:
  DSA      Entry       E  Offset  Statement   Load Mod             Program Unit                   Service  Status

  1        jsonToJS1   -1361FDAA              *PATHNAM                                                     Call
  2        jsonToJS1   +0000048C              *PATHNAM                                                     Call
  3        jsonToJS1   +0000048C              *PATHNAM                                                     Call
  4        jsonToJS1   +0000048C              *PATHNAM                                                     Call
  5        ejsJsonToJS_internal
                       +00000022              *PATHNAM                                                     Call
  6        evaluationVisitor
                       +00000196              *PATHNAM                                                     Call
  7        visitJSON   +00000124              *PATHNAM                                                     Call
  8        visitJSON   +00000166              *PATHNAM                                                     Call
  9        visitJSON   +00000166              *PATHNAM                                                     Call
  10       evaluateJsonTemplates
                       +00000096              *PATHNAM                                                     Call
  11       cfgLoadConfiguration2
                       +00000166              *PATHNAM                                                     Call
  12       cfgLoadConfiguration
                       +00000022              *PATHNAM                                                     Call
  13                   +00000002              CEEPLPKA                                                     Call
  14       main        +000005D2              *PATHNAM                                                     Call
  15                   +00000002              CEEPLPKA                                                     Call
  16                   +00001270              CEEPLPKA                                                     Call
  17       EDCZHINV    +000000B4              CELHV003             EDCZHINV                       HLE77D0  Call
  18       CEEBBEXT    +000001C6              CEEPLPKA             CEEBBEXT                       HLE77D0  Call

Just to make sure we have all the pieces, could you attach the zss log? I don't think it's been attached in this ticket before.

1000TurquoisePogs commented 1 year ago

It was requested in the call where this was discussed that perhaps debugging should be turned on. If so, the simplest way to do that is to add this to the ZWESLSTC JCL in the bottom section:

ZLDEBUG=ON

Dingmans commented 1 year ago

I would add the zss log but its just empty so it haven't been any point so far.

I started zowe with ZLDEBUG=ON. Attaching the new joblog: ZWE2SV4.txt

ifakhrutdinov commented 1 year ago

@Dingmans Thank you.

@JoeNemo the latest log has some JSON&configmgr-related debug messages. Can you please have a look?

JoeNemo commented 1 year ago

I see some bad characters in the log. So, I am still a little suspicious of character set issues, but not sure if that is causing the bug. The messages are "normal" about JSON Schema Validation , that is the validation goes through all of the JSON without finding anything bad.

It's going to be a very long diagnosis without reproducing this bug. Or seeing the memory for the embedded expression which fails to evaluate. I think we should follow what we were discussing about getting the trace on the code in embeddedjs.c

Dingmans commented 1 year ago

Greetings,

Finally made some progress by setting the components.zss.agent.64bit to true. Now zss is started and the zss log is written. It looks like it came up fine.

ZWES1014I ZIS status - 'Ok' (name='ZWESIS_LAB ', cmsRC='0', description='Ok', clientVersion='2')

I guess this would only be a problem if we want run some plugin that requires zss to be run in 31-bit amode.

Not sure what to make of the root cause still, somehow amode 31 is an issue?

ifakhrutdinov commented 1 year ago

@Dingmans great news!

This either is related to the below the bar storage which, if constrained, affect zss 31-bit since it uses it for stack/heap storage, or, there is just a bug, i.e. something in the init code doesn't play nice.

I guess this would only be a problem if we want run some plugin that requires zss to be run in 31-bit amode.

Yes, if a plug-in isn't built for 64-bit, it won't work.

Please keep this issue open for now if possible; we're looking at this.

Dingmans commented 1 year ago

@ifakhrutdinov

Yep, I'm interested in any findings also, so no problem keeping this open. Just ping if you need any more info from my side.

Regards, David

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but can be reopened if needed. Thank you for your contributions.

github-actions[bot] commented 6 months ago

This issue has been automatically closed due to lack of activity. If this issue is still valid and important to you, it can be reopened.