xenon-middleware / xenon-cli

Perform files and jobs operations with Xenon library from command line
http://nlesc.github.io/Xenon/
Apache License 2.0
2 stars 3 forks source link

Slurm adaptor got invalid key/value pair in output #72

Closed arnikz closed 4 years ago

arnikz commented 4 years ago

The GridEngine cluster at UMCU has been recently upgraded to use Slurm (v19) (and will replace GE soon-ish). So, I tested the sv-callers workflow but all Slurm jobs failed (also tried without the --max-memory arg, see release notes).

xenon -vvv scheduler slurm --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 5 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log
slurm adaptor: Got invalid key/value pair in output: Cgroup Support Configuration:
Error submitting jobscript (exit code 1):
13:18:55.487 [main] DEBUG n.e.x.a.s.ScriptingScheduler - creating sub scheduler for slurm adaptor at local://
13:18:55.498 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
13:18:55.501 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job
13:18:55.506 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Created Job local-0
13:18:55.507 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
13:18:55.508 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.s.RemoteCommandRunner - CommandRunner took 44 ms, executable = scontrol, arguments = [show, config], exitcode = 0, stdout:
Configuration data as of 2020-02-24T13:18:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = htp-batch-01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-02-17T18:10:10
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
CliFilterPlugins        = (null)
ClusterName             = spider
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerCPU            = 8000
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ALL
Epilog                  = /data/tmpdir-epilogue.sh
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = NoOverMemoryKill
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerCPU            = 8000
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 26161
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 1-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 10000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(1001)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = htp-batch-01(10.0.0.14)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 19.05.5
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm_state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = no
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at htp-batch-01 is UP

stderr:
arnikz commented 4 years ago

Perhaps, it's a good time to update the Docker images.

arnikz commented 4 years ago

@sverhoeven: could you given an estimate on how much time is required to fix this? Thanks.

sverhoeven commented 4 years ago

The ScriptingParser used in the SlurmScheduler class does not know about sections (:). I think it would take at least a day to write a robust parser and couple of hours to create a new Xenon and Xenon-* releases.

jmaassen commented 4 years ago

I've just added a single statement to ignore all lines without an = sign in them.

jmaassen commented 4 years ago

I'll add some tests and release a version with this fix

jmaassen commented 4 years ago

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

We would need a slurm 19 container to do proper testing?

arnikz commented 4 years ago

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

Thanks.

We would need a slurm 19 container to do proper testing?

Yes.

arnikz commented 4 years ago

Hi, I've tested my workflow with xenon-cli 3.0.5beta1 + new slurm-19 image but the jobs are still failing. Please heeelp!

sverhoeven commented 4 years ago

The conda xenon-cli 3.0.5beta1 package was just made for non-linux users (#73).

It does not include the fix in the https://github.com/xenon-middleware/xenon/tree/jobstatus-bug branch, it is a build with the Xenon v3.0.4 release.

jmaassen commented 4 years ago

Hmmm... my (new) unit test does parse the output correctly.

I think there may be some version mixup with xenon somewhere. I'll see if I can find the problem.

update: Ah, it seems the fix may be in the jobstatus-bug branch ;-)

jmaassen commented 4 years ago

I'll cleanup the branch and test it with the other (non-slurm) scripting adaptors. I can then merge it into master and make a new release

sverhoeven commented 4 years ago

I created a draft PR https://github.com/xenon-middleware/xenon/pull/670 for the jobstatus-bug branch, to see the test failures more easily.

jmaassen commented 4 years ago

Hmmm... most of the test pass, except for one integration test. Apparently the sbatch argument "--workdir" has changed to "--chdir" at some point. Will fix.

jmaassen commented 4 years ago

Fixed in the 3.1.0 release

sverhoeven commented 4 years ago

CLI v3.0.5 released on conda with Xenon 3.1.0. Please test

arnikz commented 4 years ago

All works fine with the latest release on Slurm 19. Thanks!