AND constraint checks not consistent for same sequence and job definition

FreddieMatherSmartDCSIT commented 10 months ago

Whilst running a performance test a defect has been encountered where a job sequence with an AND constraint is showing up as a failed constraint check even when the full sequence that should pass this constraint check has been sent. This occurred when running with

1 AER container and 2 AEOSVDC containers
1 AER container and 4 AEOSVDC containers

The job definition puml file used is

@startuml
partition "AND_constraint_1" {
    group "AND_constraint_1"
        :A;
        :B;
        :C;
        :D;
        :E;
        :F;
        :G;
        :H;
        :I;
        :J;
        :K;
        :L;
        :M;
        :N;
        :O;
        :P;
        :Q;
        :R;
        :S;
        :T;
        :U;
        :V;
        :W;
        :X;
        :Y;
        :Z;
        :AA;
        :AB;
        :AC;
        :AD;
        :AE;
        fork
            :AF;
            detach
        fork again
            :AG;
            detach
        end fork
    end group 
}
@enduml

An example of the (Verifier) logs for a job that fails is given in the attached file (all the events are there but the PV checks the constraint before the last event AG is processed example_job_fail.txt

An example (in the same run) of the (Verifier) logs for a job that succeeds is given in the attached file (all the events are there but the PV checks the constraint correctly after the last event AG is processed. example_job_success.txt

The event sequence that is used as a template to send the files is given below defect_AND_job.json

The protocol verifier configuration used is given below:

{
  "SpecUpdateRate": "PT30S",
  "IncomingDirectory": "./incoming",
  "ProcessedDirectory": "./processed",
  "ProcessingDirectory": "./processing",
  "EventThrottleRate": "PT0S",
  "ReceptionDeletionTime": "PT10M",
  "MaxEventsPerReceptionJob": "0",
  "ConcurrentReceptionLimit": "1",
  "SchemaValidate": "true",
  "SchemaValidateFrequency": "1",
  "FileControlWaitTime": "PT1S",
  "MaxOutOfSequenceEvents": "1000",
  "MaximumJobTime": "PT10M",
  "JobCompletePeriod": "PT24H",
  "JobDefinitionDirectory": "config/job_definitions",
  "DefaultJobExpiryDuration": "P99W",
  "DefaultStaleAuditEventDuration": "PT24H",
  "DefaultBlockedAuditEventDuration": "PT24H",
  "JobStoreLocation": "./JobIdStore",
  "JobStoreAgeLimit": "P7D",
  "InvariantStoreLoadRate": "PT2M",
  "MaxIntraSequenceEventTimeoutPeriod": "PT1S",
  "WaitPeriodForAllJobsCompletedCheck": "P1D",
  "WaitPeriodForJobDeletion": "PT30M",
  "WaitPeriodForInvariantDeletion": "P1D",
  "TimeoutPeriodForRetreivingStoredInvariants": "PT10S",
  "TimeoutPeriodForHangingJob": "PT10M"
}

Steps to reproduce:

Send 500 events/s to the PV (we actually use the following profile but I know plus2json sends at a flat rate)
Use the above job definition and the valid event sequence above (plus2json will generate only one valid event sequence from this job definition)
Look in logs for svdc job failures and associated job ids and confirm that some of the jobs are failing even though all events are processed
Use the following number of containers
- 1 AER container and 2 AEOSVDC containers
- 1 AER container and 4 AEOSVDC containers

ColinCarterUK commented 10 months ago

The issue has been identified. It is a bug in the way that job completion is determined. Job completion was based on the idea of branch extent. This is an integer value that represents the current number of parallel branches a job has. Each start event and each tine of an AND fork (after the first one) increment the count. Each tine of a merge fork (after the first one) and each end event decrement it. When the branch extent gets to zero the job is complete. However, the error case occurs when a job reaches all of its current end points before a branch tine of an AND fork even starts. The AND fork completion test is done after the job is determined to be complete. The job has completed but the AND fork is incomplete hence the error. The fix is to add an additional test case for in determining job completion that ensures any AND fork that has had at least one tine touched has all tines present. This will stop the premature conclusion of a job. A consequence of the fix is that the current test that each AND fork is complete is no longer used because the AND forks have to be complete for the job to finish. Note: an AND fork could be situated on a branch of the job definition after an XOR. This means that AND fork might never be touched. This is dealt with by the test that if one tine of the AND fork is touched then they all must be. It doesn't simply test that all AND forks have completed. A fix has been developed and testing is underway.

ColinCarterUK commented 10 months ago

Any job in which one or more tines of an AND fork are omitted will now timeout. The job knows events are missing so it waits for them until the JobHangingTimer times out.

cortlandstarrett commented 9 months ago

Fixed in v1.1.3. Another bug was identified while fixing this bug. Instance forks had the same weakness as AND forks. This bug was fixed, too.

xtuml / munin

AND constraint checks not consistent for same sequence and job definition #152