xlab-uiuc / acto

Push-Button End-to-End Testing of Kubernetes Operators and Controllers
Apache License 2.0
106 stars 39 forks source link

The Runner Design #236

Closed tianyin closed 4 months ago

tianyin commented 11 months ago
@Spedoske Can you edit this issue to discuss the Runner design? It seems to me that you are rewriting the Runner in https://github.com/xlab-uiuc/acto/pull/235 and use Ray to run multiple running on multiple machines (e.g., on cloudlab) Are you trying to support different runner backends? I feel the previous thread runner is also important as it allows us to run multiple runners on one machine (as discussed in the paper).

Code Design Description

Runner

Class Initialization

The Runner class is initialized with the following parameters:

During initialization, the class sets up a few variables, including preload_images and preload_images_store, and starts a new thread to asynchronously set up the cluster and indicate its availability.

Trial Execution

The run method is responsible for executing a trial and collecting snapshots. It takes the following parameters:

Within the run method, the class waits until the cluster is available (cluster_ok_event) before executing the trial. For each system input in the trial, it attempts to collect a snapshot using the snapshot_collector function. If any exception occurs during snapshot collection, the exception is caught, and the error is stored along with the snapshot (if available). The trial.send_snapshot method is called to send the snapshot and error to the trial.

After processing all system inputs, the cluster availability event is cleared, and a new thread is started to asynchronously reset the cluster and set it as available again.

Cluster Setup and Teardown

The Runner class provides several private methods for setting up and tearing down the Kubernetes cluster.

TrialInputIterator

The TrialInputIterator class is responsible for generating and iterating over system inputs (test cases) for a trial. It takes the following parameters during initialization:

The class maintains a history of applied system inputs (self.history) and a queue of pending tests (self.queuing_tests). It provides an __iter__ method that generates a tuple of (system_input, signature) where system_input is a dictionary representing the next system input to be applied, and signature is a dictionary representing the signature of the associated test case.

The __iter__ method continues generating system inputs until there are no more tests in the queue (self.queuing_tests) and no more test cases in the next_testcase iterator. For each system input, it applies the test case, appends the result to the history, and yields the system input. It handles mutation and setup of the input fields based on the test case's preconditions and mutators.

The class also provides the flush method to flush the queuing tests and the revert method to revert the last applied test case by preventing the tests in the queuing tests from being applied and re-applying the last valid test.

Trial

The provided code represents a class called Trial, which is responsible for managing the execution of trials and checking the snapshots produced during the trial.

Class Initialization

The Trial class is initialized with the following parameters:

During initialization, the class sets up various variables, including next_input, checker_set, snapshots, run_results, generation, num_mutation, error, state, and waiting_for_snapshot.

Iterator Functionality

The Trial class implements the iterator protocol by defining the __iter__ method. This allows instances of the class to be used as an iterator. The __iter__ method returns a generator that yields system inputs from the next_input iterator until the maximum number of mutations (num_mutation) is reached or an error occurs.

Within the generator, the class checks the current state of the trial and retrieves the next system input from next_input. If the trial is in a terminated or runtime exception state, the generator terminates. If no more system inputs are available, the generator terminates as well.

Before yielding the system input, the class sets the waiting_for_snapshot flag to indicate that a snapshot is expected. The yielded system input will be used to collect a snapshot in the trial execution.

Snapshot Collection

The send_snapshot method is responsible for receiving a snapshot and a runtime error (if any) produced during the execution of a system input. It takes the following parameters:

The method first asserts that the trial is waiting for a snapshot and then sets the waiting_for_snapshot flag to False. If a runtime error is provided, the trial state is set to 'runtime_exception', and the error is stored. An error message is also logged using the logging.error method.

If a snapshot is provided, the method appends the snapshot to the snapshots list, increments the generation count, and sets the waiting_for_snapshot flag to False. The check_snapshot method is then called to perform the snapshot checking.

Snapshot Checking

The check_snapshot method is responsible for performing the checking of snapshots using the checker_set. It verifies the correctness of the current snapshot based on the previous snapshot.

The method first asserts that the trial state is not 'runtime_exception' or 'terminated'. It then retrieves the previous snapshot and the current snapshot from the snapshots list. Using the checker_set, the method checks the snapshots and stores the result in the run_results list.

acto.ray

acto.ray provides a mock implementation of the Ray library when Ray is not enabled in the Acto configuration.

The module has two parts: one for mocking the ray.remote and ray.get functionalities, and another for mocking the ActorPool class.

Mocking ray.remote and ray.get

The first part of the code checks if Ray is enabled in the Acto configuration (actoConfig.ray.enabled). If Ray is enabled, it imports the actual Ray module and assigns the ray.remote and ray.get functions to the remote and get variables, respectively.

If Ray is not enabled, it defines a custom remote function and a get function as replacements for ray.remote and ray.get. These functions provide a mock implementation that mimics the behavior of the actual Ray library.

The remote function takes a runner as an argument and modifies the runner object's __init__ method. It adds an attribute remote to the runner object and sets it to the runner itself. It also sets the remote attribute of the run method to the run method itself. This allows the runner object to be used as a remote function call.

The get function simply returns the input as-is.

Mocking ActorPool

The second part of the code mocks the ActorPool class, which is used for managing a pool of actors.

If Ray is enabled, it imports the actual ActorPool class from the Ray library.

If Ray is not enabled, it defines a mock ActorPool class. This class provides a similar interface to the actual ActorPool class but uses a ThreadPoolExecutor instead of Ray's actor-based execution.

The ActorPool class has the following attributes and methods:

Spedoske commented 11 months ago

There is a mocked thread-based implementation for ray. See commit https://github.com/xlab-uiuc/acto/pull/235/commits/4bea2fe2ac6a077f470a3beda336620a0832942b

Spedoske commented 11 months ago

Updated @tianyin @tylergu

tianyin commented 11 months ago

This is a very detailed writeup. Thanks @Spedoske !

@tylergu -- could you take a read of it and merge the writing to some design docs in the repo and then close it?

tylergu commented 11 months ago

@Spedoske Thanks for the fantastic writeup and all the efforts! From my understanding this would be a quite big change in the code base. The architecture looks correct overall, but there are some details which I want to discuss. Let's have a sync to discuss some details, such as the abstraction of the input generation and the remote RPC (I think we should have an interface to implement for running the tests, and the default is to run it locally, and Ray can be only of the implementations).

tylergu commented 11 months ago

@tianyin We will iterate on this and close it once we agree on the design and migrate the design doc into the repo

tianyin commented 11 months ago

Sounds great! Thanks for doing it @tylergu @Spedoske

tianyin commented 4 months ago

Close as it is not actively pursued for now.