Refactor collect_one_epoch_experience function in OnPolicyAlgorithm

yamatokataoka commented 2 years ago

Refer https://github.com/yamatokataoka/reinforcement-learning-replications/issues/59

yamatokataoka commented 2 years ago

All final shapes right before training

4000 is the number of steps for each epoch 4 is obs dimension

observations.shape: torch.Size([4000, 4])
actions.shape: torch.Size([4000])
advantages.shape: torch.Size([4000])
discounted_returns.shape: torch.Size([4000])

yamatokataoka commented 2 years ago

All experiences separated by each episode Collecting experience with list and numpy

def collect_one_epoch_experience(self, steps_per_epoch: int) -> OneEpochExperience:
    one_epoch_experience: OneEpochExperience = {
        "observations": [],
        "actions": [],
        "rewards": [],
        "observations_with_last_observations": [],
        "episode_returns": [],
        "episode_lengths": [],
    }

    # Variables on an episode
    episode_observations: List[np.ndarray] = []
    episode_actions: List[np.ndarray] = []
    episode_rewards: List[float] = []
    episode_observations_with_last_observations: List[float] = []
    episode_return: float = 0.0
    episode_length: int = 0

    observation: np.ndarray = self.env.reset()

    for current_step in range(steps_per_epoch):
        episode_observations.append(observation)
        episode_observations_with_last_observations.append(observation)

        observation_tensor: Tensor = torch.from_numpy(observation).float()

        with torch.no_grad():
            policy_dist: Distribution = self.policy(observation_tensor)

        action: Tensor = policy_dist.sample()
        action_ndarray = action.detach().numpy()
        episode_actions.append(action_ndarray)

        reward: float
        episode_done: bool
        observation, reward, episode_done, _ = self.env.step(action_ndarray)

        episode_return += reward
        episode_rewards.append(reward)

        episode_length += 1
        self.current_total_steps += 1
        epoch_ended: bool = current_step == steps_per_epoch - 1

        if episode_done or epoch_ended:
            if epoch_ended and not (episode_done):
                logger.debug(
                    "The trajectory cut off at {} steps on the current episode".format(
                        episode_length
                    )
                )

            episode_observations_with_last_observations.append(observation)

            one_epoch_experience["observations"].append(
                episode_observations
            )
            one_epoch_experience["actions"].append(episode_actions)
            one_epoch_experience["rewards"].append(episode_rewards)
            one_epoch_experience["observations_with_last_observations"].append(
                episode_observations_with_last_observations
            )

            one_epoch_experience["episode_returns"].append(episode_return)
            one_epoch_experience["episode_lengths"].append(episode_length)

            if episode_done:
                self.current_total_episodes += 1

            observation = self.env.reset()
            episode_return, episode_length = 0.0, 0
            episode_observations, episode_actions, episode_rewards, episode_observations_with_last_observations = [], [], [], []

    return one_epoch_experience

Preprocessing before training

def train(self, one_epoch_experience: OneEpochExperience) -> None:
    observations_list: List[List[np.ndarray]] = one_epoch_experience["observations"]
    actions_list: List[List[np.ndarray]] = one_epoch_experience["actions"]
    rewards_list: List[List[float]] = one_epoch_experience["rewards"]
    observations_with_last_observations_list: List[np.ndarray] = one_epoch_experience["observations_with_last_observations"]

    # Calculate rewards-to-go over each episode, to be targets for the value function
    discounted_returns: Tensor = torch.from_numpy(np.concatenate([
        discount_cumulative_sum(one_episode_rewards, self.gamma)
        for one_episode_rewards in rewards_list
    ])).float()

    # Calculate advantages
    observations = torch.from_numpy(np.concatenate(observations_list)).float()
    actions = torch.from_numpy(np.concatenate(actions_list)).float()

    values_tensor_list: List[Tensor] = []
    with torch.no_grad():
        for observations_with_last_observation in observations_with_last_observations_list:
            observations_with_last_observation = torch.from_numpy(np.stack(observations_with_last_observation)).float()
            values_tensor_list.append(self.value_function(observations_with_last_observation).flatten())

    advantages: Tensor = torch.from_numpy(np.concatenate([
        gae(one_episode_rewards, self.gamma, one_episode_values.numpy(), self.gae_lambda)
        for one_episode_rewards, one_episode_values in zip(rewards_list, values_tensor_list)
    ])).float()

yamatokataoka commented 2 years ago

All experiences not separated by each episode Collecting experience with list and Tensor

def collect_one_epoch_experience(self, steps_per_epoch: int) -> OneEpochExperience:
    observations: List[Tensor] = []
    actions: List[Tensor] = []
    rewards: List[float] = []
    observations_with_last_observations: List[Tensor] = []
    episode_returns: List[float] = []
    episode_lengths: List[int] = []

    # Variables on an episode
    episode_return: float = 0.0
    episode_length: int = 0

    observation: np.ndarray = self.env.reset()

    for current_step in range(steps_per_epoch):
        observation_tensor: Tensor = torch.from_numpy(observation).float()
        observations.append(observation_tensor)
        observations_with_last_observations.append(observation_tensor)

        with torch.no_grad():
            policy_dist: Distribution = self.policy(observation_tensor)

        action: Tensor = policy_dist.sample()
        actions.append(action)

        reward: float
        episode_done: bool
        observation, reward, episode_done, _ = self.env.step(action.detach().numpy())

        episode_return += reward
        rewards.append(reward)

        episode_length += 1
        self.current_total_steps += 1
        epoch_ended: bool = current_step == steps_per_epoch - 1

        if episode_done or epoch_ended:
            if epoch_ended and not (episode_done):
                logger.debug(
                    "The trajectory cut off at {} steps on the current episode".format(
                        episode_length
                    )
                )

            observations_with_last_observations.append(torch.from_numpy(observation).float())

            episode_returns.append(episode_return)
            episode_lengths.append(episode_length)

            if episode_done:
                self.current_total_episodes += 1

            observation = self.env.reset()
            episode_return, episode_length = 0.0, 0

    one_epoch_experience: OneEpochExperience = {
        "observations": torch.stack(observations),
        "actions": torch.stack(actions),
        "rewards": torch.tensor(rewards),
        "observations_with_last_observations": torch.stack(observations_with_last_observations),
        "episode_returns": episode_returns,
        "episode_lengths": episode_lengths,
    }

    return one_epoch_experience

Preprocessing before training

def train(self, one_epoch_experience: OneEpochExperience) -> None:
    observations: Tensor = one_epoch_experience["observations"]
    actions: Tensor = one_epoch_experience["actions"]
    rewards: Tensor = one_epoch_experience["rewards"]
    observations_with_last_observations: Tensor = one_epoch_experience["observations_with_last_observations"]
    episode_lengths: List[float] = one_epoch_experience["episode_lengths"]

    # Calculate rewards-to-go over each episode, to be targets for the value function
    current_index = 0
    discounted_returns = torch.empty(rewards.shape)
    for episode_length in episode_lengths:
        one_episode_rewards = rewards.narrow(0, current_index, episode_length)
        discounted_returns[current_index:current_index + episode_length] = discount_cumulative_sum(one_episode_rewards, self.gamma)
        current_index += episode_length

    with torch.no_grad():
        values = self.value_function(observations_with_last_observations).flatten()

    current_index = 0
    advantages: Tensor = torch.empty(rewards.shape)
    for n_episode, episode_length in enumerate(episode_lengths):
        one_episode_rewards = rewards.narrow(0, current_index, episode_length)
        one_episode_values = values.narrow(0, current_index + n_episode, episode_length + 1)
        advantages[current_index:current_index + episode_length] = gae(one_episode_rewards, self.gamma, one_episode_values, self.gae_lambda)
        current_index += episode_length

yamatokataoka commented 2 years ago

Implemented cumulative sum with decay factor in PyTorch

https://discuss.pytorch.org/t/cumulative-sum-with-decay-factor/69788

def discount_cumulative_sum(vector: Tensor, discount: float) -> Tensor:
    """
    Compute discounted cumulative sums of vector.

    :param vector: (Tensor) An target vector
    e.g. [x0,
          x1,
          x2]
    :param discount: (float) The discount factor for the cumulative return
    :return: (Tensor) discounted cumulative sums of an vector
    e.g. [x0 + discount * x1 + discount^2 * x2,
          x1 + discount * x2,
          x2]
    """
    discounted_sequence = discount ** torch.arange(vector.size(0))
    discounted_cumulative_sums = torch.cumsum((discounted_sequence * vector).flip(0), dim=0).flip(0) / discounted_sequence
    return discounted_cumulative_sums

def gae(
    rewards: Tensor, gamma: float, values: Tensor, gae_lambda: float
) -> Tensor:
    """
    Compute Generalized Advantage Estimation (GAE)

    :param rewards: (Tensor) Rewards for all states
    :param gamma: (float) The discount factor for the cumulative return
    :param values: (Tensor) Values for all states
    :param gae_lambda: (float) A smoothing parameter for reducing the variance
    :return gaes: (Tensor) GAEs for all states
    """
    deltas: np.ndarray = rewards + gamma * values[1:] - values[:-1]
    gaes: Tensor = discount_cumulative_sum(deltas, gamma * gae_lambda)

    return gaes

yamatokataoka commented 2 years ago

Implementations comparison

epochs: 200 steps per epoch: 4000 algorithm: VPG env: Cartpole seed: 0 compute: Colab CPU

Implementation	1	2	3
current	Epoch: 199 Total steps: 8e+05 Total episodes: 1.1e+04 Average Episode Return: 186 Episode Return STD: 25.4 Max Episode Return: 200 Min Episode Return: 107 Average Episode Length: 182 Time: 487 Policy Loss: -0.00292 Avarage Entropy: 0.555 Log Prob STD: 0.447 Value Function Loss: 245	Epoch: 199 Total steps: 8e+05 Total episodes: 9.78e+03 Average Episode Return: 193 Episode Return STD: 20.5 Max Episode Return: 200 Min Episode Return: 130 Average Episode Length: 190 Time: 486 Policy Loss: -0.0115 Avarage Entropy: 0.55 Log Prob STD: 0.463 Value Function Loss: 144	Epoch: 199 Total steps: 8e+05 Total episodes: 1.13e+04 Average Episode Return: 194 Episode Return STD: 20.8 Max Episode Return: 200 Min Episode Return: 106 Average Episode Length: 190 Time: 479 Policy Loss: -0.00688 Avarage Entropy: 0.549 Log Prob STD: 0.45 Value Function Loss: 256
experience per episode with numpy	Epoch: 199 Total steps: 8e+05 Total episodes: 1.09e+04 Average Episode Return: 190 Episode Return STD: 24 Max Episode Return: 200 Min Episode Return: 121 Average Episode Length: 190 Time: 404 Policy Loss: -0.00548 Avarage Entropy: 0.554 Log Prob STD: 0.451 Value Function Loss: 286	Epoch: 199 Total steps: 8e+05 Total episodes: 1.39e+04 Average Episode Return: 190 Episode Return STD: 19.8 Max Episode Return: 200 Min Episode Return: 132 Average Episode Length: 190 Time: 401 Policy Loss: -0.0114 Avarage Entropy: 0.561 Log Prob STD: 0.444 Value Function Loss: 230	Epoch: 199 Total steps: 8e+05 Total episodes: 1.31e+04 Average Episode Return: 190 Episode Return STD: 23.9 Max Episode Return: 200 Min Episode Return: 111 Average Episode Length: 190 Time: 406 Policy Loss: -0.00812 Avarage Entropy: 0.555 Log Prob STD: 0.45 Value Function Loss: 251
experience per episode with Tensor	Epoch: 199 Total steps: 8e+05 Total episodes: 1.03e+04 Average Episode Return: 190 Episode Return STD: 19.5 Max Episode Return: 200 Min Episode Return: 132 Average Episode Length: 190 Time: 452 Policy Loss: -0.00352 Avarage Entropy: 0.549 Log Prob STD: 0.454 Value Function Loss: 225	Epoch: 199 Total steps: 8e+05 Total episodes: 1.06e+04 Average Episode Return: 182 Episode Return STD: 38 Max Episode Return: 200 Min Episode Return: 67 Average Episode Length: 182 Time: 450 Policy Loss: -0.0197 Avarage Entropy: 0.548 Log Prob STD: 0.462 Value Function Loss: 251	Epoch: 199 Total steps: 8e+05 Total episodes: 1.15e+04 Average Episode Return: 190 Episode Return STD: 26.8 Max Episode Return: 200 Min Episode Return: 82 Average Episode Length: 190 Time: 462 Policy Loss: -0.00983 Avarage Entropy: 0.551 Log Prob STD: 0.459 Value Function Loss: 260
experience with Tensor	Epoch: 199 Total steps: 8e+05 Total episodes: 1.29e+04 Average Episode Return: 182 Episode Return STD: 29.3 Max Episode Return: 200 Min Episode Return: 115 Average Episode Length: 182 Time: 402 Policy Loss: -0.0218 Avarage Entropy: 0.556 Log Prob STD: 0.445 Value Function Loss: 189	Epoch: 199 Total steps: 8e+05 Total episodes: 1.21e+04 Average Episode Return: 190 Episode Return STD: 23.4 Max Episode Return: 200 Min Episode Return: 120 Average Episode Length: 190 Time: 401 Policy Loss: -0.000805 Avarage Entropy: 0.549 Log Prob STD: 0.451 Value Function Loss: 197	Epoch: 199 Total steps: 8e+05 Total episodes: 1.25e+04 Average Episode Return: 182 Episode Return STD: 34.4 Max Episode Return: 200 Min Episode Return: 47 Average Episode Length: 182 Time: 405 Policy Loss: -0.0313 Avarage Entropy: 0.552 Log Prob STD: 0.455 Value Function Loss: 239

yamatokataoka commented 2 years ago

All experiences separated by each episode Collecting experience with list and Tensor

def collect_one_epoch_experience(self, steps_per_epoch: int) -> OneEpochExperience:
    one_epoch_experience: OneEpochExperience = {
        "observations": [],
        "actions": [],
        "rewards": [],
        "observations_with_last_observations": [],
        "episode_returns": [],
        "episode_lengths": [],
    }

    # Variables on an episode
    episode_observations: List[Tensor] = []
    episode_actions: List[Tensor] = []
    episode_rewards: List[float] = []
    episode_observations_with_last_observations: List[float] = []
    episode_return: float = 0.0
    episode_length: int = 0

    observation: np.ndarray = self.env.reset()

    for current_step in range(steps_per_epoch):
        observation_tensor: Tensor = torch.from_numpy(observation).float()
        episode_observations.append(observation_tensor)
        episode_observations_with_last_observations.append(observation_tensor)

        with torch.no_grad():
            policy_dist: Distribution = self.policy(observation_tensor)

        action: Tensor = policy_dist.sample()
        episode_actions.append(action)

        reward: float
        episode_done: bool
        observation, reward, episode_done, _ = self.env.step(action.detach().numpy())

        episode_return += reward
        episode_rewards.append(reward)

        episode_length += 1
        self.current_total_steps += 1
        epoch_ended: bool = current_step == steps_per_epoch - 1

        if episode_done or epoch_ended:
            if epoch_ended and not (episode_done):
                logger.debug(
                    "The trajectory cut off at {} steps on the current episode".format(
                        episode_length
                    )
                )

            episode_observations_with_last_observations.append(torch.from_numpy(observation).float())

            one_epoch_experience["observations"].append(
                episode_observations
            )
            one_epoch_experience["actions"].append(episode_actions)
            one_epoch_experience["rewards"].append(episode_rewards)
            one_epoch_experience["observations_with_last_observations"].append(
                episode_observations_with_last_observations
            )

            one_epoch_experience["episode_returns"].append(episode_return)
            one_epoch_experience["episode_lengths"].append(episode_length)

            if episode_done:
                self.current_total_episodes += 1

            observation = self.env.reset()
            episode_return, episode_length = 0.0, 0
            episode_observations, episode_actions, episode_rewards, episode_observations_with_last_observations = [], [], [], []

    return one_epoch_experience

Preprocessing before training

def train(self, one_epoch_experience: OneEpochExperience) -> None:
    observations_list: List[List[Tensor]] = one_epoch_experience["observations"]
    actions_list: List[List[Tensor]] = one_epoch_experience["actions"]
    rewards_list: List[List[float]] = one_epoch_experience["rewards"]
    observations_with_last_observations_list: List[List[Tensor]] = one_epoch_experience["observations_with_last_observations"]

    # Calculate rewards-to-go over each episode, to be targets for the value function
    discounted_returns: Tensor = torch.cat([
        discount_cumulative_sum(torch.tensor(one_episode_rewards), self.gamma)
        for one_episode_rewards in rewards_list
    ]).float()

    # Calculate advantages
    observations = torch.cat([torch.stack(episode_observations) for episode_observations in observations_list]).float()
    actions = torch.cat([torch.stack(episode_actions) for episode_actions in actions_list]).float()

    values_tensor_list: List[Tensor] = []
    with torch.no_grad():
        for observations_with_last_observation in observations_with_last_observations_list:
            observations_with_last_observation = torch.stack(observations_with_last_observation).float()
            values_tensor_list.append(self.value_function(observations_with_last_observation).flatten())

    advantages: Tensor = torch.cat([
        gae(torch.tensor(one_episode_rewards), self.gamma, one_episode_values, self.gae_lambda)
        for one_episode_rewards, one_episode_values in zip(rewards_list, values_tensor_list)
    ]).float()

yamatokataoka commented 2 years ago

Todos

[x] Compare results and investigate the bottlenecks

yamatokataoka commented 2 years ago

cProfile the current implementation

test inst-nodeps: /content/tox/.tmp/package/1/rl-replicas-0.0.3.tar.gz
test installed: absl-py==1.0.0,attrs==21.4.0,cachetools==4.2.4,certifi==2021.10.8,charset-normalizer==2.0.12,cloudpickle==2.0.0,dataclasses==0.8,google-auth==2.6.0,google-auth-oauthlib==0.4.6,grpcio==1.44.0,gym==0.21.0,idna==3.3,importlib-metadata==4.8.3,iniconfig==1.1.1,Markdown==3.3.6,numpy==1.19.5,oauthlib==3.2.0,packaging==21.3,pluggy==1.0.0,protobuf==3.19.4,py==1.11.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pyparsing==3.0.7,pytest==7.0.1,requests==2.27.1,requests-oauthlib==1.3.1,rl-replicas @ file:///content/tox/.tmp/package/1/rl-replicas-0.0.3.tar.gz,rsa==4.8,scipy==1.5.4,six==1.16.0,tensorboard==2.8.0,tensorboard-data-server==0.6.1,tensorboard-plugin-wit==1.8.1,tomli==1.2.3,torch==1.10.2,typing_extensions==4.1.1,urllib3==1.26.8,Werkzeug==2.0.3,zipp==3.6.0
test run-test-pre: PYTHONHASHSEED='2722520008'
test run-test: commands[0] | python -m cProfile -s cumtime tests/integration_tests/test_vpg.py
         48351144 function calls (45091207 primitive calls) in 146.650 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1854/1    0.074    0.000  146.654  146.654 {built-in method builtins.exec}
        1    0.000    0.000  146.654  146.654 test_vpg.py:1(<module>)
        1    0.000    0.000  144.559  144.559 test_vpg.py:19(test_vpg_with_cartpole)
        1    0.645    0.645  144.551  144.551 on_policy_algorithm.py:75(learn)
       50    4.189    0.084  118.579    2.372 on_policy_algorithm.py:169(collect_one_epoch_experience)
3637350/404150    5.969    0.000   82.178    0.000 module.py:1096(_call_impl)
   200050    0.856    0.000   50.446    0.000 categorical_policy.py:19(forward)
   404150    0.706    0.000   50.138    0.000 mlp.py:33(forward)
   404150    3.789    0.000   48.324    0.000 container.py:139(forward)
   204100    0.498    0.000   30.005    0.000 value_function.py:19(forward)
  1212450    2.677    0.000   28.252    0.000 linear.py:102(forward)
   200050    3.029    0.000   27.628    0.000 categorical.py:49(__init__)
   200000    0.596    0.000   26.580    0.000 time_limit.py:14(step)
   200000    5.154    0.000   25.984    0.000 cartpole.py:103(step)
       50    0.161    0.003   25.119    0.502 vpg.py:53(train)
  1212450    1.321    0.000   24.676    0.000 functional.py:1832(linear)
  1212450   22.969    0.000   22.969    0.000 {built-in method torch._C._nn.linear}
   200000    0.979    0.000   18.281    0.000 arrayprint.py:1365(_array_repr_implementation)
   200000    0.926    0.000   16.721    0.000 arrayprint.py:516(array2string)
   200050    2.856    0.000   14.809    0.000 distribution.py:34(__init__)
   200015    0.797    0.000   13.942    0.000 arrayprint.py:461(wrapper)
   200000    1.547    0.000   12.873    0.000 arrayprint.py:478(_array2string)
     4050    0.035    0.000   12.048    0.003 vpg.py:115(compute_value_loss)
   808300    0.498    0.000   10.882    0.000 activation.py:348(forward)
   808300   10.383    0.000   10.383    0.000 {built-in method tanh}
     4050    0.030    0.000    9.943    0.002 _tensor.py:251(backward)
   200000    1.117    0.000    9.914    0.000 arrayprint.py:409(_get_format_function)
     4050    0.061    0.000    9.913    0.002 __init__.py:69(backward)
     4050    9.731    0.002    9.731    0.002 {method 'run_backward' of 'torch._C._EngineBase' objects}
   200050    9.351    0.000    9.351    0.000 {method 'logsumexp' of 'torch._C._TensorBase' objects}
   400100    2.449    0.000    8.883    0.000 constraints.py:201(check)
   200000    1.706    0.000    8.855    0.000 categorical.py:108(sample)
   200000    0.242    0.000    7.987    0.000 arrayprint.py:365(<lambda>)
   200000    1.753    0.000    7.745    0.000 arrayprint.py:1124(__init__)
   419371    0.410    0.000    5.264    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
   400100    1.042    0.000    4.155    0.000 utils.py:106(__get__)
   200050    0.298    0.000    4.006    0.000 <__array_function__ internals>:2(amax)
   406436    1.572    0.000    3.966    0.000 fromnumeric.py:70(_wrapreduction)
   800250    3.965    0.000    3.965    0.000 {method 'all' of 'torch._C._TensorBase' objects}
   200050    0.449    0.000    3.406    0.000 fromnumeric.py:2589(amax)
   800100    3.321    0.000    3.321    0.000 {method 'reshape' of 'torch._C._TensorBase' objects}
     4050    0.093    0.000    2.226    0.001 optimizer.py:83(wrapper)
   200050    0.149    0.000    2.135    0.000 categorical.py:92(probs)
   1411/9    0.012    0.000    2.098    0.233 <frozen importlib._bootstrap>:966(_find_and_load)
   1411/9    0.009    0.000    2.097    0.233 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
   1360/8    0.009    0.000    2.092    0.262 <frozen importlib._bootstrap>:651(_load_unlocked)
   1234/8    0.005    0.000    2.092    0.262 <frozen importlib._bootstrap_external>:672(exec_module)
   1857/9    0.002    0.000    2.092    0.232 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   406739    2.064    0.000    2.064    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      115    0.003    0.000    2.056    0.018 __init__.py:1(<module>)
   200050    0.237    0.000    1.986    0.000 utils.py:65(logits_to_probs)
     4050    0.043    0.000    1.903    0.000 grad_mode.py:25(decorate_context)
  3233348    1.859    0.000    1.860    0.000 module.py:1164(__getattr__)
   400100    1.852    0.000    1.852    0.000 constraints.py:300(check)
  3637350    1.782    0.000    1.782    0.000 {built-in method torch._C._get_tracing_state}
     4050    0.164    0.000    1.764    0.000 adam.py:81(step)
   200050    0.293    0.000    1.749    0.000 functional.py:1650(softmax)
   432276    1.674    0.000    1.674    0.000 {built-in method numpy.array}
   200000    1.639    0.000    1.639    0.000 {built-in method multinomial}
   200000    0.723    0.000    1.517    0.000 arrayprint.py:60(_make_options_dict)
     4050    0.438    0.000    1.502    0.000 _functional.py:54(adam)
   200050    0.173    0.000    1.475    0.000 <__array_function__ internals>:2(amin)
   200050    1.417    0.000    1.417    0.000 {method 'softmax' of 'torch._C._TensorBase' objects}
   687/94    0.002    0.000    1.406    0.015 {built-in method builtins.__import__}
        9    0.000    0.000    1.369    0.152 __init__.py:3(<module>)
    12672    0.089    0.000    1.296    0.000 utils.py:10(discount_cumulative_sum)
     6336    0.110    0.000    1.264    0.000 utils.py:38(gae)
    12672    0.067    0.000    1.207    0.000 signaltools.py:1719(lfilter)
5573/1912    0.007    0.000    1.159    0.001 <frozen importlib._bootstrap>:997(_handle_fromlist)
   400750    1.124    0.000    1.124    0.000 {method 'detach' of 'torch._C._TensorBase' objects}
   200050    0.298    0.000    1.121    0.000 fromnumeric.py:2714(amin)
   658222    0.415    0.000    1.107    0.000 {built-in method builtins.getattr}
   200000    0.343    0.000    1.065    0.000 arrayprint.py:709(_formatArray)
    12672    0.983    0.000    0.983    0.000 {built-in method scipy.signal.sigtools._linear_filter}
   200000    0.768    0.000    0.913    0.000 discrete.py:22(contains)
   200150    0.889    0.000    0.889    0.000 {built-in method from_numpy}
        3    0.002    0.001    0.887    0.296 __init__.py:10(<module>)
   204162    0.340    0.000    0.768    0.000 grad_mode.py:128(__exit__)
   204162    0.351    0.000    0.759    0.000 grad_mode.py:124(__enter__)
   404150    0.461    0.000    0.758    0.000 container.py:131(__iter__)
   408324    0.477    0.000    0.743    0.000 grad_mode.py:213(__init__)
   204191    0.679    0.000    0.737    0.000 grad_mode.py:119(__init__)
   200000    0.454    0.000    0.722    0.000 arrayprint.py:718(recurser)
   200000    0.656    0.000    0.656    0.000 arrayprint.py:358(_get_formatdict)
        1    0.000    0.000    0.647    0.647 ddpg.py:1(<module>)
        1    0.000    0.000    0.644    0.644 off_policy_algorithm.py:1(<module>)
      100    0.637    0.006    0.637    0.006 {built-in method stack}
1510786/1510580    0.584    0.000    0.587    0.000 {built-in method builtins.isinstance}
   200000    0.581    0.000    0.581    0.000 arrayprint.py:1304(dtype_is_implied)
   200032    0.501    0.000    0.501    0.000 {built-in method builtins.locals}
1360/1342    0.004    0.000    0.491    0.000 <frozen importlib._bootstrap>:564(module_from_spec)

yamatokataoka commented 2 years ago

cProfile the new implementation Experiences separated by each episode Collecting experience with list and numpy

test inst-nodeps: /content/tox/.tmp/package/1/rl-replicas-0.0.3.tar.gz
test installed: absl-py==1.0.0,attrs==21.4.0,cachetools==4.2.4,certifi==2021.10.8,charset-normalizer==2.0.12,cloudpickle==2.0.0,dataclasses==0.8,google-auth==2.6.0,google-auth-oauthlib==0.4.6,grpcio==1.44.0,gym==0.21.0,idna==3.3,importlib-metadata==4.8.3,iniconfig==1.1.1,Markdown==3.3.6,numpy==1.19.5,oauthlib==3.2.0,packaging==21.3,pluggy==1.0.0,protobuf==3.19.4,py==1.11.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pyparsing==3.0.7,pytest==7.0.1,requests==2.27.1,requests-oauthlib==1.3.1,rl-replicas @ file:///content/tox/.tmp/package/1/rl-replicas-0.0.3.tar.gz,rsa==4.8,scipy==1.5.4,six==1.16.0,tensorboard==2.8.0,tensorboard-data-server==0.6.1,tensorboard-plugin-wit==1.8.1,tomli==1.2.3,torch==1.10.2,typing_extensions==4.1.1,urllib3==1.26.8,Werkzeug==2.0.3,zipp==3.6.0
test run-test-pre: PYTHONHASHSEED='2987822608'
test run-test: commands[0] | python -m cProfile -s cumtime tests/integration_tests/test_vpg.py
         38901310 function calls (37186072 primitive calls) in 125.180 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1854/1    0.071    0.000  125.184  125.184 {built-in method builtins.exec}
        1    0.001    0.001  125.184  125.184 test_vpg.py:1(<module>)
        1    0.008    0.008  122.998  122.998 test_vpg.py:19(test_vpg_with_cartpole)
        1    0.463    0.463  122.982  122.982 on_policy_algorithm.py:75(learn)
       50    3.676    0.074   96.028    1.921 on_policy_algorithm.py:169(collect_one_epoch_experience)
1892601/210289    3.317    0.000   64.557    0.000 module.py:1096(_call_impl)
   200050    0.878    0.000   51.720    0.000 categorical_policy.py:19(forward)
   210289    0.404    0.000   34.361    0.000 mlp.py:33(forward)
   210289    2.514    0.000   33.300    0.000 container.py:139(forward)
   200050    2.908    0.000   27.452    0.000 categorical.py:49(__init__)
       50    0.214    0.004   26.157    0.523 vpg.py:58(train)
   200000    0.602    0.000   25.967    0.000 time_limit.py:14(step)
   200000    4.956    0.000   25.365    0.000 cartpole.py:103(step)
   630867    1.522    0.000   18.773    0.000 linear.py:102(forward)
   200000    0.946    0.000   17.855    0.000 arrayprint.py:1365(_array_repr_implementation)
   630867    0.736    0.000   16.720    0.000 functional.py:1832(linear)
   200000    0.944    0.000   16.376    0.000 arrayprint.py:516(array2string)
   630867   15.751    0.000   15.751    0.000 {built-in method torch._C._nn.linear}
   200050    2.794    0.000   14.458    0.000 distribution.py:34(__init__)
   200015    0.788    0.000   13.585    0.000 arrayprint.py:461(wrapper)
   200000    1.526    0.000   12.521    0.000 arrayprint.py:478(_array2string)
    10239    0.036    0.000   11.820    0.001 value_function.py:19(forward)
     4050    0.035    0.000   11.761    0.003 vpg.py:140(compute_value_loss)
   200050    9.654    0.000    9.654    0.000 {method 'logsumexp' of 'torch._C._TensorBase' objects}
   200000    1.109    0.000    9.581    0.000 arrayprint.py:409(_get_format_function)
     4050    0.030    0.000    9.405    0.002 _tensor.py:251(backward)
     4050    0.059    0.000    9.374    0.002 __init__.py:69(backward)
     4050    9.198    0.002    9.198    0.002 {method 'run_backward' of 'torch._C._EngineBase' objects}
   420578    0.297    0.000    9.014    0.000 activation.py:348(forward)
   420578    8.717    0.000    8.717    0.000 {built-in method tanh}
   400100    2.238    0.000    8.685    0.000 constraints.py:201(check)
   200000    1.637    0.000    8.313    0.000 categorical.py:108(sample)
   200000    0.246    0.000    7.645    0.000 arrayprint.py:365(<lambda>)
   200000    1.711    0.000    7.399    0.000 arrayprint.py:1124(__init__)
425319/419130    0.758    0.000    5.359    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
   400100    0.940    0.000    3.960    0.000 utils.py:106(__get__)
   800250    3.944    0.000    3.944    0.000 {method 'all' of 'torch._C._TensorBase' objects}
   200050    0.296    0.000    3.704    0.000 <__array_function__ internals>:2(amax)
   400100    1.179    0.000    3.520    0.000 fromnumeric.py:70(_wrapreduction)
   800100    3.126    0.000    3.126    0.000 {method 'reshape' of 'torch._C._TensorBase' objects}
   200050    0.443    0.000    3.107    0.000 fromnumeric.py:2589(amax)
     4050    0.095    0.000    2.317    0.001 optimizer.py:83(wrapper)
   1411/9    0.011    0.000    2.187    0.243 <frozen importlib._bootstrap>:966(_find_and_load)
   1411/9    0.013    0.000    2.187    0.243 <frozen importlib._bootstrap>:936(_find_and_load_unlocked)
   1360/8    0.013    0.000    2.182    0.273 <frozen importlib._bootstrap>:651(_load_unlocked)
   1234/8    0.004    0.000    2.182    0.273 <frozen importlib._bootstrap_external>:672(exec_module)
   1857/9    0.002    0.000    2.181    0.242 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
   200050    0.140    0.000    2.074    0.000 categorical.py:92(probs)
   400403    2.016    0.000    2.016    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     4050    0.043    0.000    1.985    0.000 grad_mode.py:25(decorate_context)
   200050    0.259    0.000    1.935    0.000 utils.py:65(logits_to_probs)
   400100    1.891    0.000    1.891    0.000 constraints.py:300(check)
     4050    0.169    0.000    1.838    0.000 adam.py:81(step)
      115    0.003    0.000    1.817    0.016 __init__.py:1(<module>)
   625352    1.677    0.000    1.677    0.000 {built-in method numpy.array}
   200050    0.274    0.000    1.676    0.000 functional.py:1650(softmax)
   200000    1.626    0.000    1.626    0.000 {built-in method multinomial}
     4050    0.463    0.000    1.574    0.000 _functional.py:54(adam)
   200000    0.703    0.000    1.515    0.000 arrayprint.py:60(_make_options_dict)
   200050    0.180    0.000    1.483    0.000 <__array_function__ internals>:2(amin)
        9    0.000    0.000    1.431    0.159 __init__.py:3(<module>)
   687/94    0.002    0.000    1.389    0.015 {built-in method builtins.__import__}
   200050    1.358    0.000    1.358    0.000 {method 'softmax' of 'torch._C._TensorBase' objects}
5574/1913    0.006    0.000    1.163    0.001 <frozen importlib._bootstrap>:997(_handle_fromlist)
  1682460    1.126    0.000    1.126    0.000 module.py:1164(__getattr__)
   200050    0.262    0.000    1.117    0.000 fromnumeric.py:2714(amin)
  1892601    1.073    0.000    1.073    0.000 {built-in method torch._C._get_tracing_state}
   200000    0.350    0.000    1.066    0.000 arrayprint.py:709(_formatArray)
   651886    0.391    0.000    1.046    0.000 {built-in method builtins.getattr}
        3    0.002    0.001    1.011    0.337 __init__.py:10(<module>)
   200000    0.771    0.000    0.907    0.000 discrete.py:22(contains)
   206389    0.865    0.000    0.865    0.000 {built-in method from_numpy}
   204162    0.418    0.000    0.826    0.000 grad_mode.py:124(__enter__)
   204191    0.691    0.000    0.757    0.000 grad_mode.py:119(__init__)
   408324    0.493    0.000    0.734    0.000 grad_mode.py:213(__init__)
   204162    0.322    0.000    0.733    0.000 grad_mode.py:128(__exit__)
   200700    0.726    0.000    0.726    0.000 {method 'detach' of 'torch._C._TensorBase' objects}
   200000    0.441    0.000    0.716    0.000 arrayprint.py:718(recurser)
  1083524    0.698    0.000    0.698    0.000 {method 'append' of 'list' objects}
   200000    0.681    0.000    0.681    0.000 arrayprint.py:358(_get_formatdict)
        1    0.000    0.000    0.678    0.678 ddpg.py:1(<module>)
        1    0.000    0.000    0.676    0.676 off_policy_algorithm.py:1(<module>)
1360/1342    0.004    0.000    0.569    0.000 <frozen importlib._bootstrap>:564(module_from_spec)
   200000    0.534    0.000    0.534    0.000 arrayprint.py:1304(dtype_is_implied)
   200032    0.525    0.000    0.525    0.000 {built-in method builtins.locals}
1504450/1504244    0.494    0.000    0.497    0.000 {built-in method builtins.isinstance}
       11    0.000    0.000    0.492    0.045 utils.py:1(<module>)
     4050    0.056    0.000    0.464    0.000 functional.py:3084(mse_loss)
     6189    0.007    0.000    0.463    0.000 <__array_function__ internals>:2(stack)
        1    0.000    0.000    0.458    0.458 __init__.py:288(<module>)
   206389    0.458    0.000    0.458    0.000 {method 'float' of 'torch._C._TensorBase' objects}

yamatokataoka commented 2 years ago

Overall, the total time reduces by about 20 seconds (from 146 to 125) in the new one comparing the above two implementations with the same parameters.

# current
     1    0.000    0.000  146.654  146.654 test_vpg.py:1(<module>)
---
# new
     1    0.001    0.001  125.184  125.184 test_vpg.py:1(<module>)

While the number of function call and execution time of policy's forward function is the same, both quantities of value function decline from 204100 to 10239 and from 30 to 11.

# current
200050    0.856    0.000   50.446    0.000 categorical_policy.py:19(forward)
204100    0.498    0.000   30.005    0.000 value_function.py:19(forward)
---
# new
200050    0.878    0.000   51.720    0.000 categorical_policy.py:19(forward)
 10239    0.036    0.000   11.820    0.001 value_function.py:19(forward)

So the new implementation speeds up because of the reduction of the value function calls significantly.

Parameters epochs: 50 steps per epoch: 4000 algorithm: VPG env: Cartpole seed: 0 compute: Colab CPU

yamatokataoka commented 2 years ago

In the above performance comparison, the current implementation is about 50s slower than others. With the above analysis, I can presume that the bottleneck is calling value function a lot.

yamatokataoka commented 2 years ago

I will choose the implementation of collecting experiences separated by each episode with numpy because of the following reasons;

Converting between Tensor and numpy array is fast. They are sharing memory locations;
Returning episodes experience makes sense rather than all experiences in one array;
The implementation is relatively fast comparing with others;
No perfomance degradation.

yamatokataoka commented 2 years ago

Todos

[x] Research how to handle the last reward when it's the end of the epoch.

yamatokataoka commented 2 years ago

I found my new implementation doesn't correctly handle the last reward in the end of the epoch.

Increasing the number of value function calls, total execution time increased.

Implementations comparison

epochs: 200 steps per epoch: 4000 algorithm: VPG env: Cartpole seed: 0 compute: Colab CPU

Implementation	1	2	3
current	Epoch: 199 Total steps: 8e+05 Total episodes: 1.1e+04 Average Episode Return: 186 Episode Return STD: 25.4 Max Episode Return: 200 Min Episode Return: 107 Average Episode Length: 182 Time: 487 Policy Loss: -0.00292 Avarage Entropy: 0.555 Log Prob STD: 0.447 Value Function Loss: 245	Epoch: 199 Total steps: 8e+05 Total episodes: 9.78e+03 Average Episode Return: 193 Episode Return STD: 20.5 Max Episode Return: 200 Min Episode Return: 130 Average Episode Length: 190 Time: 486 Policy Loss: -0.0115 Avarage Entropy: 0.55 Log Prob STD: 0.463 Value Function Loss: 144	Epoch: 199 Total steps: 8e+05 Total episodes: 1.13e+04 Average Episode Return: 194 Episode Return STD: 20.8 Max Episode Return: 200 Min Episode Return: 106 Average Episode Length: 190 Time: 479 Policy Loss: -0.00688 Avarage Entropy: 0.549 Log Prob STD: 0.45 Value Function Loss: 256
experience per episode with numpy (last value)	Epoch: 199 Total steps: 8e+05 Total episodes: 1.39e+04 Average Episode Return: 182 Episode Return STD: 31.1 Max Episode Return: 200 Min Episode Return: 103 Average Episode Length: 182 Time: 467 Policy Loss: -0.0209 Avarage Entropy: 0.564 Log Prob STD: 0.433 Value Function Loss: 286	Epoch: 199 Total steps: 8e+05 Total episodes: 1.28e+04 Average Episode Return: 190 Episode Return STD: 27.1 Max Episode Return: 200 Min Episode Return: 103 Average Episode Length: 190 Time: 477 Policy Loss: -0.0178 Avarage Entropy: 0.556 Log Prob STD: 0.443 Value Function Loss: 222	Epoch: 199 Total steps: 8e+05 Total episodes: 1.23e+04 Average Episode Return: 182 Episode Return STD: 38.7 Max Episode Return: 200 Min Episode Return: 56 Average Episode Length: 182 Time: 476 Policy Loss: -0.0177 Avarage Entropy: 0.558 Log Prob STD: 0.442 Value Function Loss: 166

yamatokataoka commented 2 years ago

Todos

[x] Move the last reward bootstrapping code to each model train function
[x] Compare performances

yamatokataoka commented 2 years ago

Performance comparison with the current version

epochs: 200 steps per epoch: 4000 algorithm: VPG env: Cartpole seed: 0 compute: Colab CPU

Total execution time is improved because the number of value function call is reduced after each trajectory ended.

Implementation	1	2	3
current	Epoch: 199 Total steps: 8e+05 Total episodes: 1.39e+04 Average Episode Return: 182 Episode Return STD: 31.1 Max Episode Return: 200 Min Episode Return: 103 Average Episode Length: 182 Time: 467 Policy Loss: -0.0209 Avarage Entropy: 0.564 Log Prob STD: 0.433 Value Function Loss: 286	Epoch: 199 Total steps: 8e+05 Total episodes: 1.28e+04 Average Episode Return: 190 Episode Return STD: 27.1 Max Episode Return: 200 Min Episode Return: 103 Average Episode Length: 190 Time: 477 Policy Loss: -0.0178 Avarage Entropy: 0.556 Log Prob STD: 0.443 Value Function Loss: 222	Epoch: 199 Total steps: 8e+05 Total episodes: 1.23e+04 Average Episode Return: 182 Episode Return STD: 38.7 Max Episode Return: 200 Min Episode Return: 56 Average Episode Length: 182 Time: 476 Policy Loss: -0.0177 Avarage Entropy: 0.558 Log Prob STD: 0.442 Value Function Loss: 166
last reward bootstrapping on each model	Epoch: 199 Total steps: 8e+05 Total episodes: 1.13e+04 Average Episode Return: 190 Episode Return STD: 34.7 Max Episode Return: 200 Min Episode Return: 37 Average Episode Length: 190 Time: 416 Policy Loss: -0.0228 Avarage Entropy: 0.554 Log Prob STD: 0.46 Value Function Loss: 360	Epoch: 199 Total steps: 8e+05 Total episodes: 1.08e+04 Average Episode Return: 182 Episode Return STD: 32.6 Max Episode Return: 200 Min Episode Return: 92 Average Episode Length: 182 Time: 416 Policy Loss: -0.0115 Avarage Entropy: 0.542 Log Prob STD: 0.466 Value Function Loss: 219	Epoch: 199 Total steps: 8e+05 Total episodes: 1.12e+04 Average Episode Return: 182 Episode Return STD: 37.2 Max Episode Return: 200 Min Episode Return: 58 Average Episode Length: 182 Time: 418 Policy Loss: -0.00876 Avarage Entropy: 0.552 Log Prob STD: 0.451 Value Function Loss: 277

yamatokataoka commented 2 years ago

Todos

[x] Use action selection function

yamatokataoka / reinforcement-learning-replications

Refactor collect_one_epoch_experience function in OnPolicyAlgorithm #64