A Bug in HERReplayBuffer. #811

sunkafei · 2023-02-27T12:40:14Z

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- documentation request (i.e. "X is missing from the documentation.")
- new feature request
I have visited the source website
I have searched through the issue tracker for duplicates

I have mentioned version numbers, operating system and environment, where applicable:

import tianshou, gym, torch, numpy, sys
print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)

Hello, I have some questions in the code of rewrite_transitions in HERReplayBuffer:

    def rewrite_transitions(self, indices: np.ndarray) -> None:
        """Re-write the goal of some sampled transitions' episodes according to HER.
        Currently applies only HER's 'future' strategy. The new goals will be written \
        directly to the internal batch data temporarily and will be restored right \
        before the next sampling or when using some of the buffer's method (e.g. \
        `add`, `save_hdf5`, etc.). This is to make sure that n-step returns \
        calculation etc., performs correctly without additional alteration.
        """
        if indices.size == 0:
            return

        # Sort indices keeping chronological order
        indices[indices < self._index] += self.maxsize
        indices = np.sort(indices)
        indices[indices >= self.maxsize] -= self.maxsize

        # Construct episode trajectories
        indices = [indices]
        for _ in range(self.horizon - 1):
            indices.append(self.next(indices[-1]))
        indices = np.stack(indices)

        # Calculate future timestep to use
        current = indices[0]
        terminal = indices[-1]
        future_offset = np.random.uniform(size=len(indices[0])) * (terminal - current)
        future_offset = future_offset.astype(int)
        future_t = (current + future_offset)
        ...

As ReplayBuffer is implemented as a circular queue, the indices in terminal may be less than the corresponding indices in current. So that some elements in future_offset may be negative, which will make the states in future_t not desired future states.
@Juno-T

The text was updated successfully, but these errors were encountered:

Juno-T · 2023-02-27T12:52:26Z

Thanks for the question. I think this case was covered by the first line, indices[indices < self._index] += self.maxsize, where the indices that are wrapped back by the circular queue get incremented by self.maxsize making them chronological ordered again.

However, if you have some codes illustrating the bug, please feel free to share them so that we can investigate further.

Juno-T · 2023-02-27T13:00:37Z

For reference, there is a test case for testing cycled indices for HERReplayBuffer in test_buffer.py.

sunkafei · 2023-02-28T02:41:38Z

@Juno-T Thanks for your reply. I found that the variable future_t was used in the following 2 places:

        if self._save_obs_next:
            ep_obs_next = self[unique_ep_indices].obs_next
            future_obs = self[future_t[unique_ep_close_indices]].obs_next
        else:
            future_obs = self[self.next(future_t[unique_ep_close_indices])].obs

As unique_ep_close_indices represents the last indices in every episode, why not replace future_t with terminal and remove the calculation of future_t from the code?

Juno-T · 2023-02-28T04:59:00Z

So according to the paper, they found that instead of rewriting the goal by the terminal state's achieved goal, it is equally good or better to use random future state's achieved goal (see fig.6).

If you look at the future_t's calculation, you can see that it contains the indices of a random future state vary by episodes.

~~I think we should label this issue as question, @Trinkle23897.~~
Edit: it is indeed a bug.

sunkafei · 2023-02-28T05:46:50Z

@Juno-T I think I understand what you mean, but the calculation of future_t may be wrong. Here is an example to reproduce the error.

import numpy as np
from tianshou.data import Batch
from tianshou.data import HERReplayBuffer
from env import MyGoalEnv  # from test.base.env import MyGoalEnv
def test_herreplaybuffer():
    env_size = 99
    bufsize = 15
    env = MyGoalEnv(env_size, array_state=False)
    def compute_reward_fn(ag, g):
        return env.compute_reward_fn(ag, g, {})
    buf = HERReplayBuffer(bufsize, compute_reward_fn=compute_reward_fn, horizon=30, future_k=8)
    buf.future_p = 1
    
    for x, ep_len in enumerate([10, 20]):
        obs, _ = env.reset()
        for i in range(ep_len):
            act = 1
            obs_next, rew, terminated, truncated, info = env.step(act)
            batch = Batch(
                obs=obs,
                act=[act],
                rew=rew,
                terminated=(i == ep_len - 1),
                truncated=(i == ep_len - 1),
                obs_next=obs_next,
                info=info
            )
            if x == 1 and obs["observation"] < 10:
                obs = obs_next
                continue
            buf.add(batch)
            obs = obs_next
    # The above code generate two episodes. States numbered from 0 to 9 for the first episode. States numbered from 10 to 19 for the second episode.
    buf._restore_cache()
    sample_indices = np.array([10])  # Suppose the sampled indices is [10]
    buf.rewrite_transitions(sample_indices)
    print(buf.obs)
    # We can see that the desired_goal for 10 may be 6,7,8,9,10,11. But 6,7,8,9 are the states in episode 1.

if __name__ == '__main__':
    test_herreplaybuffer()

In the above code, the desired_goal may come from other eposides!

sunkafei · 2023-02-28T06:23:23Z

@Juno-T If change the code to sample_indices = np.array([9]), the desired_goal is 10.

Here is an explanation for my code:

The states for the first episode are in [0, 9), the states for the second episode are in [10, 19)
The bufsize is 15, so the initial observations are [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
After the first episode, the observations are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0]
As the second episode starts from 10, the observations after the second episode are [15, 16, 17, 18, 19, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
So that the 11th observation is 10, which is the first state for the second episode.

Juno-T · 2023-02-28T06:51:25Z

Thank you for the clarification, I have confirmed the bug locally and it is indeed a bug in future_t calculation. The terminal indices can be lower than the current indices. This make the future_offset to be negative and the future_t will take timestep from previous episodes.

I think this should fix it:

def rewrite_transitions(self, indices: np.ndarray) -> None:

        ...

        # Calculate future timestep to use
        current = indices[0]
        terminal = indices[-1]
        episodes_len = (terminal - current + self.maxsize) % self.maxsize
        future_offset = np.random.uniform(size=len(indices[0])) * (episodes_len)
        future_offset = future_offset.astype(int)
        future_t = (current + future_offset) % self.maxsize
        
        ...

Would you like to create a pull request to fix this bug? You can add your test case to the test_buffer.py and also verfiy some edge cases.

sunkafei · 2023-02-28T07:14:55Z

@Juno-T Thanks for your reply. I will make a pull request when I'm free.

sunkafei mentioned this issue Feb 28, 2023

bug fix and add a test case #815

Closed

9 tasks

Trinkle23897 added the bug Something isn't working label Feb 28, 2023

sunkafei added a commit to sunkafei/tianshou that referenced this issue Feb 28, 2023

fix bug thu-ml#811

2de10d9

sunkafei added a commit to sunkafei/tianshou that referenced this issue Mar 1, 2023

fix bug thu-ml#811

fab18b0

Trinkle23897 closed this as completed in bc222e8 Mar 4, 2023

Trinkle23897 linked a pull request Mar 4, 2023 that will close this issue

fix bug (#811) #817

Merged

9 tasks

BFAnas pushed a commit to BFAnas/tianshou that referenced this issue May 5, 2024

Fix thu-ml#811 (thu-ml#817)

4fa850c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Bug in HERReplayBuffer. #811

A Bug in HERReplayBuffer. #811

sunkafei commented Feb 27, 2023

Juno-T commented Feb 27, 2023

Juno-T commented Feb 27, 2023 •

edited

Loading

sunkafei commented Feb 28, 2023

Juno-T commented Feb 28, 2023 •

edited

Loading

sunkafei commented Feb 28, 2023

sunkafei commented Feb 28, 2023 •

edited

Loading

Juno-T commented Feb 28, 2023

sunkafei commented Feb 28, 2023

A Bug in HERReplayBuffer. #811

A Bug in HERReplayBuffer. #811

Comments

sunkafei commented Feb 27, 2023

Juno-T commented Feb 27, 2023

Juno-T commented Feb 27, 2023 • edited Loading

sunkafei commented Feb 28, 2023

Juno-T commented Feb 28, 2023 • edited Loading

sunkafei commented Feb 28, 2023

sunkafei commented Feb 28, 2023 • edited Loading

Juno-T commented Feb 28, 2023

sunkafei commented Feb 28, 2023

Juno-T commented Feb 27, 2023 •

edited

Loading

Juno-T commented Feb 28, 2023 •

edited

Loading

sunkafei commented Feb 28, 2023 •

edited

Loading