This repo is part of a project to replace ACT-R's utility module with classic RL theory algorithms. The code is still unusable and needs to be implemented. The plan is roughly outlined here.
In ACT-R, action selection is based on production utility. The Utility Up of a production p is currently calculated as:
... Up = Up + α(Rt - Up)
Where Rt = R - t(Up), and t(Up) is the elapsed since the firing of Up.
The standard, utility-based implementation suffers from a few problems. First, it does not estimate cumulative future rewards. Second, the quantity Rt - Up might become negative, when the R - t(Up) < Up. In other words, a production that, after high-school graduation, go-to-college or find-a-job, if go-to-college fires and gets a reward after 4 years, it will likely be "punished" for it. In many ways, this does not matter, since what matter for utility-based competition is the relative utility. However, to make sure that the relative utilities remain consistent, rewards should have a perfect value that offsets the timing. But this defeats the purpose of RL---a RL agent should find the way to estimate the expected future rewards.
Similarly, this system cannot take into account individual differences in temporal discounting, which we know exists. Again, they could be rephrased in terms of different values for R, but this seems unrealistic since they wouild produce different behaviros for short term as well.
In classic RL, actions _a_1, _a_2 ... aM an are tied to specific states _s_1, _s_2, sN. Each action has an associated Q value which represents the estimated rewards expected to be generated in the long-term by selecting that particular action. The value Q(s,a) of an action a is updated according to the temporal difference equation:
... Q(st,at) = Q(st,at) + α[Rt + γ_Q_(st+1,at+1)) - Q(st,at)]
Where Q(st+1,a) is the value of an action a that applies to the state st+1 that follows st after the execution of at. This value could be the value of the next action, as implied by this notation (using a SARSA rule) or the max value of the possible actions, i.e., max[Q(st+1, _a_1,), Q(st+1, _a_2,) ... Q(st+1, _a_N,)] when using an off-policy rule like Q-Learning.
In ACT-R, productions are implicitly tied to a state by their condition. Thus a production p can be expressed as an <s,a> pair. From this, it is fairly simple to adapt SARSA with the following procedure:
After the i-th production pi is selected, update the previous production pi-1 according to the rule:
Q(pi) = Q(pi) + α[R + γ_Q_(pi+1) - Q(pi)]
This system by itself cannot handle non-markov environments. That is not a problem per se, but we can extend it as an n-Q-Learning or n-SARSA version by adding a parameter n that specifies the number of backup steps for the update rule. In this implementation, the update rule simply propagates back in time to production i - n, and consists of n successive updating stees:
- Q(pi-1) = Q(pi-1) + α[R + γQ(pi) - Q(pi-1)]
- Q(pi-2) = Q(pi-2) + α[R + γQ(pi-1) - Q(pi-2)] ..+ ... n. Q(pi-n) = Q(pi-n) + α[R + γQ(pi-n+1) - Q(pi-n)]
The main problem with ACT-R is that time decays linearly, producing negative rewards when unchecked. In the expressed equation, the recursive use of γQ(pi) guarantees that reward propagates back with exponentially declining returns with the number of steps. However, ACT-R needs to handle varying intervals between rewards. The simplest trick is to scale the parameter γ with a time-encoding exponent, so that events immediately attached have a value close to γ = 1:
γ = γ^[t(i) - t(i-1)]
in which t(i) is the time at which production pi fires. The time scale (seconds, minutes) can be easily adjusted by a scalar value. Here is an example of how the temporal discount parameter γ decays with the elapsed time, over a span of 10 minutes (600 seconds):