The last several years have seen remarkable developments in artificial intelligence (AI). Computers can now play complex video games such as Pac-Man, Space Invader, and even StarCraft II at a human level. Computers have also beaten human champions of classic games such as chess, shogi, and Go. As in these games, we often take multiple actions (e.g. navigating in a city) to achieve a particular goal (get to a restaurant) in our life. A success in these situations depends on a sequence of actions. This property – a delayed reward – poses a serious problem: if a reward comes only after taking many actions, how can one know which actions were responsible for the final outcome? In other words, how can one properly distribute credits to different actions? This problem, called a credit assignment problem, is what makes learning in these situations difficult. How do brains and machines solve this problem?
Computers that play these games are equipped with specific algorithms. First, these algorithms learn the “value” of each state, or how good each state is. In the example of finding a restaurant, a state may be each location, and the value would be related to how far the restaurant is from that location. In the case of board games, states can be defined by positions of stones or pieces, and the value can be defined as the probability of winning starting from each state. These algorithms learn values of different states by trial-and-error.
One strategy to learn values is to first play out an entire sequence of actions, and after experiencing a final outcome, then look back and update values of all the states visited. Although it might sound reasonable, this is not what most modern algorithms learn. Another strategy is temporal difference (TD) learning, in which values are updated “on the fly”, even before experiencing a final outcome (Sutton and Barto, 1998). Even when learning is not completed, thus learned values are still tentative, TD learning uses these tentative values to evaluate whether a recent move led to a good or bad state. Values are updated as soon as an increase or decrease of value is detected between neighboring states, so as to minimize the difference in values between the two. This discrepancy between values of consecutive time points, plus a reward when it is obtained, is called TD error. In TD learning algorithms, TD errors are, thus, computed on a moment-by-moment basis on the fly, and as soon as a TD error is detected, the value is updated.
This moment-by-moment computation of TD error is what makes TD learning special: it provides a solution to the credit assignment problem mentioned above. After taking every action, a TD error indicates, at least tentatively, whether the action that was taken was good or bad. Another important point is that TD learning is local: it requires only the currently available information (which state you are in now and a moment before, the learned values of these states, and the reward that you just obtained). This dramatically reduces the memory requirement. These are the main reasons why TD learning algorithms and its variants still play a central role in many AI applications.
In the brain, it has long been thought that the activity of dopamine neurons approximates TD errors. First, dopamine neurons respond to a surprising reward. When a cue predicts a reward, however, dopamine neurons now respond to the cue and their response to reward is reduced. Furthermore, when a predicted reward is suddenly omitted, dopamine neurons reduce their activity below baseline. Previous studies have shown that all of these responses, locked to cue and reward, can be explained as TD errors (for example, the response to a reward-predictive cue can be explained as a sudden increase in value caused by the cue) (Schultz et al., 1997). However, recent studies have observed that dopamine signals exhibit some patterns that cannot be readily explained by TD errors. For one, when the animal approached a remote reward location in a maze, the dopamine concentration gradually ramped up on the timescale of seconds. These studies proposed that dopamine neurons, at least in these cases, signal values instead of TD errors because values are expected to increase as the animal gets closer to a reward location (Howe et al., 2013; Hamid et al., 2016).
Do dopamine neurons signal TD errors or values? Considering that the TD error reflects a change in value over time (i.e. time derivative of value), the mere presence of a ramping signal does not distinguish whether dopamine signals convey value or TD error (Gershman, 2014). For example, if the value increases more steeply as the animal approaches a reward location (i.e. if the value is a convex shape), not only the value but also TD error can ramp up.
In this study (Kim et al., 2020), we developed a set of experimental paradigms to dissociate TD errors from values. For instance, we used various manipulations such as teleport in virtual reality using head-fixed mice. Computer screens presented a virtual corridor in which mice were rewarded with a drop of water when they reached a goal location. In test trials, mice were, for example, suddenly teleported to a location closer to the goal location. In other trials, we manipulated the speed of scene movement. Our results demonstrated that dopamine signals showed a derivative-like property; they showed a transient increase at the time of teleport, and the magnitude of ramp dynamically tracked the speed of scene movement, not positions per se. Importantly, dopamine neurons did not respond when mice were teleported between two corridors with different wall patterns if the teleport did not change the distance to the goal (thus, there is no change in value). In contrast, dopamine neurons responded as long as the value was changed across teleport; teleporting between two corridors associated with different sizes of reward, or teleporting backward which would result in a reduction in value. These results indicated that dopamine neurons do not signal a pure sensory surprise but that a change in value is important.
Together, this study indicates that dopamine neurons perform a derivative-like computation over value on a moment-by-moment basis. In short, dopamine neurons are always checking errors in predictions (i.e. values). As discussed above, this moment-by-moment nature of TD error is what makes TD learning algorithms so powerful, providing a solution to credit assignment problems. This study, thus, provides evidence supporting a central tenet of TD errors in the brain, closing the gap between the brain and machines.