How Does the Brain Orchestrate Learning from Reward? [Uchida Lab]


                    Animals learn a variety of actions; they can flexibly learn a new action or modify a learned action depending on a specific situation. They can also learn a more stable performance with training, and develop a skill or habit. How can animals learn flexibly without losing stable performance of a skill? In the brain, different brain areas are responsible for each learning type. For instance, the dorsomedial part of the striatum (DMS) is important for flexible learning, while the dorsolateral part of the striatum (DLS) plays a role in skill and habit. However, the exact mechanism of learning in these brain areas is not understood.
Dopamine is important for reward-based learning. It is believed that dopamine neurons signal the discrepancy between actual and predicted reward (“reward prediction error”, or RPE) to other brain areas as a teaching signal, so that other brain areas can learn from surprising outcomes. Although most learning theories assume that dopamine neurons broadcast the same teaching signals throughout the striatum, it is currently controversial how diverse dopamine function is to support different kinds of learning. Do dopamine neurons send the same teaching signals to both DMS and DLS, or do they show a different rule in the activity to support different types of learning? Our postdoc, Iku Tsutsui-Kimura, addressed this question by combining neural recording and modeling using a well-controlled instrumental conditioning paradigm (Tsutsui-Kimura et al., 2020).
In this study, thirsty mice learned to associate an odor cue with an available water port; they smelled an odor, chose a water port based on the odor cue, stayed in the water port, and acquired a water reward. We systematically examined the activity of dopamine axon projections in various locations of the striatum. Surprisingly, we found that dopamine axon activity across striatal subareas was very similar, signaling reward prediction error (i.e. surprising outcomes). We examined the temporal dynamics of the activity carefully. Interestingly, we found that dopamine neurons showed inhibition in error trials, and showed slight excitation in correct trials right after choice BEFORE the actual outcomes. By modeling, we found that such dynamical dopamine activity was explained by the temporal difference (TD) error, a specific type of reward prediction error in a learning theory of artificial intelligence (Sutton and Barto, 1998). Similar to TD errors, we found that dopamine activity in both DMS and DLS dynamically tracks a moment-by-moment change in the estimation of an upcoming reward.
However, there were subtle, but critical differences in dopamine activity across the striatal areas. Dopamine axons in most areas show inhibition when the received reward is smaller than expected. Dopamine inhibition is believed to suppress the taken-action in the future by signaling “not good”. On the other hand, we found that dopamine axons in DLS tend to show an excitation even if the reward is smaller than expected. In other words, TD errors in DLS are positively biased. We believe that our findings are consistent with the idea that DLS plays a role in skill and habit. DLS may learn and stably maintain skill and habit by receiving “OK” signal from dopamine neurons every time when animals repeat (or train) the action even if the outcome is smaller than expected. Hence, skill/habit can develop with repetition and may not be lost with occasional bad outcomes.
In the natural world, animals continuously learn flexibly without losing skills. Our works propose a simple mechanism to achieve this: while wide areas of the brain continuously receive moment-by-moment teaching signals, the criteria of the teaching signals are slightly different. Thus, with the same small reward, one brain area receives “OK” signal to keep the skill and another brain area receives “not good” signal to improve actions in parallel. In this way, different types of learning may be achieved by a similar algorithm with only slightly modified teaching signals. In the future, it is important to examine how exactly the dynamic and diverse dopamine signals affect neuronal activity in the striatal subareas. Eventually, we wish to address how the parallel learning in striatal subareas contribute to the final decision-making as a whole.
by Mitsuko Watabe-Uchida 
PDF
Nao Uchida, Uchida lab