In domains as diverse as mastering video games, controlling robotic limbs, and finetuning ChatGPT, a family of approaches known collectively as “reinforcement learning” (RL) has revolutionized the field of artificial intelligence in recent years. These algorithms seek to maximize the total amount of reward an agent receives, whether that be points in a video game, successful completion of a manipulation task, or the preferences of human raters. Animals face an analogous set of challenges as they attempt to find food, water, and mates while minimizing their exposure to predators and other environmental hazards. It thus stands to reason that dedicated RL circuitry not only exists in the vertebrate brain, but in fact inspired many of the tricks and mechanisms that were later developed to enhance performance in artificial systems.
Over the last several decades, the Uchida lab and others in the field have been exploring these parallels, with a particular focus on the neuromodulator dopamine and its downstream targets in a region of the brain called the striatum. In a newly-published paper in the journal Nature, we deepen these connections between natural and artificial intelligence while borrowing inspiration in the opposite direction — using modern advances in RL to reveal the computational principles underlying the anatomy and physiology of the striatum.
We begin with the observation that while traditional RL algorithms consider only the average amount of reward an agent obtains, subject to various sources of uncertainty in the environment, a key innovation for extending these algorithms to more complex domains and improving their performance has been to learn the entire probability distribution of rewards. Indeed, a prior collaboration involving our lab suggested that the activity of dopamine neurons was consistent with this richer learning objective, known as “distributional RL.” However, there was scant evidence that the striatum actually learns these complete probability distributions on the basis of dopamine signals, as suggested by this earlier work. Moreover, it was unclear how the striatum — a heterogeneous structure comprised of subsystems that often oppose one another — could do so even in principle.
To find out, we designed a behavioral task in which mice were trained to associate random odors with different probability distributions of water reward. Two distributions had the same mean but different variance, and so give rise to different predictions of reward under distributional (but not traditional) RL. When we recorded in the striatum, we found many neurons that distinguished the mean-matched distributions in this way. Furthermore, these signatures of distributional RL in the striatum were reduced when we eliminated dopamine, consistent with its hypothesized role as an RL teaching signal.
In the striatum, dopamine is thought to act in opposite ways on the two major classes of neurons, which are called D1 and D2, depending on the type of dopamine receptors that they express. However, it remains unclear how both of these cell types could contribute to RL in the brain. By considering the shape of the distribution beyond the mean, we realized that D1 neurons might specialize in the upper tail of the reward distribution, because their connections tend to strengthen in response to better than expected outcomes, which follow increases in dopamine. Meanwhile, D2 neurons, whose connections tend to strengthen in response to worse than expected outcomes (decreases in dopamine), might prefer the lower tail of the reward distribution.
We developed a formal mathematical model of this process and then demonstrated a close match between the predictions of this model and the quantitative structure of our neuronal recordings — for both the entire population, and D1 or D2 neurons independently. When we selectively increased or decreased the activity of these D1 or D2 cells, we also observed changes in the animals’ behavior that was precisely predicted by our model. We therefore propose that D1 neurons reflect “optimistic” predictions, above the mean, while D2 neurons reflect “pessimistic” predictions, below the mean.
We are excited by this work for many reasons. First, it provides a normative explanation for coexistence and apparent opponency between D1 and D2 neurons in the striatum, an omnipresent empirical finding which has nonetheless puzzled theorists for over a decade.
Second, it helps unify the study of RL in the brain with decision-making under risk. That is, we know animals can make sensible decisions between, say, a smaller, certain reward and a larger, uncertain one. Distributional RL provides a mechanism for learning about these choices — and a candidate means by which these choices might be biased, whether across development (e.g. by chronic drug use, early childhood experience, or genetic predisposition), psychopathology (e.g. by addiction, bipolar disorder, or depression), or behavioral timescales (e.g. by acute drug use, stress, or hunger).
Third and finally, it provides one more piece of evidence that the connection between artificial and biological intelligence is a profound one. More specifically, it appears that once again, evolution stumbled upon some of the same principles in mammals that computer scientists only recently discovered improve learning in silicon-based systems. Only this time, those principles remained unknown to biologists until the application of AI frameworks and methods — forging yet another link in the circle of RL in brains and machines.
by Adam Lowet and Nao Uchida