There is widespread interest in many forms of learning among neuroscientists, as well as the artificial intelligence community. In particular, one type of learning, known as reinforcement learning, has generated great buzz from stunning demonstrations of its power in a variety of machine settings (e.g. game playing) as well as the enormous progress in uncovering its neural basis in animals (e.g. work from the Uchida lab in MCB). In reinforcement learning, the agent (such as a person or computer playing a video game) learns which actions are good and bad through the delivery of rewards and punishments, respectively. One aspect of learning that has remained underexplored is the rapidity of this process under many conditions, where animals can learn from a small number of trials. We have studied this in mice, and have made exciting discoveries.
Rodents are olfactory specialists and can use odors to learn contingencies (for example, whether an odor will lead to a rewarding outcome or a punishment) quickly and well. We developed a task in which mice smell an odor and have to decide based on previous experience with that odor whether it will be followed with a reward or not. Indeed, the mice could readily learn to place multiple odors into rewarded and unrewarded categories. Once they have learned the rule, they can do such categorization for new odors in a matter of minutes (fewer than 10 trials for each odor). Learning the reward contingencies of odors must change the activity of neurons in the relevant areas of the brain, but what are these relevant areas? Two prime candidates are the olfactory cortex and the olfactory tubercle. The olfactory cortex, in particular the posterior piriform cortex (pPC) is thought to be an associative area, collating diverse signals and projecting to higher brain regions with cognitive functions. The olfactory tubercle (OT) is part of the ventral striatum, with direct connections to areas of the brain involved in motivated behaviors and reward.
We recorded the electrical activity of single neurons in pPC and OT as mice learned, through trial-and-error, the reward valence assigned to a panel of previously unexperienced odor stimuli. We examined the activity of single neurons, as well as the collective activity of populations of neurons, to look for signatures of reward selectivity. For example, after learning, all of the odors that predict reward might elicit similar activity in a single neuron, which would be distinct from the activity elicited by unrewarded odors. Such a category-dependent single neuron activity in our experiments was very prevalent in the OT, but we found no evidence of it pPC. We then asked whether the activity of entire groups of neurons can be used to successfully predict if a given odor stimulus was rewarding or not. Using a widely-used method (a categorical classifier called support vector machine), we discovered that pPC activity largely reflects sensory coding, with very little explicit information about reward. By contrast, the OT acquires a representation of reward rapidly in a matter of minutes and this information is expressed in less than 100 ms of the onset of the stimulus (well before the animal takes any motor action to indicate if it was a rewarded or unrewarded odor). Therefore, we are able to conclude that coding of stimulus information required for reward prediction does not occur within olfactory cortex, but rather in circuits involving the olfactory striatum.
Our study is exciting because it establishes a paradigm for studying rapid reinforcement learning in a circuit that is right at the interface of sensory input and reward areas, which will now allow a detailed molecular, cellular and circuit dissection of this process.