A paper co-authored by Toyota Professor of Artificial Intelligence Satinder Singh Baveja has been recognized for its lasting impact on the field of reinforcement learning. First published in 1999, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning” has been selected for the 2019 Classic Paper Award by the Artificial Intelligence Journal. The award will be conferred at the International Joint Conference on Artificial Intelligence in August.
This is the sixth year the AI Journal has recognized a classic paper, selecting up to two key influential publications annually.
Baveja’s paper, written while he worked at AT&T Labs with collaborators Richard Sutton and Doina Precup from University of Massachusetts, Amherst, tackled the difficult problem of giving artificial intelligence a way to understand and represent knowledge that abstracts over temporal details. This is a key issue for AI designers who want to create an agent that can learn from its past experience and plan ahead.
Human decision making routinely involves choice among courses of action over a broad range of time scales. For example, a traveler considering a trip to a distant city has to decide whether or not to go, weighing the benefits of the trip against the expense, and then make a number of other choices at every leg of the journey. They’ll decide whether to fly or to drive, whether to take a taxi or to arrange a ride, and even their smallest actions require foresight and decision. Just calling a taxi may involve finding a telephone, dialing each digit, and the individual muscle contractions to lift the receiver to the ear. The question had stood since the 1970’s – how can we understand and automate this ability to work flexibly with multiple overlapping time scales?
The researchers addressed the problem within an AI framework called reinforcement learning.
Reinforcement learning (RL) is a model for training artificial intelligences via trial and error. By contrast, supervised learning, the kind used to train, for example, image recognition software, provides its agents with labeled examples (and requires people to label them). RL instead works without labels or human instruction, but with a system of reward and positive feedback.
A common tool in RL design is the Markov decision process, which gives an AI at a given point a decision that will give it a randomized reward. At any time, the AI will exist in one state, choose between two or more options ahead of it, and then transition to a new state and repeat.
The researchers showed that options like these enable knowledge and actions abstracted over time to be included in the RL framework in a natural and general way. In particular, they showed that options could be used interchangeably with a more primitive sequence of actions in existing planning and learning models.
In addition, they demonstrated three important uses and improvements to options in planning that make AI more robust and responsive: agents can use the results from planning with options to interrupt future options and give better results than planned, learn about the results of different options ahead of time by partially performing it, and develop subgoals that can be used to improve the options themselves.
The paper pulled together a variety of work across the AI field and established their proposals in a simpler, more general setting with fewer changes to existing RL work.