Publications | Scott M. Jordan

2024

AAAI

From Past to Future: Rethinking Eligibility Traces

Dhawal Gupta, Scott M. Jordan, Shreyas Chaudhari, Bo Liu, Philip S. Thomas, and Bruno Castro da Silva

In Proceedings of the AAAI Conference on Artificial Intelligence, 2024

Abs arXiv HTML

In this paper, we introduce a fresh perspective on the challenges of credit assignment and policy evaluation. First, we delve into the nuances of eligibility traces and explore instances where their updates may result in unexpected credit assignment to preceding states. From this investigation emerges the concept of a novel value function, which we refer to as the \emphbidirectional value function. Unlike traditional state value functions, bidirectional value functions account for both future expected returns (rewards anticipated from the current state onward) and past expected returns (cumulative rewards from the episode’s start to the present). We derive principled update equations to learn this value function and, through experimentation, demonstrate its efficacy in enhancing the process of policy evaluation. In particular, our results indicate that the proposed learning approach can, in certain challenging contexts, perform policy evaluation more rapidly than TD(λ) – a method that learns forward value functions, vπ, \emphdirectly. Overall, our findings present a new perspective on eligibility traces and potential advantages associated with the novel value function it inspires, especially for policy evaluation.
Goal-Space Planning with Subgoal Models

Chunlok Lo, Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, Gabor Mihucz, Farzane Aminmansour, and Martha White

2024

Abs arXiv

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.
A New View on Planning in Online Reinforcement Learning

Kevin Roice, Parham Mohammad Panahi, Scott M. Jordan, Adam White, and Martha White

2024

Abs arXiv

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.
ICML

Position: Benchmarking is Limited in Reinforcement Learning Research

Scott M. Jordan, Adam White, Bruno Castro da Silva, Martha White, and Philip S. Thomas

In Proceedings of the 41st International Conference on Machine Learning, ICML, 2024

Abs arXiv HTML

Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.
RLC

The Cliff of Overcommitment with Policy Gradient Step Sizes

Scott M. Jordan, Samuel Neumann, James E. Kostas, Adam White, and Philip S. Thomas

Reinforcement Learning Journal, 2024

Abs

Policy gradient methods form the basis for many successful reinforcement learning algorithms, but their success depends heavily on selecting an appropriate step size. While many adaptive step size methods exist, none are both free of hyperparameter tuning and able to converge quickly to an optimal policy. Moreover, it is unclear why these methods are insufficient, so we aim to uncover what needs to be addressed to make an effective adaptive step size for policy gradient methods. Through extensive empirical investigation, the results reveal that when the step size is above optimal, the policy overcommits to sub-optimal actions leading to longer training times. These findings suggest the need for a new kind of policy optimization that can prevent or recover from entropy collapses.

2023

Robust Markov Decision Processes without Model Estimation

Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M. Jordan, and Zhihua Zhang

2023

arXiv
Coagent Networks: Generalized and Scaled

James E. Kostas, Scott M. Jordan, Yash Chandak, Georgios Theocharous, Dhawal Gupta, Martha White, Bruno Castro da Silva, and Philip S. Thomas

2023

arXiv
Rigorous Experimentation For Reinforcement Learning

Scott M. Jordan

University of Massachusetts Amherst, 2023

HTML
NeurIPS

Behavior Alignment via Reward Function Optimization

Dhawal Gupta, Yash Chandak, Scott M. Jordan, Philip S. Thomas, and Bruno Castro da Silva

In Proceedings of the Thirty-seventh Conference on Neural Information Process Systems, NeurIPS, 2023

Abs arXiv HTML

Designing reward functions for efficiently guiding reinforcement learning (RL) agents toward specific behaviors is a complex task. This is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. Naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer’s intended goal. Although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. To address these issues, we introduce a new framework that uses a bi-level objective to learn \emphbehavior alignment reward functions. These functions integrate auxiliary rewards reflecting a designer’s heuristics and domain knowledge with the environment’s primary rewards. Our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. Remarkably, it can also adapt an agent’s policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying RL algorithms. We evaluate our method’s efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. We investigate heuristic auxiliary rewards of varying quality – some of which are beneficial and others detrimental to the learning process. Our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. It not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.

2022

Scientific Experimentation for Reinforcement Learning

Scott M. Jordan

Opinion Talk - Deep Reinforcement Learning Workshop at NeurIPS, Dec 2022

HTML PDF Slides

2021

ICML

High Confidence Generalization for Reinforcement Learning

James E. Kostas, Yash Chandak, Scott M. Jordan, Georgios Theocharous, and Philip S. Thomas

In ICML, Dec 2021

Abs HTML PDF

We present several classes of reinforcement learning algorithms that safely generalize to Markov decision processes (MDPs) not seen during training. Specifically, we study the setting in which some set of MDPs is accessible for training. The goal is to generalize safely to MDPs that are sampled from the same distribution, but which may not be in the set accessible for training. For various definitions of safety, our algorithms give probabilistic guarantees that agents can safely generalize to MDPs that are sampled from the same distribution but are not necessarily in the training set. These algorithms are a type of Seldonian algorithm (Thomas et al., 2019), which is a class of machine learning algorithms that return models with probabilistic safety guarantees for user-specified definitions of safety.
Impact of changes in tissue optical properties on near-infrared diffuse correlation spectroscopy measures of skeletal muscle blood flow

Miles F Bartlett, Scott M. Jordan, Dennis M Hueber, and Michael D Nelson

Journal of Applied Physiology, Dec 2021

Abs HTML Code

Near-infrared diffuse correlation spectroscopy (DCS) is increasingly used to study relative changes in skeletal muscle blood flow. However, most diffuse correlation spectrometers assume that tissue optical properties-such as absorption (\mu_a) and reduced scattering (\mu_s’) coefficients-remain constant during physiological provocations, which is untrue for skeletal muscle. Here, we interrogate how changes in tissue \mu_a and \mu_s’ affect DCS calculations of blood flow index (BFI). We recalculated BFI using raw autocorrelation curves and \mu_a/\mu_s’ values recorded during a reactive hyperemia protocol in 16 healthy young individuals. First, we show that incorrectly assuming baseline \mu_a and \mu_s’ substantially affects peak BFI and BFI slope when expressed in absolute terms (cm2/s, P < 0.01), but these differences are abolished when expressed in relative terms (% baseline). Next, to evaluate the impact of physiologic changes in \mu_a and \mu_s’, we compared peak BFI and BFI slope when \mu_a and \mu_s’ were held constant throughout the reactive hyperemia protocol versus integrated from a 3-s rolling average. Regardless of approach, group means for peak BFI and BFI slope did not differ. Group means for peak BFI and BFI slope were also similar following ad absurdum analyses, where we simulated supraphysiologic changes in \mu_a/\mu_s’. In both cases, however, we identified individual cases where peak BFI and BFI slope were indeed affected, with this result being driven by relative changes in \mu_a over \mu_s’. Overall, these results provide support for past reports in which \mu_a/\mu_s’ were held constant but also advocate for real-time incorporation of \mu_a and \mu_s’ moving forward. NEW & NOTEWORTHY We investigated how changes in tissue optical properties affect near-infrared diffuse correlation spectroscopy (NIR-DCS)-derived indices of skeletal muscle blood flow (BFI) during physiological provocation. Although accounting for changes in tissue optical properties has little impact on BFI on a group level, individual BFI calculations are indeed impacted by changes in tissue optical properties. NIR-DCS calculations of BFI should therefore account for real-time, physiologically induced changes in tissue optical properties whenever possible.

2020

ICML

Evaluating the Performance of Reinforcement Learning Algorithms

Scott M. Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, and Philip S. Thomas

In ICML, Dec 2020

Abs arXiv HTML Code Slides

Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning. Recent reproducibility analyses have shown that reported performance results are often inconsistent and difficult to replicate. In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics. Taking a step towards ensuring that reported results are consistent, we propose a new comprehensive evaluation methodology for reinforcement learning algorithms that produces reliable measurements of performance both on a single environment and when aggregated across environments. We demonstrate this method by evaluating a broad class of reinforcement learning algorithms on standard benchmark tasks.
NeurIPS

Towards Safe Policy Improvement for Non-Stationary MDPs

Yash Chandak, Scott M. Jordan, Georgios Theocharous, Martha White, and Philip S. Thomas

In NeurIPS, Dec 2020

Abs arXiv HTML Code

Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems. Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis. Safety is ensured using sequential hypothesis testing of a policy’s forecasted performance, and confidence intervals are obtained using wild bootstrap.

2019

ICML

Learning Action Representations for Reinforcement Learning

Yash Chandak, Georgios Theocharous, James E. Kostas, Scott M. Jordan, and Philip S. Thomas

In ICML, Dec 2019

Abs arXiv HTML

Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori. We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions. These representations improve generalization over large, finite action sets by allowing the agent to infer the outcomes of actions similar to actions already taken. We provide an algorithm to both learn and use action representations and provide conditions for its convergence. The efficacy of the proposed method is demonstrated on large-scale real-world problems.
ICTIR

Learning a Better Negative Sampling Policy with Deep Neural Networks for Search

Daniel Cohen, Scott M. Jordan, and W. Bruce Croft

In ICTIR, Dec 2019

PDF
Classical Policy Gradient: Preserving Bellman’s Principle of Optimality

Philip S. Thomas, Scott M. Jordan, Yash Chandak, Chris Nota, and James E. Kostas

CoRR, Dec 2019

Abs arXiv

We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman’s principle of optimality, and provide an expression for the gradient of the objective.
Evaluating Reinforcement Learning Algorithms Using Cumulative Distributions of Performance

Scott M. Jordan, Yash Chandak, Mengxue Zhang, Daniel Cohen, and Philip S. Thomas

Fourth Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), Jul 2019

2018

Distributed Evaluations: Ending Neural Point Metrics

Daniel Cohen, Scott M. Jordan, and W. Bruce Croft

In ACM SIGIR - LND4IR Workshop, Jul 2018

Abs arXiv

We propose a new evaluation metric for information retrieval that is based on the cumulative distribution of performance across a set of queries. This metric is more robust to noise than existing metrics, and can be used to compare systems in a statistically principled way.
Using Cumulative Distribution Based Performance Analysis to Benchmark Models

Scott M. Jordan, Daniel Cohen, and Philip S. Thomas

Critiquing and Correcting Trends in Machine Learning NeurIPS Workshop, Dec 2018

Abs PDF

When using only reported empirical results, it has become difficult to identify machine learning methods that provide meaningful advancement. One reason is that results are commonly only reported using well-tuned models, and thus represent an optimistic evaluation of performance. In this work, we propose a new framework for evaluating algorithms that presents both the performance when the system is well-tuned, as well as the difficulty of tuning the algorithm. This is achieved by considering the distribution of performances that result when applying the method with different hyper-parameter settings (e.g., different step sizes and network structures). Using common benchmark tasks in supervised and reinforcement learning, we demonstrate how this evaluation framework can both evaluate an algorithm’s robustness to hyper-parameter selection and identify new areas of improvement.
Learning to Use a Ratchet by Modeling Spatial Relations in Demonstrations

Li Yang Ku, Scott M. Jordan, Julia Badger, Erik Learned-Miller, and Rod Grupen

International Symposium on Experimental Robotics (ISER), Dec 2018

Abs HTML PDF

We introduce a framework where visual features, describing the interaction among a robot hand, a tool, and an assembly fixture, can be learned efficiently using a small number of demonstrations. We illustrate the approach by torquing a bolt with the Robonaut-2 humanoid robot using a handheld ratchet. The difficulties include the uncertainty of the ratchet pose after grasping and the high precision required for mating the socket to the bolt and replacing the tool in the tool holder. Our approach learns the desired relative position between visual features on the ratchet and the bolt. It does this by identifying goal offsets from visual features that are consistently observable over a set of demonstrations. With this approach we show that Robonaut-2 is capable of grasping the ratchet, tightening a bolt, and putting the ratchet back into a tool holder. We measure the accuracy of the socket-bolt mating subtask over multiple demonstrations and show that a small set of demonstrations can decrease the error significantly.