I am a Postdoctoral Fellow at the University of Alberta advised by Professor Martha White. I completed my Ph.D. in 2022 at the University of Massachusetts, where I am advised by Professor Philip Thomas. I primarily research techniques for solving sequential decision-making problems focusing on using Reinforcement Learning techniques. My research interests centers around three areas:

  1. designing experimental methodologies that are more reliable and informative than those found in standard machine learning experiments,
  2. understanding the necessary properties for scaling reinforcement learning techniques to solve many tasks,
  3. developing scalable optimization methods for performing on-device learning.


  • Fall 2022 – Started as a Posdoc at the University of Alberta.
  • Fall 2022 – Defended my Dissertation
  • Spring 2022 – I was a visiting student in the Aerospace Controls Laboratory at MIT advised by Jonathan How.
  • Fall 2021 – I proposed my dissertation on “Rigourous Experimentation for Reinforcement Learning.”
  • Summer 2021 – I interned at Unity Technologies working on reducing the need for hyperparameter tuning for reinforcement learning.




A Generalized Learning Rule for Asynchronous Coagent Networks
James Kostas, Scott Jordan, Yash Chandak, Georgios Theocharous, Dhawal Gupta, Philip Thomas
5th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2022.


Coagent networks for reinforcement learning (RL) (Thomas and Barto, 2011) provide a framework for deriving principled learning rules for stochastic neural networks in the RL setting. Previous work provided generalized coagent learning rules for the asynchronous setting (Kostas et al., 2020) and for the setting in which network parameters are shared (Zini et al., 2020). This work provides a generalized theorem that can be used to obtain learning rules for the combination of those cases; that is, the case where an asynchronous coagent network uses shared parameters. This work also provides a discussion of recent, ongoing, and future work.



Impact of Changes in Tissue Optical Properties on Near-infrared Diffuse Correlation Spectroscopy Measures of Skeletal Muscle Blood Flow
Miles F Bartlett, Scott Jordan, Dennis M. Hueber, Michael D Nelson
Journal of Applied Physiology, 2021

Abstract | paper

Near-infrared diffuse correlation spectroscopy (DCS) is increasingly used to study relative changes in skeletal muscle blood flow. However, most diffuse correlation spectrometers assume that tissue optical properties-such as absorption (μa) and reduced scattering (μ's) coefficients-remain constant during physiological provocations, which is untrue for skeletal muscle. Here, we interrogate how changes in tissue μa and μ's affect DCS calculations of blood flow index (BFI). We recalculated BFI using raw autocorrelation curves and μa/μ's values recorded during a reactive hyperemia protocol in 16 healthy young individuals. First, we show that incorrectly assuming baseline μa and μ's substantially affects peak BFI and BFI slope when expressed in absolute terms (cm2/s, P < 0.01), but these differences are abolished when expressed in relative terms (% baseline). Next, to evaluate the impact of physiologic changes in μa and μ's, we compared peak BFI and BFI slope when μa and μ's were held constant throughout the reactive hyperemia protocol versus integrated from a 3-s rolling average. Regardless of approach, group means for peak BFI and BFI slope did not differ. Group means for peak BFI and BFI slope were also similar following ad absurdum analyses, where we simulated supraphysiologic changes in μa/μ's. In both cases, however, we identified individual cases where peak BFI and BFI slope were indeed affected, with this result being driven by relative changes in μa over μ's. Overall, these results provide support for past reports in which μa/μ's were held constant but also advocate for real-time incorporation of μa and μ's moving forward.NEW & NOTEWORTHY We investigated how changes in tissue optical properties affect near-infrared diffuse correlation spectroscopy (NIR-DCS)-derived indices of skeletal muscle blood flow (BFI) during physiological provocation. Although accounting for changes in tissue optical properties has little impact on BFI on a group level, individual BFI calculations are indeed impacted by changes in tissue optical properties. NIR-DCS calculations of BFI should therefore account for real-time, physiologically induced changes in tissue optical properties whenever possible.


High Confidence Generalization for Reinforcement Learning
James Kostas, Yash Chandak, Scott Jordan, Georgios Theocharous, Philip Thomas
Thirty-eighth International Conference on Machine Learning (ICML), 2021.

Abstract | pdf | Video

We present several classes of reinforcement learn-ing algorithms that safely generalize toMarkovdecision processes(MDPs) not seen during train-ing. Specifically, we study the setting in whichsome set of MDPs is accessible for training. Forvarious definitions of safety, our algorithms giveprobabilistic guarantees that agents can safely gen-eralize to MDPs that are sampled from the samedistribution but are not necessarily in the train-ing set. These algorithms are a type ofSeldo-nianalgorithm (Thomas et al., 2019), which is aclass of machine learning algorithms that returnmodels with probabilistic safety guarantees foruser-specified definitions of safety.



Towards Safe Policy Improvement for Non-Stationary MDPs
Yash Chandak,, Scott Jordan, Georgios Theocharous, Martha White, Philip Thomas
(Spotlight) Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS), 2020.

Abstract | Arxiv | Blogpost | Code | Video

Many real-world sequential decision-making problems involve critical systems that present both human-life and financial risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. Addressing safety in the presence of non-stationarity remains an open question in the literature. We present a type of Seldonian algorithm (Thomas et al., 2019), taking the first steps towards ensuring safety, with high confidence, for smoothly varying non-stationary decision problems, through a synthesis of model-free reinforcement learning algorithms with methods from time-series analysis.


Evaluating the Performance of Reinforcement Learning Algorithms
Scott Jordan, Yash Chandak, Daniel Cohen, Mengxue Zhang, Philip Thomas
Thirty-seventh International Conference on Machine Learning (ICML), 2020.

Abstract | Arxiv | Code | Video

Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning. Recent reproducibility analyses have shown that reported performance results are often inconsistent and difficult to replicate. In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics. Taking a step towards ensuring that reported results are consistent, we propose a new comprehensive evaluation methodology for reinforcement learning algorithms that produces reliable measurements of performance both on a single environment and when aggregated across environments. We demonstrate this method by evaluating a broad class of reinforcement learning algorithms on standard benchmark tasks.



Learning a Better Negative Sampling Policy with Deep Neural Networks for Search
Daniel Cohen, Scott Jordan, W. Bruce Croft
Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, 2019.
Best Full Paper Award.

Abstract | pdf

In information retrieval, sampling methods used to select documents for neural models must often deal with large class imbalances during training. This issue necessitates careful selection of negative instances when training neural models to avoid the risk of overfitting. For most work, heuristic sampling approaches, or policies, are created based off of domain experts, such as choosing samples with high BM25 scores or a random process over candidate documents. However, these sampling approaches are done with the test distribution in mind. In this paper, we demonstrate that the method chosen to sample negative documents during training plays a critical role in both the stability of training, as well as overall performance. Furthermore, we establish that using reinforcement learning to optimize a policy over a set of sampling functions can significantly improve performance over standard training practices with respect to IR metrics and is robust to hyperparameters and random seeds.


Learning Action Representations for Reinforcement Learning
Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, Philip Thomas
Thirty-sixth International Conference on Machine Learning (ICML), 2019.

Abstract | Arxiv | Video

Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori. We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions. These representations improve generalization over large, finite action sets by allowing the agent to infer the outcomes of actions similar to actions already taken. We provide an algorithm to both learn and use action representations and provide conditions for its convergence. The efficacy of the proposed method is demonstrated on large-scale real-world problems.


Evaluating Reinforcement learning Algorithms Using Cumulative Distributions of Performance
Scott Jordan, Yash Chandak, Mengxue Zhang, Daniel Cohen, Philip Thomas
4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2019.


Classical Policy Gradient: Preserving Bellman’s Principle of Optimality
Philip Thomas, Scott Jordan, Yash Chandak, Chris Nota, James Kostas,
Technical Report.

Abstract | Arxiv

We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective.



Using Cumulative Distribution Based Performance Analysis to Benchmark Models
Scott Jordan, Daniel Cohen, Philip Thomas
In Critiquing and Correcting Trends in ML Workshop at NeurIPS, 2018.

Abstract | pdf

When using only reported empirical results, it has become difficult to identify machine learning methods that provide meaningful advancement. One reason is that results are commonly only reported using well-tuned models, and thus represent an optimistic evaluation of performance. In this work, we propose a new framework for evaluating algorithms that presents both the performance when the system is well-tuned, as well as the difficulty of tuning the algorithm. This is achieved by considering the distribution of performances that result when applying the method with different hyper-parameter settings (e.g., different step sizes and network structures). Using common benchmark tasks in supervised and reinforcement learning, we demonstrate how this evaluation framework can both evaluate an algorithm’s robustness to hyper-parameter selection and identify new areas of improvement.


Distributed Evaluations: Ending Neural Point Metrics
Scott Jordan, Daniel Cohen, Philip Thomas
SIGIR 2018 Workshop on Learning from Limited or Noisy Data, 2018.

Abstract | Arxiv

With the rise of neural models across the field of information retrieval, numerous publications have incrementally pushed the envelope of performance for a multitude of IR tasks. However, these networks often sample data in random order, are initialized randomly, and their success is determined by a single evaluation score. These issues are aggravated by neural models achieving incremental improvements from previous neural baselines, leading to multiple near state of the art models that are difficult to reproduce and quickly become deprecated. As neural methods are starting to be incorporated into low resource and noisy collections that further exacerbate this issue, we propose evaluating neural models both over multiple random seeds and a set of hyperparameters within ϵ distance of the chosen configuration for a given metric.


Learning to Use a Ratchet by Modeling Spatial Relations in Demonstrations
Li Yang Ku, Scott Jordan, Julia Badger, Erik Learned-Miller, Rod A. Grupen
International Symposium on Experimental Robotics (ISER), 2018.

Abstract | pdf

We introduce a framework where visual features, describing the interac- tion among a robot hand, a tool, and an assembly fixture, can be learned efficiently using a small number of demonstrations. We illustrate the approach by torquing a bolt with the Robonaut-2 humanoid robot using a handheld ratchet. The difficulties include the uncertainty of the ratchet pose after grasping and the high precision re- quired for mating the socket to the bolt and replacing the tool in the tool holder. Our approach learns the desired relative position between visual features on the ratchet and the bolt. It does this by identifying goal offsets from visual features that are consistently observable over a set of demonstrations. With this approach we show that Robonaut-2 is capable of grasping the ratchet, tightening a bolt, and putting the ratchet back into a tool holder. We measure the accuracy of the socket-bolt mating subtask over multiple demonstrations and show that a small set of demonstrations can decrease the error significantly.