Johannes Kirschner

johannes.kirschner@sdsc.ethz.ch

I am a Senior Data Scientist at the Swiss Data Science Center (SDSC). I work on developing principled and practical algorithms for sequential decision-making, and deploying state-of-the-art machine learning techniques in challenging real-world applications. See below for some of my research highlights.

Before joining the SDSC, I completed a PostDoc with Csaba Szepesvári at the University of Alberta (supported by an SNF Early Postdoc.Mobility fellowship). I hold a PhD in machine learning that I completed under the supervision of Andreas Krause at ETH Zurich.

If you are a student looking for a thesis project, I have several projects available here and here.

News

Nov 1, 2023	Two papers accepted at NeurIPS! '’Regret Minimization via Saddle Point Optimization’‘, with Alireza Bakhtiari, Kushagra Chandak, Volodymyr Tkachuk, Csaba Szepesvari. '’Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off’‘, with Zichen Zhang, Junxi Zhang, Francesco Zanini, Alex Ayoub, Masood Dehghan, Dale Schuurmans.
Aug 21, 2023	I am starting a new position as a Sr. Data Sciencist at the Swiss Data Science Center in Zurich. Ping me if you like to catch up!
Aug 13, 2023	Our paper on Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications, together with Tor Lattimore and Andreas Krause, got accepted for publication at JMLR.
May 1, 2023	We are restarting the reinforcement learning online seminars. Join us for exciting talks and engaging discussions!
Jan 20, 2023	Two papers accepted: ICLR 2023 as *notable-top-5%: '’Near-optimal Policy Identification in Active Reinforcement Learning’‘. Together with Xiang Li, Viraj Mehta, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, and Ilija Bogunovic. AISTATS 2023: '’Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning’‘*. Together with Volodymyr Tkachuk, Seyed Alireza Bakhtiari, Matej Jusup, Ilija Bogunovic, and Csaba Szepesvari.
Feb 1, 2022	I am serving as Associate Chair at ICML 2022
Aug 1, 2021	Started my PostDoc at the University of Alberta
May 17, 2021	Successfully defended my PhD thesis!
Dec 15, 2020	I received an SNF Early PostDoc.Mobility Fellowship

Research Highlights

Safe Bayesian Optimization for Particle Accelerators

Together with collaboraters at PSI and ETH Zurich, I developed safe data-driven tuning algorithms for particle accelerators. Manually adjusting machine parameters is a re-occuring and time consuming task that is required on many acceletors and cuts down valuable time for experiments. A main difficulty is that all adjustments need to respect safety parameters to avoid damaging the machines (or trigger automated shutdown procedures). We successfully deployed our methods on two major experimental facilities at PSI, the High Intensity Proton Accelerator (HIPA) and the Swiss Free Electron Laser (SwissFEL).

Frequentist Analysis of Information-Directed Sampling

In my PhD thesis, I pioneered mathematical foundations of information-directed sampling (IDS), an algorithm design principle proposed by Daniel Russo and Benjamin Van Roy. Together with Tor Lattimore and Andreas Krause, I showed that the algorithm applies much more broadly to linear partial monitoring (and is provably near-optimal in all finite-action settings). More recently, I showed that IDS is also asymptotically optimal (together with Claire Vernade, Tor Lattimore and Csaba Szepesvári). This resolves an open problem in the literatue. It is also a remarkable result, because IDS was never explicitly designed for this regime.

Publications

2023

Regret Minimization via Saddle Point Optimization

Johannes Kirschner, Alireza Bakhtiari, Kushagra Chandak, Volodymyr Tkachuk, and Csaba Szepesvari

In Proc. Neural Information Processing Systems (NeurIPS) Dec 2023

Bib

@inproceedings{kirschner2023regret,
  title = {Regret Minimization via Saddle Point Optimization},
  author = {Kirschner, Johannes and Bakhtiari, Alireza and Chandak, Kushagra and Tkachuk, Volodymyr and Szepesvari, Csaba},
  booktitle = {Proc. Neural Information Processing Systems (NeurIPS)},
  year = {2023},
  month = dec,
}

Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications

Johannes Kirschner, Tor Lattimore, and Andreas Krause

Journal of Machine Learning Research (JMLR) Dec 2023

Abs arXiv Bib

Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.
@article{kirschner2023linearpm, title = {Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications}, author = {Kirschner, Johannes and Lattimore, Tor and Krause, Andreas}, journal = {Journal of Machine Learning Research (JMLR)}, year = {2023}, }
Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off

Zichen Zhang, Johannes Kirschner, Junxi Zhang, Francesco Zanini, Alex Ayoub, Masood Dehghan, and Dale Schuurmans

In Proc. Neural Information Processing Systems (NeurIPS) Dec 2023

Abs arXiv Bib

A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.
@inproceedings{zhang2022managing, title = {Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off}, booktitle = {Proc. Neural Information Processing Systems (NeurIPS)}, author = {Zhang, Zichen and Kirschner, Johannes and Zhang, Junxi and Zanini, Francesco and Ayoub, Alex and Dehghan, Masood and Schuurmans, Dale}, year = {2023}, month = dec, }
Near-optimal Policy Identification in Active Reinforcement Learning

Xiang Li, Viraj Mehta, Johannes Kirschner, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, and Ilija Bogunovic

Accepted at ICLR (notable-top-5%) May 2023

Abs arXiv Bib

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the expensive transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a generative model. We propose the AE-LSVI algorithm for best policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy uniformly over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.
@article{li2022near, title = {Near-optimal Policy Identification in Active Reinforcement Learning}, author = {Li, Xiang and Mehta, Viraj and Kirschner, Johannes and Char, Ian and Neiswanger, Willie and Schneider, Jeff and Krause, Andreas and Bogunovic, Ilija}, journal = {Accepted at ICLR (notable-top-5%)}, year = {2023}, month = may, }
Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning

Volodymyr Tkachuk, Seyed Alireza Bakhtiari, Johannes Kirschner, Matej Jusup, Ilija Bogunovic, and Csaba Szepesvari

Accepted at AISTATS Apr 2023

Abs arXiv Bib

A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient alg
@article{tkachuk2023efficient, author = {Tkachuk, Volodymyr and Bakhtiari, Seyed Alireza and Kirschner, Johannes and Jusup, Matej and Bogunovic, Ilija and Szepesvari, Csaba}, title = {Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning}, journal = {Accepted at AISTATS}, year = {2023}, month = apr, }

2022

Tuning particle accelerators with safety constraints using Bayesian optimization

Johannes Kirschner, Mojmir Mutný, Andreas Krause, Jaime Portugal, Nicole Hiller, and Jochem Snuverink

Phys. Rev. Accel. Beams Jun 2022

Abs arXiv Bib

Tuning machine parameters of particle accelerators is a repetitive and time-consuming task that is challenging to automate. While many off-the-shelf optimization algorithms are available, in practice their use is limited because most methods do not account for safety-critical constraints in each iteration, such as loss signals or step-size limitations. One notable exception is safe Bayesian optimization, which is a data-driven tuning approach for global optimization with noisy feedback. We propose and evaluate a step-size limited variant of safe Bayesian optimization on two research facilities of the Paul Scherrer Institut (PSI): a) the Swiss Free Electron Laser (SwissFEL) and b) the High-Intensity Proton Accelerator (HIPA). We report promising experimental results on both machines, tuning up to 16 parameters subject to 224 constraints.
@article{kirschner2022tuning, title = {Tuning particle accelerators with safety constraints using Bayesian optimization}, author = {Kirschner, Johannes and Mutn\'y, Mojmir and Krause, Andreas and Coello de Portugal, Jaime and Hiller, Nicole and Snuverink, Jochem}, journal = {Phys. Rev. Accel. Beams}, volume = {25}, issue = {6}, pages = {062802}, numpages = {14}, year = {2022}, month = jun, publisher = {American Physical Society}, doi = {10.1103/PhysRevAccelBeams.25.062802}, url = {https://link.aps.org/doi/10.1103/PhysRevAccelBeams.25.062802}, }

2021

Information-Directed Sampling — Frequentist Analysis and Applications

Johannes Kirschner

Jun 2021

Abs Bib HTML

Sequential decision-making is an iterative process between a learning agent and an environment. We study the stochastic setting, where the learner chooses an action in each round and the environment returns a noisy feedback signal. The learner’s objective is to maximize a reward function that depends on the chosen actions. This basic model has many applications, including adaptive experimental design, product recommendations, dynamic pricing and black-box optimization. The combination of statistical uncertainty and the objective to maximize reward creates a tension between exploration and exploitation: The learner has to carefully balance between actions that provide informative feedback and actions estimated to yield a high reward. The fields of bandit algorithms and partial monitoring study methods to resolve the exploration-exploitation trade-off optimally, using various regularity assumptions on the feedback-reward structure. Two of the most widely used methods are optimistic algorithms and Thompson sampling, which have been successfully applied in numerous settings and come with strong theoretical guarantees. More recently, however, an increasing amount of evidence shows that optimism and Thompson sampling are not universal exploration principles. In structured models with correlated feedback, clearly suboptimal actions sometimes provide informative feedback that outweighs their cost. Meanwhile, optimistic approaches and Thompson sampling discard such actions early on, which leads to inefficient exploration. An alternative and less studied design principle is information-directed sampling (IDS), originally proposed in the Bayesian setting. The main contribution in this thesis is a frequentist interpretation of the IDS framework, complemented with frequentist performance guarantees for several settings with linear reward and feedback structure. Using the IDS approach, we resolve the long-standing challenge to find an asymptotically instance-optimal algorithm for linear bandits that is simultaneously minimax optimal. We further extend the IDS approach to the more general linear partial monitoring setting, making the method applicable to a vast range of previously studied models for sequential decision-making. Along the way, we develop the theory of information-directed sampling, uncover a connection to primal-dual methods and propose computationally faster approximations. Lastly, we discuss extensions of the IDS framework to contextual decision-making and the kernelized setting and highlight example applications.
@phdthesis{kirschner2021information, title = {Information-Directed Sampling --- Frequentist Analysis and Applications}, author = {Kirschner, Johannes}, year = {2021}, school = {ETH Zurich}, }
Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit Feedback

Marc Jourdan, Mojmír Mutný, Johannes Kirschner, and Andreas Krause

In Algorithmic Learning Theory Jun 2021

Abs arXiv Bib

Combinatorial bandits with semi-bandit feedback generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set. The action set satisfies a given structure such as forming a base of a matroid or a path in a graph. We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set. Using the recently popularized game framework, we interpret this problem as a sequential zero-sum game and develop a CombGame meta-algorithm whose instances are asymptotically optimal algorithms with finite time guarantees. In addition to comparing two families of learners to instantiate our meta-algorithm, the main contribution of our work is a specific oracle efficient instance for best-arm identification with combinatorial actions. Based on a projection-free online learning algorithm for convex polytopes, it is the first computationally efficient algorithm which is asymptotically optimal and has competitive empirical performance.
@inproceedings{jourdan2021efficient, title = {Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit Feedback}, author = {Jourdan, Marc and Mutn\'{y}, Mojm{\'i}r and Kirschner, Johannes and Krause, Andreas}, booktitle = {Algorithmic Learning Theory}, pages = {805--849}, year = {2021}, organization = {PMLR}, }
Bias-Robust Bayesian Optimization via Dueling Bandits

Johannes Kirschner, and Andreas Krause

In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS) Jul 2021

Abs arXiv Bib

We consider Bayesian optimization in settings where observations can be adversarially biased, for example by an uncontrolled hidden confounder. Our first contribution is a reduction of the confounded setting to the dueling bandit model. Then we propose a novel approach for dueling bandits based on information-directed sampling (IDS). Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees. Our analysis further generalizes a previously proposed semi-parametric linear bandit model to non-linear reward functions, and uncovers interesting links to doubly-robust estimation.
@inproceedings{kirschner2020dueling, title = {Bias-Robust Bayesian Optimization via Dueling Bandits}, author = {Kirschner, Johannes and Krause, Andreas}, booktitle = {Proc. International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2021}, month = jul, }
Asymptotically Optimal Information-Directed Sampling

Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári

In Proc. International Conference on Learning Theory (COLT) Aug 2021

Abs arXiv Bib

We introduce a simple and efficient algorithm for stochastic linear bandits with finitely many actions that is asymptotically optimal and (nearly) worst-case optimal in finite time. The approach is based on the frequentist information-directed sampling (IDS) framework, with a surrogate for the information gain that is informed by the optimization problem that defines the asymptotic lower bound. Our analysis sheds light on how IDS balances the trade-off between regret and information and uncovers a surprising connection between the recently proposed primal-dual methods and the IDS algorithm. We demonstrate empirically that IDS is competitive with UCB in finite-time, and can be significantly better in the asymptotic regime.
@inproceedings{kirschner2020asymptotically, title = {Asymptotically Optimal Information-Directed Sampling}, author = {Kirschner, Johannes and Lattimore, Tor and Vernade, Claire and Szepesv{\'a}ri, Csaba}, booktitle = {Proc. International Conference on Learning Theory (COLT)}, year = {2021}, month = aug, author+an = {1=highlight}, }

2020

Distributionally Robust Bayesian Optimization

Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause

In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS) Aug 2020

Abs arXiv Bib

Robustness to distributional shift is one of the key challenges of contemporary machine learning. Attaining such robustness is the goal of distributionally robust optimization, which seeks a solution to an optimization problem that is worst-case robust under a specified distributional shift of an uncontrolled covariate. In this paper, we study such a problem when the distributional shift is measured via the maximum mean discrepancy (MMD). For the setting of zeroth-order, noisy optimization, we present a novel distributionally robust Bayesian optimization algorithm (DRBO). Our algorithm provably obtains sub-linear robust regret in various settings that differ in how the uncertain covariate is observed. We demonstrate the robust performance of our method on both synthetic and real-world benchmarks.
@inproceedings{kirschner2020distributionally, title = {{Distributionally Robust Bayesian Optimization}}, author = {Kirschner, Johannes and Bogunovic, Ilija and Jegelka, Stefanie and Krause, Andreas}, booktitle = {Proc. International Conference on Artificial Intelligence and Statistics (AISTATS)}, month = aug, year = {2020}, author+an = {1=highlight}, }
Information Directed Sampling for Linear Partial Monitoring

Johannes Kirschner, Tor Lattimore, and Andreas Krause

In Proc. International Conference on Learning Theory (COLT) Jul 2020

Abs arXiv Bib

Partial monitoring is a rich framework for sequential decision making under uncertainty that generalizes many well known bandit models, including linear, combinatorial and dueling bandits. We introduce information directed sampling (IDS) for stochastic partial monitoring with a linear reward and observation structure. IDS achieves adaptive worst-case regret rates that depend on precise observability conditions of the game. Moreover, we prove lower bounds that classify the minimax regret of all finite games into four possible regimes. IDS achieves the optimal rate in all cases up to logarithmic factors, without tuning any hyper-parameters. We further extend our results to the contextual and the kernelized setting, which significantly increases the range of possible applications.
@inproceedings{kirschner2020pm, author = {Kirschner, Johannes and Lattimore, Tor and Krause, Andreas}, booktitle = {Proc. International Conference on Learning Theory (COLT)}, month = jul, title = {Information Directed Sampling for Linear Partial Monitoring}, year = {2020}, author+an = {1=highlight}, }
Experimental Design for Optimization of Orthogonal Projection Pursuit Models

Mojmir Mutný, Johannes Kirschner, and Andreas Krause

In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI) Feb 2020

Abs Bib HTML

Bayesian optimization and kernelized bandit algorithms are widely used techniques for sequential black box function optimization with applications in parameter tuning, control, robotics among many others. To be effective in high dimensional settings, previous approaches make additional assumptions, for example on low-dimensional subspaces or an additive structure. In this work, we go beyond the additivity assumption and use an orthogonal projection pursuit regression model, which strictly generalizes additive models. We present a two-stage algorithm motivated by experimental design to first decorrelate the additive components. Subsequently, the bandit optimization benefits from the statistically efficient additive model. Our method provably decorrelates the fully additive model and achieves optimal sublinear simple regret in terms of the number of function evaluations. To prove the rotation recovery, we derive novel concentration inequalities for linear regression on subspaces. In addition, we specifically address the issue of acquisition function optimization and present two domain dependent efficient algorithms. We validate the algorithm numerically on synthetic as well as real-world optimization problems.
@inproceedings{mutny2020projectionpursuit, author = {Mutn\'{y}, Mojmir and Kirschner, Johannes and Krause, Andreas}, booktitle = {Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI)}, month = feb, title = {Experimental Design for Optimization of Orthogonal Projection Pursuit Models}, year = {2020}, author+an = {2=highlight}, }

2019

Bayesian Optimization for Fast and Safe Parameter Tuning of SwissFEL

Johannes Kirschner, Manuel Nonnenmacher, Mojmir Mutný, Nicole Hiller, Andreas Adelmann, Rasmus Ischebeck, and Andreas Krause

In Proc. International Free-Electron Laser Conference (FEL2019) Jun 2019

Abs Bib

Parameter tuning is a notoriously time-consuming task in accelerator facilities. As tool for global optimization with noisy evaluations, Bayesian optimization was recently shown to outperform alternative methods. By learning a model of the underlying function using all available data, the next evaluation can be chosen carefully to find the optimum with as few steps as possible and without violating any safety constraints. However, the per-step computation time increases significantly with the number of parameters and the generality of the approach can lead to slow convergence on functions that are easier to optimize. To overcome these limitations, we divide the global problem into sequential subproblems that can be solved efficiently using safe Bayesian optimization. This allows us to trade off local and global convergence and to adapt to additional structure in the objective function. Further, we provide slice-plots of the function as user feedback during the optimization. We showcase how we use our algorithm to tune up the FEL output of SwissFEL with up to 40 parameters simultaneously, and reach convergence within reasonable tuning times in the order of 30 minutes (< 2000 steps).
@inproceedings{kirschner2019swissfel2, author = {Kirschner, Johannes and Nonnenmacher, Manuel and Mutn\'y, Mojmir and Hiller, Nicole and Adelmann, Andreas and Ischebeck, Rasmus and Krause, Andreas}, booktitle = {Proc. International Free-Electron Laser Conference (FEL2019)}, month = jun, title = {Bayesian Optimization for Fast and Safe Parameter Tuning of SwissFEL}, year = {2019}, author+an = {1=highlight}, }
Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces

Johannes Kirschner, Mojmir Mutný, Nicole Hiller, Rasmus Ischebeck, and Andreas Krause

In Proc. International Conference for Machine Learning (ICML) Jun 2019

Abs arXiv Bib

Bayesian optimization is known to be difficult to scale to high dimensions, because the acquisition step requires solving a non-convex optimization problem in the same search space. In order to scale the method and keep its benefits, we propose an algorithm (LineBO) that restricts the problem to a sequence of iteratively chosen one-dimensional sub-problems that can be solved efficiently. We show that our algorithm converges globally and obtains a fast local rate when the function is strongly convex. Further, if the objective has an invariant subspace, our method automatically adapts to the effective dimension without changing the algorithm. When combined with the SafeOpt algorithm to solve the sub-problems, we obtain the first safe Bayesian optimization algorithm with theoretical guarantees applicable in high-dimensional settings. We evaluate our method on multiple synthetic benchmarks, where we obtain competitive performance. Further, we deploy our algorithm to optimize the beam intensity of the Swiss Free Electron Laser with up to 40 parameters while satisfying safe operation constraints.
@inproceedings{kirschner2019linebo, author = {Kirschner, Johannes and Mutn\'y, Mojmir and Hiller, Nicole and Ischebeck, Rasmus and Krause, Andreas}, booktitle = {Proc. International Conference for Machine Learning (ICML)}, month = jun, title = {Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces}, year = {2019}, author+an = {1=highlight}, }
Information-Directed Exploration for Deep Reinforcement Learning

Nikolay Nikolov, Johannes Kirschner, Felix Berkenkamp, and Andreas Krause

In Proc. International Conference on Learning Representations (ICLR) May 2019

Abs arXiv Bib

Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.
@inproceedings{nikolov2019information, author = {Nikolov, Nikolay and Kirschner, Johannes and Berkenkamp, Felix and Krause, Andreas}, booktitle = {Proc. International Conference on Learning Representations (ICLR)}, month = may, primaryclass = {cs.LG}, title = {Information-Directed Exploration for Deep Reinforcement Learning}, year = {2019}, author+an = {2=highlight}, }
Stochastic Bandits with Context Distributions

Johannes Kirschner, and Andreas Krause

In Proc. Neural Information Processing Systems (NeurIPS) Dec 2019

Abs arXiv Bib

We introduce a stochastic contextual bandit model where at each time step the environment chooses a distribution over a context set and samples the context from this distribution. The learner observes only the context distribution while the exact context realization remains hidden. This allows for a broad range of applications where the context is stochastic or when the learner needs to predict the context. We leverage the UCB algorithm to this setting and show that it achieves an order-optimal high-probability bound on the cumulative regret for linear and kernelized reward functions. Our results strictly generalize previous work in the sense that both our model and the algorithm reduce to the standard setting when the environment chooses only Dirac delta distributions and therefore provides the exact context to the learner. We further analyze a variant where the learner observes the realized context after choosing the action. Finally, we demonstrate the proposed method on synthetic and real-world datasets.
@inproceedings{kirschner2019context, author = {Kirschner, Johannes and Krause, Andreas}, booktitle = {Proc. Neural Information Processing Systems (NeurIPS)}, month = dec, title = {Stochastic Bandits with Context Distributions}, year = {2019}, author+an = {1=highlight}, }

2018

Information Directed Sampling and Bandits with Heteroscedastic Noise

Johannes Kirschner, and Andreas Krause

In Proc. International Conference on Learning Theory (COLT) Jul 2018

Abs arXiv Bib

In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and to satisfy a tail bound uniformly on the domain; a restrictive assumption for many applications. In this work, we consider bandits with heteroscedastic noise, where we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret analysis framework, that is similar to the Bayesian framework of Russo and Van Roy (2014), and we prove a new high-probability regret bound for general, possibly randomized policies, which depends on a quantity we refer to as regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize the regret-information ratio over all possible action sampling distributions. This further relies on concentration inequalities for online least squares regression in separable Hilbert spaces, which we generalize to the case of heteroscedastic noise. We then formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding novel algorithms for Bayesian optimization. We also prove frequentist regret bounds, which in the homoscedastic case recover known bounds for UCB, but can be much better when the noise is heteroscedastic. Empirically, we demonstrate in a linear setting with heteroscedastic noise, that some of our methods can outperform UCB and Thompson Sampling, while staying competitive when the noise is homoscedastic.
@inproceedings{kirschner2018heteroscedastic, author = {Kirschner, Johannes and Krause, Andreas}, booktitle = {Proc. International Conference on Learning Theory (COLT)}, month = jul, title = {Information Directed Sampling and Bandits with Heteroscedastic Noise}, year = {2018}, author+an = {1=highlight}, }