Publications | André Biedenkapp

2025

One Does Not Simply Estimate State: Comparing Model-based and Model-free Reinforcement Learning on the Partially Observable MordorHike Benchmark

Sai Prasanna, André Biedenkapp+, and Raghu Rajan+

In Eighteenth European Workshop on Reinforcement Learning, 2025 +Joint last authorship

Bib Abstract OpenReview

Evaluating reinforcement learning agents on partially observable Markov decision processes remains lacking as common benchmarks often do not require complex state estimation under non-linear dynamics and noise. We introduce MordorHike, a benchmark suite for rigorous state estimation testing, revealing performance gaps which would not be possible on other benchmarks. We present an evaluation framework assessing task performance and state estimation quality via probing. Using this framework, we empirically compare model-based (Dreamer, R2I) and model-free (DRQN) agents for sophisticated state estimation. The analysis reveals that Dreamer excels in sample efficiency and achieves superior performance in the hardest setting while R2I underperforms, suggesting its linear recurrent architecture may be a bottleneck. Further analysis reveals links between state estimation quality and task performance. Finally, out-of-distribution analysis shows a generalization gap for all algorithms, although Dreamer maintains an edge in the most challenging setting. The results highlight the need for robust state estimation and the need for proper evaluation benchmarks while validating the usefulness of MordorHike for future POMDP research.
@inproceedings{prasanna-ewrl25a, title = {One Does Not Simply Estimate State: Comparing Model-based and Model-free Reinforcement Learning on the Partially Observable MordorHike Benchmark}, author = {Prasanna, Sai and Biedenkapp+, André and Rajan+, Raghu}, year = {2025}, booktitle = {Eighteenth European Workshop on Reinforcement Learning}, joint_last = {true}, }
Mighty: A Comprehensive Tool for studying Generalization, Meta-RL and AutoRL

A. Mohan, T. Eimer, C. Benjamins, M. Lindauer, and A. Biedenkapp

In Eighteenth European Workshop on Reinforcement Learning, 2025

Bib Abstract OpenReview Code

Robust generalization, rapid adaptation, and automated tuning are essential for deploying reinforcement learning in real-world settings. However, research on these aspects remains scattered across non-standard codebases and custom orchestration scripts. We introduce Mighty, an open-source library that unifies Contextual Generalization, Meta-RL, and AutoRL under a single modular interface. Mighty cleanly separates a configurable Agent - specified by its learning algorithm, model architecture, replay buffer, exploration strategy, and hyperparameters - from a configurable environment modeled as a Contextual MDP in which transitions, rewards, and initial states are governed by context parameters. This design decouples inner‐loop weight updates from outer‐loop adaptations, enabling users to compose, within one framework, (i) contextual generalization and curriculum methods (e.g. Unsupervised Environment Design), (ii) bi‐level meta‐learning (e.g. MAML, black‐box strategies), and (iii) automated hyperparameter and architecture search (e.g. Bayesian optimization, evolutionary strategies, population‐based training). We present Mighty’s design philosophy and core features and validate the ongoing base implementations on classic control and continuous control tasks. We hope that by providing a unified, modular interface, Mighty will simplify experimentation and inspire further advances in robust, adaptable reinforcement learning.
@inproceedings{mohan-ewrl25a, title = {Mighty: A Comprehensive Tool for studying Generalization, Meta-RL and AutoRL}, author = {Mohan, A. and Eimer, T. and Benjamins, C. and Lindauer, M. and Biedenkapp, A.}, year = {2025}, booktitle = {Eighteenth European Workshop on Reinforcement Learning}, }
On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+(λ,λ))-GA

Tài Nguyen, Phong Le, André Biedenkapp, Carola Doerr, and Nguyen Dang

In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’25), 2025 🏅Best paper award on the "Learning for Evolutionary Computation" track

Bib Abstract arXiv Code

Dynamic Algorithm Configuration (DAC) has garnered significant attention in recent years, particularly in the prevalence of machine learning and deep learning algorithms. Numerous studies have leveraged the robustness of decision-making in Reinforcement Learning (RL) to address the optimization challenges associated with algorithm configuration. However, making an RL agent work properly is a non-trivial task, especially in reward design, which necessitates a substantial amount of handcrafted knowledge based on domain expertise. In this work, we study the importance of reward design in the context of DAC via a case study on controlling the population size of the (1+(λ,λ))-GA optimizing OneMax. We observed that a poorly designed reward can hinder the RL agent’s ability to learn an optimal policy because of a lack of exploration, leading to both scalability and learning divergence issues. To address those challenges, we propose the application of a reward shaping mechanism to facilitate enhanced exploration of the environment by the RL agent. Our work not only demonstrates the ability of RL in dynamically configuring the (1+(λ,λ))-GA, but also confirms the advantages of reward shaping in the scalability of RL agents across various sizes of OneMax problems.
@inproceedings{nguyen-gecco25a, title = {On the Importance of Reward Design in Reinforcement Learning-based Dynamic Algorithm Configuration: A Case Study on OneMax with (1+(λ,λ))-GA}, author = {Nguyen, Tài and Le, Phong and Biedenkapp, André and Doerr, Carola and Dang, Nguyen}, year = {2025}, booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference ({GECCO}'25)}, }
Efficient Cross-Episode Meta-RL

Gresa Shala, André Biedenkapp, Pierre Krack, Florian Walter, and Josif Grabocka

In Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Bib Abstract PDF OpenReview Code

We introduce Efficient Cross-Episodic Transformers (ECET), a new algorithm for online Meta-Reinforcement Learning that addresses the challenge of enabling reinforcement learning agents to perform effectively in previously unseen tasks. We demonstrate how past episodes serve as a rich source of in-context information, which our model effectively distills and applies to new contexts. Our learned algorithm is capable of outperforming the previous state-of-the-art and provides more efficient meta-training while significantly improving generalization capabilities. Experimental results, obtained across various simulated tasks of the MuJoCo, Meta-World and ManiSkill benchmarks, indicate a significant improvement in learning efficiency and adaptability compared to the state-of-the-art. Our approach enhances the agent’s ability to generalize from limited data and paves the way for more robust and versatile AI systems.
@inproceedings{shala-iclr25a, title = {Efficient Cross-Episode Meta-RL}, author = {Shala, Gresa and Biedenkapp, André and Krack, Pierre and Walter, Florian and Grabocka, Josif}, year = {2025}, booktitle = {Proceedings of the Thirteenth International Conference on Learning Representations}, }

A Llama walks into the ’Bar’: Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam

Rean Fernandes, André Biedenkapp, Frank Hutter, and Noor Awad

arXiv:2504.04945, 2025

@article{fernandes-arxiv25a,
  title = {A Llama walks into the ’Bar’: Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam},
  author = {Fernandes, Rean and Biedenkapp, André and Hutter, Frank and Awad, Noor},
  year = {2025},
  journal = {arXiv:2504.04945},
}

Meta-learning Population-based Methods for Reinforcement Learning

Johannes Hog, Raghu Rajan, André Biedenkapp, Noor Awad, Frank Hutter, and Vu Nguyen

Transactions on Machine Learning Research, 2025

Bib Abstract PDF OpenReview Code

Reinforcement learning (RL) algorithms are highly sensitive to their hyperparameter settings. Recently, numerous methods have been proposed to dynamically optimize these hyperparameters. One prominent approach is Population-Based Bandits (PB2), which uses time-varying Gaussian processes (GP) to dynamically optimize hyperparameters with a population of parallel agents. Despite its strong overall performance, PB2 experiences slow starts due to the GP initially lacking sufficient information. To mitigate this issue, we propose four different methods that utilize meta-data from various environments. These approaches are novel in that they adapt meta-learning methods to accommodate the time-varying setting. Among these approaches, MultiTaskPB2, which uses meta-learning for the surrogate model, stands out as the most promising approach. It outperforms PB2 and other baselines in both anytime and final performance across two RL environment families.
@article{hog-tmlr25a, title = {Meta-learning Population-based Methods for Reinforcement Learning}, author = {Hog, Johannes and Rajan, Raghu and Biedenkapp, André and Awad, Noor and Hutter, Frank and Nguyen, Vu}, journal = {Transactions on Machine Learning Research}, issn = {2835-8856}, year = {2025}, }

2024

Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning

Tidiane Camaret Ndir, André Biedenkapp, and Noor Awad

In Seventeenth European Workshop on Reinforcement Learning, 2024

Bib Abstract OpenReview arXiv Code

In this work, we address the challenge of zero-shot generalization (ZSG) in Reinforcement Learning (RL), where agents must adapt to entirely novel environments without additional training. We argue that understanding and utilizing contextual cues, such as the gravity level of the environment, is critical for robust generalization, and we propose to integrate the learning of context representations directly with policy learning. Our algorithm demonstrates improved generalization on various simulated domains, outperforming prior context-learning techniques in zero-shot settings. By jointly learning policy and context, our method acquires behavior-specific context representations, enabling adaptation to unseen environments and marks progress towards reinforcement learning systems that generalize across diverse real-world tasks. Our code and experiments are available at https://github.com/tidiane-camaret/contextual_rl_zero_shot
@inproceedings{ndir-ewrl24a, title = {Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning}, author = {Ndir, Tidiane Camaret and Biedenkapp, André and Awad, Noor}, year = {2024}, booktitle = {Seventeenth European Workshop on Reinforcement Learning}, }
One-shot World Models Using a Transformer Trained on a Synthetic Prior

Fabio Ferreira, Moreno Schlageter, Raghu Rajan, André Biedenkapp, and Frank Hutter

In NeurIPS 2024 Workshop on Open-World Agents, 2024

Bib Abstract arXiv

A World Model is a compressed spatial and temporal representation of a real world environment that allows one to train an agent or execute planning methods. However, world models are typically trained on observations from the real world environment, and they usually do not enable learning policies for other real environments. We propose One-Shot World Model (OSWM), a transformer world model that is learned in an in-context learning fashion from purely synthetic data sampled from a prior distribution. Our prior is composed of multiple randomly initialized neural networks, where each network models the dynamics of each state and reward dimension of a desired target environment. We adopt the supervised learning procedure of Prior-Fitted Networks by masking next-state and reward at random context positions and query OSWM to make probabilistic predictions based on the remaining transition context. During inference time, OSWM is able to quickly adapt to the dynamics of a simple grid world, as well as the CartPole gym and a custom control environment by providing 1k transition steps as context and is then able to successfully train environment-solving agent policies. However, transferring to more complex environments remains a challenge, currently. Despite these limitations, we see this work as an important stepping-stone in the pursuit of learning world models purely from synthetic data.
@inproceedings{ferreira-owa24a, title = {One-shot World Models Using a Transformer Trained on a Synthetic Prior}, author = {Ferreira, Fabio and Schlageter, Moreno and Rajan, Raghu and Biedenkapp, André and Hutter, Frank}, year = {2024}, booktitle = {NeurIPS 2024 Workshop on Open-World Agents}, }
CANDID DAC: Leveraging Coupled Action Dimensions with Importance Differences in DAC

Philipp Bordne, M. Asif Hasan, Eddie Bergman, Noor Awad, and André Biedenkapp

In Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Workshop Track, 2024

Bib Abstract OpenReview arXiv Code Poster Talk

High-dimensional action spaces remain a challenge for dynamic algorithm configuration (DAC). Interdependencies and varying importance between action dimensions are further known key characteristics of DAC problems. We argue that these Coupled Action Dimensions with Importance Differences (CANDID) represent aspects of the DAC problem that are not yet fully explored. To address this gap, we introduce a new white-box benchmark within the DACBench suite that simulates the properties of CANDID. Further, we propose sequential policies as an effective strategy for managing these properties. Such policies factorize the action space and mitigate exponential growth by learning a policy per action dimension. At the same time, these policies accommodate the interdependence of action dimensions by fostering implicit coordination. We show this in an experimental study of value-based policies on our new benchmark. This study demonstrates that sequential policies significantly outperform independent learning of factorized policies in CANDID action spaces. In addition, they overcome the scalability limitations associated with learning a single policy across all action dimensions. The code used for our experiments is available under https://github.com/PhilippBordne/candidDAC.
@inproceedings{bordne-automlws24a, title = {CANDID DAC: Leveraging Coupled Action Dimensions with Importance Differences in DAC}, author = {Bordne, Philipp and Hasan, M. Asif and Bergman, Eddie and Awad, Noor and Biedenkapp, André}, booktitle = {Proceedings of the Third International Conference on Automated Machine Learning (AutoML 2024), Workshop Track}, year = {2024}, }
Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization

Sai Prasanna*, Karim Farid*, Raghu Rajan, and André Biedenkapp

Reinforcement Learning Journal, 3, pp. 1317–1350, 2024 *Joint first authorship
Note: To also be presented at the Seventeenth European Workshop on Reinforcement Learning (EWRL 2024)

Bib Abstract PDF OpenReview arXiv Code Poster Website

Zero-shot generalization (ZSG) to unseen dynamics is a major challenge for creating generally capable embodied agents. To address the broader challenge, we start with the simpler setting of contextual reinforcement learning (cRL), assuming observability of the context values that parameterize the variation in the system’s dynamics, such as the mass or dimensions of a robot, without making further simplifying assumptions about the observability of the Markovian state. Toward the goal of ZSG to unseen variation in context, we propose the contextual recurrent state-space model (cRSSM), which introduces changes to the world model of the Dreamer (v3) (Hafner et al., 2023). This allows the world model to incorporate context for inferring latent Markovian states from the observations and modeling the latent dynamics. Our experiments show that such systematic incorporation of the context improves the ZSG of the policies trained on the “dreams” of the world model. We further find qualitatively that our approach allows Dreamer to disentangle the latent state from context, allowing it to extrapolate its dreams to the many worlds of unseen contexts. The code for all our experiments is available at https://github.com/sai-prasanna/dreaming_of_many_worlds.
@article{prasanna-rlc24a, title = {Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization}, author = {Prasanna, Sai and Farid, Karim and Rajan, Raghu and Biedenkapp, André}, year = {2024}, journal = {Reinforcement Learning Journal}, volume = {3}, issue = {1}, pages = {1317--1350}, }
HPO-RL-Bench: A Zero-Cost Benchmark for HPO in Reinforcement Learning

Gresa Shala, Sebastian Pineda Arango, André Biedenkapp, Frank Hutter, and Josif Grabocka

In Proceedings of the Third International Conference on Automated Machine Learning (AutoML’24), ABCD Track, 2024 🏅Awarded runner up for the best paper award

Bib Abstract OpenReview Code Slides Talk

Despite the undeniable importance of optimizing the hyperparameters of RL algorithms, existing state-of-the-art Hyperparameter Optimization (HPO) techniques are not frequently utilized by RL researchers. To catalyze HPO research in RL, we present a new large-scale benchmark that includes pre-computed reward curve evaluations of hyperparameter configurations for six established RL algorithms (PPO, DDPG, A2C, SAC, TD3, DQN) on 22 environments (Atari, Mujoco, Control), repeated for multiple seeds. We exhaustively computed the reward curves of all possible combinations of hyperparameters for the considered hyperparameter spaces for each RL algorithm in each environment. As a result, our benchmark permits zero-cost experiments for deploying and comparing new HPO methods. In addition, the benchmark offers a set of integrated HPO methods, enabling plug-and-play tuning of the hyperparameters of new RL algorithms, while pre-computed evaluations allow a zero-cost comparison of a new RL algorithm against the tuned RL baselines in our benchmark.
@inproceedings{shala-automlabcd24a, title = {HPO-RL-Bench: A Zero-Cost Benchmark for HPO in Reinforcement Learning}, author = {Shala, Gresa and Arango, Sebastian Pineda and Biedenkapp, André and Hutter, Frank and Grabocka, Josif}, year = {2024}, booktitle = {Proceedings of the Third International Conference on Automated Machine Learning (AutoML'24), ABCD Track}, }
Hierarchical Transformers are Efficient Meta-Reinforcement Learners

Gresa Shala, André Biedenkapp, and Josif Grabocka

arXiv:2402.06402, 2024

Bib Abstract arXiv Code

We introduce Hierarchical Transformers for Meta-Reinforcement Learning (HTrMRL), a powerful online meta-reinforcement learning approach. HTrMRL aims to address the challenge of enabling reinforcement learning agents to perform effectively in previously unseen tasks. We demonstrate how past episodes serve as a rich source of information, which our model effectively distills and applies to new contexts. Our learned algorithm is capable of outperforming the previous state-of-the-art and provides more efficient meta-training while significantly improving generalization capabilities. Experimental results, obtained across various simulated tasks of the Meta-World Benchmark, indicate a significant improvement in learning efficiency and adaptability compared to the state-of-the-art on a variety of tasks. Our approach not only enhances the agent’s ability to generalize from limited data but also paves the way for more robust and versatile AI systems.
@article{shala-arxiv24a, title = {Hierarchical Transformers are Efficient Meta-Reinforcement Learners}, author = {Shala, Gresa and Biedenkapp, André and Grabocka, Josif}, year = {2024}, journal = {arXiv:2402.06402}, }

2023

MDP Playground: An Analysis and Debug Testbed for Reinforcement Learning

Raghu Rajan, Jessica Lizeth Borja Diaz, Suresh Guttikonda, Fabio Ferreira, André Biedenkapp, Jan Ole Hartz, and Frank Hutter

Journal of Artificial Intelligence Research (JAIR), 77, pp. 821–890, 2023

Bib Abstract PDF arXiv Code

We present MDP Playground, a testbed for Reinforcement Learning (RL) agents with dimensions of hardness that can be controlled independently to challenge agents in different ways and obtain varying degrees of hardness in toy and complex RL environments. We consider and allow control over a wide variety of dimensions, including delayed rewards, sequence lengths, reward density, stochasticity, image representations, irrelevant features, time unit, action range and more. We define a parameterised collection of fast-to-run toy environments in OpenAI Gym by varying these dimensions and propose to use these to understand agents better. We then show how to design experiments using MDP Playground to gain insights on the toy environments. We also provide wrappers that can inject many of these dimensions into any Gym environment. We experiment with these wrappers on Atari and Mujoco to allow for understanding the effects of these dimensions on environments that are more complex than the toy environments. We also compare the effect of the dimensions on the toy and complex environments. Finally, we show how to use MDP Playground to debug agents, to study the interaction of multiple dimensions and describe further use-cases.
@article{rajan-jair23a, title = {MDP Playground: An Analysis and Debug Testbed for Reinforcement Learning}, author = {Rajan, Raghu and Diaz, Jessica Lizeth Borja and Guttikonda, Suresh and Ferreira, Fabio and Biedenkapp, André and von Hartz, Jan Ole and Hutter, Frank}, year = {2023}, journal = {Journal of Artificial Intelligence Research (JAIR)}, doi = {10.1613/jair.1.14314}, volume = {77}, pages = {821--890} }
Contextualize Me – The Case for Context in Reinforcement Learning

Carolin Benjamins, Theresa Eimer, Frederik Schubert, Aditya Mohan, Sebastian Döhler, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, and Marius Lindauer

Transactions on Machine Learning Research, 2023

Bib Abstract PDF OpenReview arXiv Code Poster Website Talk

While Reinforcement Learning (RL) has made great strides towards solving increasingly complicated problems, many algorithms are still brittle to even slight environmental changes. Contextual Reinforcement Learning (cRL) provides a framework to model such changes in a principled manner, thereby enabling flexible, precise and interpretable task specification and generation. Our goal is to show how the framework of cRL contributes to improving zero-shot generalization in RL through meaningful benchmarks and structured reasoning about generalization tasks. We confirm the insight that optimal behavior in cRL requires context information, as in other related areas of partial observability. To empirically validate this in the cRL framework, we provide various context-extended versions of common RL environments. They are part of the first benchmark library, CARL, designed for generalization based on cRL extensions of popular benchmarks, which we propose as a testbed to further study general agents. We show that in the contextual setting, even simple RL environments become challenging - and that naive solutions are not enough to generalize across complex context spaces.
@article{benjamins-tmlr23a, title = {Contextualize Me – The Case for Context in Reinforcement Learning}, author = {Benjamins, Carolin and Eimer, Theresa and Schubert, Frederik and Mohan, Aditya and Döhler, Sebastian and Biedenkapp, André and Rosenhahn, Bodo and Hutter, Frank and Lindauer, Marius}, year = {2023}, journal = {Transactions on Machine Learning Research}, issn = {2835-8856}, }
Gray-Box Gaussian Processes for Automated Reinforcement Learning

Gresa Shala, André Biedenkapp, Frank Hutter, and Josif Grabocka

In International Conference on Learning Representations (ICLR’23), 2023

Bib Abstract PDF OpenReview Code

Despite having achieved spectacular milestones in an array of important real-world applications, most Reinforcement Learning (RL) methods are very brittle concerning their hyperparameters. Notwithstanding the crucial importance of setting the hyperparameters in training state-of-the-art agents, the task of hyperparameter optimization (HPO) in RL is understudied. In this paper, we propose a novel gray-box Bayesian Optimization technique for HPO in RL, that enriches Gaussian Processes with reward curve estimations based on generalized logistic functions. In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), dozens of environments (Atari, Mujoco), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL.
@inproceedings{shala-iclr23a, title = {Gray-Box Gaussian Processes for Automated Reinforcement Learning}, author = {Shala, Gresa and Biedenkapp, André and Hutter, Frank and Grabocka, Josif}, year = {2023}, booktitle = {International Conference on Learning Representations (ICLR'23)}, }

2022

Automated Dynamic Algorithm Configuration

Steven Adriaensen, André Biedenkapp, Gresa Shala, Noor Awad, Theresa Eimer, Marius Lindauer, and Frank Hutter

Journal of Artificial Intelligence Research (JAIR), 75, pp. 1633–1699, 2022

Bib Abstract PDF arXiv Code Website

The performance of an algorithm often critically depends on its parameter configuration. While a variety of automated algorithm configuration methods have been proposed to relieve users from the tedious and error-prone task of manually tuning parameters, there is still a lot of untapped potential as the learned configuration is static, i.e., parameter settings remain fixed throughout the run. However, it has been shown that some algorithm parameters are best adjusted dynamically during execution, e.g., to adapt to the current part of the optimization landscape. Thus far, this is most commonly achieved through hand-crafted heuristics. A promising recent alternative is to automatically learn such dynamic parameter adaptation policies from data. In this article, we give the first comprehensive account of this new field of automated dynamic algorithm configuration (DAC), present a series of recent advances, and provide a solid foundation for future research in this field. Specifically, we (i) situate DAC in the broader historical context of AI research; (ii) formalize DAC as a computational problem; (iii) identify the methods used in prior-art to tackle this problem; (iv) conduct empirical case studies for using DAC in evolutionary optimization, AI planning, and machine learning.
@article{adriaens-jair22a, title = {Automated Dynamic Algorithm Configuration}, author = {Adriaensen, Steven and Biedenkapp, André and Shala, Gresa and Awad, Noor and Eimer, Theresa and Lindauer, Marius and Hutter, Frank}, year = {2022}, journal = {Journal of Artificial Intelligence Research (JAIR)}, volume = {75}, pages = {1633--1699}, doi = {10.1613/jair.1.13922}, }
Dynamic Algorithm Configuration by Reinforcement Learning

André Biedenkapp

PhD thesis, University of Freiburg, Department of Computer Science, Machine Learning Chair, 2022
Note: Passed with Summa Cum Laude (best possible grade)

Bib Abstract PDF Website

The performance of algorithms, be it in the domain of machine learning, hard combinatorial problem solving or AI in general depends on their many parameters. Tuning an algorithm manually, however, is error-prone and very time-consuming. Many, if not most, algorithms are iterative in nature. Thus, they traverse a potentially diverse solution space, which might require different parameter settings at different stages to behave optimally. Further, algorithms are often used for solving a diverse set of problem instances, which by themselves might require different parameters. Taking all of this into account is infeasible for a human designer. Automated methods have therefore been proposed to mitigate human errors and minimize manual efforts. While such meta-algorithmic methods have shown large successes, there is still a lot of untapped potentials as prior approaches typically only consider configurations that do not change during an algorithm’s run or do not adapt to the problem instance.

In this dissertation, we present the first framework that is capable of dynamically configuring algorithms, in other words, capable of adapting configurations to the problem instance at hand during an algorithm’s solving process. To this end, we model the dynamic algorithm configuration (DAC) problem as a contextual Markov decision process. This enables us to learn dynamic configuration policies in a data-driven way by means of reinforcement learning.

We empirically demonstrate the effectiveness of our framework on a diverse set of problem settings consisting of artificial benchmarks, evolutionary algorithms, AI planning systems, as well as deep learning. We show that DAC outperforms previous meta-algorithmic approaches. Building on these successes, we formulate the first standardized interface for dynamic configuration and an extensive benchmark to facilitate reproducibility and lower the barrier of entry for new researchers into this novel research field. Lastly, our work on DAC feeds back into the reinforcement learning paradigm. Through the lens of DAC, we identify shortcomings in current state-of-the-art approaches and demonstrate how to solve these. In particular, intending to learn general policies for DAC, our work pushes the boundaries of generalization in reinforcement learning. We demonstrate how to efficiently incorporate domain knowledge when training general agents and propose to move from a reactive way of doing reinforcement learning to a proactive way by learning when to make new decisions.
@phdthesis{biedenkapp22, title = {Dynamic Algorithm Configuration by Reinforcement Learning}, author = {Biedenkapp, André}, year = {2022}, address = {Freiburg, Germany}, school = {University of Freiburg, Department of Computer Science, Machine Learning Chair}, }
AutoRL-Bench 1.0

Gresa Shala, Sebastian Pineda Arango, André Biedenkapp, Frank Hutter, and Josif Grabocka

In Workshop on Meta-Learning (MetaLearn@NeurIPS’22), 2022

Bib Abstract PDF OpenReview

It is well established that Reinforcement Learning (RL) is very brittle and sensitive to the choice of hyperparameters. This prevents RL methods from being usable out of the box. The field of automated RL (AutoRL) aims at automatically configuring the RL pipeline, to both make RL usable by a broader audience, as well as reveal its full potential. Still, there has been little progress towards this goal as new AutoRL methods often are evaluated with incompatible experimental protocols. Furthermore, the typically high cost of experimentation prevents a thorough and meaningful comparison of different AutoRL methods or established hyperparameter optimization (HPO) methods from the automated Machine Learning (AutoML) community. To alleviate these issues, we propose the first tabular AutoRL Benchmark for studying the hyperparameters of RL algorithms. We consider the hyperparameter search spaces of five well established RL methods (PPO, DDPG, A2C, SAC, TD3) across 22 environments for which we compute and provide the reward curves. This enables HPO methods to simply query our benchmark as a lookup table, instead of actually training agents. Thus, our benchmark offers a testbed for very fast, fair, and reproducible experimental protocols for comparing future black-box, gray-box, and online HPO methods for RL.
@inproceedings{shala-metalearn22a, title = {AutoRL-Bench 1.0}, author = {Shala, Gresa and Arango, Sebastian Pineda and Biedenkapp, André and Hutter, Frank and Grabocka, Josif}, year = {2022}, booktitle = {Workshop on Meta-Learning (MetaLearn@NeurIPS'22)}, }
Gray-Box Gaussian Processes for Automated Reinforcement Learning

Gresa Shala, André Biedenkapp, Frank Hutter, and Josif Grabocka

In Workshop on Meta-Learning (MetaLearn@NeurIPS’22), 2022

Bib Abstract PDF OpenReview

Despite having achieved spectacular milestones in an array of important real-world applications, most Reinforcement Learning (RL) methods are very brittle concerning their hyperparameters. Notwithstanding the crucial importance of setting the hyperparameters in training state-of-the-art agents, the task of hyperparameter optimization (HPO) in RL is understudied. In this paper, we propose a novel gray-box Bayesian Optimization technique for HPO in RL, that enriches Gaussian Processes with reward curve estimations based on generalized logistic functions. We thus about the performance of learning algorithms, transferring information across configurations and about epochs of the learning algorithm. In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), 22 environments (OpenAI Gym: Mujoco, Atari, Classic Control), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL.
@inproceedings{shala-metalearn22b, title = {Gray-Box Gaussian Processes for Automated Reinforcement Learning}, author = {Shala, Gresa and Biedenkapp, André and Hutter, Frank and Grabocka, Josif}, year = {2022}, booktitle = {Workshop on Meta-Learning (MetaLearn@NeurIPS'22)}, }
DeepCAVE: An Interactive Analysis Tool for Automated Machine Learning

René Sass, Eddie Bergman, André Biedenkapp, Frank Hutter, and Marius Lindauer

In Workshop on Adaptive Experimental Design and Active Learning in the Real World (ReALML@ICML’22), 2022

Bib Abstract PDF arXiv Code Website

Automated Machine Learning (AutoML) is used more than ever before to support users in determining efficient hyperparameters, neural architectures, or even full machine learning pipelines. However, users tend to mistrust the optimization process and its results due to a lack of transparency, making manual tuning still widespread. We introduce DeepCAVE, an interactive framework to analyze and monitor state-of-the-art optimization procedures for AutoML easily and ad hoc. By aiming for full and accessible transparency, DeepCAVE builds a bridge between users and AutoML and contributes to establishing trust. Our framework’s modular and easy-to-extend nature provides users with automatically generated text, tables, and graphic visualizations. We show the value of DeepCAVE in an exemplary use-case of outlier detection, in which our framework makes it easy to identify problems, compare multiple runs and interpret optimization processes. The package is freely available on GitHub this https URL.
@inproceedings{sass-realml22a, title = {DeepCAVE: An Interactive Analysis Tool for Automated Machine Learning}, author = {Sass, René and Bergman, Eddie and Biedenkapp, André and Hutter, Frank and Lindauer, Marius}, year = {2022}, booktitle = {Workshop on Adaptive Experimental Design and Active Learning in the Real World (ReALML@ICML'22)}, doi = {10.48550/arXiv.2206.03493}, }
Learning Domain-Independent Policies for Open List Selection

André Biedenkapp, David Speck, Silvan Sievers, Frank Hutter, Marius Lindauer, and Jendrik Seipp

In Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL@ICAPS’22), 2022

Bib Abstract PDF Slides Talk

Since its proposal over a decade ago, LAMA has been considered one of the best-performing satisficing classical planners. Its key component is heuristic search with multiple open lists, each using a different heuristic function to order states. Even with a very simple, ad-hoc policy for open list selection, LAMA achieves state-of-the-art results. In this paper, we propose to use dynamic algorithm configuration to learn such policies in a principled and data-driven manner. On the learning side, we show how to train a reinforcement learning agent over several heterogeneous environments, aiming at zero-shot generalization to new related domains. On the planning side, our experimental results show that the trained policies often reach the performance of LAMA, and sometimes even perform better. Furthermore, our analysis of different policies shows that prioritizing states reached via preferred operators is crucial, explaining the strong performance of LAMA.
@inproceedings{biedenkapp-prl22a, title = {Learning Domain-Independent Policies for Open List Selection}, author = {Biedenkapp, André and Speck, David and Sievers, Silvan and Hutter, Frank and Lindauer, Marius and Seipp, Jendrik}, year = {2022}, booktitle = {Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL@ICAPS'22)}, }
Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration

André Biedenkapp*, Nguyen Dang*, Martin S. Krejca*, Frank Hutter, and Carola Doerr

In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’22), pp. 766–775, 2022 *Joint first authorship 🏅Won the best paper award on the GECH track.

Bib Abstract PDF arXiv Blog Code Website Talk

It has long been observed that the performance of evolutionary algorithms and other randomized search heuristics can benefit from a non-static choice of the parameters that steer their optimization behavior. Mechanisms that identify suitable configurations on the fly ("parameter control") or via a dedicated training process ("dynamic algorithm configuration") are therefore an important component of modern evolutionary computation frameworks. Several approaches to address the dynamic parameter setting problem exist, but we barely understand which ones to prefer for which applications. As in classical benchmarking, problem collections with a known ground truth can offer very meaningful insights in this context. Unfortunately, settings with well-understood control policies are very rare. One of the few exceptions for which we know which parameter settings minimize the expected runtime is the LeadingOnes problem. We extend this benchmark by analyzing optimal control policies that can select the parameters only from a given portfolio of possible values. This also allows us to compute optimal parameter portfolios of a given size. We demonstrate the usefulness of our benchmarks by analyzing the behavior of the DDQN reinforcement learning approach for dynamic algorithm configuration.
@inproceedings{biedenkapp-gecco22a, title = {Theory-inspired Parameter Control Benchmarks for Dynamic Algorithm Configuration}, author = {Biedenkapp, André and Dang, Nguyen and Krejca, Martin S. and Hutter, Frank and Doerr, Carola}, year = {2022}, pages = {766--775}, booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference ({GECCO}'22)}, }
Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Jack Parker-Holder, Raghu Rajan, Xingyou Song, André Biedenkapp, Yingjie Miao, Theresa Eimer, Baohe Zhang, Vu Nguyen, Roberto Calandra, Aleksandra Faust, Frank Hutter, and Marius Lindauer

Journal of Artificial Intelligence Research (JAIR), 74, pp. 517-568, 2022

Bib Abstract PDF arXiv Poster Talk

The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward.
@article{parker-holder-jair22a, title = {Automated Reinforcement Learning (AutoRL): A Survey and Open Problems}, author = {Parker-Holder, Jack and Rajan, Raghu and Song, Xingyou and Biedenkapp, André and Miao, Yingjie and Eimer, Theresa and Zhang, Baohe and Nguyen, Vu and Calandra, Roberto and Faust, Aleksandra and Hutter, Frank and Lindauer, Marius}, year = {2022}, journal = {Journal of Artificial Intelligence Research (JAIR)}, pages = {517-568}, volume = {74}, doi = {10.1613/jair.1.13596}, }
Contextualize Me – The Case for Context in Reinforcement Learning

Carolin Benjamins, Theresa Eimer, Frederik Schubert, Aditya Mohan, André Biedenkapp, Bodo Rosenhan, Frank Hutter, and Marius Lindauer

arXiv:2202.04500, 2022

Bib Abstract PDF arXiv Code

While Reinforcement Learning (RL) has made great strides towards solving increasingly complicated problems, many algorithms are still brittle to even slight changes in environments. Contextual Reinforcement Learning (cRL) provides a theoretical framework to model such changes in a principled manner, thereby enabling flexible, precise and interpretable task specification and generation. Thus, cRL is an important formalization for studying generalization in RL. In this work, we reason about solving cRL in theory and practice. We show that theoretically optimal behavior in contextual Markov Decision Processes requires explicit context information. In addition, we empirically explore context-based task generation, utilizing context information in training and propose cGate, our state-modulating policy architecture. To this end, we introduce the first benchmark library designed for generalization based on cRL extensions of popular benchmarks, CARL. In short: Context matters!
@article{benjamins-arxiv22a, title = {Contextualize Me – The Case for Context in Reinforcement Learning}, author = {Benjamins, Carolin and Eimer, Theresa and Schubert, Frederik and Mohan, Aditya and Biedenkapp, André and Rosenhan, Bodo and Hutter, Frank and Lindauer, Marius}, year = {2022}, journal = {arXiv:2202.04500}, }
SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization

Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, René Sass, and Frank Hutter

Journal of Machine Learning Research (JMLR) – MLOSS, 23(54), pp. 1-9, 2022

Bib Abstract PDF arXiv Code Talk

Algorithm parameters, in particular hyperparameters of machine learning algorithms, can substantially impact their performance. To support users in determining well-performing hyperparameter configurations for their algorithms, datasets and applications at hand, SMAC3 offers a robust and flexible framework for Bayesian Optimization, which can improve performance within a few evaluations. It offers several facades and pre-sets for typical use cases, such as optimizing hyperparameters, solving low dimensional continuous (artificial) global optimization problems and configuring algorithms to perform well across multiple problem instances. The SMAC3 package is available under a permissive BSD-license at https://github.com/automl/SMAC3.
@article{lindauer-jmlr22a, title = {SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization}, author = {Lindauer, Marius and Eggensperger, Katharina and Feurer, Matthias and Biedenkapp, André and Deng, Difan and Benjamins, Carolin and Ruhkopf, Tim and Sass, René and Hutter, Frank}, year = {2022}, journal = {Journal of Machine Learning Research (JMLR) -- MLOSS}, volume = {23}, number = {54}, pages = {1-9}, }

2021

CARL: A Benchmark for Contextual and Adaptive Reinforcement Learning

Carolin Benjamins, Theresa Eimer, Frederik Schubert, André Biedenkapp, Bodo Rosenhan, Frank Hutter, and Marius Lindauer

In Workshop on Ecological Theory of Reinforcement Learning (EcoRL@NeurIPS’21), 2021

Bib Abstract PDF arXiv Code

While Reinforcement Learning has made great strides towards solving ever more complicated tasks, many algorithms are still brittle to even slight changes in their environment. This is a limiting factor for real-world applications of RL. Although the research community continuously aims at improving both robustness and generalization of RL algorithms, unfortunately it still lacks an open-source set of well-defined benchmark problems based on a consistent theoretical framework, which allows comparing different approaches in a fair, reliable and reproducibleway. To fill this gap, we propose CARL, a collection of well-known RL environments extended to contextual RL problems to study generalization. We show the urgent need of such benchmarks by demonstrating that even simple toy environments become challenging for commonly used approaches if different contextual instances of this task have to be considered. Furthermore, CARL allows us to provide first evidence that disentangling representation learning of the states from the policy learning with the context facilitates better generalization. By providing variations of diverse benchmarks from classic control, physical simulations, games and a real-world application of RNA design, CARL will allow the community to derive many more such insights on a solid empirical foundation.
@inproceedings{benjamins-arxiv21a, title = {CARL: A Benchmark for Contextual and Adaptive Reinforcement Learning}, author = {Benjamins, Carolin and Eimer, Theresa and Schubert, Frederik and Biedenkapp, André and Rosenhan, Bodo and Hutter, Frank and Lindauer, Marius}, year = {2021}, booktitle = {Workshop on Ecological Theory of Reinforcement Learning (EcoRL@NeurIPS'21)}, journal = {arXiv:2110.02102}, }
DACBench: A Benchmark Library for Dynamic Algorithm Configuration

Theresa Eimer, André Biedenkapp, Maximilian Reimer, Steven Adriaensen, Frank Hutter, and Marius Lindauer

In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI’21), pp. 1668–1674, 2021

Bib Abstract PDF Supp arXiv Blog Code Website Talk

Dynamic Algorithm Configuration (DAC) aims to dynamically control a target algorithm’s hyperparameters in order to improve its performance. Several theoretical and empirical results have demonstrated the benefits of dynamically controlling hyperparameters in domains like evolutionary computation, AI Planning or deep learning. Replicating these results, as well as studying new methods for DAC, however, is difficult since existing benchmarks are often specialized and incompatible with the same interfaces. To facilitate benchmarking and thus research on DAC, we propose DACBench, a benchmark library that seeks to collect and standardize existing DAC benchmarks from different AI domains, as well as provide a template for new ones. For the design of DACBench, we focused on important desiderata, such as (i) flexibility, (ii) reproducibility, (iii) extensibility and (iv) automatic documentation and visualization. To show the potential, broad applicability and challenges of DAC, we explore how a set of six initial benchmarks compare in several dimensions of difficulty.
@inproceedings{eimer-ijcai21, title = {DACBench: A Benchmark Library for Dynamic Algorithm Configuration}, author = {Eimer, Theresa and Biedenkapp, André and Reimer, Maximilian and Adriaensen, Steven and Hutter, Frank and Lindauer, Marius}, year = {2021}, booktitle = {Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI'21)}, pages = {1668--1674}, publisher = {ijcai.org}, }
Learning Heuristic Selection with Dynamic Algorithm Configuration

David Speck*, André Biedenkapp*, Frank Hutter, Robert Mattmüller, and Marius Lindauer

In Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021), pp. 597–605, 2021 *Joint first authorship

Bib Abstract PDF arXiv Code Poster Website Talk

A key challenge in satisficing planning is to use multiple heuristics within one heuristic search. An aggregation of multiple heuristic estimates, for example by taking the maximum, has the disadvantage that bad estimates of a single heuristic can negatively affect the whole search. Since the performance of a heuristic varies from instance to instance, approaches such as algorithm selection can be successfully applied. In addition, alternating between multiple heuristics during the search makes it possible to use all heuristics equally and improve performance. However, all these approaches ignore the internal search dynamics of a planning system, which can help to select the most useful heuristics for the current expansion step. We show that dynamic algorithm configuration can be used for dynamic heuristic selection which takes into account the internal search dynamics of a planning system. Furthermore, we prove that this approach generalizes over existing approaches and that it can exponentially improve the performance of the heuristic search. To learn dynamic heuristic selection, we propose an approach based on reinforcement learning and show empirically that domain-wise learned policies, which take the internal search dynamics of a planning system into account, can exceed existing approaches.
@inproceedings{speck-et-al-icaps2021b, author = {Speck, David and Biedenkapp, Andr{\'e} and Hutter, Frank and Mattm{\"u}ller, Robert and Lindauer, Marius}, title = {Learning Heuristic Selection with Dynamic Algorithm Configuration}, editor = {Goldman, Robert P. and Biundo, Susanne and Katz, Michael}, booktitle = {Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021)}, year = {2021}, publisher = {AAAI Press}, pages = {597--605}, }
TempoRL: Learning When to Act

André Biedenkapp, Raghu Rajan, Frank Hutter, and Marius Lindauer

In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 139, pp. 914–924, 2021

Bib Abstract PDF Supp arXiv Blog Code Poster Slides Talk

Reinforcement learning is a powerful approach to learn behaviour through interactions with an environment. However, behaviours are usually learned in a purely reactive fashion, where an appropriate action is selected based on an observation. In this form, it is challenging to learn when it is necessary to execute new decisions. This makes learning inefficient, especially in environments that need various degrees of fine and coarse control. To address this, we propose a proactive setting in which the agent not only selects an action in a state but also for how long to commit to that action. Our TempoRL approach introduces skip connections between states and learns a skip-policy for repeating the same action along these skips. We demonstrate the effectiveness of TempoRL on a variety of traditional and deep RL environments, showing that our approach is capable of learning successful policies up to an order of magnitude faster than vanilla Q-learning.
@inproceedings{biedenkapp-icml21, title = {TempoRL: Learning When to Act}, author = {Biedenkapp, André and Rajan, Raghu and Hutter, Frank and Lindauer, Marius}, year = {2021}, volume = {139}, pages = {914--924}, booktitle = {Proceedings of the 38th International Conference on Machine Learning (ICML 2021)}, }
Self-Paced Context Evaluations for Contextual Reinforcement Learning

Theresa Eimer, André Biedenkapp, Frank Hutter, and Marius Lindauer

In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 139, pp. 2948–2958, 2021

Bib Abstract PDF Supp arXiv Blog Code Poster Talk

Reinforcement learning (RL) has made a lot of advances for solving a single problem in a given environment; but learning policies that generalize to unseen variations of a problem remains challenging. To improve sample efficiency for learning on such instances of a problem domain, we present Self-Paced Context Evaluation (SPaCE). Based on self-paced learning, \spc automatically generates \task curricula online with little computational overhead. To this end, SPaCE leverages information contained in state values during training to accelerate and improve training performance as well as generalization capabilities to new instances from the same problem domain. Nevertheless, SPaCE is independent of the problem domain at hand and can be applied on top of any RL agent with state-value function approximation. We demonstrate SPaCE’s ability to speed up learning of different value-based RL agents on two environments, showing better generalization capabilities and up to 10x faster learning compared to naive approaches such as round robin or SPDRL, as the closest state-of-the-art approach.
@inproceedings{eimer-icml21, title = {Self-Paced Context Evaluations for Contextual Reinforcement Learning}, author = {Eimer, Theresa and Biedenkapp, André and Hutter, Frank and Lindauer, Marius}, year = {2021}, volume = {139}, pages = {2948--2958}, booktitle = {Proceedings of the 38th International Conference on Machine Learning (ICML 2021)}, }
Bag of Baselines for Multi-objective Joint Neural Architecture Search and Hyperparameter Optimization

Sergio Izquierdo, Julia Guerrero-Viu, Sven Hauns, Guilherme Miotto, Simon Schrodi, André Biedenkapp, Thomas Elsken, Difan Deng, Marius Lindauer, and Frank Hutter

In Workshop on Automated Machine Learning (AutoML@ICML’21), 2021

Bib Abstract PDF arXiv Code Poster Talk

Neural architecture search (NAS) and hyperparameter optimization (HPO) make deep learning accessible to non-experts by automatically finding the architecture of the deep neural network to use and tuning the hyperparameters of the used training pipeline. While both NAS and HPO have been studied extensively in recent years, NAS methods typically assume fixed hyperparameters and vice versa - there exists little work on joint NAS + HPO. Furthermore, NAS has recently often been framed as a multi-objective optimization problem, in order to take, e.g., resource requirements into account. In this paper, we propose a set of methods that extend current approaches to jointly optimize neural architectures and hyperparameters with respect to multiple objectives. We hope that these methods will serve as simple baselines for future research on multi-objective joint NAS + HPO. To facilitate this, all our code is available at this https URL.
@inproceedings{viu-automlicml21a, title = {Bag of Baselines for Multi-objective Joint Neural Architecture Search and Hyperparameter Optimization}, author = {Izquierdo, Sergio and Guerrero-Viu, Julia and Hauns, Sven and Miotto, Guilherme and Schrodi, Simon and Biedenkapp, André and Elsken, Thomas and Deng, Difan and Lindauer, Marius and Hutter, Frank}, year = {2021}, booktitle = {Workshop on Automated Machine Learning (AutoML@ICML'21)}, journal = {Workshop on Automated Machine Learning (AutoML@ICML'21)}, }
MDP Playground: A Design and Debug Testbed for Reinforcement Learning

Raghu Rajan, Jessica Lizeth Borja Diaz, Suresh Guttikonda, Fabio Ferreira, André Biedenkapp, Jan Ole Hartz, and Frank Hutter

arXiv:1909.07750 [cs.LG], 2021

Bib Abstract PDF arXiv Code

We present \emphMDP Playground, an efficient testbed for Reinforcement Learning (RL) agents with \textitorthogonal dimensions that can be controlled independently to challenge agents in different ways and obtain varying degrees of hardness in generated environments. We consider and allow control over a wide variety of dimensions, including \textitdelayed rewards, \textitrewardable sequences, \textitdensity of rewards, \textitstochasticity, \textitimage representations, \textitirrelevant features, \textittime unit, \textitaction range and more. We define a parameterised collection of fast-to-run toy environments in \textitOpenAI Gym by varying these dimensions and propose to use these for the initial design and development of agents. We also provide wrappers that inject these dimensions into complex environments from \textitAtari and \textitMujoco to allow for evaluating agent robustness. We further provide various example use-cases and instructions on how to use \textitMDP Playground to design and debug agents. We believe that \textitMDP Playground is a valuable testbed for researchers designing new, adaptive and intelligent RL agents and those wanting to unit test their agents.
@article{rajan-arxiv21, title = {MDP Playground: A Design and Debug Testbed for Reinforcement Learning}, author = {Rajan, Raghu and Diaz, Jessica Lizeth Borja and Guttikonda, Suresh and Ferreira, Fabio and Biedenkapp, André and von Hartz, Jan Ole and Hutter, Frank}, year = {2021}, journal = {arXiv:1909.07750 [cs.LG]}, }
Sample-Efficient Automated Deep Reinforcement Learning

Jörg K H Franke, Gregor Köhler, André Biedenkapp, and Frank Hutter

International Conference on Learning Representations (ICLR) 2021, 2021

Bib Abstract PDF OpenReview arXiv Blog Code Talk

Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.
@article{franke-iclr21a, title = {Sample-Efficient Automated Deep Reinforcement Learning}, author = {Franke, Jörg K H and Köhler, Gregor and Biedenkapp, André and Hutter, Frank}, year = {2021}, booktitle = {International Conference on Learning Representations (ICLR) 2021}, journal = {International Conference on Learning Representations (ICLR) 2021}, }
On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Baohe Zhang, Raghu Rajan, Luis Pineda, Nathan Lambert, André Biedenkapp, Kurtland Chua, Frank Hutter, and Roberto Calandra

In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS)’21, 130, pp. 4015–4023, 2021

Bib Abstract PDF Supp arXiv Blog Code Poster Talk

Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner. MBRL algorithms can be fairly complex due to the separate dynamics modeling and the subsequent planning algorithm, and as a result, they often possess tens of hyperpa- rameters and architectural choices. For this reason, MBRL typically requires significant human expertise before it can be applied to new problems and domains. To alleviate this problem, we propose to use automatic hyperparameter optimization (HPO). We demonstrate that this problem can be tackled effectively with automated HPO, which we demonstrate to yield significantly improved performance compared to human experts. In addition, we show that tuning of several MBRL hyperparameters dynamically, i.e. during the training itself, further improves the performance compared to using static hyperparameters which are kept fixed for the whole training. Finally, our experiments provide valuable insights into the effects of several hyperparameters, such as plan horizon or learning rate and their influence on the stability of training and resulting rewards.
@inproceedings{zhang-aistats21, title = {On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning}, author = {Zhang, Baohe and Rajan, Raghu and Pineda, Luis and Lambert, Nathan and Biedenkapp, André and Chua, Kurtland and Hutter, Frank and Calandra, Roberto}, year = {2021}, volume = {130}, pages = {4015--4023}, publisher = {{PMLR}}, booktitle = {Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS)'21}, }
In-Loop Meta-Learning with Gradient-Alignment Reward

Samuel Müller, André Biedenkapp, and Frank Hutter

In AAAI workshop on Meta-Learning Challenges, 2021

Bib Abstract PDF Code

t the heart of the standard deep learning training loop is a greedy gradient step minimizing a given loss. We propose to add a second step to maximize training generalization. To do this, we optimize the loss of the next training step. While computing the gradient for this generally is very expensive and many interesting applications consider non-differentiable parameters (e.g. due to hard samples), we present a cheap-to-compute and memory-saving reward, the gradient-alignment reward (GAR), that can guide the optimization. We use this reward to optimize multiple distributions during model training. First, we present the application of GAR to choosing the data distribution as a mixture of multiple dataset splits in a small scale setting. Second, we show that it can successfully guide learning augmentation strategies competitive with state-of-the-art augmentation strategies on CIFAR-10 and CIFAR-100.
@inproceedings{mueller-metalearn21a, title = {In-Loop Meta-Learning with Gradient-Alignment Reward}, author = {Müller, Samuel and Biedenkapp, André and Hutter, Frank}, year = {2021}, booktitle = {AAAI workshop on Meta-Learning Challenges}, journal = {AAAI workshop on Meta-Learning Challenges}, }

2020

Squirrel: A Switching Hyperparameter Optimizer Description of the entry by AutoML.org & IOHprofiler to the NeurIPS 2020 BBO challenge

Noor Awad, Gresa Shala, Difan Deng, Neeratyoy Mallik, Matthias Feurer, Katharina Eggensperger, André Biedenkapp, Diederick Vermetten, Hao Wang, Carola Doerr, Marius Lindauer, and Frank Hutter

arXiv:2012.08180 [cs.LG], 2020 🏅Winner of the NeurIPS 2020 BBO challenge on the meta-learning friendly track

Bib Abstract PDF arXiv Code Talk

In this short note, we describe our submission to the NeurIPS 2020 BBO challenge. Motivated by the fact that different optimizers work well on different problems, our approach switches between different optimizers. Since the team names on the competition’s leaderboard were randomly generated ”alliteration nicknames”, consisting of an adjective and an animal with the same initial letter, we called our approach the Switching Squirrel, or here, short, Squirrel.
@article{awad-arxiv20a, title = {Squirrel: A Switching Hyperparameter Optimizer Description of the entry by AutoML.org IOHprofiler to the NeurIPS 2020 BBO challenge}, author = {Awad, Noor and Shala, Gresa and Deng, Difan and Mallik, Neeratyoy and Feurer, Matthias and Eggensperger, Katharina and Biedenkapp, André and Vermetten, Diederick and Wang, Hao and Doerr, Carola and Lindauer, Marius and Hutter, Frank}, year = {2020}, journal = {arXiv:2012.08180 [cs.LG]}, }
Learning Heuristic Selection with Dynamic Algorithm Configuration

David Speck*, André Biedenkapp*, Frank Hutter, Robert Mattmüller, and Marius Lindauer

In ICAPS 2020 Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL), pp. 61–69, 2020 *Joint first authorship

Bib Abstract PDF arXiv Code Website Talk

A key challenge in satisficing planning is to use multiple heuristics within one heuristic search. An aggregation of multiple heuristic estimates, for example by taking the maximum, has the disadvantage that bad estimates of a single heuristic can negatively affect the whole search. Since the performance of a heuristic varies from instance to instance, approaches such as algorithm selection can be successfully applied. In addition, alternating between multiple heuristics during the search makes it possible to use all heuristics equally and improve performance. However, all these approaches ignore the internal search dynamics of a planning system, which can help to select the most helpful heuristics for the current expansion step. We show that dynamic algorithm configuration can be used for dynamic heuristic selection which takes into account the internal search dynamics of a planning system. Furthermore, we prove that this approach generalizes over existing approaches and that it can exponentially improve the performance of the heuristic search. To learn dynamic heuristic selection, we propose an approach based on reinforcement learning and show empirically that domain-wise learned policies, which take the internal search dynamics of a planning system into account, can exceed existing approaches in terms of coverage.
@inproceedings{speck-et-al-icaps2020wsprl, title = {Learning Heuristic Selection with Dynamic Algorithm Configuration}, author = {Speck, David and Biedenkapp, Andr{\'e} and Hutter, Frank and Mattm{\"u}ller, Robert and Lindauer, Marius}, booktitle = {ICAPS 2020 Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL)}, year = {2020}, pages = {61--69}, }
Learning Step-Size Adaptation in CMA-ES

Gresa Shala*, André Biedenkapp*, Noor Awad, Steven Adriaensen, Marius Lindauer, and Frank Hutter

In Proceedings of the Sixteenth International Conference on Parallel Problem Solving from Nature (PPSN’20), 12269, pp. 691–706, 2020 *Joint first authorship

Bib Abstract PDF Blog Code Poster Website Talk

An algorithm’s parameter setting often affects its ability to solve a given problem, e.g., population-size, mutation-rate or crossover- rate of an evolutionary algorithm. Furthermore, some parameters have to be adjusted dynamically, such as lowering the mutation-strength over time. While hand-crafted heuristics offer a way to fine-tune and dynamically configure these parameters, their design is tedious, time-consuming and typically involves analyzing the algorithm’s behavior on simple problems that may not be representative for those that arise in practice. In this paper, we show that formulating dynamic algorithm configuration as a reinforcement learning problem allows us to automatically learn policies that can dynamically configure the mutation step-size parameter of Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We evaluate our approach on a wide range of black-box optimization problems, and show that (i) learning step-size policies has the potential to improve the performance of CMA-ES; (ii) learned step-size policies can outperform the default Cumulative Step-Size Adaptation of CMA-ES; and transferring the policies to (iii) different function classes and to (iv) higher dimensions is also possible.
@inproceedings{shala-ppsn20, title = {Learning Step-Size Adaptation in CMA-ES}, author = {Shala, Gresa and Biedenkapp, André and Awad, Noor and Adriaensen, Steven and Lindauer, Marius and Hutter, Frank}, year = {2020}, series = {Lecture Notes in Computer Science}, volume = {12269}, pages = {691--706}, publisher = {Springer}, booktitle = {Proceedings of the Sixteenth International Conference on Parallel Problem Solving from Nature (PPSN'20)}, }
Towards TempoRL: Learning When to Act

André Biedenkapp, Raghu Rajan, Frank Hutter, and Marius Lindauer

In Workshop on Inductive Biases, Invariances and Generalization in RL (BIG@ICML’20), 2020

Bib Abstract PDF Code Slides Talk

Reinforcement Learning is a powerful approach to learning behaviour through interactions with an environment. However, behaviours are learned in a purely reactive fashion, where an appropriate action is selected based on an observation. In this form, it is challenging to learn when it is necessary to make new decisions. This makes learning inefficient especially in environments with with very fine-grained time steps. Instead we propose a more proactive setting in which not only an action is chosen in a state but also for how long to commit to that action. We demonstrate the effectiveness of our proposed approach on a set of small grid worlds, showing that our approach is capable of learning successful policies much faster than vanilla Q-learning.
@inproceedings{biedenkapp-bigicml20, title = {Towards TempoRL: Learning When to Act}, author = {Biedenkapp, André and Rajan, Raghu and Hutter, Frank and Lindauer, Marius}, year = {2020}, booktitle = {Workshop on Inductive Biases, Invariances and Generalization in RL (BIG@ICML'20)}, }
Towards Self-Paced Context Evaluations for Contextual Reinforcement Learning

Theresa Eimer, André Biedenkapp, Frank Hutter, and Marius Lindauer

In Workshop on Inductive Biases, Invariances and Generalization in RL (BIG@ICML’20), 2020

Bib Abstract PDF Code Talk

Reinforcement Learning has performed very well on games and lab-based tasks. However, learning policies across a distribution of instances of the same task still remains challenging. Recent approaches assume either little variation between instances or an unlimited amount of training examples from a given distribution. Both properties are not always feasible in real-world applications. Thus, we need methods that enable agents to generalize from a limited set of example instances or experiences. We present an approach, based on self-paced learning, that allows to exploit the information contained in state values during training to accelerate and improve training performance as well as generalization capabilities, independent of the problem domain at hand. The proposed Self-Paced Context Evaluation (SPaCE) provides a way to automatically generate instance curricula online with little computational overhead.
@inproceedings{eimer-bigicml20, title = {Towards Self-Paced Context Evaluations for Contextual Reinforcement Learning}, author = {Eimer, Theresa and Biedenkapp, André and Hutter, Frank and Lindauer, Marius}, year = {2020}, booktitle = {Workshop on Inductive Biases, Invariances and Generalization in RL (BIG@ICML'20)}, }
Dynamic Algorithm Configuration: Foundation of a New Meta-Algorithmic Framework

André Biedenkapp, Furkan H Bozkurt, Theresa Eimer, Frank Hutter, and Marius Lindauer

In Proceedings of the Twenty-fourth European Conference on Artificial Intelligence (ECAI’20), pp. 427–434, 2020

Bib Abstract PDF Supp Blog Code Slides Talk

The performance of many algorithms in the fields of hard combinatorial problem solving, machine learning or AI in general depends on parameter tuning. Automated methods have been proposed to alleviate users from the tedious and error-prone task of manually searching for performance-optimized configurations across a set of problem instances. However, there is still a lot of untapped potential through adjusting an algorithm’s parameters online since different parameter values can be optimal at different stages of the algorithm. Prior work showed that reinforcement learning is an effective approach to learn policies for online adjustments of algorithm parameters in a data-driven way. We extend that approach by formulating the resulting dynamic algorithm configuration as a contextual MDP, such that RL not only learns a policy for a single instance, but across a set of instances. To lay the foundation for studying dynamic algorithm configuration with RL in a controlled setting, we propose white-box benchmarks covering major aspects that make dynamic algorithm configuration a hard problem in practice and study the per- formance of various types of configuration strategies for them. On these white-box benchmarks, we show that (i) RL is a robust candidate for learning configuration policies, outperforming standard pa- rameter optimization approaches, such as classical algorithm configuration; (ii) based on function approximation, RL agents can learn to generalize to new types of instances; and (iii) self-paced learning can substantially improve the performance by selecting a useful sequence of training instances automatically.
@inproceedings{biedenkapp-ecai20, title = {Dynamic Algorithm Configuration: Foundation of a New Meta-Algorithmic Framework}, author = {Biedenkapp, André and Bozkurt, Furkan H and Eimer, Theresa and Hutter, Frank and Lindauer, Marius}, year = {2020}, pages = {427--434}, booktitle = {Proceedings of the Twenty-fourth European Conference on Artificial Intelligence (ECAI'20)}, }

2019

Towards White-box Benchmarks for Algorithm Control

André Biedenkapp, Furkan H. Bozkurt, Frank Hutter, and Marius Lindauer

In IJCAI 2019 DSO Workshop, 2019
Note: In this early work on DAC we refered to "dynamic algorithm configuraiton" as "algorithm control"

Bib Abstract PDF arXiv Poster Slides Website

The performance of many algorithms in the fields of hard combinatorial problem solving, machine learning or AI in general depends on tuned hyperparameter configurations. Automated methods have been proposed to alleviate users from the tedious and error-prone task of manually searching for performance-optimized configurations across a set of problem instances. However there is still a lot of untapped potential through adjusting an algorithm’s hyperparameters online since different hyperparameters are potentially optimal at different stages of the algorithm. We formulate the problem of adjusting an algorithm’s hyperparameters for a given instance on the fly as a contextual MDP, making reinforcement learning (RL) the prime candidate to solve the resulting algorithm control problem in a data-driven way. Furthermore, inspired by applications of algorithm configuration, we introduce new white-box benchmarks suitable to study algorithm control. We show that on short sequences, algorithm configuration is a valid choice, but that with increasing sequence length a black-box view on the problem quickly becomes infeasible and RL performs better.
@inproceedings{biedenkapp-dso19, title = {Towards White-box Benchmarks for Algorithm Control}, author = {Biedenkapp, André and Bozkurt, Furkan H. and Hutter, Frank and Lindauer, Marius}, year = {2019}, booktitle = {IJCAI 2019 DSO Workshop}, }
BOAH: A Tool Suite for Multi-Fidelity Bayesian Optimization & Analysis of Hyperparameters

Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Joshua Marben, Philipp Müller, and Frank Hutter

arXiv:1908.06756 [cs.LG], 2019

Bib Abstract PDF arXiv Code Website

Hyperparameter optimization and neural architecture search can become prohibitively expensive for regular black-box Bayesian optimization because the training and evaluation of a single model can easily take several hours. To overcome this, we introduce a comprehensive tool suite for effective multi-fidelity Bayesian optimization and the analysis of its runs. The suite, written in Python, provides a simple way to specify complex design spaces, a robust and efficient combination of Bayesian optimization and HyperBand, and a comprehensive analysis of the optimization process and its outcomes.
@article{lindauer-arxiv19, title = {BOAH: A Tool Suite for Multi-Fidelity Bayesian Optimization Analysis of Hyperparameters}, author = {Lindauer, Marius and Eggensperger, Katharina and Feurer, Matthias and Biedenkapp, André and Marben, Joshua and Müller, Philipp and Hutter, Frank}, year = {2019}, journal = {arXiv:1908.06756 [cs.LG]}, }
Towards Assessing the Impact of Bayesian Optimization’s Own Hyperparameters

Marius Lindauer, Matthias Feurer, Katharina Eggensperger, André Biedenkapp, and Frank Hutter

In IJCAI 2019 DSO Workshop, 2019

Bib Abstract PDF arXiv Slides

Bayesian Optimization (BO) is a common approach for hyperparameter optimization (HPO) in automated machine learning. Although it is well-accepted that HPO is crucial to obtain well-performing machine learning models, tuning BO’s own hyperparameters is often neglected. In this paper, we empirically study the impact of optimizing BO’s own hyperparameters and the transferability of the found settings using a wide range of benchmarks, including artificial functions, HPO and HPO combined with neural architecture search. In particular, we show (i) that tuning can improve the any-time performance of different BO approaches, that optimized BO settings also perform well (ii) on similar problems and (iii) partially even on problems from other problem families, and (iv) which BO hyperparameters are most important.
@inproceedings{lindauer-dso19, title = {Towards Assessing the Impact of Bayesian Optimization's Own Hyperparameters}, author = {Lindauer, Marius and Feurer, Matthias and Eggensperger, Katharina and Biedenkapp, André and Hutter, Frank}, year = {2019}, booktitle = {IJCAI 2019 DSO Workshop}, }

2018

CAVE: Configuration Assessment, Visualization and Evaluation

André Biedenkapp, Joshua Marben, Marius Lindauer, and Frank Hutter

In Proceedings of the International Conference on Learning and Intelligent Optimization (LION’18), 11353, pp. 115–130, 2018

Bib Abstract PDF Code Slides Website Talk

To achieve peak performance of an algorithm (in particular for problems in AI), algorithm configuration is often necessary to determine a well-performing parameter configuration. So far, most studies in algorithm configuration focused on proposing better algorithm configuration procedures or on improving a particular algorithm’s performance. In contrast, we use all the collected empirical performance data gathered during algorithm configuration runs to generate extensive insights into an algorithm, given problem instances and the used configurator. To this end, we provide a tool, called CAVE , that automatically generates comprehensive reports and insightful figures from all available empirical data. CAVE aims to help algorithm and configurator developers to better understand their experimental setup in an automated fashion. We showcase its use by thoroughly analyzing the well studied SAT solver spear on a benchmark of software verification instances and by empirically verifying two long-standing assumptions in algorithm configuration and parameter importance: (i) Parameter importance changes depending on the instance set at hand and (ii) Local and global parameter importance analysis do not necessarily agree with each other.
@inproceedings{biedenkapp-lion18a, title = {CAVE: Configuration Assessment, Visualization and Evaluation}, author = {Biedenkapp, André and Marben, Joshua and Lindauer, Marius and Hutter, Frank}, year = {2018}, volume = {11353}, pages = {115--130}, publisher = {Springer}, booktitle = {Proceedings of the International Conference on Learning and Intelligent Optimization (LION'18)}, }

2017

Efficient Parameter Importance Analysis via Ablation with Surrogates

André Biedenkapp, Marius Lindauer, Katharina Eggensperger, Chris Fawcett, Holger H Hoos, and Frank Hutter

In Proceedings of the Thirty-First Conference on Artificial Intelligence (AAAI’17), pp. 773–779, 2017

Bib Abstract PDF Code Poster

To achieve peak performance, it is often necessary to adjust the parameters of a given algorithm to the class of problem instances to be solved; this is known to be the case for popular solvers for a broad range of AI problems, including AI planning, propositional satisfiability (SAT) and answer set programming (ASP). To avoid tedious and often highly sub-optimal manual tuning of such parameters by means of ad-hoc methods, general-purpose algorithm configuration procedures can be used to automatically find performance-optimizing parameter settings. While impressive performance gains are often achieved in this manner, additional, potentially costly parameter importance analysis is required to gain insights into what parameter changes are most responsible for those improvements. Here, we show how the running time cost of ablation analysis, a well-known general-purpose approach for assessing parameter importance, can be reduced substantially by using regression models of algorithm performance constructed from data collected during the configuration process. In our experiments, we demonstrate speed-up factors between 33 and 14 727 for ablation analysis on various configuration scenarios from AI planning, SAT, ASP and mixed integer programming (MIP).
@inproceedings{biedenkapp-aaai17a, title = {Efficient Parameter Importance Analysis via Ablation with Surrogates}, author = {Biedenkapp, André and Lindauer, Marius and Eggensperger, Katharina and Fawcett, Chris and Hoos, Holger H and Hutter, Frank}, year = {2017}, booktitle = {Proceedings of the Thirty-First Conference on Artificial Intelligence (AAAI'17)}, pages = {773--779}, }