Introduction to multi-objective reinforcement learning

Multi-objective reinforcement learning in a nutshell

Single-objective reinforcement learning (SORL): The optimal behaviour is defined via scalar reward.
Multi-objective reinforcement learning (MORL): Consider multiple conflicting objectives simultaneously.
Policy $\pi$ is considered Pareto optimal if it:
1. outperforms all other policies in at least one objective, and
2. is (at least) equally good in other objectives.
Goal: Find a set of Pareto-optimal policies.¹

Figure 1: Conceptual Pareto front for two objectives. Each point in the plot represents a value of policy $\pi$ with given preference $\mathbf{w}$.

Why consider multiple objectives?

Move responsibility of choosing the desired trade-off between the objectives from engineer to end-user.
The desired trade-off can be determined after the optimization is done and the results can be observed.
There may be uncertainty in user preferences between objectives.

Example: Multi-objective Halfcheetah

Two objectives: Energy consumption and running speed.
All behaviours are generated by a single policy that is conditioned on the preferences between the objectives.

Multi-objective MDPs and value-functions

Formally, a multi-objective sequential decision making problem is presented as a multi-objective Markov decision process MOMDP: $\langle S, A, T, \gamma, \mu, \mathbf{R}\rangle$.
The major difference to the traditional MDP’s is the inclusion of vector valued reward function $\mathbf{R}: S \times A \times S \rightarrow \mathbb{R}^d$, where $d \geq 2$ is the number of objectives.
Vector-valued value-function is defined as in the single-objective case: \[\mathbf{V}^{\pi} = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t \mathbf{r}_{t}\,\vert\,\pi, \mu\right]\]
Consider two policies $\pi$ and $\pi'$ with two objectives $i$ and $j$:
- It is possible to find situation such that $V_i^\pi > V_i^{\pi'}$ while $V_j^\pi < V_j^{\pi'}$
- → Value-function defines only partial ordering over the value-space.

Utility functions

Utility function (or scalarization function) is a mapping that encodes the user preferences over the objectives: \[ u: \mathbb{R}^d \rightarrow \mathbb{R} \]
In practice, the user preferences are described as a weight vector: \[ \mathbf{w} = [w_1, w_2, \dots, w_d],\; \sum_{i=1}^d w_i = 1\, \land\, \forall i:\; 0 < w_i < 1 \]
The vector-valued value-function can be converted back to scalar-valued function using the utility function \[ V_u^\pi = u(\mathbf{V}^\pi) \]
The scalarized value-function defines a total ordering of the value-space for a given preference.

Types of utility functions

Intuitively, the scalarized value should increase, if the value of one objective increases, while the values of other objectives remain unchanged.
- i.e. $u([0.5, 0.5]) > u([0.5, 0.1])$.
Monotonically increasing utility functions fulfill this requirement.
- This is a minimal assumption about the form of the utility function.
- Crucially, this set contains non-linear utility functions, such as Tchebycheff utility function (Perny and Weng 2010).
Linear utility functions are commonly used in practice
- $V_{\mathbf{w}}^\pi = \mathbf{w}^T \mathbf{V}^\pi$
- In addition to having certain important theoretical properties, they are also easy to interpret.

What to optimize for?

In SORL, the goal is to optimize the expected cumulative return.
MORL, there are two optimality criteria:
- Scalarized expected return (SER) \[ V_u^\pi = u\Biggl(\mathbb{E}\left[\sum_{t=0}^\infty \gamma^t \mathbf{r}_t\,\bigg\vert\,\pi, s_0\right] \Biggl) \]
- Expected scalarized return (ESR) \[ V_u^\pi = \mathbb{E}\left[u\left(\sum_{t=0}^\infty \gamma^t \mathbf{r}_t\right)\,\bigg\vert\,\pi, s_0\right] \]
If the utility function is linear, both criteria produce same policies.

Bellman equation with ESR

Recall Bellman equation: \[ \mathbf{V}^{\pi}(s) = \underbrace{\mathbf{R}_\tau}_{\text{Immediate reward}} + \gamma \underbrace{\sum_{s'} P\big(s'\,\vert\,s, \pi(s)\big)\mathbf{V}^\pi(s')}_{\text{Expected return from state $s'$ onwards}} \]
Consider what happens with the ESR optimality criterion with non-linear utility function.
\[ \mathbb{E}\left[u\bigg(\mathbf{R}_\tau + \sum_{t=\tau}^\infty \gamma^t \mathbf{r}_t\bigg)\,\bigg\vert\,\pi, s_\tau \right] \neq u(\mathbf{R}_\tau) + \mathbb{E}\left[ u\bigg(\sum_{t=\tau}^\infty \gamma^t \mathbf{r}_t\bigg)\,\bigg\vert\, \pi, s_\tau\right] \]
Implication: most existing methods considering MDPs cannot be used with ESR optimality criterion and non-linear utility functions.
One needs to take into account the previously accumulated rewards.

How to solve a MORL problem

Previous slides: The agent optimizes the scalarized expected return/expected scalarized return. However, our goal is to find a set of solutions.
(Roijers and Whiteson 2017) identified two approaches for solving a MORL task.

Outer loop methods:

Run single-objective algorithms with different user preferences repeatedly.

solutions = set()
for w in W:
   pi_w = SORL(w) # Solve single-objective problem with given w
   solutions.add(pi_w)

Inner loop methods: Modify the underlying algorithm, and directly generate a solution set.

Are deterministic stationary policies enough?

The one-state MOMDP in Figure 3 contains 3 deterministic stationary policies.
These deterministic policies have values: $\mathbf{V}^{\pi_1} = \big(3 / (1 - \gamma), 0\big)$, $\mathbf{V}^{\pi_2} = \big(0, 3 / (1 - \gamma)\big)$, $\mathbf{V}^{\pi_3} = \big(1/(1 - \gamma), 1/(1-\gamma)\big)$
Create a mixture policy $\pi_m$ that selects action $a_1$ with probability $p$, and action $a_2$ otherwise (Vamplew et al. 2009).
- $\mathbf{V}^{\pi_m} = \big(3p / (1 - \gamma), 3(1-p)/(1 - \gamma)\big)$

Figure 3: One state MOMDP with three actions. Example follows (Roijers et al. 2013), adapted from (White 1982)

To show that a non-stationary policy can dominate a stationary policy, consider a policy that alternates between actions $a_1$ and $a_2$, starting from $a_1$ (White 1982).

Are linear utility functions all you need?

Can non-convex portions of Pareto-front be recovered with a linear utility function?
(Lu, Herman, and Yu 2023): The value-functions are convex.¹. → One can discover the whole Pareto-front with a linear utility function.
Existing algorithms have two bottlenecks:
1. Preference for deterministic policies.
2. Numerical instability.
These issues can be remedied by augmenting the reward-function with a strongly concave term²: \[ \pi(\cdot;\mathbf{w}) = \underset{\pi'(\cdot;\mathbf{w})}{\mathrm{arg\;max}}\, \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t \big(\mathbf{w}^T \mathbf{R}(a_t, s_t) + \alpha \mathcal{H}(\pi'(s_t;\mathbf{w}))\big)\right] \]
This is a multi-objective version of Soft-Actor Critic (Haarnoja et al. 2018).

Material & Resources

MORL-baselines: Baseline algorithms.
MO-Gymnasium: Multi-objective environments.
Overview of theory and applications of multi-objective sequential decision making algorithms: Hayes, Conor F., Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, et al. 2022. “A Practical Guide to Multi-Objective Reinforcement Learning and Planning.” Autonomous Agents and Multi-Agent Systems 36 (1): 26.
Algorithm for continuous control tasks: Xu, Jie, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. 2020. “Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control.” In Proceedings of the 37th International Conference on Machine Learning.
Sample-efficient MOLR algorithm based on GPI: Alegre, Lucas N., Ana L. C. Bazzan, Diederik M. Roijers, Ann Nowé, and Bruno C. da Silva. 2023. “Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization.” In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems.

References

Haarnoja, Tuomas, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” In International Conference on Machine Learning.

Lu, Haoye, Daniel Herman, and Yaoliang Yu. 2023. “Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality.” In The Eleventh International Conference on Learning Representations.

Perny, Patrice, and Paul Weng. 2010. “On Finding Compromise Solutions in Multiobjective Markov Decision Processes.” In ECAI 2010. Vol. 215.

Roijers, Diederik M., Peter Vamplew, Shimon Whiteson, and Richard Dazeley. 2013. “A Survey of Multi-Objective Sequential Decision-Making.” Journal of Artificial Intelligence Research 48.

Roijers, Diederik M., and Shimon Whiteson. 2017. Multi-Objective Decision Making. 1st ed. Syntehesis Lectures on Artificial Intelligence and Machine Learning. Springer Cham.

Vamplew, Peter, Richard Dazeley, Ewan Barker, and Andrei Kelarev. 2009. “Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks.” In AI 2009: Advances in Artificial Intelligence.

White, D. J. 1982. “Multi-Objective Infinite-Horizon Discounted Markov Decision Processes.” Journal of Mathematical Analysis and Applications 89 (2).