2024-12-11
Formally, a multi-objective sequential decision making problem is presented as a multi-objective Markov decision process MOMDP: \(\langle S, A, T, \gamma, \mu, \mathbf{R}\rangle\).
The major difference to the traditional MDP’s is the inclusion of vector valued reward function \(\mathbf{R}: S \times A \times S \rightarrow \mathbb{R}^d\), where \(d \geq 2\) is the number of objectives.
Vector-valued value-function is defined as in the single-objective case: \[\mathbf{V}^{\pi} = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t \mathbf{r}_{t}\,\vert\,\pi, \mu\right]\]
Consider two policies \(\pi\) and \(\pi'\) with two objectives \(i\) and \(j\):
Scalarized expected return (SER) \[ V_u^\pi = u\Biggl(\mathbb{E}\left[\sum_{t=0}^\infty \gamma^t \mathbf{r}_t\,\bigg\vert\,\pi, s_0\right] \Biggl) \]
Expected scalarized return (ESR) \[ V_u^\pi = \mathbb{E}\left[u\left(\sum_{t=0}^\infty \gamma^t \mathbf{r}_t\right)\,\bigg\vert\,\pi, s_0\right] \]
Recall Bellman equation: \[ \mathbf{V}^{\pi}(s) = \underbrace{\mathbf{R}_\tau}_{\text{Immediate reward}} + \gamma \underbrace{\sum_{s'} P\big(s'\,\vert\,s, \pi(s)\big)\mathbf{V}^\pi(s')}_{\text{Expected return from state $s'$ onwards}} \]
Consider what happens with the ESR optimality criterion with non-linear utility function.
\[ \mathbb{E}\left[u\bigg(\mathbf{R}_\tau + \sum_{t=\tau}^\infty \gamma^t \mathbf{r}_t\bigg)\,\bigg\vert\,\pi, s_\tau \right] \neq u(\mathbf{R}_\tau) + \mathbb{E}\left[ u\bigg(\sum_{t=\tau}^\infty \gamma^t \mathbf{r}_t\bigg)\,\bigg\vert\, \pi, s_\tau\right] \]
Implication: most existing methods considering MDPs cannot be used with ESR optimality criterion and non-linear utility functions.
One needs to take into account the previously accumulated rewards.
Previous slides: The agent optimizes the scalarized expected return/expected scalarized return. However, our goal is to find a set of solutions.
(Roijers and Whiteson 2017) identified two approaches for solving a MORL task.
Outer loop methods:
Run single-objective algorithms with different user preferences repeatedly.
Inner loop methods: Modify the underlying algorithm, and directly generate a solution set.
MORL-baselines: Baseline algorithms.
MO-Gymnasium: Multi-objective environments.
Overview of theory and applications of multi-objective sequential decision making algorithms: Hayes, Conor F., Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, et al. 2022. “A Practical Guide to Multi-Objective Reinforcement Learning and Planning.” Autonomous Agents and Multi-Agent Systems 36 (1): 26.
Algorithm for continuous control tasks: Xu, Jie, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. 2020. “Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control.” In Proceedings of the 37th International Conference on Machine Learning.
Sample-efficient MOLR algorithm based on GPI: Alegre, Lucas N., Ana L. C. Bazzan, Diederik M. Roijers, Ann Nowé, and Bruno C. da Silva. 2023. “Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization.” In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems.