Generalizing Pareto optimal policies in Multi-objective Reinforcement learning

Empirical study of hypernetworks

Santeri Heiskanen
Supervisors: Prof. Ville Kyrki¹, Dr. Atanu Mazumdar ¹, Prof. Joni Kämäräinen²

¹Aalto University, ²Tampere University

20-07-2024

Multi-objective reinforcement learning in a nutshell

Multi-objective reinforcement learning (MORL) is an extension to single-objective reinforcement learning, where one considers multiple (conflicting) objectives simultaneously.
A policy \(\pi\) is considered Pareto-optimal if it performs better than the other policies in at least one objective, and is (at least) equally good in other objectives.
Goal of MORL is (often) to find a set of Pareto-optimal policies, called Pareto front.¹
Why?
- Many real-world problems are truly multi-objective.
- The responsibility of making the compromise is moved away from the engineer to the user.
- The desired solution can be selected after observing the outcomes of the different trade-offs.

**Figure**: Conceptual Pareto-front for two objectives. Each point in the plot represents a value of a policy \(\pi\) with given preferences \(\mathbf{w}\).

1. The exact goal is dependent on multiple factors, including the type of the utility function. See Roijers et al. for more information.

Problem setting

Formal problem definition

The problem is formally defined as Multi-objective Markov decision process (MOMDP).
The reward function of a MOMPD is vector-valued and thus the value function of a policy \(\pi\) is also vector-valued. \[\begin{equation*} \mathbf{V}^{\pi} = \mathbb{E}\left[ \sum_{k=0}^{\infty} \gamma^k \mathbf{r}_{k+1}\;|\;\pi, \mu \right] \end{equation*}\]
Utility function \(u:\; \mathbb{R}^d \rightarrow \mathbb{R}\), parametrized by the user preferences \(\mathbf{w}\), is used to map the value-function into a scalar: \(\;V_u^{\pi} = u(\mathbf{V}^\pi)\)
Goal of the MORL agent is to optimize the scalarized (discounted) cumulative return.

Assumptions

The user preferences \(\mathbf{w}\) are not known in advance.
The utility function is as linear combination of the preferences over the objectives: \(u(\mathbf{r}, \mathbf{w}) = \mathbf{w}^T \mathbf{r}\)
The preferences are presented as a non-zero vector \(\mathbf{w}\), where \[\begin{equation*} 0 \leq w_i \leq 1,\; i=1,\dots, n\text{ and } \sum_{i=1}^n w_i = 1 \end{equation*}\]

Research questions

Motivation

Current approaches for finding Pareto-front either produce a set of policies naturally (inner-loop methods), or repeatedly solve a single objective problem with different scalarizations of the objective (outer-loop methods).
The inherit simplicity of outer-loop methods is appealing. However, they suffer from inefficiency since a new policy is trained for each scalarization of the goal.
The optimal policies for different user preferences can differ significantly. Previous attemps to generalize learned information to new policies with different preferences over the objectives (e.g. Meta-policy approach by Chen et al. ) have suffered from sub-optimal performance.

Research questions

Can the learned information from the previous Pareto optimal policies be generalized to new policies with different preferences over the objectives, while retaining the quality of the solution set?
Can the information sharing improve the efficiency of the training process?

Solution proposal

Hypernetworks have shown excellent performance in many different tasks, including traditional RL .
Hypothesis: natural information sharing of hypernetworks could improve the generalization capabilities of Pareto-optimal policies.
Two different configurations for the critic:
- A-critic-MO-hyper: The critic network takes in only action \(a\) as its input.
- SAP-critic-MO-hyper: The critic network takes in state \(s\), action \(a\) and preferences \(\mathbf{w}\) as its inputs.
- In both cases the hypernetwork takes the state \(s\) and preferences \(\mathbf{w}\) as it's input.
The policies are trained using an existing MORL algorithm, called Concave Augmented Pareto Q-learning (CAPQL) .

Figure a) Following the work of Sarafian et al. , the critic is modeled as a Contextual bandit \(Q_{s, \mathbf{w}}(a)\).

Figure b) The hypernetwork consists of multiple residual network plots. Each head of the hypernetwork produces the parameters for a single layer of the critic network.

Experiment setup

Baseline methods

Concave augmented Pareto Q-learning (CAPQL)

Uses 2-layer MLP for Gaussian policy, and the critic network.

CAPQL Large

CAPQL with 3-layer MLP with ≈ 9M parameters for the critic network (Similar to the A-critic-MO-hyper).

Prediction guided multi-objective reinforcement learning (PGMORL)

Evolutionary approach that predicts on which preferences to train on.

General Policy improvement with linear support (GPI-LS)

Has theoretical guarantees on which preferences should be trained on. Known to be sample efficient.

Benchmark tasks

Multi-objective Half cheetah

Two objectives: Energy efficiency & forward speed.
Pareto-front consists of 4 different policy families.

Multi-objective Hopper

Two objectives: Jump height & forward speed.
Pareto-front consists of 2 different policy families.

Multi-objective Swimmer

Two objectives: Energy efficiency & forward speed.
Pareto-front consists of 3 different policies families.

Increased amount of policy families indicates larger variance between the optimal behaviours for different preferences.
Exact environment definitions adapted from Xu et al.

Image of the Half cheetah environment — **Figure a)**: The Half cheetah environment

Image of the Hopper environment — **Figure b)**: The Hopper environment

Image of the Swimmer environment — **Figure c)**: The Swimmer environment

CAPQL and CAPQL large base methods. Large tests if any improvements are due to parameter count or model architecture.
PGMORL previous SOTA on continuous control. Trains discrete, separate policies. Sample Inefficient.
GPI-LS is a quite new method, provides SOTA sample-efficiency. Has theoretical guarantees on which preferences need to be considered.
The PGMORL paper describes that the Pareto-front can be partitioned into distinct families. Can be seen as an indicator on how different the policies along the front are.

Performance indicators

Hypervolume⇑ measures the volume of the Pareto-dominated undominated solutions over the possible in an approximate coverage set. This metric combines the uniformity, spread and convergence of the solution set and is able to indicate improvement in any of them. . Formally, the hypervolume is defined as: \[\begin{equation*} I^*_H(\mathcal{S}) = \int_{\mathbf{z}_{\mathrm{lower}}}^{\mathbf{z}_{\mathrm{upper}}} \mathbf{1}_{\mathcal{S}}(\mathbf{z})d\mathbf{z} \end{equation*}\] where \(\mathcal{S}\) is the solution set and \(\mathbf{1}\) is an indicator function that is 1, when the vector \(\mathbf{z}\) is dominated by some vector \(\mathbf{s}\) in the solution set.
Sparsity⇓ measures how sparse the produced solution set is. It is defined as the average distance between two adjacent points in the solution space: \[\begin{equation*} Sp(\mathcal{S}) = \frac{1}{|\mathcal{S}| - 1} \sum_{j=1}^m \sum_{i=1}^{|\mathcal{S}|}( \tilde{S}_j(i) - \tilde{S}_i(i+1) )^2 \end{equation*}\] where \(\mathcal{S}\) is the approximated solution set. Since one always desires denser approximation of the Pareto-front, a lower sparsity is better.

Expected utility metric (EUM)⇑ describes the amount of utility the agent is expected to provide to the user on average, given the used family of utility functions. More formally, EUM is defined as \[\begin{equation*} \mathrm{EUM} = \mathbb{E}_{P_u}\left[ \underset{\mathbf{V}^\pi \in \mathcal{S}}{\max}\; u(\mathbf{V}^\pi) \right] \end{equation*}\] where \(\mathcal{S}\) is the solution set, \(u\) is the utility function and \(P_u\) is the distribution over the utility functions.
Inverted generational distance + (IGD+)⇓ measures the performance of the given approximation of the Pareto-front by measuring the average distance of a point \(z_i\) from a reference point set \(Z = \{z_1, z_2, \dots, z_m\}\) to the nearest point \(a_i\) in the solution set: \[\begin{equation*} IGD^+(\mathcal{S}) = \frac{1}{|Z|}\sum_{j=1}^{|Z|} d^+(\mathbf{z}, \mathbf{s}_{j(k)}) \end{equation*}\] where \(d^+_j = \sqrt{\sum_{i=1}^m \max\{z_j - s_j, 0\}}\) and \(\mathbf{s}_{j(k)}\) is the nearest solution point to \(\mathbf{z}_k\).

Performance

Overall, the hypernetworks do not improve the final performance of the algorithms.
Moreover, the increase in the parameter count results in larger run to run variance.
Passing the state and preferences explicitly to the critic improves the performance especially in the more complicated environments.
Interestingly, the hypernetwork based approaches tend to overfit on the energy efficiency objective on Halfcheetah and Swimmer as one can seen in figure b).

Figure a) The interquartile mean (IQM) scores with 95% bootstraped confidence intervals calculated over 5 runs for each of the used environments. The GPI-LS was trained for 300k timesteps, PGMORL for 9M timesteps and rest for 1.2M timesteps.

Figure b): Examples of the solution sets produced by the algorithms. Note the clustering on the energy efficiency objective in Half cheetah and Swimmer.

Sample efficiency

The hypernetwork based approaches improve the convergence speed in Hopper and Halfcheetah, while none of the methods were able to learn any reasonable policies in Swimmer.
The increased run-to-run variance introduced by the larger models is noticeable in all environments.
In Hopper, the CAPQL Large and hypernetwork based approaches perform similarly, indicating that the performance improvements might be due to increased parameter count instead of hypernetwork architecture.

Figure: The IQM values of the performance metrics with 95% bootstrapped confidence intervals during training. The values were calculated over 5 runs.

Rollouts with SAP-critic-MO-hyper

\(\mathbf{w} = \{1.0, 0.0\}\)

\(\mathbf{w} = \{0.75, 0.25\}\)

\(\mathbf{w} = \{0.5, 0.5\}\)

\(\mathbf{w} = \{0.25, 0.75\}\)

\(\mathbf{w} = \{0.0, 1.0\}\)

Closer look at the training process

Figure a): The progress of the Pareto front during three different experiments: left: An failed experiment with SAP-critic-MO-hyper, middle: An successful experiment with SAP-critic-MO-hyper, right An successful experiment with CAPQL.

Figure b): The preference distribution (top),
and the scalarized reward distribution (bottom) of the hypernetwork based approach as a function of timesteps during the leftmost experiment of figure a).

Clearly, the SAP-critic-MO-hyper overfits on the energy efficiency objective early on. Even in the successful experiment, the first 1 million steps are spend considering only policies that optimize the energy efficiency.
This indicates that the rewards from the energy efficiency objective must be significantly larger than from the speed objective, which is verified by the scalarized reward distribution shown in figure b).
Curiously, the CAPQL also is affected by the unbalanced objectives, since the solution set contains almost optimal policies for saving energy early on. However, it does not overfit on the easier objective.

Overcoming the overfitting

The overcome the overfitting, a warm-up period was added. During this period, a skewed preference distribution was used, where preferences favoring the forward speed objective had a higher change of getting selected.
After the warm-up period, the Gaussian distribution from the original CAPQL was used for sampling the preferences.
The warm-up phase increases the initial convergence speed significantly in Halfcheetah, while decreasing the run-to-run variance simultaneously. However, there is a significant drop in the performance soon after the warm-up phase ends.
Moreover, the warm-up distribution improves the coverage of the solution set, yet it introduces a gap to the region of policies that favor the energy efficiency.
Unfortunately, not even the warm-up phase could not improve the performance in the Swimmer.

Figure a) Conceptual figure of the skewed preference distribution. Preferences belonging on the red area were given 1.75 times higher change of being selected than the preferences outside this area.

Figure b) The performance of the hypernetwork approach with and without the warm-up phase. The vertical lines mark the end of warm-up phase. The bottom figures show an example of the produced solution sets.

Summary & Future research directions

Code | Thesis

Summary

Hypernetworks seem to be a promising option for improving the convergence speed of MORL algorithms in more complicated tasks.
However, Hypernetworks do not seem to improve the final performance over the base algorithm significantly in all cases.
Hypernetworks are known to be fragile, and this is the case also here.
The method used to select the preferences has a significant impact on the performance.

Future directions

This work considered using hypernetwork only for the critic. However, also the policy is conditioned on the preferences. Utilizing a hypernetwork for both policy and critic could be an option worth looking into.
Currently, GPI-LS offers unmatched sample-efficiency. Applying hypernetworks to GPI-LS could provide insight on how much the hypernetworks can improve the convergence speed in the best case.
Continue to develop more general algorithm for selecting the preferences during the training.

References

[1] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of multi-objective sequential decision-making,” Journal of Artificial Intelligence Research, vol. 48, 2013.

[2] C. F. Hayes et al., “A practical guide to multi-objective reinforcement learning and planning,” Autonomous Agents and Multi-Agent Systems, vol. 36, no. 1, 2022.

[3] X. Chen, A. Ghadirzadeh, M. Björkman, and P. Jensfelt, “Meta-learning for multi-objective reinforcement learning,” in 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS),2019.

[4] V. K. Chauhan, J. Zhou, P. Lu, S. Molaei, and D. A. Clifton, “A brief review of hypernetworks in deep learning,” Artificial Intelligence Review, vol. 57, no. 6, 2024.

[5] H. Lu, D. Herman, and Y. Yu, “Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality,” in The eleventh international conference on learning representations, 2023.

[6] E. Sarafian, S. Keynan, and S. Kraus, “Recomposing the reinforcement learning building blocks with hypernetworks,” in Proceedings of the 38th international conference on machine learning, 2021.

[7] J. Xu, Y. Tian, P. Ma, D. Rus, S. Sueda, and W. Matusik, “Prediction-guided multi-objective reinforcement learning for continuous robot control,” in Proceedings of the 37th international conference on machine learning, 2020.

[8] A. B. Alegre Lucas N, D. M. Roijers, A. Nowé, and B. C. da Silva, “Sample-efficient multi-objective learning via generalized policy improvement prioritization,” in AAMAS ’23: Proceedings of the 2023 international conference on autonomous agents and multiagent systems, 2023.

[9] L. M. Zintgraf, T. V. Kanters, D. M. Roijers, F. Oliehoek, and P. Beau, “Quality assessment of MORL algorithms: A utility-based approach,” in Benelearn 2015: Proceedings of the 24th annual machine learning conference of belgium and the netherlands, 2015.

[10] E. Zitzler, J. Knowles, and L. Thiele, “Quality assessment of pareto set approximations,” in Multiobjective optimization: Interactive and evolutionary approaches, Springer, 2008.

[11] H.Ishibuchi, H. Masuda, Y. Tanigaki, and Y. Nojima, “Modified distance calculation in generational distance and inverted generational distance,” in Evolutionary multi-criterion optimization, 2015.