WQMIX¶

概述¶

WQMIX 首次在 Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning 中提出。他们的研究重点是 QMIX 中 Q 函数的单调结构所代表的特定函数空间的独特属性，以及通过投影到该空间所引发的不同联合动作之间的表示质量权衡。他们表明，QMIX 中的投影可能无法恢复最优策略，这主要源于对每个联合动作的平等加权。 WQMIX 通过在投影中引入加权来纠正这一点，以便更重视更好的联合动作。

WQMIX 提出了两种加权方案，并证明它们能够为任何联合动作 Q 值恢复正确的最大动作。

WQMIX 引入了两个可扩展的版本：中央加权（CW）QMIX 和乐观加权（OW）QMIX，并在捕食者-猎物和具有挑战性的多智能体星际争霸基准任务上展示了改进的性能。

快速事实¶

WQMIX 是一种基于价值的无模型多智能体强化学习算法，采用集中训练与分散执行的范式。并且仅支持离散动作空间。
WQMIX 考虑了一个部分可观测的场景，其中每个智能体只能获取个体观测。
WQMIX 接受 DRQN 作为个体价值网络。
WQMIX 使用由代理网络和混合网络组成的架构来表示联合价值函数。混合网络是一个前馈神经网络，它将代理网络的输出作为输入，并以单调的方式混合它们，生成联合动作值。

关键方程或关键图表¶

整体WQMIX架构，包括个体代理网络和混合网络结构：

首先，WQMIX 研究了一个在表格设置中代表理想化版本的 QMIX 的操作符。此分析的主要目的是理解 QMIX 由于其训练目标和使用的受限函数类别而产生的基本限制。这是所有可以通过表格 \(Q_{a}(s,u)\) 的单调函数表示的 \(Q_{tot}\) 的空间：

\[\mathcal{Q}^{m i x}:=\left\{Q_{t o t} \mid Q_{t o t}(s, \mathbf{u})=f_{s}\left(Q_{1}\left(s, u_{1}\right), \ldots Q_{n}\left(s, u_{n}\right)\right), \frac{\partial f_{s}}{\partial Q_{a}} \geq 0, Q_{a}(s, u) \in \mathbb{R}\right\}\]

在我们理想的QMIX算法的每次迭代中，我们通过解决以下优化问题来约束\(Q_{tot}\)位于上述空间中：

\[\underset{q \in \mathcal{Q}^{m i x}}{\operatorname{argmin}} \sum_{\mathbf{u} \in \mathbf{U}}\left(\mathcal{T}^{*} Q_{t o t}(s, \mathbf{u})-q(s, \mathbf{u})\right)^{2}\]

其中，贝尔曼最优算子定义为：

\[\mathcal{T}^{*} Q(s, \mathbf{u}):=\mathbb{E}\left[r+\gamma \max _{\mathbf{u}^{\prime}} Q\left(s^{\prime}, \mathbf{u}^{\prime}\right)\right]\]

然后定义相应的投影运算符 \(T^{Qmix}\) 如下：

\[\Pi_{\mathrm{Qmix}} Q:=\underset{q \in \mathcal{Q}^{\text {mix }}}{\operatorname{argmin}} \sum_{\mathbf{u} \in \mathbf{U}}(Q(s, \mathbf{u})-q(s, \mathbf{u}))^{2}\]

\(T_{*}^{Qmix}\) 的属性：

\(T_{*}^{Qmix}\) 不是一个收缩。
QMIX的argmax并不总是正确的。
QMIX 可能会低估最优联合行动的价值。

WQMIX论文认为，在QMIX中进行优化时，对联合动作的等权重处理可能是导致目标最小化解的argmax可能不正确的原因。为了优先估计\(T_{tot}(u^{*})\)，同时仍然固定其他联合动作的价值估计，我们可以在QMIX的投影算子中添加一个合适的权重函数w：

\[\Pi_{w} Q:=\underset{q \in \mathcal{Q}^{\text {mix }}}{\operatorname{argmin}} \sum_{\mathbf{u} \in \mathbf{U}} w(s, \mathbf{u})(Q(s, \mathbf{u})-q(s, \mathbf{u}))^{2}\]

权重的选择对于确保WQMIX能够克服QMIX的局限性至关重要。 WQMIX考虑了两种不同的权重，并证明了这些w的选择确保了从投影返回的\(Q_{tot}\)具有正确的argmax。

理想化的中心加权：这意味着对每个次优动作进行降权。然而，这种加权需要计算联合动作空间中的最大值，这通常是不可行的。在实现中，WQMIX 在深度强化学习设置中采用了这种加权的近似方法。

\[\begin{split}w(s, \mathbf{u})=\left\{\begin{array}{ll} 1 & \mathbf{u}=\mathbf{u}^{*}=\operatorname{argmax}_{\mathbf{u}} Q(s, \mathbf{u}) \\ \alpha & \text { otherwise } \end{array}\right.\end{split}\]

乐观加权: 这种加权方法对那些相对于Q被低估的联合动作赋予更高的权重，因此这些动作可能是真正的最优动作（在乐观的展望中）。

\[\begin{split}w(s, \mathbf{u})=\left\{\begin{array}{ll} 1 & Q_{t o t}(s, \mathbf{u})<Q(s, \mathbf{u}) \\ \alpha & \text { otherwise } \end{array}\right.\end{split}\]

有关详细分析，请参阅WQMIX论文。

实现¶

默认配置定义如下：

class ding.policy.wqmix.WQMIXPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[源代码]

Overview:

Policy class of WQMIX algorithm. WQMIX is a reinforcement learning algorithm modified from Qmix,
你可以在以下链接查看论文 https://arxiv.org/abs/2006.10800

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _reset_learn, _state_dict_learn, _load_state_dict_learn
_init_collect, _forward_collect, _reset_collect, _process_transition, _init_eval, _forward_eval_reset_eval, _get_train_sample, default_model

Config:

ID

符号

类型

默认值

描述

其他（形状）

1

type

字符串

qmix

RL policy register name, refer to

registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

布尔

真

Whether to use cuda for network

this arg can be diff-

erent from modes

3

on_policy

布尔

假

Whether the RL algorithm is on-policy

or off-policy

priority

布尔

假

Whether use priority(PER)

priority sample,

update priority

5

priority_

IS_weight

布尔

假

Whether use Importance Sampling

Weight to correct biased update.

IS weight

6

learn.update_

per_collect

整数

20

How many updates(iterations) to train

after collector’s one collection. Only

valid in serial training

this args can be vary

from envs. Bigger val

means more off-policy

7

learn.target_

update_theta

浮点数

0.001

Target network update momentum

parameter.

between[0,1]

8

learn.discount

_factor

浮点数

0.99

Reward’s future discount factor, aka.

gamma

may be 1 when sparse

reward env

使用的网络接口WQMIX定义如下：

class ding.model.template.WQMix(agent_num: int, obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, lstm_type: str = 'gru', dueling: bool = False)[源代码]

Overview:
WQMIX (https://arxiv.org/abs/2006.10800) 网络，包含两个组件：1) Q_tot，与QMIX网络相同，由代理Q网络和混合网络组成。2) 一个不受限制的联合动作Q_star，由代理Q网络和mixer_star网络组成。QMIX论文提到所有代理共享本地Q网络参数，因此在Q_tot或Q_star中只初始化一个Q网络。

Interface:
__init__, forward.

forward(data: dict, single_step: bool = True, q_star: bool = False) → dict[source]

Overview:
QMIX网络的前向计算图。输入字典包括时间序列观察数据和相关数据，用于预测总q_value和每个代理的q_value。根据q_star参数决定是计算Q_tot还是Q_star。

Arguments:

data (dict): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’].

agent_state (torch.Tensor): 每个代理的时间序列局部观测数据。

global_state (torch.Tensor): 时间序列全局观测数据。

prev_state (list): 用于 q_network 或 _q_network_star 的先前rnn状态。

动作 (torch.Tensor 或 None): 如果动作为 None，则使用 argmax q_value 索引作为动作来计算 agent_q_act。

single_step (bool): 是否单步前进，如果是，则在前进前添加时间步维度，并在前进后移除它。

Q_star (bool): 是否使用Q_star网络进行前向传播。如果为True，则使用Q_star网络，其中代理网络具有与Q网络相同的架构但不共享参数，混合网络是一个具有3个256维隐藏层的前馈网络；如果为False，则使用Q网络，与Qmix论文中的Q网络相同。

Returns:

ret (dict): 输出数据字典，包含键 [total_q, logit, next_state].

total_q (torch.Tensor): 总q值，这是混合器网络的结果。

agent_q (torch.Tensor): 每个代理的q值。

next_state (list): 下一个RNN状态。

Shapes:

agent_state (torch.Tensor): \((T, B, A, N)\), 其中 T 是时间步，B 是批次大小，A 是代理数量，N 是观测形状。

全局状态 (torch.Tensor): \((T, B, M)\), 其中 M 是全局观测形状。

prev_state (list): 数学公式:(T, B, A), 一个长度为B的列表，每个元素是一个长度为A的列表。

动作 (torch.Tensor): \((T, B, A)\).

total_q (torch.Tensor): \((T, B)\).

agent_q (torch.Tensor): \((T, B, A, P)\), 其中 P 是动作形状。

next_state (list): 数学:(T, B, A), 一个长度为B的列表，每个元素是一个长度为A的列表。

基准测试¶

展示了在DI-engine中实现的WQMIX在SMAC（Samvelyan等人，2019年）中的基准测试结果，用于星际争霸微操问题。

smac 地图	最佳平均奖励	配置链接	比较
MMM	1.00	config_link_M	wqmix(Tabish) (1.0)
3s5z	0.72	config_link_3	wqmix(Tabish) (0.94)
5米6米	0.45	config_link_5	wqmix(Tabish) (0.9)

参考文献¶

Rashid, Tabish, 等人. “加权qmix：扩展单调值函数分解用于深度多智能体强化学习.” arXiv预印本 arXiv:2006.10800 (2020).
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson. Qmix: 深度多智能体强化学习的单调值函数分解. 国际机器学习会议. PMLR, 2018.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, Thore Graepel. 用于协作多智能体学习的价值分解网络. arXiv 预印本 arXiv:1706.05296, 2017.
Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, Yung Yi. QTRAN: 学习通过转换进行因子分解以实现合作多智能体强化学习。国际机器学习会议。PMLR, 2019.
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. 星际争霸多智能体挑战. arXiv preprint arXiv:1902.04043, 2019.

ID	符号	类型	默认值	描述	其他（形状）
1	`type`	字符串	qmix	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	布尔	真	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	布尔	假	Whether the RL algorithm is on-policy or off-policy
	`priority`	布尔	假	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	布尔	假	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update_` `per_collect`	整数	20	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	浮点数	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	浮点数	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env