QMIX¶

概述¶

QMIX 是由 Rashid et al.(2018) 提出的，用于在多智能体集中式训练中学习基于全局状态信息的联合动作价值函数，并从集中式端到端框架中提取分布式执行策略。 QMIX 使用集中式神经网络来估计联合动作值，作为基于局部观察的每个智能体动作值的复杂非线性组合，这为集中式动作价值函数提供了一种新的表示，并保证了集中式和分散式策略之间的一致性。

QMIX 是 VDN(Sunehag et al. 2017) 的非线性扩展。与 VDN(Value-Decomposition Networks For Cooperative Multi-Agent Learning ) 相比，QMIX 在训练过程中可以通过超网络(hyper-network)输入的全局信息表示更多的额外状态信息（智能体观测范围外），并且可以表示更丰富的动作价值函数类。

核心要点¶

QMIX 使用 集中式训练与分散式执行(centralized training with decentralized execution) 的范式。
QMIX 是一种 无模型(model-free)、基于价值(value-based)、异策略(off-policy)、多智能体(multi-agent) 的强化学习方法。
QMIX 仅支持 离散(discrete) 动作空间。
QMIX 考虑了一种 部分可观察(partially observable) 的情景，其中每个智能体只获得个体观察。
QMIX 接受 DRQN 作为个体价值网络来解决 部分可观察 问题。
QMIX 使用由 智能体网络(agent networks)、混合网络(mixing network)、超网络(hyper-network) 组成的架构来表示联合价值函数。混合网络是一个前馈神经网络，它将智能体网络的输出作为输入并单调地混合它们，产生联合动作值。混合网络的权重由单独的超网络产生。

关键方程或关键图形¶

VDN 和 QMIX 是使用将联合动作价值函数 $Q_{tot}$ 分解为用于分散执行的个体函数 $Q_a$ 的思想的代表性方法。

为了实现集中式训练与分散式执行 (centralized training with decentralized execution CTDE)，我们需要确保在 $Q_{tot}$ 上执行的全局 $argmax$ 与在每个 $Q_a$ 上执行的一组单独的 $argmax$ 操作产生相同的结果：

\[\begin{split}$\arg \max _{\boldsymbol{u}} Q_{\mathrm{tot}}(\boldsymbol{\tau}, \boldsymbol{u})=\left(\begin{array}{c}\arg \max _{u_{1}} Q_{1}\left(\tau_{1}, u_{1}\right) \\ \vdots \\ \arg \max _{u_{N}} Q_{N}\left(\tau_{n}, u_{N}\right)\end{array}\right)$\end{split}\]

VDN 将联合动作价值函数分解为个体动作价值函数之和。 $$Q_{\mathrm{tot}}(\boldsymbol{\tau}, \boldsymbol{u})=\sum_{i=1}^{N} Q_{i}\left(\tau_{i}, u_{i}\right)$$

QMIX 扩展了这种加法值分解，将联合动作价值函数表示为一个单调函数。QMIX 基于单调性，即对联合动作值 $Q_{tot}$ 和个体动作值 $Q_a$ 之间关系的约束。

\[\frac{\partial Q_{tot}}{\partial Q_{a}} \geq 0， \forall a \in A\]

QMIX 的整体架构包括个体智能体网络、混合网络和超网络：

QMIX 通过最小化下面的损失函数来训练混合网络：

\[y^{tot} = r + \gamma \max_{\textbf{u}^{’}}Q_{tot}(\tau^{'}, \textbf{u}^{'}, s^{'}; \theta^{-})\]

\[\mathcal{L}(\theta) = \sum_{i=1}^{b} [(y_{i}^{tot} - Q_{tot}(\tau, \textbf{u}, s; \theta))^{2}]\]

混合网络的每个权重都是由独立的超网络产生的，它以全局状态作为输入并输出混合网络一层的权重。更多细节可以在原始论文 Rashid et al.(2018) 中找到。

VDN 和 QMIX 是试图分解 $Q_tot$ 的方法，分别假设可加性和单调性。因此，满足这些条件的联合动作价值函数将被 VDN 和 QMIX 很好地分解。然而，存在一些任务，其联合动作价值函数不满足所述条件。 QTRAN (Son et al. 2019) 提出了一种通过将原始联合动作价值函数转换为容易分解的函数来摆脱这种结构约束的分解方法。 QTRAN (QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning) 保证了比 VDN 或 QMIX 更一般的分解。

实现¶

算法的默认设置如下：

class ding.policy.qmix.QMIXPolicy(cfg: EasyDict, model: Module | None = None, enable_field: List[str] | None = None)[source]

Overview:
QMIX算法的策略类。QMIX是一种多智能体强化学习算法，您可以通过以下链接查看论文https://arxiv.org/abs/1803.11485。

Config:

ID

符号

类型

默认值

描述

其他（形状）

1

type

字符串

qmix

RL policy register name, refer to

registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

布尔

真

Whether to use cuda for network

this arg can be diff-

erent from modes

3

on_policy

布尔

假

Whether the RL algorithm is on-policy

or off-policy

priority

布尔

假

Whether use priority(PER)

priority sample,

update priority

5

priority_

IS_weight

布尔

假

Whether use Importance Sampling

Weight to correct biased update.

IS weight

6

learn.update_

per_collect

整数

20

How many updates(iterations) to train

after collector’s one collection. Only

valid in serial training

this args can be vary

from envs. Bigger val

means more off-policy

7

learn.target_

update_theta

浮点数

0.001

Target network update momentum

parameter.

between[0,1]

8

learn.discount

_factor

浮点数

0.99

Reward’s future discount factor, aka.

gamma

may be 1 when sparse

reward env

QMIX 使用的网络接口定义如下：

class ding.model.template.QMix(agent_num: int, obs_shape: int, global_obs_shape: int | List[int], action_shape: int, hidden_size_list: list, mixer: bool = True, lstm_type: str = 'gru', activation: Module = ReLU(), dueling: bool = False)[source]

Overview:
与QMIX相关的算法的神经网络和计算图(https://arxiv.org/abs/1803.11485)。QMIX由两部分组成：代理Q网络和混合器（可选）。QMIX论文提到所有代理共享本地Q网络参数，因此这里只初始化一个Q网络。然后根据mixer设置使用求和或混合器网络处理本地Q以获得全局Q。

Interface:
__init__, forward.

forward(data: dict, single_step: bool = True) → dict[source]

Overview:
QMIX 前向计算图，输入字典包括时间序列观察和相关数据，用于预测总 q_value 和每个代理的 q_value。

Arguments:

data (dict): Input data dict with keys [‘obs’, ‘prev_state’, ‘action’].

agent_state (torch.Tensor): 每个代理的时间序列局部观测数据。

global_state (torch.Tensor): 时间序列全局观测数据。

prev_state (list): 用于 q_network 的先前rnn状态。

动作 (torch.Tensor 或 None): 每个代理在函数外部给出的动作。如果动作为 None，则使用 argmax q_value 索引作为动作来计算 agent_q_act。

single_step (bool): 是否单步前进，如果是，则在前进前添加时间步维度并在前进后移除它。

Returns:

ret (dict): 输出数据字典，包含键 [total_q, logit, next_state].

ReturnsKeys:

total_q (torch.Tensor): 总q值，这是混合器网络的结果。

agent_q (torch.Tensor): 每个代理的q值。

next_state (list): q_network 的下一个 RNN 状态。

Shapes:

agent_state (torch.Tensor): $(T, B, A, N)$, 其中 T 是时间步，B 是批量大小，A 是代理数量，N 是观测形状。

全局状态 (torch.Tensor): $(T, B, M)$, 其中 M 是全局观测形状。

prev_state (list): 数学:(B, A), 一个长度为B的列表，每个元素是一个长度为A的列表。

动作 (torch.Tensor): $(T, B, A)$.

total_q (torch.Tensor): $(T, B)$.

agent_q (torch.Tensor): $(T, B, A, P)$, 其中 P 是动作形状。

next_state (list): 数学:(B, A), 一个长度为B的列表，每个元素是一个长度为A的列表。

基准测试¶

Benchmark and comparison of qmix algorithm¶
环境	最佳平均奖励	配置链接	比较
MMM	1	config_link_MMM	Pymarl(1)
3s5z	1	config_link_3s5z	Pymarl(1)
MMM2	0.8	config_link_MMM2	Pymarl(0.7)
5m6m	0.6	config_link_5m6m	Pymarl(0.76)
2c_vs_64zg	1	config_link_2c_vs_64zg	Pymarl(1)

附注：

上述结果是通过在五个不同的随机种子 (0, 1, 2, 3, 4) 上运行相同的配置获得的。

2. 对于像 QMIX 这样的多智能体离散动作空间算法，通常使用 SMAC 环境集进行测试，并通常通过最高平均奖励训练 10M env_step 进行评估。有关 SMAC 的更多详细信息，请参阅 SMAC Env 教程 SMAC Env Tutorial 。

引用¶

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson. Qmix: 深度多智能体强化学习的单调值函数分解. 国际机器学习会议. PMLR, 2018.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, Thore Graepel. 用于协作多智能体学习的价值分解网络. arXiv 预印本 arXiv:1706.05296, 2017.
Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, Yung Yi. QTRAN: 学习通过转换进行因子分解以实现合作多智能体强化学习。国际机器学习会议。PMLR, 2019.
Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. 《星际争霸多智能体挑战》。arXiv预印本 arXiv:1902.04043, 2019.

其他开源实现¶

pymarl

ID	符号	类型	默认值	描述	其他（形状）
1	`type`	字符串	qmix	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	布尔	真	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	布尔	假	Whether the RL algorithm is on-policy or off-policy
	`priority`	布尔	假	Whether use priority(PER)	priority sample, update priority
5	`priority_` `IS_weight`	布尔	假	Whether use Importance Sampling Weight to correct biased update.	IS weight
6	`learn.update_` `per_collect`	整数	20	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy
7	`learn.target_` `update_theta`	浮点数	0.001	Target network update momentum parameter.	between[0,1]
8	`learn.discount` `_factor`	浮点数	0.99	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env