实现一个随机变量分布#

本指南概述了如何为 PyMC 实现一个分布。它专为希望向库中添加新分布的开发人员设计。用户不会意识到所有这些复杂性，而应改用 ~pymc.CustomDist 等辅助方法。

PyMC Distribution 建立在 PyTensor 的 RandomVariable 之上，并实现了 logp、logcdf、icdf 和 moment 方法，以及其他初始化和验证辅助工具。最值得注意的是 shape/dims/observed kwargs、替代参数化以及默认的 transform。

以下是实施新发行版所需的步骤摘要检查表。每个部分将在下面展开：

创建一个新的 RandomVariable Op
实现相应的 Distribution 类
为新的 RandomVariable 添加测试
为 logp / logcdf / icdf 和 moment 方法添加测试
记录新的 Distribution。

本指南不试图解释 Distributions 当前实现的原理，仅在有助于实现新的“标准”分布时提供详细信息。

1. Creating a new `RandomVariable` `Op`#

RandomVariable 负责实现随机采样方法。RandomVariable 还负责参数广播和形状推断。

在创建新的 RandomVariable 之前，请确保它尚未在 NumPy 库 中提供。如果已经存在，应首先将其添加到 PyTensor 库中，然后再导入到 PyMC 库中。

此外，并不总是需要实现一个新的 RandomVariable。例如，如果新的 Distribution 只是现有 Distribution 的一个特殊参数化。OrderedLogistic 和 OrderedProbit 就是 Categorical 分布的特殊参数化。

以下代码片段展示了如何创建一个新的 RandomVariable：

from pytensor.tensor.var import TensorVariable
from pytensor.tensor.random.op import RandomVariable
from typing import List, Tuple

# Create your own `RandomVariable`...
class BlahRV(RandomVariable):
    name: str = "blah"

    # Provide the minimum number of (output) dimensions for this RV
    # (e.g. `0` for a scalar, `1` for a vector, etc.)
    ndim_supp: int = 0

    # Provide the number of (input) dimensions for each parameter of the RV
    # (e.g. if there's only one vector parameter, `[1]`; for two parameters,
    # one a matrix and the other a scalar, `[2, 0]`; etc.)
    ndims_params: List[int] = [0, 0]

    # The NumPy/PyTensor dtype for this RV (e.g. `"int32"`, `"int64"`).
    # The standard in the library is `"int64"` for discrete variables
    # and `"floatX"` for continuous variables
    dtype: str = "floatX"

    # A pretty text and LaTeX representation for the RV
    _print_name: Tuple[str, str] = ("blah", "\\operatorname{blah}")

    # If you want to add a custom signature and default values for the
    # parameters, do it like this. Otherwise this can be left out.
    def __call__(self, loc=0.0, scale=1.0, **kwargs) -> TensorVariable:
        return super().__call__(loc, scale, **kwargs)

    # This is the Python code that produces samples.  Its signature will always
    # start with a NumPy `RandomState` object, then the distribution
    # parameters, and, finally, the size.

    @classmethod
    def rng_fn(
        cls,
        rng: np.random.RandomState,
        loc: np.ndarray,
        scale: np.ndarray,
        size: Tuple[int, ...],
    ) -> np.ndarray:
        return scipy.stats.blah.rvs(loc, scale, random_state=rng, size=size)

# Create the actual `RandomVariable` `Op`...
blah = BlahRV()

一些需要记住的重要事项：

rng_fn 方法内的所有内容都是纯 Python 代码（输入也是如此），并且 不应该 使用其他 PyTensor 符号操作。随机方法应使用 rng，这是一个 NumPy RandomGenerator，以便样本是可重复的。
非默认的 RandomVariable 维度将通过 size kwarg 传递到 rng_fn 中。rng_fn 需要考虑这一点以确保输出正确。size 是 NumPy 和 SciPy 使用的规范，对于单变量分布，它的工作方式类似于 PyMC 的 shape，但对于多变量分布则不同。对于多变量分布，size 排除了 ndim_supp 支持维度，而 TensorVariable 或 ndarray 的 shape 则包括了支持维度。更多上下文请查看维度笔记本。
PyTensor 可以自动推断单变量 RandomVariables (ndim_supp=0) 的输出形状。对于多变量分布 (ndim_supp>=1)，必须在新的 RandomVariable 类中实现 _supp_shape_from_params 方法。该方法返回给定其参数的 RV 的支持维度。在某些情况下，这可以从其参数之一的形状推导出来，在这种情况下，可以使用助手 pytensor.tensor.random.utils.supp_shape_from_ref_param_shape()，如在 DirichletMultinomialRV 中那样。在其他情况下，参数值（而不是它们的形状）可能决定分布的支持形状，如在 ~pymc.distributions.multivarite._LKJCholeskyCovRV 中发生的那样。在更简单的情况下，它们可能是常量。
在新 rng_fn 中使用其他 PyTensor 和 PyMC RandomVariables 的 rng_fn 类方法 是可以的。例如，如果您正在实现一个负半正态 RandomVariable，您的 rng_fn 可以简单地返回 - halfnormal.rng_fn(rng, scale, size)。

注意：除了 size，PyMC API 还提供了 shape、dims 和 observed 作为定义分布维度的替代方法，但这由 Distribution 处理，不需要任何额外的更改。

为了快速测试你的新 RandomVariable Op 是否正常工作，你可以使用必要的参数调用 Op，然后对返回的对象调用 draw：

# blah = pytensor.tensor.random.uniform in this example
# multiple calls with the same seed should return the same values
pm.draw(blah([0, 0], [1, 2], size=(10, 2)), random_seed=1)

# array([[0.83674527, 0.76593773],
#    [0.00958496, 1.85742402],
#    [0.74001876, 0.6515534 ],
#    [0.95134629, 1.23564938],
#    [0.41460156, 0.33241175],
#    [0.66707807, 1.62134924],
#    [0.20748312, 0.45307477],
#    [0.65506507, 0.47713784],
#    [0.61284429, 0.49720329],
#    [0.69325978, 0.96272673]])

2. Inheriting from a PyMC base `Distribution` class#

在实现新的 RandomVariable Op 之后，现在是时候在新的 PyMC Distribution 中使用它了。PyMC 以一种非常函数式编程的方式工作，而 distribution 类主要用于添加 PyMC API 功能并将相关方法组织在一起。实际上，它们负责：

链接（调度）一个 rv_op 类与相应的 moment、logp、logcdf 和 icdf 方法。
定义一个标准变换（用于连续分布），将一个有界变量域（例如，正线）转换为无界域（即，实线），这是许多采样器所偏好的。
验证分布的参数化，并将非符号输入（即数值字面量或NumPy数组）转换为符号变量。
将多个替代参数化转换为 RandomVariable 所定义的标准参数化。

以下是示例的延续方式：

import pytensor.tensor as pt
from pymc.distributions.continuous import PositiveContinuous
from pymc.distributions.dist_math import check_parameters
from pymc.distributions.shape_utils import rv_size_is_none


# Subclassing `PositiveContinuous` will dispatch a default `log` transformation
class Blah(PositiveContinuous):
    # This will be used by the metaclass `DistributionMeta` to dispatch the
    # class `logp` and `logcdf` methods to the `blah` `Op` defined in the last line of the code above.
    rv_op = blah

    # dist() is responsible for returning an instance of the rv_op.
    # We pass the standard parametrizations to super().dist
    @classmethod
    def dist(cls, param1, param2=None, alt_param2=None, **kwargs):
        param1 = pt.as_tensor_variable(param1)
        if param2 is not None and alt_param2 is not None:
            raise ValueError("Only one of param2 and alt_param2 is allowed.")
        if alt_param2 is not None:
            param2 = 1 / alt_param2
        param2 = pt.as_tensor_variable(param2)

        # The first value-only argument should be a list of the parameters that
        # the rv_op needs in order to be instantiated
        return super().dist([param1, param2], **kwargs)

    # moment returns a symbolic expression for the stable moment from which to start sampling
    # the variable, given the implicit `rv`, `size` and `param1` ... `paramN`.
    # This is typically a "representative" point such as the the mean or mode.
    def moment(rv, size, param1, param2):
        moment, _ = pt.broadcast_arrays(param1, param2)
        if not rv_size_is_none(size):
            moment = pt.full(size, moment)
        return moment

    # Logp returns a symbolic expression for the elementwise log-pdf or log-pmf evaluation
    # of the variable given the `value` of the variable and the parameters `param1` ... `paramN`.
    def logp(value, param1, param2):
        logp_expression = value * (param1 + pt.log(param2))

        # A switch is often used to enforce the distribution support domain
        bounded_logp_expression = pt.switch(
            pt.gt(value >= 0),
            logp_expression,
            -np.inf,
        )

        # We use `check_parameters` for parameter validation. After the default expression,
        # multiple comma-separated symbolic conditions can be added.
        # Whenever a bound is invalidated, the returned expression raises an error
        # with the message defined in the optional `msg` keyword argument.
        return check_parameters(
            bounded_logp_expression,
            param2 >= 0,
            msg="param2 >= 0",
        )

    # logcdf works the same way as logp. For bounded variables, it is expected to return
    # `-inf` for values below the domain start and `0` for values above the domain end.
    def logcdf(value, param1, param2):
        ...

    def icdf(value, param1, param2):
        ...

一些注意事项：

一个分布至少应该继承自 Discrete 或 Continuous。对于后者，存在更具体的子类：PositiveContinuous、UnitContinuous、BoundedContinuous、CircularContinuous、SimplexContinuous，它们为变量指定了默认的变换。如果你需要指定一次性的自定义变换，你也可以创建一个 _default_transform 调度函数，就像在 LKJCholeskyCov 中所做的那样。
如果一个分布没有对应的 rng_fn 实现，仍然应该创建一个 RandomVariable 来引发 NotImplementedError。例如，在 Flat 中就是这种情况。在这种情况下，将需要提供一个 moment 方法，因为没有 rng_fn，PyMC 无法回退到使用随机抽取作为 MCMC 的初始点。
如上所述，PyMC 以一种非常函数式的方式工作，并且 logp、logcdf、icdf 和 moment 方法所需的所有信息都应通过 RandomVariable 输入来“传递”。你可以传递那些对于 rng_fn 方法来说并非严格需要的数值参数，但这些参数在这些方法中会被使用。只需记住这可能会影响 RandomVariable 的正确形状推断行为。
logcdf 和 icdf 方法不是必需的，但这是一个很好的加分项！
目前，moment 方法仅支持一个矩，而最有用的可能是“高阶”矩（即 mean > median > mode）… 如果你处理的是离散分布，可能需要截断矩。moment 应返回随机变量的有效点（即，在该点评估时总是具有非零概率）
在创建 moment 方法时，要注意 size != None 的情况，并根据不一定用于计算矩的参数正确地进行广播。例如，pm.Normal.dist(mu=0, sigma=np.arange(1, 6)) 中的 sigma 对于矩的计算是无关的，但仍可能提供形状信息。在这种情况下，moment 应返回 [mu, mu, mu, mu, mu]。

为了快速检查是否一切正常，您可以尝试以下操作：

import pymc as pm
from pymc.distributions.distribution import moment

# pm.blah = pm.Normal in this example
blah = pm.blah.dist(mu=0, sigma=1)

# Test that the returned blah_op is still working fine
pm.draw(blah, random_seed=1)
# array(-1.01397228)

# Test the moment method
moment(blah).eval()
# array(0.)

# Test the logp method
pm.logp(blah, [-0.5, 1.5]).eval()
# array([-1.04393853, -2.04393853])

# Test the logcdf method
pm.logcdf(blah, [-0.5, 1.5]).eval()
# array([-1.17591177, -0.06914345])

3. Adding tests for the new `RandomVariable`#

新的 RandomVariables 的测试主要位于 tests/distributions/test_*.py 中。大多数测试可以通过默认的 BaseTestDistributionRandom 类来实现，该类提供了用于检查的默认测试：

预期的输入通过 check_pymc_params_match_rv_op 传递给 dist 类方法 中的 rv_op。
预期的（精确的）抽取结果正在通过 check_pymc_draws_match_reference 返回。
形状变量推断是正确的，通过 check_rv_size

from pymc.testing import BaseTestDistributionRandom, seeded_scipy_distribution_builder


class TestBlah(BaseTestDistributionRandom):
    pymc_dist = pm.Blah
    # Parameters with which to test the blah pymc Distribution
    pymc_dist_params = {"param1": 0.25, "param2": 2.0}
    # Parameters that are expected to have passed as inputs to the RandomVariable op
    expected_rv_op_params = {"param1": 0.25, "param2": 2.0}
    # If the new `RandomVariable` is simply calling a `numpy`/`scipy` method,
    # we can make use of `seeded_[scipy|numpy]_distribution_builder` which
    # will prepare a seeded reference distribution for us.
    reference_dist_params = {"mu": 0.25, "loc": 2.0}
    reference_dist = seeded_scipy_distribution_builder("blah")
    tests_to_run = [
        "check_pymc_params_match_rv_op",
        "check_pymc_draws_match_reference",
        "check_rv_size",
    ]

应为分布的每个可选参数化添加额外的测试。在这种情况下，只需包含测试 check_pymc_params_match_rv_op 就足够了，因为只有这个有所不同。

确保测试的替代参数值会导致关联的默认参数值不同。例如，如果它只是反向的，用 1.0 进行测试并不是很有信息量，因为转换也会返回 1.0，我们无法（像）确定这是否正确工作。

class TestBlahAltParam2(BaseTestDistributionRandom):

    pymc_dist = pm.Blah
    # param2 is equivalent to 1 / alt_param2
    pymc_dist_params = {"param1": 0.25, "alt_param2": 4.0}
    expected_rv_op_params = {"param1": 0.25, "param2": 2.0}
    tests_to_run = ["check_pymc_params_match_rv_op"]

自定义测试也可以像 TestFlat 那样添加到类中。

关于 `check_rv_size` 测试的说明：#

可以通过添加可选的类属性 sizes_to_check 和 sizes_expected 来为 check_rv_size 测试定义自定义输入大小（以及预期的输出形状）：

sizes_to_check = [None, (1), (2, 3)]
sizes_expected = [(3,), (1, 3), (2, 3, 3)]
tests_to_run = ["check_rv_size"]

这通常是用于多元分布的。你可以在 TestDirichlet 中看到一个例子。

关于 `check_pymcs_draws_match_reference` 测试的注释#

check_pymcs_draws_match_reference 是一个非常简单的测试，用于检查从 RandomVariable 和完全相同的 Python 函数中抽取的样本是否相等，前提是输入和随机种子相同。检查了一个小样本量（size=15）。这并不旨在测试随机数生成器的正确性。后一种测试（如果需要）可以通过 pymc_random 和 pymc_random_discrete 方法进行，这些方法将对 RandomVariable.rng_fn 和参考 Python 函数进行昂贵的统计比较。只有当存在一个好的独立生成器参考（即，不仅仅是 rng_fn 内部所做的 NumPy / SciPy 调用的相同组合）时，这种测试才有意义。

最后，当你的 rng_fn 做的事情不仅仅是调用 NumPy 或 SciPy 方法时，你需要设置一个等效的种子函数来进行精确的抽取比较（而不是依赖于 seeded_[scipy|numpy]_distribution_builder）。你可以在 TestWeibull 中找到一个例子，其 rng_fn 返回 beta * np.random.weibull(alpha, size=size)。

4. Adding tests for the `logp` / `logcdf` / `icdf` methods#

对 logp、logcdf 和 icdf 的测试主要使用 ~testing 中实现的辅助函数 check_logp、check_logcdf、check_icdf 和 check_selfconsistency_discrete_logcdf。

from pymc.testing import Domain, check_logp, check_logcdf, select_by_precision

R = Domain([-np.inf, -2.1, -1, -0.01, 0.0, 0.01, 1, 2.1, np.inf])
Rplus = Domain([0, 0.01, 0.1, 0.9, 0.99, 1, 1.5, 2, 100, np.inf])


def test_blah():
    check_logp(
        pymc_dist=pm.Blah,
        # Domain of the distribution values
        domain=R,
        # Domains of the distribution parameters
        paramdomains={"mu": R, "sigma": Rplus},
        # Reference scipy (or other) logp function
        scipy_logp=lambda value, mu, sigma: sp.norm.logpdf(value, mu, sigma),
        # Number of decimal points expected to match between the pymc and reference functions
        decimal=select_by_precision(float64=6, float32=3),
        # Maximum number of combinations of domain * paramdomains to test
        n_samples=100,
    )

    check_logcdf(
        pymc_dist=pm.Blah,
        domain=R,
        paramdomains={"mu": R, "sigma": Rplus},
        scipy_logcdf=lambda value, mu, sigma: sp.norm.logcdf(value, mu, sigma),
        decimal=select_by_precision(float64=6, float32=1),
        n_samples=-1,
    )

这些方法将对 domain 和 paramdomains 值的组合进行网格评估，并检查 PyMC 方法和参考函数是否匹配。有几个细节值得注意：

默认情况下，Domain 的第一个和最后一个值（边缘）不进行比较（它们用于其他用途）。如果需要测试 Domain 的边缘，可以重复边缘值。这可以通过 Bool 实现：Bool = Domain([0, 0, 1, 1], "int64")
有一些默认域（如 R 和 Rplus）可以用于测试你的新分布，但如果有一个很好的理由（例如，当默认值导致太多极端且不太可能的组合，而这些组合对于验证实现的正确性不太有信息量时），在测试函数内部创建自己的域也是完全可以的。
默认情况下，会测试 100 个 param x paramdomain 组合的随机子集，以控制测试运行时间。在测试你闪亮的新分布时，你可以暂时将 n_samples=-1 设置为强制测试所有组合。这一点很重要，以避免你的 PR 在将来运行时由于随机测试到一些不良参数组合而导致意外失败。
在GitHub上，一些测试会在 pytensor.config.floatX 标志为 "float64" 和 "float32" 的情况下运行两次。然而，参考Python函数将在纯 "float64" 环境中运行，这意味着参考结果和PyMC结果可能会有很大差异（例如，极端参数下可能下溢至 -np.inf）。因此，你应该确保在本地两种情况下都进行测试。一个快速且粗略的方法是在文件顶部，紧接在 import pytensor 之后，临时添加 pytensor.config.floatX = "float32"。记得同时将 n_samples 设置为 -1 以测试所有组合。测试输出将显示导致失败的精确参数值。如果你确信你的实现是正确的，你可以选择通过 select_by_precision 调整小数精度，或调整测试的 Domain 值。在极端情况下，你可以用条件 xfail 标记测试（如果只有一种子方法失败，应将其分开，以便 xfail 尽可能精确）：

def test_blah_logp(self):
    ...


@pytest.mark.xfail(
   condition=(pytensor.config.floatX == "float32"),
   reason="Fails on float32 due to numerical issues",
)
def test_blah_logcdf(self):
    ...

5. Adding tests for the `moment` method#

对 moment 的测试使用了函数 assert_moment_is_expected，该函数检查是否：

Moments 返回 期望 值
时刻具有预期的尺寸和形状
时刻具有有限的 logp

import pytest
from pymc.distributions import Blah
from pymc.testing import assert_moment_is_expected


@pytest.mark.parametrize(
    "param1, param2, size, expected",
    [
        (0, 1, None, 0),
        (0, np.ones(5), None, np.zeros(5)),
        (np.arange(5), 1, None, np.arange(5)),
        (np.arange(5), np.arange(1, 6), (2, 5), np.full((2, 5), np.arange(5))),
    ],
)
def test_blah_moment(param1, param2, size, expected):
    with Model() as model:
        Blah("x", param1=param1, param2=param2, size=size)
    assert_moment_is_expected(model, expected)

以下是一些值得记住的细节：

在必须手动广播参数的情况下，添加测试条件非常重要，如果不这样做，测试条件将失败。一个直接的方法是将使用的参数设为标量，未使用的参数设为向量（一次一个），并设置大小为 None。
换句话说，确保测试不同的大小和广播组合，以涵盖这些情况。

6. Documenting the new `Distribution`#

新的分布应当有一个丰富的文档字符串，遵循与之前实现的分布相同的格式。它通常看起来像这样：

r"""Univariate blah distribution.

 The pdf of this distribution is

 .. math::

    f(x \mid \param1, \param2) = \exp{x * (param1 + \log{param2})}

 .. plot::

     import matplotlib.pyplot as plt
     import numpy as np
     import scipy.stats as st
     import arviz as az
     x = np.linspace(-5, 5, 1000)
     params1 = [0., 0., 0., -2.]
     params2 = [0.4, 1., 2., 0.4]
     for param1, param2 in zip(params1, params2):
         pdf = st.blah.pdf(x, param1, param2)
         plt.plot(x, pdf, label=r'$\param1$ = {}, $\param2$ = {}'.format(param1, param2))
     plt.xlabel('x', fontsize=12)
     plt.ylabel('f(x)', fontsize=12)
     plt.legend(loc=1)
     plt.show()

 ========  ==========================================
 Support   :math:`x \in [0, \infty)`
 ========  ==========================================

 Blah distribution can be parameterized either in terms of param2 or
 alt_param2. The link between the two parametrizations is
 given by

 .. math::

    \param2 = \dfrac{1}{\alt_param2}


 Parameters
 ----------
 param1: float
     Interpretation of param1.
 param2: float
     Interpretation of param2 (param2 > 0).
 alt_param2: float
     Interpretation of alt_param2 (alt_param2 > 0) (alternative to param2).

 Examples
 --------
 .. code-block:: python

     with pm.Model():
         x = pm.Blah('x', param1=0, param2=10)
 """

新的分布应在 docs 模块中的相应 API 页面中引用（例如，pymc/docs/api/distributions.continuous.rst）。如果合适，应在 pymc-examples 中添加一个新的笔记本示例，展示如何使用此分布，以及它与用户可能更熟悉的其他分布的关系（和/或差异）。

实现一个随机变量分布#

1. Creating a new RandomVariable Op#

2. Inheriting from a PyMC base Distribution class#

3. Adding tests for the new RandomVariable#

关于 check_rv_size 测试的说明：#

关于 check_pymcs_draws_match_reference 测试的注释#

4. Adding tests for the logp / logcdf / icdf methods#

5. Adding tests for the moment method#

6. Documenting the new Distribution#