可靠性统计与预测校准#

注意

本笔记本使用了不是 PyMC 依赖项的库，因此需要专门安装这些库才能运行此笔记本。打开下面的下拉菜单以获取更多指导。

import os
import random

from io import StringIO

import arviz as az
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

from lifelines import KaplanMeierFitter, LogNormalFitter, WeibullFitter
from lifelines.utils import survival_table_from_events
from scipy.stats import binom, lognorm, norm, weibull_min

RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")
%config InlineBackend.figure_format = 'retina'

可靠性统计#

当我们想要对生产线上的潜在故障进行推断时，我们可能会根据行业、商品性质或我们试图回答的具体问题，拥有大样本或小样本数据集。但在所有情况下，都存在成本问题和可容忍的故障数量。

因此，可靠性研究必须考虑故障发生的重要观察期、故障成本以及运行错误指定研究的成本。在问题的定义和建模练习的性质方面，精确性的要求至关重要。

失效时间数据的关键特征是其删失方式以及这种删失如何影响传统的统计汇总和估计技术。在本笔记本中，我们将重点放在失效时间的预测上，并将贝叶斯校准预测区间的概念与一些频率主义的替代方法进行比较。我们借鉴了《可靠性数据统计方法》一书中的工作。详情请参见（参见 Meeker 等 [2022]）

预测的类型#

我们可能想要了解故障时间分布的不同视图，例如：

新产品的故障时间
未来样本中m个单位内发生k次故障的时间

虽然存在非参数和描述性方法可以用于评估这类问题，但我们主要关注的是我们有一个概率模型的情况，例如失效时间的对数正态分布 \(F(t: \mathbf{\theta})\)，由未知的 \(\mathbf{\theta}\) 参数化。

演示文稿的结构#

我们将

讨论可靠性数据的累积密度函数CDF的非参数估计
展示如何推导出相同函数的频率主义者或最大似然估计，以指导我们的预测区间
展示贝叶斯方法如何在信息稀疏的情况下用于增强对相同CDF的分析。

在整个过程中，重点将放在理解CDF如何帮助我们在可靠性环境中理解失败风险。特别是如何使用CDF来推导出一个经过良好校准的预测区间。

示例失败分布#

在可靠性统计研究中，重点在于具有长尾的位置-尺度分布。在一个理想的世界中，我们会确切知道哪个分布描述了我们的故障过程，并且可以精确地定义下一次故障的预测区间。

mu, sigma = 6, 0.3


def plot_ln_pi(mu, sigma, xy=(700, 75), title="Exact Prediction Interval for Known Lognormal"):
    failure_dist = lognorm(s=sigma, scale=np.exp(mu))
    samples = failure_dist.rvs(size=1000, random_state=100)
    fig, axs = plt.subplots(1, 3, figsize=(20, 8))
    axs = axs.flatten()
    axs[0].hist(samples, ec="black", color="slateblue", bins=30)
    axs[0].set_title(f"Failure Time Distribution: LN({mu}, {sigma})")
    count, bins_count = np.histogram(samples, bins=30)
    pdf = count / sum(count)
    cdf = np.cumsum(pdf)
    axs[1].plot(bins_count[1:], cdf, label="CDF", color="slateblue")
    axs[2].plot(bins_count[1:], 1 - cdf, label="Survival", color="slateblue")
    axs[2].legend()
    axs[1].legend()
    axs[1].set_title("Cumulative Density Function")
    axs[2].set_title("Survival Curve")

    lb = failure_dist.ppf(0.01)
    ub = failure_dist.ppf(0.99)
    axs[0].annotate(
        f"99% Prediction \nInterval: [{np.round(lb, 3)}, {np.round(ub, 3)}]",
        xy=(xy[0], xy[1] - 25),
        fontweight="bold",
    )
    axs[0].fill_betweenx(y=range(200), x1=lb, x2=ub, alpha=0.2, label="p99 PI", color="cyan")
    axs[1].fill_betweenx(y=range(2), x1=lb, x2=ub, alpha=0.2, label="p99 PI", color="cyan")
    axs[2].fill_betweenx(y=range(2), x1=lb, x2=ub, alpha=0.2, label="p99 PI", color="cyan")
    lb = failure_dist.ppf(0.025)
    ub = failure_dist.ppf(0.975)
    axs[0].annotate(
        f"95% Prediction \nInterval: [{np.round(lb, 3)}, {np.round(ub, 3)}]",
        xy=(xy[0], xy[1]),
        fontweight="bold",
    )
    axs[0].fill_betweenx(y=range(200), x1=lb, x2=ub, alpha=0.2, label="p95 PI", color="magenta")
    axs[1].fill_betweenx(y=range(2), x1=lb, x2=ub, alpha=0.2, label="p95 PI", color="magenta")
    axs[2].fill_betweenx(y=range(2), x1=lb, x2=ub, alpha=0.2, label="p95 PI", color="magenta")
    axs[0].legend()
    axs[1].legend()
    axs[2].legend()
    plt.suptitle(title, fontsize=20)


plot_ln_pi(mu, sigma)

../_images/c95e8d11d0f6644ec4cb69424b5b3445a79e8bd95fd6c0e1621beb65579231ec.png

从数据中估计故障分布#

在现实世界中，我们很少拥有如此精确的知识。相反，我们通常从完全不那么清晰的数据开始。我们首先将检查三个工厂中关于热交换器的故障数据，并将这些信息汇总起来，以量化这三个工厂中热交换器的使用寿命。

数据是有意设计得较小，以便我们可以专注于评估失效时间数据所涉及的描述性统计。特别是，我们将估计经验累积分布函数和生存函数。然后，我们将这种分析风格推广到更大的数据集之后。

热交换数据#

关于截尾数据的说明：请参见下文，了解故障数据如何标记一个观测值是否已被截尾，即我们是否已观察到换热器寿命的完整过程。这是故障时间数据的一个重要特征。过于简单的统计汇总会在估计故障发生率时产生偏差，因为我们的研究并未观察到每个项目的完整生命周期。最常见的截尾形式是所谓的“右截尾”数据，其中我们没有观察到部分观测值的“故障”事件。由于过早结束数据收集，我们的历史记录是不完整的。

左删失（我们从其历史记录的开始部分无法观察到某个项目）和区间删失（包括左删失和右删失）也可能发生，但较为少见。

heat_exchange_df

	Years Lower	Years Upper	Censoring Indicator	Count	Plant	year_interval	failed	censored	risk_set
0	0	1	Left	1	1	0,1	1	0	100
1	1	2	Interval	2	1	1,2	2	0	99
2	2	3	Interval	2	1	2,3	2	0	97
3	3		Right	95	1	3,	0	95	0
4	0	1	Left	2	2	0,1	2	0	100
5	1	2	Interval	3	2	1,2	3	0	98
6	2		Right	95	2	2,	0	95	0
7	0	1	Left	1	3	0,1	1	0	100
8	1		Right	99	3	1,	0	99	0

actuarial_table = heat_exchange_df.groupby(["Years Upper"])[["failed", "risk_set"]].sum()
actuarial_table = actuarial_table.tail(3)


def greenwood_variance(df):
    ### Used to estimate the variance in the CDF
    n = len(df)
    ps = [df.iloc[i]["p_hat"] / (df.iloc[i]["risk_set"] * df.iloc[i]["1-p_hat"]) for i in range(n)]
    s = [(df.iloc[i]["S_hat"] ** 2) * np.sum(ps[0 : i + 1]) for i in range(n)]
    return s


def logit_transform_interval(df):
    ### Used for robustness in the estimation of the Confidence intervals in the CDF
    df["logit_CI_95_lb"] = df["F_hat"] / (
        df["F_hat"]
        + df["S_hat"] * np.exp((1.960 * df["Standard_Error"]) / (df["F_hat"] * df["S_hat"]))
    )
    df["logit_CI_95_ub"] = df["F_hat"] / (
        df["F_hat"]
        + df["S_hat"] / np.exp((1.960 * df["Standard_Error"]) / (df["F_hat"] * df["S_hat"]))
    )
    df["logit_CI_95_lb"] = np.where(df["logit_CI_95_lb"] < 0, 0, df["logit_CI_95_lb"])
    df["logit_CI_95_ub"] = np.where(df["logit_CI_95_ub"] > 1, 1, df["logit_CI_95_ub"])
    return df


def make_actuarial_table(actuarial_table):
    ### Actuarial lifetables are used to describe the nature of the risk over time and estimate
    actuarial_table["p_hat"] = actuarial_table["failed"] / actuarial_table["risk_set"]
    actuarial_table["1-p_hat"] = 1 - actuarial_table["p_hat"]
    actuarial_table["S_hat"] = actuarial_table["1-p_hat"].cumprod()
    actuarial_table["CH_hat"] = -np.log(actuarial_table["S_hat"])
    ### The Estimate of the CDF function
    actuarial_table["F_hat"] = 1 - actuarial_table["S_hat"]
    actuarial_table["V_hat"] = greenwood_variance(actuarial_table)
    actuarial_table["Standard_Error"] = np.sqrt(actuarial_table["V_hat"])
    actuarial_table["CI_95_lb"] = (
        actuarial_table["F_hat"] - actuarial_table["Standard_Error"] * 1.960
    )
    actuarial_table["CI_95_lb"] = np.where(
        actuarial_table["CI_95_lb"] < 0, 0, actuarial_table["CI_95_lb"]
    )
    actuarial_table["CI_95_ub"] = (
        actuarial_table["F_hat"] + actuarial_table["Standard_Error"] * 1.960
    )
    actuarial_table["CI_95_ub"] = np.where(
        actuarial_table["CI_95_ub"] > 1, 1, actuarial_table["CI_95_ub"]
    )
    actuarial_table["ploting_position"] = actuarial_table["F_hat"].rolling(1).median()
    actuarial_table = logit_transform_interval(actuarial_table)
    return actuarial_table


actuarial_table_heat = make_actuarial_table(actuarial_table)
actuarial_table_heat = actuarial_table_heat.reset_index()
actuarial_table_heat.rename({"Years Upper": "t"}, axis=1, inplace=True)
actuarial_table_heat["t"] = actuarial_table_heat["t"].astype(int)
actuarial_table_heat

	t	failed	risk_set	p_hat	1-p_hat	S_hat	CH_hat	F_hat	V_hat	Standard_Error	CI_95_lb	CI_95_ub	ploting_position	logit_CI_95_lb	logit_CI_95_ub
0	1	4	300	0.013333	0.986667	0.986667	0.013423	0.013333	0.000044	0.006622	0.000354	0.026313	0.013333	0.005013	0.034977
1	2	5	197	0.025381	0.974619	0.961624	0.039131	0.038376	0.000164	0.012802	0.013283	0.063468	0.038376	0.019818	0.073016
2	3	2	97	0.020619	0.979381	0.941797	0.059965	0.058203	0.000350	0.018701	0.021550	0.094856	0.058203	0.030694	0.107629

值得花一些时间来详细讲解这个例子，因为它建立了时间到故障模型中一些关键量的估计。

首先注意我们如何将时间视为一系列离散的时间间隔。数据格式是离散的周期格式，因为它记录了随时间的累计故障。我们将在下面看到另一种故障数据格式——项目-周期格式，它记录了所有周期内的每个单独项目及其对应的状态。在这种格式中，关键量是每个周期内的failed项目集和risk_set。其他所有内容都由此推导而来。

首先，我们在连续三年的三个公司中确定了生产的热交换器的数量及其随后失败的数量。这提供了当年失败概率的估计值：p_hat 及其逆值 1-p_hat。这些值在一年中进一步结合，以估计生存曲线 S_hat，该曲线可以进一步转换以恢复累积危险 CH_hat 和累积密度函数 F_hat 的估计值。

接下来，我们希望有一种快速且粗略的方法来量化我们对CDF估计的不确定性程度。为此，我们使用greenwood公式来估计我们估计的V_hat的方差，即F_hat。这为我们提供了标准误差以及文献中推荐的两类置信区间。

我们将相同的技术应用于更大的数据集，并在下面绘制了一些这些量。

减震器数据：频繁主义可靠性分析研究#

减震器数据采用周期格式，但它记录了随着时间推移风险集不断减少的情况，即在每个时间点有一个项目被删失或失效，即从测试中成功移除（通过）或因失效而移除。这是周期格式数据的特殊情况。

actuarial_table_shock = make_actuarial_table(shockabsorbers_events)
actuarial_table_shock

	t	removed	failed	censored	entrance	risk_set	p_hat	1-p_hat	S_hat	CH_hat	F_hat	V_hat	Standard_Error	CI_95_lb	CI_95_ub	ploting_position	logit_CI_95_lb	logit_CI_95_ub
0	0.0	0	0	0	38	38	0.000000	1.000000	1.000000	-0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	NaN	NaN
1	6700.0	1	1	0	0	38	0.026316	0.973684	0.973684	0.026668	0.026316	0.000674	0.025967	0.000000	0.077212	0.026316	0.003694	0.164570
2	6950.0	1	0	1	0	37	0.000000	1.000000	0.973684	0.026668	0.026316	0.000674	0.025967	0.000000	0.077212	0.026316	0.003694	0.164570
3	7820.0	1	0	1	0	36	0.000000	1.000000	0.973684	0.026668	0.026316	0.000674	0.025967	0.000000	0.077212	0.026316	0.003694	0.164570
4	8790.0	1	0	1	0	35	0.000000	1.000000	0.973684	0.026668	0.026316	0.000674	0.025967	0.000000	0.077212	0.026316	0.003694	0.164570
5	9120.0	1	1	0	0	34	0.029412	0.970588	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
6	9660.0	1	0	1	0	33	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
7	9820.0	1	0	1	0	32	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
8	11310.0	1	0	1	0	31	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
9	11690.0	1	0	1	0	30	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
10	11850.0	1	0	1	0	29	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
11	11880.0	1	0	1	0	28	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
12	12140.0	1	0	1	0	27	0.000000	1.000000	0.945046	0.056521	0.054954	0.001431	0.037831	0.000000	0.129103	0.054954	0.013755	0.195137
13	12200.0	1	1	0	0	26	0.038462	0.961538	0.908698	0.095742	0.091302	0.002594	0.050927	0.000000	0.191119	0.091302	0.029285	0.250730
14	12870.0	1	0	1	0	25	0.000000	1.000000	0.908698	0.095742	0.091302	0.002594	0.050927	0.000000	0.191119	0.091302	0.029285	0.250730
15	13150.0	1	1	0	0	24	0.041667	0.958333	0.870836	0.138302	0.129164	0.003756	0.061285	0.009046	0.249282	0.129164	0.048510	0.301435
16	13330.0	1	0	1	0	23	0.000000	1.000000	0.870836	0.138302	0.129164	0.003756	0.061285	0.009046	0.249282	0.129164	0.048510	0.301435
17	13470.0	1	0	1	0	22	0.000000	1.000000	0.870836	0.138302	0.129164	0.003756	0.061285	0.009046	0.249282	0.129164	0.048510	0.301435
18	14040.0	1	0	1	0	21	0.000000	1.000000	0.870836	0.138302	0.129164	0.003756	0.061285	0.009046	0.249282	0.129164	0.048510	0.301435
19	14300.0	1	1	0	0	20	0.050000	0.950000	0.827294	0.189595	0.172706	0.005191	0.072047	0.031495	0.313917	0.172706	0.072098	0.359338
20	17520.0	1	1	0	0	19	0.052632	0.947368	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
21	17540.0	1	0	1	0	18	0.000000	1.000000	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
22	17890.0	1	0	1	0	17	0.000000	1.000000	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
23	18450.0	1	0	1	0	16	0.000000	1.000000	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
24	18960.0	1	0	1	0	15	0.000000	1.000000	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
25	18980.0	1	0	1	0	14	0.000000	1.000000	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
26	19410.0	1	0	1	0	13	0.000000	1.000000	0.783752	0.243662	0.216248	0.006455	0.080342	0.058778	0.373717	0.216248	0.098254	0.411308
27	20100.0	2	1	1	0	12	0.083333	0.916667	0.718440	0.330673	0.281560	0.009334	0.096613	0.092199	0.470922	0.281560	0.133212	0.499846
28	20150.0	1	0	1	0	10	0.000000	1.000000	0.718440	0.330673	0.281560	0.009334	0.096613	0.092199	0.470922	0.281560	0.133212	0.499846
29	20320.0	1	0	1	0	9	0.000000	1.000000	0.718440	0.330673	0.281560	0.009334	0.096613	0.092199	0.470922	0.281560	0.133212	0.499846
30	20900.0	1	1	0	0	8	0.125000	0.875000	0.628635	0.464205	0.371365	0.014203	0.119177	0.137778	0.604953	0.371365	0.178442	0.616380
31	22700.0	1	1	0	0	7	0.142857	0.857143	0.538830	0.618356	0.461170	0.017348	0.131711	0.203016	0.719324	0.461170	0.232453	0.707495
32	23490.0	1	0	1	0	6	0.000000	1.000000	0.538830	0.618356	0.461170	0.017348	0.131711	0.203016	0.719324	0.461170	0.232453	0.707495
33	26510.0	1	1	0	0	5	0.200000	0.800000	0.431064	0.841499	0.568936	0.020393	0.142806	0.289037	0.848835	0.568936	0.296551	0.805150
34	27410.0	1	0	1	0	4	0.000000	1.000000	0.431064	0.841499	0.568936	0.020393	0.142806	0.289037	0.848835	0.568936	0.296551	0.805150
35	27490.0	1	1	0	0	3	0.333333	0.666667	0.287376	1.246964	0.712624	0.022828	0.151089	0.416490	1.000000	0.712624	0.368683	0.913267
36	27890.0	1	0	1	0	2	0.000000	1.000000	0.287376	1.246964	0.712624	0.022828	0.151089	0.416490	1.000000	0.712624	0.368683	0.913267
37	28100.0	1	0	1	0	1	0.000000	1.000000	0.287376	1.246964	0.712624	0.022828	0.151089	0.416490	1.000000	0.712624	0.368683	0.913267

故障数据的极大似然拟合#

除了对我们的数据进行描述性总结外，我们还可以使用生命表数据来估计一个适合我们故障时间分布的单变量模型。如果拟合良好，这将使我们能够更清晰地了解特定预测区间的预测分布。在这里，我们将使用 lifelines 包中的函数来估计右删失数据的极大似然拟合。

lnf = LogNormalFitter().fit(actuarial_table_shock["t"] + 1e-25, actuarial_table_shock["failed"])
lnf.print_summary()

model	lifelines.LogNormalFitter
number of observations	38
number of events observed	11
log-likelihood	-124.20
hypothesis	mu_ != 0, sigma_ != 1

	coef	se(coef)	coef lower 95%	coef upper 95%	cmp to	z	p	-log2(p)
mu_	10.13	0.14	9.85	10.41	0.00	71.08	<0.005	inf
sigma_	0.53	0.11	0.31	0.74	1.00	-4.26	<0.005	15.58

AIC	252.41

尽管我们很想采用这个模型并直接应用，但在数据有限的情况下我们需要谨慎。例如，在热交换数据中，我们有三年的数据，总共只有11次故障。一个过于简单的模型可能会导致错误的结论。目前，我们将专注于减震器数据——其非参数描述以及对该数据的简单单变量拟合。

Show code cell source Hide code cell source

def plot_cdfs(actuarial_table, dist_fits=True, ax=None, title="", xy=(3000, 0.5), item_period=None):
    if item_period is None:
        lnf = LogNormalFitter().fit(actuarial_table["t"] + 1e-25, actuarial_table["failed"])
        wbf = WeibullFitter().fit(actuarial_table["t"] + 1e-25, actuarial_table["failed"])
    else:
        lnf = LogNormalFitter().fit(item_period["t"] + 1e-25, item_period["failed"])
        wbf = WeibullFitter().fit(item_period["t"] + 1e-25, item_period["failed"])
    if ax is None:
        fig, ax = plt.subplots(figsize=(20, 10))
    ax.plot(
        actuarial_table["t"],
        actuarial_table["F_hat"],
        "-o",
        color="black",
        label="Non-Parametric Estimate of CDF",
    )
    ax.plot(
        actuarial_table["t"],
        actuarial_table["CI_95_lb"],
        color="darkorchid",
        linestyle="--",
        label="Non-Parametric 95% CI based on Normal Approx",
    )
    ax.plot(actuarial_table["t"], actuarial_table["CI_95_ub"], color="darkorchid", linestyle="--")
    ax.fill_between(
        actuarial_table["t"].values,
        actuarial_table["CI_95_lb"].values,
        actuarial_table["CI_95_ub"].values,
        color="darkorchid",
        alpha=0.2,
    )
    ax.plot(
        actuarial_table["t"],
        actuarial_table["logit_CI_95_lb"],
        color="royalblue",
        linestyle="--",
        label="Non-Parametric 95% CI based on Logit Approx",
    )
    ax.plot(
        actuarial_table["t"], actuarial_table["logit_CI_95_ub"], color="royalblue", linestyle="--"
    )
    ax.fill_between(
        actuarial_table["t"].values,
        actuarial_table["logit_CI_95_lb"].values,
        actuarial_table["logit_CI_95_ub"].values,
        color="royalblue",
        alpha=0.2,
    )
    if dist_fits:
        lnf.plot_cumulative_density(ax=ax, color="crimson", alpha=0.8)
        wbf.plot_cumulative_density(ax=ax, color="cyan", alpha=0.8)
        ax.annotate(
            f"Lognormal Fit: mu = {np.round(lnf.mu_, 3)}, sigma = {np.round(lnf.sigma_, 3)} \nWeibull Fit: lambda = {np.round(wbf.lambda_, 3)}, rho = {np.round(wbf.rho_, 3)}",
            xy=(xy[0], xy[1]),
            fontsize=12,
            weight="bold",
        )
    ax.set_title(
        f"Estimates of the Cumulative Density Function \n derived from our {title} Failure Data",
        fontsize=20,
    )
    ax.set_ylabel("Fraction Failing")
    ax.set_xlabel("Time Scale")
    ax.legend()
    return ax

plot_cdfs(actuarial_table_shock, title="Shock Absorber");

../_images/448f4295ed9f1d2c7e04bb5c578f093f46e0dc3386c6162767050082f92dfa1d.png

这表明与数据的拟合良好，并且正如您可能预期的那样，随着减震器的老化，失效的比例会增加。但是，我们如何量化给定估计模型的预测呢？

计算近似统计预测区间的插件程序#

由于我们已经估计了减震器数据CDF的对数正态拟合，我们可以绘制其近似的预测区间。这里可能感兴趣的是预测区间的下限，因为作为制造商，我们可能希望了解保修索赔以及如果下限过低，面临退款风险的情况。

plot_ln_pi(
    10.128,
    0.526,
    xy=(40000, 120),
    title="Plug-in Estimate of Shock Absorber Failure Prediction Interval",
)

../_images/1ddddc2ce12894e22ff1bc74b396959ffd6b767618149d1a97527c8ae26327b6.png

Bootstrap校准与覆盖估计#

我们现在想要估计这个预测区间所隐含的覆盖率，为此我们将对95%置信区间的上下界进行自举估计，并最终评估它们在最大似然估计（MLE）拟合条件下的覆盖率。我们将使用分数加权（贝叶斯）自举法。我们将报告两种估计覆盖率统计量的方法——第一种是基于从已知范围内随机抽取一个值，并评估它是否落在95% MLE的上下界之间。

我们评估覆盖率的第二种方法是引导估计95%的下限和上限，然后评估这些引导值在最大似然估计（MLE）拟合条件下覆盖的范围。

def bayes_boot(df, lb, ub, seed=100):
    w = np.random.dirichlet(np.ones(len(df)), 1)[0]
    lnf = LogNormalFitter().fit(df["t"] + 1e-25, df["failed"], weights=w)
    rv = lognorm(s=lnf.sigma_, scale=np.exp(lnf.mu_))
    ## Sample random choice from implied bootstrapped distribution
    choices = rv.rvs(1000)
    future = random.choice(choices)
    ## Check if choice is contained within the MLE 95% PI
    contained = (future >= lb) & (future <= ub)
    ## Record 95% interval of bootstrapped dist
    lb = rv.ppf(0.025)
    ub = rv.ppf(0.975)
    return lb, ub, contained, future, lnf.sigma_, lnf.mu_

CIs = [bayes_boot(actuarial_table_shock, 8928, 70188, seed=i) for i in range(1000)]
draws = pd.DataFrame(
    CIs, columns=["Lower Bound PI", "Upper Bound PI", "Contained", "future", "Sigma", "Mu"]
)
draws

	Lower Bound PI	Upper Bound PI	Contained	future	Sigma	Mu
0	9539.612509	52724.144855	True	12721.873679	0.436136	10.018018
1	11190.287884	70193.514908	True	37002.708820	0.468429	10.240906
2	12262.816075	49883.853578	True	47419.786326	0.357947	10.115890
3	10542.933491	119175.518677	False	82191.111953	0.618670	10.475782
4	9208.478350	62352.752208	True	25393.072528	0.487938	10.084221
...	...	...	...	...	...	...
995	9591.084003	85887.280659	True	53836.013839	0.559245	10.264690
996	9724.522689	45044.965713	True	19481.035376	0.391081	9.948911
997	8385.673724	104222.628035	True	16417.930907	0.642870	10.294282
998	11034.708453	56492.385175	True	42085.798578	0.416605	10.125331
999	9408.619310	51697.495907	True	19253.676200	0.434647	10.001273

1000行 × 6列

我们可以使用这些引导统计量来进一步计算预测分布的量。在我们的例子中，我们可以使用简单参数模型的参数CDF，但我们将采用经验CDF来说明当模型更复杂时如何使用这种技术。

def ecdf(sample):
    # convert sample to a numpy array, if it isn't already
    sample = np.atleast_1d(sample)

    # find the unique values and their corresponding counts
    quantiles, counts = np.unique(sample, return_counts=True)

    # take the cumulative sum of the counts and divide by the sample size to
    # get the cumulative probabilities between 0 and 1
    cumprob = np.cumsum(counts).astype(np.double) / sample.size

    return quantiles, cumprob


fig, axs = plt.subplots(1, 2, figsize=(20, 6))
axs = axs.flatten()
ax = axs[0]
ax1 = axs[1]
hist_data = []
for i in range(1000):
    samples = lognorm(s=draws.iloc[i]["Sigma"], scale=np.exp(draws.iloc[i]["Mu"])).rvs(1000)
    qe, pe = ecdf(samples)
    ax.plot(qe, pe, color="skyblue", alpha=0.2)
    lkup = dict(zip(pe, qe))
    hist_data.append([lkup[0.05]])
hist_data = pd.DataFrame(hist_data, columns=["p05"])
samples = lognorm(s=draws["Sigma"].mean(), scale=np.exp(draws["Mu"].mean())).rvs(1000)
qe, pe = ecdf(samples)
ax.plot(qe, pe, color="red", label="Expected CDF for Shock Absorbers Data")
ax.set_xlim(0, 30_000)
ax.set_title("Bootstrapped CDF functions for the Shock Absorbers Data", fontsize=20)
ax1.hist(hist_data["p05"], color="slateblue", ec="black", alpha=0.4, bins=30)
ax1.set_title("Estimate of Uncertainty in the 5% Failure Time", fontsize=20)
ax1.axvline(
    hist_data["p05"].quantile(0.025), color="cyan", label="Lower Bound CI for 5% failure time"
)
ax1.axvline(
    hist_data["p05"].quantile(0.975), color="cyan", label="Upper Bound CI for 5% failure time"
)
ax1.legend()
ax.legend();

../_images/e8f7812cd02ee84f6a9d09bd29d04c3329ca40245b1ef47a6dfaa3b87db5bbb8.png

接下来我们将绘制自举数据以及基于MLE拟合所达到的两个覆盖率估计。换句话说，当我们想要评估基于MLE拟合的预测区间的覆盖率时，我们也可以自举估计这个量。

Show code cell source Hide code cell source

mosaic = """AABB
            CCCC"""
fig, axs = plt.subplot_mosaic(mosaic=mosaic, figsize=(20, 12))
mle_rv = lognorm(s=0.53, scale=np.exp(10.128))
axs = [axs[k] for k in axs.keys()]
axs[0].scatter(
    draws["Mu"],
    draws["Lower Bound PI"],
    c=draws["Contained"],
    cmap=cm.cool,
    alpha=0.3,
    label="Fits in MLE 95% CI",
)
axs[1].scatter(
    draws["Sigma"],
    draws["Lower Bound PI"],
    c=draws["Contained"],
    cmap=cm.cool,
    alpha=0.3,
    label="Fits in MLE 95% CI",
)
axs[0].set_title("Bootstrapped Mu against Bootstrapped 95% Lower Bound")
prop = draws["Contained"].sum() / len(draws)
axs[0].annotate(
    f"Estimated Prediction \nEmpirical Coverage \nBased on Sampling : {np.round(prop, 3)}",
    xy=(10.4, 12000),
    fontweight="bold",
)
axs[1].set_title("Bootstrapped Sigma against Bootstrapped 95% Lower Bound")
axs[0].legend()
axs[0].set_xlabel("Mu")
axs[1].set_xlabel("Sigma")
axs[0].set_ylabel("Estimated Lower 95% PI")
axs[1].legend()
axs[2].hist(
    mle_rv.cdf(draws["Lower Bound PI"]),
    bins=50,
    label="Bootstrap 95% LB",
    ec="k",
    color="royalblue",
    alpha=0.2,
)
axs[2].hist(
    mle_rv.cdf(draws["Upper Bound PI"]),
    bins=50,
    label="Bootstrap 95% UB",
    ec="k",
    color="darkorchid",
    alpha=0.2,
)
axs[2].hist(
    np.abs(mle_rv.cdf(draws["Lower Bound PI"]) - mle_rv.cdf(draws["Upper Bound PI"])),
    alpha=0.2,
    bins=50,
    color="slateblue",
    ec="black",
    label="Bootstrap Abs Diff",
)
axs[2].axvline(
    np.abs(mle_rv.cdf(draws["Lower Bound PI"]) - mle_rv.cdf(draws["Upper Bound PI"])).mean(),
    label="Expected Coverage",
)
axs[2].set_title("Difference in LB and UB | MLE(mu, sigma)")
axs[2].legend()

plug_in = np.abs(
    np.mean(mle_rv.cdf(draws["Lower Bound PI"])) - np.mean(mle_rv.cdf(draws["Upper Bound PI"]))
)
lb = np.round(draws["Lower Bound PI"].mean(), 3)
ub = np.round(draws["Upper Bound PI"].mean(), 3)
axs[2].annotate(
    f"Estimated Prediction Interval \n Coverage Based on Plug in Method : {np.round(plug_in, 3)} \n with [{lb, ub}]",
    xy=(0.6, 80),
    fontweight="bold",
);

../_images/b4c67a126872ecfcb0599fa9836abe209c9d2b1f29c40c881d354306ecab7679.png

这些模拟应该比我们在这里进行的次数多得多。应该清楚地看到我们如何也可以改变MLE区间大小以达到所需的覆盖水平。

轴承座数据：贝叶斯可靠性分析研究#

接下来我们将研究一个参数拟合略微不那么干净的数据集。该数据最明显的特征是失败记录的数量较少。数据以周期格式记录，显示每个周期中风险集的程度。

我们希望通过这个例子来展示如何在使用频率主义技术对减震器数据进行良好估计的情况下，增强对轴承座数据的处理。特别是我们将展示如何通过贝叶斯方法来解决出现的问题。

actuarial_table_bearings = make_actuarial_table(bearing_cage_events)

actuarial_table_bearings

	t	removed	failed	censored	entrance	risk_set	p_hat	1-p_hat	S_hat	CH_hat	F_hat	V_hat	Standard_Error	CI_95_ub	ploting_position	logit_CI_95_lb	logit_CI_95_ub
0	0.0	0	0	0	1703	1703	0.000000	1.000000	1.000000	-0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	NaN	NaN
1	50.0	288	0	288	0	1703	0.000000	1.000000	1.000000	-0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	NaN	NaN
2	150.0	148	0	148	0	1415	0.000000	1.000000	1.000000	-0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	NaN	NaN
3	230.0	1	1	0	0	1267	0.000789	0.999211	0.999211	0.000790	0.000789	6.224491e-07	0.000789	0.002336	0.000789	0.000111	0.005581
4	250.0	124	0	124	0	1266	0.000000	1.000000	0.999211	0.000790	0.000789	6.224491e-07	0.000789	0.002336	0.000789	0.000111	0.005581
5	334.0	1	1	0	0	1142	0.000876	0.999124	0.998336	0.001666	0.001664	1.386254e-06	0.001177	0.003972	0.001664	0.000415	0.006641
6	350.0	111	0	111	0	1141	0.000000	1.000000	0.998336	0.001666	0.001664	1.386254e-06	0.001177	0.003972	0.001664	0.000415	0.006641
7	423.0	1	1	0	0	1030	0.000971	0.999029	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
8	450.0	106	0	106	0	1029	0.000000	1.000000	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
9	550.0	99	0	99	0	923	0.000000	1.000000	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
10	650.0	110	0	110	0	824	0.000000	1.000000	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
11	750.0	114	0	114	0	714	0.000000	1.000000	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
12	850.0	119	0	119	0	600	0.000000	1.000000	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
13	950.0	127	0	127	0	481	0.000000	1.000000	0.997367	0.002637	0.002633	2.322113e-06	0.001524	0.005620	0.002633	0.000846	0.008165
14	990.0	1	1	0	0	354	0.002825	0.997175	0.994549	0.005466	0.005451	1.022444e-05	0.003198	0.011718	0.005451	0.001722	0.017117
15	1009.0	1	1	0	0	353	0.002833	0.997167	0.991732	0.008303	0.008268	1.808196e-05	0.004252	0.016603	0.008268	0.003008	0.022519
16	1050.0	123	0	123	0	352	0.000000	1.000000	0.991732	0.008303	0.008268	1.808196e-05	0.004252	0.016603	0.008268	0.003008	0.022519
17	1150.0	93	0	93	0	229	0.000000	1.000000	0.991732	0.008303	0.008268	1.808196e-05	0.004252	0.016603	0.008268	0.003008	0.022519
18	1250.0	47	0	47	0	136	0.000000	1.000000	0.991732	0.008303	0.008268	1.808196e-05	0.004252	0.016603	0.008268	0.003008	0.022519
19	1350.0	41	0	41	0	89	0.000000	1.000000	0.991732	0.008303	0.008268	1.808196e-05	0.004252	0.016603	0.008268	0.003008	0.022519
20	1450.0	27	0	27	0	48	0.000000	1.000000	0.991732	0.008303	0.008268	1.808196e-05	0.004252	0.016603	0.008268	0.003008	0.022519
21	1510.0	1	1	0	0	21	0.047619	0.952381	0.944506	0.057093	0.055494	2.140430e-03	0.046265	0.146173	0.055494	0.010308	0.248927
22	1550.0	11	0	11	0	20	0.000000	1.000000	0.944506	0.057093	0.055494	2.140430e-03	0.046265	0.146173	0.055494	0.010308	0.248927
23	1650.0	6	0	6	0	9	0.000000	1.000000	0.944506	0.057093	0.055494	2.140430e-03	0.046265	0.146173	0.055494	0.010308	0.248927
24	1850.0	1	0	1	0	3	0.000000	1.000000	0.944506	0.057093	0.055494	2.140430e-03	0.046265	0.146173	0.055494	0.010308	0.248927
25	2050.0	2	0	2	0	2	0.000000	1.000000	0.944506	0.057093	0.055494	2.140430e-03	0.046265	0.146173	0.055494	0.010308	0.248927

为了估计单变量或非参数的CDF，我们需要将周期格式的数据分解为项目-周期格式。

项目周期数据格式#

item_period = bearing_cage_df["Hours"].to_list() * bearing_cage_df["Count"].sum()
ids = [[i] * 25 for i in range(bearing_cage_df["Count"].sum())]
ids = [int(i) for l in ids for i in l]
item_period_bearing_cage = pd.DataFrame(item_period, columns=["t"])
item_period_bearing_cage["id"] = ids
item_period_bearing_cage["failed"] = np.zeros(len(item_period_bearing_cage))

## Censor appropriate number of ids
unique_ids = item_period_bearing_cage["id"].unique()
censored = bearing_cage_df[bearing_cage_df["Censoring Indicator"] == "Censored"]
i = 0
stack = []
for hour, count, idx in zip(censored["Hours"], censored["Count"], censored["Count"].cumsum()):
    temp = item_period_bearing_cage[
        item_period_bearing_cage["id"].isin(unique_ids[i:idx])
        & (item_period_bearing_cage["t"] == hour)
    ]
    stack.append(temp)
    i = idx

censored_clean = pd.concat(stack)

### Add  appropriate number of failings
stack = []
unique_times = censored_clean["t"].unique()
for id, fail_time in zip(
    [9999, 9998, 9997, 9996, 9995, 9994],
    bearing_cage_df[bearing_cage_df["failed"] == 1]["t"].values,
):
    temp = pd.DataFrame(unique_times[unique_times < fail_time], columns=["t"])
    temp["id"] = id
    temp["failed"] = 0
    temp = pd.concat([temp, pd.DataFrame({"t": [fail_time], "id": [id], "failed": [1]}, index=[0])])
    stack.append(temp)

failed_clean = pd.concat(stack).sort_values(["id", "t"])
censored_clean
item_period_bearing_cage = pd.concat([failed_clean, censored_clean])
## Transpose for more concise visual
item_period_bearing_cage.head(30).T

	0	1	2	3	4	5	6	7	8	9	...	4	5	6	7	8	9	0	0	1	2
t	50.0	150.0	250.0	350.0	450.0	550.0	650.0	750.0	850.0	950.0	...	450.0	550.0	650.0	750.0	850.0	950.0	1009.0	50.0	150.0	250.0
id	9994.0	9994.0	9994.0	9994.0	9994.0	9994.0	9994.0	9994.0	9994.0	9994.0	...	9995.0	9995.0	9995.0	9995.0	9995.0	9995.0	9995.0	9996.0	9996.0	9996.0
failed	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0

3行 × 30列

assert item_period_bearing_cage["id"].nunique() == 1703
assert item_period_bearing_cage["failed"].sum() == 6
assert item_period_bearing_cage[item_period_bearing_cage["t"] >= 1850]["id"].nunique() == 3

当我们绘制经验累积分布函数时，我们看到y轴的最大高度仅为0.05。一个简单的最大似然估计拟合在数据观测范围之外的外推中会严重出错。

ax = plot_cdfs(
    actuarial_table_bearings,
    title="Bearings",
    dist_fits=False,
    xy=(20, 0.7),
    item_period=item_period_bearing_cage,
)

../_images/ab32293eceda7603411cd5544e1cd9814373ac410566edb06ce4d11d2c3c8089.png

概率图：在受限线性范围内比较CDF#

在本节中，我们将使用线性化MLE拟合的技术，以便能够进行视觉上的“拟合优度”检查。这些类型的图依赖于一种转换，该转换可以应用于位置和尺度分布，将它们的CDF转换为线性空间。

对于对数正态分布和威布尔分布的拟合，我们可以在线性空间中表示它们的CDF，作为对数值t与适当的\(CDF^{-1}\)之间的关系。

Show code cell source Hide code cell source

def sev_ppf(p):
    return np.log(-np.log(1 - p))


mosaic = """AABB
            CCCC"""
fig, axs = plt.subplot_mosaic(mosaic=mosaic, figsize=(20, 15))
axs = [axs[k] for k in axs.keys()]
ax = axs[0]
ax2 = axs[1]
ax3 = axs[2]
ax.plot(
    np.log(actuarial_table_bearings["t"]),
    norm.ppf(actuarial_table_bearings["logit_CI_95_ub"]),
    "-o",
    label="Non-Parametric CI UB",
    color="slateblue",
)
ax.scatter(
    np.log(actuarial_table_bearings["t"]),
    norm.ppf(actuarial_table_bearings["ploting_position"]),
    label="Non-Parametric CDF",
    color="black",
)
ax.plot(
    np.log(actuarial_table_bearings["t"]),
    norm.ppf(actuarial_table_bearings["logit_CI_95_lb"]),
    "-o",
    label="Non-Parametric CI LB",
    color="slateblue",
)
for mu in np.linspace(10, 12, 3):
    for sigma in np.linspace(1.6, 1.9, 3):
        rv = lognorm(s=sigma, scale=np.exp(mu))
        ax.plot(
            np.log(actuarial_table_bearings["t"]),
            norm.ppf(rv.cdf(actuarial_table_bearings["t"])),
            "--",
            label=f"LN({np.round(mu, 3)}, {np.round(sigma, 3)})",
            color="grey",
        )

lnf = LogNormalFitter().fit(item_period_bearing_cage["t"], item_period_bearing_cage["failed"])
rv = lognorm(s=lnf.sigma_, scale=np.exp(lnf.mu_))
ax.plot(
    np.log(actuarial_table_bearings["t"]),
    norm.ppf(rv.cdf(actuarial_table_bearings["t"])),
    "--",
    label=f"MLE LN({np.round(lnf.mu_, 3)}, {np.round(lnf.sigma_, 3)})",
    color="RED",
)

for r in np.linspace(1, 2, 3):
    for s in np.linspace(12000, 25000, 2):
        rv = weibull_min(c=r, scale=s)
        ax2.plot(
            np.log(actuarial_table_bearings["t"]),
            sev_ppf(rv.cdf(actuarial_table_bearings["t"])),
            "--",
            label=f"Wb({np.round(s, 3)}, {np.round(r, 3)})",
            color="lightblue",
        )


wbf = WeibullFitter().fit(item_period_bearing_cage["t"], item_period_bearing_cage["failed"])
rv = weibull_min(c=wbf.rho_, scale=wbf.lambda_)
ax2.plot(
    np.log(actuarial_table_bearings["t"]),
    sev_ppf(rv.cdf(actuarial_table_bearings["t"])),
    "--",
    label=f"MLE Wb({np.round(wbf.lambda_, 3)}, {np.round(wbf.rho_, 3)})",
    color="red",
)

ax2.plot(
    np.log(actuarial_table_bearings["t"]),
    sev_ppf(actuarial_table_bearings["logit_CI_95_ub"]),
    "-o",
    label="Non-Parametric CI UB",
    color="slateblue",
)
ax2.scatter(
    np.log(actuarial_table_bearings["t"]),
    sev_ppf(actuarial_table_bearings["ploting_position"]),
    label="Non-Parametric CDF",
    color="black",
)
ax2.plot(
    np.log(actuarial_table_bearings["t"]),
    sev_ppf(actuarial_table_bearings["logit_CI_95_lb"]),
    "-o",
    label="Non-Parametric CI LB",
    color="slateblue",
)

ax3.plot(
    actuarial_table_bearings["t"],
    actuarial_table_bearings["logit_CI_95_ub"],
    "-o",
    label="Non-Parametric CI UB",
    color="slateblue",
)
ax3.scatter(
    actuarial_table_bearings["t"],
    actuarial_table_bearings["F_hat"],
    label="Non-Parametric CDF",
    color="black",
)
ax3.plot(
    actuarial_table_bearings["t"],
    actuarial_table_bearings["logit_CI_95_lb"],
    "-o",
    label="Non-Parametric CI LB",
    color="slateblue",
)
lnf.plot_cumulative_density(ax=ax3, color="cyan")
wbf.plot_cumulative_density(ax=ax3, color="darkorchid")


ax2.set_title("Linearizing the Weibull CDF", fontsize=20)
ax.set_title("Linearizing the Lognormal CDF", fontsize=20)
ax3.set_title("MLE CDF Fits", fontsize=20)
ax.legend()
ax.set_xlabel("Time")
ax2.set_xlabel("Time")
xticks = np.round(np.linspace(0, actuarial_table_bearings["t"].max(), 10), 1)
yticks = np.round(np.linspace(0, actuarial_table_bearings["F_hat"].max(), 10), 4)
ax.set_xticklabels(xticks)
ax.set_yticklabels(yticks)
ax2.set_xticklabels(xticks)
ax2.set_yticklabels([])
ax2.legend()
ax.set_ylabel("Fraction Failing");

/home/oriol/bin/miniconda3/envs/releases/lib/python3.10/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
/home/oriol/bin/miniconda3/envs/releases/lib/python3.10/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
/tmp/ipykernel_29435/2628457585.py:2: RuntimeWarning: divide by zero encountered in log
  return np.log(-np.log(1 - p))
/home/oriol/bin/miniconda3/envs/releases/lib/python3.10/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
/tmp/ipykernel_29435/2628457585.py:2: RuntimeWarning: divide by zero encountered in log
  return np.log(-np.log(1 - p))
/tmp/ipykernel_29435/2628457585.py:128: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels(xticks)
/tmp/ipykernel_29435/2628457585.py:129: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(yticks)
/tmp/ipykernel_29435/2628457585.py:130: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax2.set_xticklabels(xticks)

../_images/fae0adf490c0ee48a8b2f080f75445eab43f732dd20c2f537fea18b6336b1c5e.png

我们可以看到，这里无论是MLE拟合都无法覆盖观测数据的范围。

可靠性数据的贝叶斯建模#

我们现在已经了解了如何使用频率主义或最大似然估计框架对稀疏可靠性进行参数模型拟合的建模和可视化。我们希望现在展示如何在贝叶斯范式中实现相同风格的推断。

与MLE范式一样，我们需要对删失似然进行建模。对于我们上面看到的大多数对数位置分布，似然性表示为分布pdf \(\phi\) 和cdf \(\Phi\) 的组合的函数，具体应用取决于数据点是否在时间窗口内完全观察到或被删失。

\[ L(\mu, \sigma) = \prod_{i = 1}^{n} \Bigg(\dfrac{1}{\sigma t_{i}} \phi\Bigg[ \dfrac{log(t_{i}) - \mu}{\sigma} \Bigg] \Bigg)^{\delta_{i}} \cdot \Bigg(1 - \Phi \Bigg[ \dfrac{log(t_{i}) - \mu}{\sigma} \Bigg] \Bigg)^{1-\delta}\]

其中 \(\delta_{i}\) 是指示观测值是失败还是右删失观测值的指标。根据删失观测值的性质，可以通过对CDF进行类似的修改来包括更复杂的删失类型。

直接使用PYMC实现Weibull生存分析#

我们使用两种不同的先验规范拟合了该模型的两个版本，一种采用数据上的模糊均匀先验，另一种则指定了更接近最大似然估计（MLE）拟合的先验。我们将在下面展示先验规范对模型与观测数据校准的影响。

def weibull_lccdf(y, alpha, beta):
    """Log complementary cdf of Weibull distribution."""
    return -((y / beta) ** alpha)


item_period_max = item_period_bearing_cage.groupby("id")[["t", "failed"]].max()
y = item_period_max["t"].values
censored = ~item_period_max["failed"].values.astype(bool)


priors = {"beta": [100, 15_000], "alpha": [4, 1, 0.02, 8]}
priors_informative = {"beta": [10_000, 500], "alpha": [2, 0.5, 0.02, 3]}


def make_model(p, info=False):
    with pm.Model() as model:
        if info:
            beta = pm.Normal("beta", p["beta"][0], p["beta"][1])
        else:
            beta = pm.Uniform("beta", p["beta"][0], p["beta"][1])
        alpha = pm.TruncatedNormal(
            "alpha", p["alpha"][0], p["alpha"][1], lower=p["alpha"][2], upper=p["alpha"][3]
        )

        y_obs = pm.Weibull("y_obs", alpha=alpha, beta=beta, observed=y[~censored])
        y_cens = pm.Potential("y_cens", weibull_lccdf(y[censored], alpha, beta))
        idata = pm.sample_prior_predictive()
        idata.extend(pm.sample(random_seed=100, target_accept=0.95))
        idata.extend(pm.sample_posterior_predictive(idata))

    return idata, model


idata, model = make_model(priors)
idata_informative, model = make_model(priors_informative, info=True)

/tmp/ipykernel_29435/305144286.py:28: UserWarning: The effect of Potentials on other parameters is ignored during prior predictive sampling. This is likely to lead to invalid or biased predictive samples.
  idata = pm.sample_prior_predictive()
Sampling: [alpha, beta, y_obs]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, alpha]

100.00% [8000/8000 00:06<00:00 Sampling 4 chains, 1 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.
/tmp/ipykernel_29435/305144286.py:30: UserWarning: The effect of Potentials on other parameters is ignored during posterior predictive sampling. This is likely to lead to invalid or biased predictive samples.
  idata.extend(pm.sample_posterior_predictive(idata))
Sampling: [y_obs]

100.00% [4000/4000 00:00<00:00]

/tmp/ipykernel_29435/305144286.py:28: UserWarning: The effect of Potentials on other parameters is ignored during prior predictive sampling. This is likely to lead to invalid or biased predictive samples.
  idata = pm.sample_prior_predictive()
Sampling: [alpha, beta, y_obs]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, alpha]

100.00% [8000/8000 00:06<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 6 seconds.
/tmp/ipykernel_29435/305144286.py:30: UserWarning: The effect of Potentials on other parameters is ignored during posterior predictive sampling. This is likely to lead to invalid or biased predictive samples.
  idata.extend(pm.sample_posterior_predictive(idata))
Sampling: [y_obs]

100.00% [4000/4000 00:00<00:00]

az.plot_trace(idata, kind="rank_vlines");

../_images/fc98ba995b87b11fcdf350a392c2bc4eb1ddb249e0d2e62d4f5d4f8dad523163.png

az.plot_trace(idata_informative, kind="rank_vlines");

../_images/2aa50007631faef558f640376c83724d18d3c1da661f1f7d8aa3569d80684755.png

az.summary(idata)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta	8149.835	2916.378	3479.929	13787.447	111.293	78.729	612.0	371.0	1.01
alpha	2.626	0.558	1.779	3.631	0.030	0.025	616.0	353.0	1.01

az.summary(idata_informative)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta	10031.200	491.768	9110.601	10975.705	10.891	7.742	2045.0	1940.0	1.0
alpha	2.181	0.166	1.875	2.487	0.004	0.003	2240.0	1758.0	1.0

joint_draws = az.extract(idata, num_samples=1000)[["alpha", "beta"]]
alphas = joint_draws["alpha"].values
betas = joint_draws["beta"].values

joint_draws_informative = az.extract(idata_informative, num_samples=1000)[["alpha", "beta"]]
alphas_informative = joint_draws_informative["alpha"].values
betas_informative = joint_draws_informative["beta"].values

mosaic = """AAAA
            BBCC"""
fig, axs = plt.subplot_mosaic(mosaic=mosaic, figsize=(20, 13))
axs = [axs[k] for k in axs.keys()]
ax = axs[0]
ax1 = axs[2]
ax2 = axs[1]
hist_data = []
for i in range(1000):
    draws = pm.draw(pm.Weibull.dist(alpha=alphas[i], beta=betas[i]), 1000)
    qe, pe = ecdf(draws)
    lkup = dict(zip(pe, qe))
    hist_data.append([lkup[0.1], lkup[0.05]])
    ax.plot(qe, pe, color="slateblue", alpha=0.1)
hist_data_info = []
for i in range(1000):
    draws = pm.draw(pm.Weibull.dist(alpha=alphas_informative[i], beta=betas_informative[i]), 1000)
    qe, pe = ecdf(draws)
    lkup = dict(zip(pe, qe))
    hist_data_info.append([lkup[0.1], lkup[0.05]])
    ax.plot(qe, pe, color="pink", alpha=0.1)
hist_data = pd.DataFrame(hist_data, columns=["p10", "p05"])
hist_data_info = pd.DataFrame(hist_data_info, columns=["p10", "p05"])
draws = pm.draw(pm.Weibull.dist(alpha=np.mean(alphas), beta=np.mean(betas)), 1000)
qe, pe = ecdf(draws)
ax.plot(qe, pe, color="purple", label="Expected CDF Uninformative Prior")
draws = pm.draw(
    pm.Weibull.dist(alpha=np.mean(alphas_informative), beta=np.mean(betas_informative)), 1000
)
qe, pe = ecdf(draws)
ax.plot(qe, pe, color="magenta", label="Expected CDF Informative Prior")
ax.plot(
    actuarial_table_bearings["t"],
    actuarial_table_bearings["logit_CI_95_ub"],
    "--",
    label="Non-Parametric CI UB",
    color="black",
)
ax.plot(
    actuarial_table_bearings["t"],
    actuarial_table_bearings["F_hat"],
    "-o",
    label="Non-Parametric CDF",
    color="black",
    alpha=1,
)
ax.plot(
    actuarial_table_bearings["t"],
    actuarial_table_bearings["logit_CI_95_lb"],
    "--",
    label="Non-Parametric CI LB",
    color="black",
)
ax.set_xlim(0, 2500)
ax.set_title(
    "Bayesian Estimation of Uncertainty in the Posterior Predictive CDF \n Informative and Non-Informative Priors",
    fontsize=20,
)
ax.set_ylabel("Fraction Failing")
ax.set_xlabel("Time")
ax1.hist(
    hist_data["p10"], bins=30, ec="black", color="skyblue", alpha=0.4, label="Uninformative Prior"
)
ax1.hist(
    hist_data_info["p10"],
    bins=30,
    ec="black",
    color="slateblue",
    alpha=0.4,
    label="Informative Prior",
)
ax1.set_title("Distribution of 10% failure Time", fontsize=20)
ax1.legend()
ax2.hist(
    hist_data["p05"], bins=30, ec="black", color="cyan", alpha=0.4, label="Uninformative Prior"
)
ax2.hist(
    hist_data_info["p05"], bins=30, ec="black", color="pink", alpha=0.4, label="Informative Prior"
)
ax2.legend()
ax2.set_title("Distribution of 5% failure Time", fontsize=20)
wbf = WeibullFitter().fit(item_period_bearing_cage["t"] + 1e-25, item_period_bearing_cage["failed"])
wbf.plot_cumulative_density(ax=ax, color="cyan", label="Weibull MLE Fit")
ax.legend()
ax.set_ylim(0, 0.25);

../_images/9e66bf26e93ddc4219eaa7228b64b7f91bfc2570c23f596050b4ca578fafcf4f.png

我们可以看到，由我们故意模糊的先验驱动的贝叶斯不确定性估计比我们的MLE拟合和无信息先验包含了更多的不确定性，并且无信息先验意味着5%和10%失效时间的预测分布更宽。使用无信息先验的贝叶斯模型似乎更好地捕捉了我们CDF的非参数估计中的不确定性，但没有更多信息，很难判断哪个模型更合适。

在数据稀疏的情况下，每个模型拟合的失败百分比随时间变化的实际估计尤为关键。这是一个有意义的检查，我们可以与主题专家讨论他们对产品10%失败时间的期望和范围的合理性。

结论#

我们已经从频率论和贝叶斯的角度探讨了如何分析和建模可靠性，并与非参数估计进行了比较。我们展示了如何通过自助法和贝叶斯方法为多个关键统计量推导预测区间。我们已经看到了通过重采样方法和信息先验规范来校准这些预测区间的方法。这些观点是互补的，选择适当的技术应由感兴趣问题的因素驱动，而不是某种意识形态的承诺。

特别是，我们已经看到如何将MLE拟合到我们的轴承数据中，提供了一种在贝叶斯分析中建立先验的合理初步猜测方法。我们还看到如何通过从隐含模型中推导出关键量，并对这些推论进行审查，来引出专业知识。贝叶斯预测区间的选择是根据我们对先验期望的校准，在我们没有先验期望的情况下，我们可以提供模糊或非信息性的先验。这些先验的推论可以再次根据适当的成本函数进行检查和分析。

作者#

由Nathaniel Forde于2022年1月9日撰写（pymc-examples#491）

参考资料#

[1]

W.Q. Meeker, L.A. Escobar, 和 F.G Pascual. 可靠性数据的统计方法. Wiley, 2022.

水印#

%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,xarray_einstats

Last updated: Sun Jan 15 2023

Python implementation: CPython
Python version       : 3.10.8
IPython version      : 8.7.0

pytensor       : 2.8.10
xarray_einstats: 0.5.0.dev0

pandas    : 1.5.2
pymc      : 5.0.0
matplotlib: 3.6.2
xarray    : 2022.12.0
numpy     : 1.24.0
arviz     : 0.13.0

Watermark: 2.3.1

许可证声明#

本示例库中的所有笔记本均在MIT许可证下提供，该许可证允许修改和重新分发，前提是保留版权和许可证声明。

引用 PyMC 示例#

要引用此笔记本，请使用Zenodo为pymc-examples仓库提供的DOI。

重要

许多笔记本是从其他来源改编的：博客、书籍……在这种情况下，您应该引用原始来源。

同时记得引用代码中使用的相关库。

这是一个BibTeX的引用模板：

@incollection{citekey,
  author    = "<notebook authors, see above>",
  title     = "<notebook title>",
  editor    = "PyMC Team",
  booktitle = "PyMC examples",
  doi       = "10.5281/zenodo.5654871"
}

渲染后可能看起来像：

分类

标签

可靠性统计与预测校准#

可靠性统计#

预测的类型#

演示文稿的结构#

示例失败分布#

从数据中估计故障分布#

热交换数据#

减震器数据：频繁主义可靠性分析研究#

故障数据的极大似然拟合#

计算近似统计预测区间的插件程序#

Bootstrap校准与覆盖估计#

轴承座数据：贝叶斯可靠性分析研究#

项目周期数据格式#

概率图：在受限线性范围内比较CDF#

可靠性数据的贝叶斯建模#

直接使用PYMC实现Weibull生存分析#

预测区间内的失败次数#

插件估计#

在贝叶斯后验上应用相同的过程#

结论#

作者#

参考资料#

水印#

许可证声明#

引用 PyMC 示例#