最小二乘法拟合模型到数据

这是为物理科学家(例如物理学家、天文学家)或工程师提供的statsmodels快速介绍。

为什么需要这个?

因为大部分statsmodels是由统计学家编写的,他们使用不同的术语和有时不同的方法,这使得很难知道哪些类和函数是相关的,以及它们的输入和输出意味着什么。

[1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

线性模型

假设你有在位置 x 处的测量值 y 以及测量误差 y_err

如何使用statsmodels将直线模型拟合到这些数据上?

关于广泛的讨论,请参见 Hogg 等人 (2010),“数据分析方法:拟合模型到数据” … 我们将使用他们在表1中给出的示例数据。

所以模型是 f(x) = a * x + b 并且在图1中他们打印了我们想要重现的结果……对于这组数据,“标准加权最小二乘拟合”的最佳拟合参数和参数误差是:* a = 2.24 +- 0.11 * b = 34 +- 18

[2]:
data = """
  x   y y_err
201 592    61
244 401    25
 47 583    38
287 402    15
203 495    21
 58 173    15
210 479    27
202 504    14
198 510    30
158 416    16
165 393    14
201 442    25
157 317    52
131 311    16
166 400    34
160 337    31
186 423    42
125 334    26
218 533    16
146 344    22
"""
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
data = pd.read_csv(StringIO(data), delim_whitespace=True).astype(float)

# Note: for the results we compare with the paper here, they drop the first four points
data.head()
/var/folders/xc/cwj7_pwj6lb0lkpyjtcbm7y80000gn/T/ipykernel_80772/2751177292.py:28: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  data = pd.read_csv(StringIO(data), delim_whitespace=True).astype(float)
[2]:
x y y_err
0 201.0 592.0 61.0
1 244.0 401.0 25.0
2 47.0 583.0 38.0
3 287.0 402.0 15.0
4 203.0 495.0 21.0

要拟合一条直线,请使用加权最小二乘类 WLS … 参数称为:* exog = sm.add_constant(x) * endog = y * weights = 1 / sqrt(y_err)

请注意,exog 必须是一个二维数组,其中包含列 x 和额外的一列全为1的列。添加这一列全为1的列意味着你想要拟合模型 y = a * x + b,不添加则意味着你想要拟合模型 y = a * x

并且您必须使用选项 cov_type='fixed scale' 来告诉 statsmodels 您确实有具有绝对尺度的测量误差。如果您不这样做,statsmodels 会将权重视为数据点之间的相对权重,并在内部重新缩放它们,以便最佳拟合模型将具有 chi**2 / ndf = 1

[3]:
exog = sm.add_constant(data["x"])
endog = data["y"]
weights = 1.0 / (data["y_err"] ** 2)
wls = sm.WLS(endog, exog, weights)
results = wls.fit(cov_type="fixed scale")
print(results.summary())
                            WLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.400
Model:                            WLS   Adj. R-squared:                  0.367
Method:                 Least Squares   F-statistic:                     193.5
Date:                Wed, 16 Oct 2024   Prob (F-statistic):           4.52e-11
Time:                        18:27:16   Log-Likelihood:                -119.06
No. Observations:                  20   AIC:                             242.1
Df Residuals:                      18   BIC:                             244.1
Df Model:                           1
Covariance Type:          fixed scale
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        213.2735     14.394     14.817      0.000     185.062     241.485
x              1.0767      0.077     13.910      0.000       0.925       1.228
==============================================================================
Omnibus:                        0.943   Durbin-Watson:                   2.901
Prob(Omnibus):                  0.624   Jarque-Bera (JB):                0.181
Skew:                          -0.205   Prob(JB):                        0.914
Kurtosis:                       3.220   Cond. No.                         575.
==============================================================================

Notes:
[1] Standard Errors are based on fixed scale

与scipy.optimize.curve_fit进行检查

[4]:
# You can use `scipy.optimize.curve_fit` to get the best-fit parameters and parameter errors.
from scipy.optimize import curve_fit


def f(x, a, b):
    return a * x + b


xdata = data["x"]
ydata = data["y"]
p0 = [0, 0]  # initial parameter estimate
sigma = data["y_err"]
popt, pcov = curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma=True)
perr = np.sqrt(np.diag(pcov))
print("a = {0:10.3f} +- {1:10.3f}".format(popt[0], perr[0]))
print("b = {0:10.3f} +- {1:10.3f}".format(popt[1], perr[1]))
a =      1.077 +-      0.077
b =    213.273 +-     14.394

检查自定义成本函数

[5]:
# You can also use `scipy.optimize.minimize` and write your own cost function.
# This does not give you the parameter errors though ... you'd have
# to estimate the HESSE matrix separately ...
from scipy.optimize import minimize


def chi2(pars):
    """Cost function."""
    y_model = pars[0] * data["x"] + pars[1]
    chi = (data["y"] - y_model) / data["y_err"]
    return np.sum(chi ** 2)


result = minimize(fun=chi2, x0=[0, 0])
popt = result.x
print("a = {0:10.3f}".format(popt[0]))
print("b = {0:10.3f}".format(popt[1]))
a =      1.077
b =    213.274

非线性模型

[6]:
# TODO: we could use the examples from here:
# http://probfit.readthedocs.org/en/latest/api.html#probfit.costfunc.Chi2Regression

Last update: Oct 16, 2024