最小二乘法拟合模型到数据¶
这是为物理科学家(例如物理学家、天文学家)或工程师提供的statsmodels快速介绍。
为什么需要这个?
因为大部分statsmodels是由统计学家编写的,他们使用不同的术语和有时不同的方法,这使得很难知道哪些类和函数是相关的,以及它们的输入和输出意味着什么。
[1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
线性模型¶
假设你有在位置 x 处的测量值 y 以及测量误差 y_err。
如何使用statsmodels将直线模型拟合到这些数据上?
关于广泛的讨论,请参见 Hogg 等人 (2010),“数据分析方法:拟合模型到数据” … 我们将使用他们在表1中给出的示例数据。
所以模型是 f(x) = a * x + b 并且在图1中他们打印了我们想要重现的结果……对于这组数据,“标准加权最小二乘拟合”的最佳拟合参数和参数误差是:* a = 2.24 +- 0.11 * b = 34 +- 18
[2]:
data = """
x y y_err
201 592 61
244 401 25
47 583 38
287 402 15
203 495 21
58 173 15
210 479 27
202 504 14
198 510 30
158 416 16
165 393 14
201 442 25
157 317 52
131 311 16
166 400 34
160 337 31
186 423 42
125 334 26
218 533 16
146 344 22
"""
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
data = pd.read_csv(StringIO(data), delim_whitespace=True).astype(float)
# Note: for the results we compare with the paper here, they drop the first four points
data.head()
/var/folders/xc/cwj7_pwj6lb0lkpyjtcbm7y80000gn/T/ipykernel_80772/2751177292.py:28: FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
data = pd.read_csv(StringIO(data), delim_whitespace=True).astype(float)
[2]:
| x | y | y_err | |
|---|---|---|---|
| 0 | 201.0 | 592.0 | 61.0 |
| 1 | 244.0 | 401.0 | 25.0 |
| 2 | 47.0 | 583.0 | 38.0 |
| 3 | 287.0 | 402.0 | 15.0 |
| 4 | 203.0 | 495.0 | 21.0 |
要拟合一条直线,请使用加权最小二乘类 WLS … 参数称为:* exog = sm.add_constant(x) * endog = y * weights = 1 / sqrt(y_err)
请注意,exog 必须是一个二维数组,其中包含列 x 和额外的一列全为1的列。添加这一列全为1的列意味着你想要拟合模型 y = a * x + b,不添加则意味着你想要拟合模型 y = a * x。
并且您必须使用选项 cov_type='fixed scale' 来告诉 statsmodels 您确实有具有绝对尺度的测量误差。如果您不这样做,statsmodels 会将权重视为数据点之间的相对权重,并在内部重新缩放它们,以便最佳拟合模型将具有 chi**2 / ndf = 1。
[3]:
exog = sm.add_constant(data["x"])
endog = data["y"]
weights = 1.0 / (data["y_err"] ** 2)
wls = sm.WLS(endog, exog, weights)
results = wls.fit(cov_type="fixed scale")
print(results.summary())
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.400
Model: WLS Adj. R-squared: 0.367
Method: Least Squares F-statistic: 193.5
Date: Wed, 16 Oct 2024 Prob (F-statistic): 4.52e-11
Time: 18:27:16 Log-Likelihood: -119.06
No. Observations: 20 AIC: 242.1
Df Residuals: 18 BIC: 244.1
Df Model: 1
Covariance Type: fixed scale
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 213.2735 14.394 14.817 0.000 185.062 241.485
x 1.0767 0.077 13.910 0.000 0.925 1.228
==============================================================================
Omnibus: 0.943 Durbin-Watson: 2.901
Prob(Omnibus): 0.624 Jarque-Bera (JB): 0.181
Skew: -0.205 Prob(JB): 0.914
Kurtosis: 3.220 Cond. No. 575.
==============================================================================
Notes:
[1] Standard Errors are based on fixed scale
与scipy.optimize.curve_fit进行检查¶
[4]:
# You can use `scipy.optimize.curve_fit` to get the best-fit parameters and parameter errors.
from scipy.optimize import curve_fit
def f(x, a, b):
return a * x + b
xdata = data["x"]
ydata = data["y"]
p0 = [0, 0] # initial parameter estimate
sigma = data["y_err"]
popt, pcov = curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma=True)
perr = np.sqrt(np.diag(pcov))
print("a = {0:10.3f} +- {1:10.3f}".format(popt[0], perr[0]))
print("b = {0:10.3f} +- {1:10.3f}".format(popt[1], perr[1]))
a = 1.077 +- 0.077
b = 213.273 +- 14.394
检查自定义成本函数¶
[5]:
# You can also use `scipy.optimize.minimize` and write your own cost function.
# This does not give you the parameter errors though ... you'd have
# to estimate the HESSE matrix separately ...
from scipy.optimize import minimize
def chi2(pars):
"""Cost function."""
y_model = pars[0] * data["x"] + pars[1]
chi = (data["y"] - y_model) / data["y_err"]
return np.sum(chi ** 2)
result = minimize(fun=chi2, x0=[0, 0])
popt = result.x
print("a = {0:10.3f}".format(popt[0]))
print("b = {0:10.3f}".format(popt[1]))
a = 1.077
b = 213.274
非线性模型¶
[6]:
# TODO: we could use the examples from here:
# http://probfit.readthedocs.org/en/latest/api.html#probfit.costfunc.Chi2Regression
Last update:
Oct 16, 2024