优化器调度

优化器调度#

`constant_schedule`(value)	构造一个常量时间表。
`cosine_decay_schedule`(init_value, decay_steps)	返回一个实现余弦学习率衰减的函数。
`cosine_onecycle_schedule`(transition_steps, ...)	返回一个实现onecycle学习率调度的函数。
`exponential_decay`(init_value, ...[, ...])	构建一个具有连续或离散指数衰减的计划。
`join_schedules`(schedules, boundaries)	顺序应用多个日程。
`linear_onecycle_schedule`(transition_steps, ...)	返回一个包含三个线性阶段的学习率。
`linear_schedule`(init_value, end_value, ...)	Schedule with linear transition from `init_value` to `end_value`.
`piecewise_constant_schedule`(init_value[, ...])	返回一个实现分段常数调度的函数。
`piecewise_interpolate_schedule`(...[, ...])	返回一个实现分段插值调度的函数。
`polynomial_schedule`(init_value, end_value, ...)	构造一个从初始值到结束值的多项式过渡调度。
`sgdr_schedule`(cosine_kwargs)	带有温暖重启的SGD。
`warmup_constant_schedule`(init_value, ...)	线性预热后跟随常数调度，即没有衰减。
`warmup_cosine_decay_schedule`(init_value, ...)	线性预热后接余弦衰减。
`warmup_exponential_decay_schedule`(...[, ...])	线性预热随后是指数衰减。
`Schedule`	alias of `Callable`[[`Array` \| `ndarray` \| `bool` \| `number` \| `float` \| `int`], `Array` \| `ndarray` \| `bool` \| `number` \| `float` \| `int`]
`InjectHyperparamsState`(count, hyperparams, ...)	为向后兼容保留的弃用类。
`inject_hyperparams`(inner_factory[, ...])	包装器，用于将有状态的超参数注入到梯度变换中。

optax.schedules.Schedule#: alias of Callable[[Array | ndarray | bool | number | float | int], Array | ndarray | bool | number | float | int]

常量调度#

optax.schedules.constant_schedule(value: float | int) → base.Schedule[来源]#

构造一个常量时间表。

Parameters:

value – 在整个过程中保持不变的值。

Returns:

schedule: 将步骤计数映射到值的函数。

示例

>>> schedule_fn = optax.constant_schedule(5)
>>> schedule_fn(0)
5
>>> schedule_fn(100)
5

余弦衰减计划#

optax.schedules.cosine_decay_schedule(init_value: float, decay_steps: int, alpha: float = 0.0, exponent: float = 1.0) → base.Schedule[来源]#

返回一个实现余弦学习率衰减的函数。

此计划在指定的步数内平滑地降低学习率（decay_steps）。衰减遵循余弦函数，并可以选择一个指数来修改衰减曲线。一个最小值（alpha）确保学习率不会完全降至零。

更精确地说，迭代\(t\)时的学习率为：

\[\begin{cases} \frac{I (1 - \alpha)}{2}(1+\cos(\pi\,\frac{t}{T})^p) + I \alpha\, & \text{如果 } t \leq T \\ I \alpha, & \text{如果 } t > T \end{cases} \]

其中 \(T\) 是衰减步数 (decay_steps), \(p\) 是 exponent 而 \(I\) 是初始值 (init_value).

Parameters:

init_value – 学习率的初始值。
decay_steps – 正整数 - 应用衰减的步骤数。
alpha – 用于调整学习率的最小乘数值。默认为 0.0。
exponent – The default decay is 0.5 * (1 + cos(pi * t/T)), where t is the current timestep and T is the decay_steps. The exponent modifies this to be (0.5 * (1 + cos(pi * t/T))) ** exponent. Defaults to 1.0.

Returns:

schedule: 一个将步数映射到值的函数。

参考文献

Loshchilov 等人，SGDR: Stochastic Gradient Descent with Warm Restarts，2017

optax.schedules.cosine_onecycle_schedule(transition_steps: int, peak_value: float, pct_start: float = 0.3, div_factor: float = 25.0, final_div_factor: float = 10000.0) → base.Schedule[来源]#

返回一个实现onecycle学习率调度的函数。

This schedule increases the learning rate and then decreases it in a cosine-like manner. The number of steps over which the learning rate increases is determined by the pct_start argument. The maximum value of the learning rate is determined by the peak_value argument, the initial value of the learning rate is determined through the formula init_value = peak_value / div_factor, and the final value is determined by the final_div_factor argument.

Parameters:

transition_steps – 退火过程中经历的步骤数量。
peak_value – 排程在周期的 pct_start 百分比处达到的最大值（以步骤数表示）。
pct_start – 增加学习率过程中所花费的周期百分比（以步骤数量计算）。
div_factor – 通过 init_value = peak_value / div_factor 确定初始值。
final_div_factor – 通过 final_value = init_value / final_div_factor 确定最终值。

Returns:

schedule: 一个将步数映射到值的函数

参考文献

史密斯等，超收敛：使用大学习率进行神经网络的超快速训练，2017

指数衰减计划#

optax.schedules.exponential_decay(init_value: float, transition_steps: int, decay_rate: float, transition_begin: int = 0, staircase: bool = False, end_value: float | None = None) → base.Schedule[来源]#

构建一个具有连续或离散指数衰减的日程。

这个函数对提供的初始值应用指数衰减函数。当 count >= transition_begin 时，函数返回衰减后的值为：

rate_factor = ((count - transition_begin) / transition_steps)
decayed_value = init_value * (decay_rate ** rate_factor)

如果参数 staircase 为 True，那么 count / transition_steps 是一个整数除法，衰减值遵循阶梯函数。

Parameters:

init_value – 初始学习率。
transition_steps – 必须为正数。请参见上述衰减计算。
decay_rate – 不能为零。衰减率。
transition_begin – 必须为正数。开始退火的步骤数（在这之前的步骤中标量值保持固定在 init_value）。
staircase – 如果 True, 在离散间隔处衰减值。
end_value – the value at which the exponential decay stops. When decay_rate < 1, end_value is treated as a lower bound, otherwise as an upper bound. Has no effect when decay_rate = 0.

Returns:

schedule: 一个将步数映射到值的函数。

加入时间表#

optax.schedules.join_schedules(schedules: Sequence[base.Schedule], boundaries: Sequence[int]) → base.Schedule[来源]#

顺序应用多个计划。

Parameters:

schedules – 可调用对象的列表（预期为optax调度）。每个调度将接收一个步骤计数，表示自上一个边界转换以来的步骤数量。
边界 – 一个整数列表（长度比调度少一）指示何时在调度之间转换。

Returns:

一个将步数映射到值的函数。

Return type:

日程安排

注入超参数#

optax.schedules.inject_hyperparams(inner_factory: Callable[..., 基础.渐变变换], static_args: str | Iterable[str] = (), hyperparam_dtype: jnp.dtype | None = None) → Callable[..., 基础.渐变变换额外参数][来源]#

用于将有状态的超参数注入到GradientTransformations中的包装器。

此包装器允许您传递时间表（即一个根据步数返回数值的函数），而不是超参数的常量。您只能调度数值超参数（即布尔标志不能被调度）。

此函数支持传递简单的调度，这些调度仅依赖于步骤计数，也支持传递依赖于复杂内部状态的有状态调度。状态更新可以依赖于通过extra_args提供给梯度变换的额外信息。

例如，要使用 optax.scale_by_adam() 在 beta_1 上采用分段线性计划，而在 beta_2 上保持不变：

>>> import optax
>>> import jax.numpy as jnp
>>> # create a learning rate that increases linearly from 0.1 to 1.0
... # over 100 iterations
>>> linear_schedule = optax.piecewise_interpolate_schedule(
...    'linear', init_value=0.1, boundaries_and_scales={100: 1.})
>>> scheduled_adam = optax.inject_hyperparams(optax.scale_by_adam)(
...     b1=linear_schedule, b2=0.99)

您可以手动更改未通过 hyperparams 字典在 InjectHyperparamState 中调度的数值超参数：

>>> params, grads = jnp.array(0.), jnp.array(0.)
>>> state = scheduled_adam.init(params)
>>> updates, state = scheduled_adam.update(grads, state)
>>> state.hyperparams['b2'] = 0.95
>>> updates, state = scheduled_adam.update(updates, state)  # uses b2 = 0.95

手动覆盖计划的超参数将不会产生影响（例如，在上面的代码示例中，您无法手动调整 b1）。

Parameters:

inner_factory – 一个返回具有动态超参数的内部 optax.GradientTransformation 的函数。
static_args – 一个字符串或字符串的可迭代对象，指定哪些可调用参数不是调度。 inject_hyperparams 默认将所有可调用对象视为调度，因此如果一个超参数是非调度可调用对象，您必须使用此参数进行指定。
hyperparam_dtype – 可选的数据类型覆盖。如果指定，所有浮点超参数将被转换为此类型。

Returns:

一个可调用的返回 optax.GradientTransformationExtraArgs。此可调用接受与 inner_factory 相同的参数，不同的是您可以提供调度作为常量参数。

在版本 0.1.9 中更改：新参数 hyperparam_dtype，返回的可调用输出一个 GradientTransformationExtraArgs 而不是 GradientTransformation。

class optax.schedules.InjectHyperparamsState(count: jnp.ndarray, hyperparams: dict[str, chex.Numeric], inner_state: base.OptState)[来源]#: 保留已弃用的类以兼容旧版本。

自版本 0.1.9 起已弃用: 请使用 InjectStatefulHyperparamsState 替代。

线性计划#

optax.schedules.linear_onecycle_schedule(transition_steps: int, peak_value: float, pct_start: float = 0.3, pct_final: float = 0.85, div_factor: float = 25.0, final_div_factor: float = 10000.0) → base.Schedule[来源]#

返回具有三个线性阶段的学习率。

Phase 1, from iteration 0 to pct_start * transition_steps. The learning rate increases linearly from peak_value / div_factor to peak_value.
Phase 2, from iteration pct_start * transition_steps to pct_final * transition_steps. The learning rate decreases linearly from peak_value back to the initial peak_value/div_factor.
Phase 3: For the remaining steps, the learning rate interpolates between peak_value/div_factor and peak_value / final_div_factor. If final_div_factor is larger than div_factor, this is a decreasing phase.

Parameters:

transition_steps – 退火过程中经历的步骤数量。
peak_value – 排程在周期的 pct_start 百分比处达到的最大值（以步骤数表示）。
pct_start – 增加学习率过程中所花费的周期百分比（以步骤数量计算）。
pct_final – 周期（以步骤数计算）中花费在增加到 peak_value 然后再降低回 init_value 的百分比。
div_factor – 通过 init_value = peak_value / div_factor 确定初始值。
final_div_factor – 通过 final_value = init_value / final_div_factor 确定最终值。

Returns:

schedule: 一个将步数映射到值的函数

参考文献

史密斯等，超收敛：使用大学习率进行神经网络的超快速训练，2017

optax.schedules.linear_schedule(init_value: chex.Scalar, end_value: chex.Scalar, transition_steps: int, transition_begin: int = 0) → base.Schedule[来源]#

调度从 init_value 到 end_value 的线性过渡。

更精确地说，迭代\(t\)时的学习率为：

\[\begin{cases} I, & \text{如果 } t < B \\ I + \frac{t - B}{T} (E - I), & \text{如果 } B \leq t < B + T \\ E, & \text{如果 } t \geq B + T \end{cases} \]

其中 \(I\) 是初始值，\(E\) 是结束值，\(B\) 是过渡开始，\(T\) 是过渡步骤。

该调度等价于 optax.polynomial_schedule() 并且 power=1。

Parameters:

init_value – 要退火的标量的初始值。
end_value – 被退火的标量的结束值。
transition_steps – number of steps over which annealing takes place. The scalar starts changing at transition_begin steps and completes the transition by transition_begin + transition_steps steps. If transition_steps <= 0, then the entire annealing process is disabled and the value is held fixed at init_value.
transition_begin – 必须为正数。经过多少步后开始退火（在这之前，这么多步的标量值固定为 init_value）。

Returns:

schedule: 一个将步数映射到值的函数。

示例

>>> schedule_fn = optax.linear_schedule(
...    init_value=1.0, end_value=0.01, transition_steps=100)
>>> schedule_fn(0)  # learning rate on the first iteration
Array(1., dtype=float32, weak_type=True)
>>> schedule_fn(100)  # learning rate on the last iteration
Array(0.01, dtype=float32, weak_type=True)

分段时间表#

optax.schedules.piecewise_constant_schedule(init_value: float, boundaries_and_scales: dict[int, float] | None = None) → base.Schedule[来源]#

返回一个实现分段常数调度的函数。

Parameters:

init_value – 一个初始值 init_v.
boundaries_and_scales – A map from boundaries b_i to non-negative scaling factors f_i. For any step count s, the schedule returns init_v scaled by the product of all factors f_i such that b_i < s.

Returns:

schedule: 一个将步数映射到值的函数。

optax.schedules.piecewise_interpolate_schedule(interpolate_type: str, init_value: float, boundaries_and_scales: dict[int, float] | None = None) → base.Schedule[来源]#

返回一个实现分段插值调度的函数。

Parameters:

interpolate_type – ‘线性’或‘余弦’，指定插值策略。
init_value – 一个初始值 init_v.
boundaries_and_scales – A map from boundaries b_i to non-negative scaling factors f_i. At boundary step b_i, the schedule returns init_v scaled by the product of all factors f_j such that b_j <= b_i. The values in between each boundary will be interpolated as per type.

Returns:

schedule: 一个将步数映射到值的函数。

多项式调度#

optax.schedules.polynomial_schedule(init_value: chex.Scalar, end_value: chex.Scalar, power: chex.Scalar, transition_steps: int, transition_begin: int = 0) → base.Schedule[来源]#

构造一个从初始值到最终值的多项式过渡时间表。

This function transitions the learning rate from an initial value (init_value) to a final value (end_value) over a specified number of steps (transition_steps) with a polynomial function of power power. The transition can optionally begin after a specified number of initial steps (transition_begin).

更精确地说，迭代\(t\)时的学习率为：

\[\begin{cases} I, & \text{如果 } t < B \\ (I - E) \left( 1 - \frac{t - B}{T} \right)^{P} + E, & \text{如果 } B \leq t < B + T \\ E, & \text{如果 } t \geq B + T \end{cases} \]

其中 \(I\) 是初始值，\(E\) 是结束值， \(B\) 是过渡开始，\(T\) 是过渡步数，且 \(P\) 是用于多项式过渡的功率。

Parameters:

init_value – 用于退火的标量的初始值。
end_value – 要退火的标量的结束值。
power – 用于从初始化到结束的多项式的幂。
transition_steps – number of steps over which annealing takes place. The scalar starts changing at transition_begin steps and completes the transition by transition_begin + transition_steps steps. If transition_steps <= 0, then the entire annealing process is disabled and the value is held fixed at init_value.
transition_begin – 必须为正数。开始退火所需的步数（在此步数之前，标量值保持固定在 init_value）。

Returns:

schedule: 一个将步数映射到值的函数。

示例

>>> schedule_fn = optax.polynomial_schedule(
...    init_value=1.0, end_value=0.01, transition_steps=100, power=2)
>>> schedule_fn(0)  # learning rate on the first iteration
Array(1., dtype=float32, weak_type=True)
>>> schedule_fn(100)  # learning rate on the last iteration
Array(0.01, dtype=float32, weak_type=True)

以下示例使用非零 transition_begin。在这种情况下，学习率在前 transition_begin 次迭代中保持不变：

>>> schedule_fn = optax.polynomial_schedule(
...    init_value=1.0,
...    end_value=0.01,
...    transition_steps=100,
...    transition_begin=5,
...    power=2,
... )
>>> counts = [0, 5, 6, 104, 105, 110]
>>> print(
...    *[f'count:{i} value:{schedule_fn(i):.4f}' for i in counts],
...    sep='\n')
count:0 value:1.0000
count:5 value:1.0000
count:6 value:0.9803
count:104 value:0.0101
count:105 value:0.0100
count:110 value:0.0100

在平台上减少#

optax.contrib.reduce_on_plateau(factor: float = 0.1, patience: int = 10, rtol: float = 0.0001, atol: float = 0.0, cooldown: int = 0, accumulation_size: int = 1, min_scale: float = 0.0) → 基础.渐变变换额外参数[来源]#

当指标停止改善时，减小学习率。

模型通常会从减少学习中受益，一旦学习停止。这个调度器读取一个指标量，如果在一个 patience 数量的epochs中没有看到改进，则学习率会降低。

Parameters:

factor – 降低学习率的因子。 new_scale = scale * factor.
patience – 在没有任何改进的迭代次数后，将降低学习率。
rtol – 衡量新最优解的相对容忍度。
atol – 绝对容忍度，用于测量新的最优解。
冷却时间 – 在缩放减少后，恢复正常操作之前需要等待的迭代次数。
accumulation_size – 在对平台应用 reduce 逻辑之前要聚合的值的数量。如果提供给优化器的值是测试值，简单地取 1（默认值）。如果提供给优化器的值是当前小批量的损失，考虑使用更大的累积大小。
min_scale – 学习率衰减停止的比例。

Returns:

一个 optax.GradientTransformationExtraArgs 对象。

另请参见

平稳期学习率调度器的减小示例示例。

热身计划#

optax.schedules.warmup_constant_schedule(init_value: float, peak_value: float, warmup_steps: int) → base.Schedule[来源]#

线性预热后跟随恒定计划，即无衰减。

Parameters:

init_value – 要退火的标量的初始值。
peak_value – 在预热结束时要退火的标量的峰值。
warmup_steps – 正整数，线性热身的长度。

Returns:

schedule: 一个将步数映射到值的函数

optax.schedules.warmup_cosine_decay_schedule(init_value: float, peak_value: float, warmup_steps: int, decay_steps: int, end_value: float = 0.0, exponent: float = 1.0) → base.Schedule[来源]#

线性热身后接余弦衰减。

Parameters:

init_value – 要退火的标量的初始值。
peak_value – 在预热结束时要退火的标量的峰值。
warmup_steps – 正整数，线性预热的长度。
decay_steps – 正整数，计划的总长度。注意这包括预热时间，因此应用余弦退火的步骤数是 decay_steps - warmup_steps。
end_value – 要退火的标量的结束值。
exponent – The default decay is 0.5 * (1 + cos(pi t/T)), where t is the current timestep and T is decay_steps. The exponent modifies this to be (0.5 * (1 + cos(pi * t/T))) ** exponent. Defaults to 1.0.

Returns:

schedule: 一个将步数映射到值的函数

optax.schedules.warmup_exponential_decay_schedule(init_value: float, peak_value: float, warmup_steps: int, transition_steps: int, decay_rate: float, transition_begin: int = 0, staircase: bool = False, end_value: float | None = None) → base.Schedule[来源]#

线性预热后接指数衰减。

Parameters:

init_value – 要退火的标量的初始值。
peak_value – 在预热结束时要退火的标量的峰值。
warmup_steps – 正整数，线性热身的长度。
transition_steps – 必须为正数。有关更多细节，请参见 optax.exponential_decay()。
decay_rate – 不能为零。衰减率。
transition_begin – 必须是正数。经过多少步开始退火（在这之前的很多步骤标量值保持在 peak_value 不变）。
staircase – 如果 True, 在离散间隔处衰减值。
end_value – the value at which the exponential decay stops. When decay_rate < 1, end_value is treated as a lower bound, otherwise as an upper bound. Has no effect when decay_rate = 0.

Returns:

schedule: 一个将步数映射到值的函数

温暖重启#

optax.schedules.sgdr_schedule(cosine_kwargs: Iterable[dict[str, chex.Numeric]]) → base.Schedule[来源]#

带有温暖重启的SGD。

该学习率调度应用多个联合余弦衰减周期。

Parameters:

cosine_kwargs – 一个字典的可迭代对象，每个元素指定传递给每个余弦衰减周期的参数。decay_steps 关键字参数将指定每个周期持续的时间，因此确定何时过渡到下一个周期。

Returns:

schedule: 一个将步数映射到值的函数

参考文献

Loshchilov 等人，SGDR: Stochastic Gradient Descent with Warm Restarts，2017

优化器调度

目录

优化器调度#

常量调度#

余弦衰减计划#

指数衰减计划#

加入时间表#

注入超参数#

线性计划#

分段时间表#

多项式调度#

在平台上减少#

热身计划#

温暖重启#