xffl.learning.optim¶
Optimization utilities
Attributes¶
Default xFFL logger |
Classes¶
Creates the specified XFFL optimizer configured for FM pretraining. |
Functions¶
Returns true if the gradient accumulation is terminated and its time to step, False otherwise. |
|
|
Creates a Warmup + Cosine Decay learning rate scheduler. |
Module Contents¶
- xffl.learning.optim.logger: logging.Logger¶
Default xFFL logger
- xffl.learning.optim._is_optimizer_accumulation_step_time(gradient_accumulation: int | None, step: int, total_steps_per_epoch: int) bool¶
Returns true if the gradient accumulation is terminated and its time to step, False otherwise.
- Parameters:
gradient_accumulation (Optional[int]) – Gradient accumulation steps
step (int) – Current training epoch step
total_steps_per_epoch (int) – Number of training steps per epoch
- Returns:
True if the gradient accumulation is terminated and its time to step, False otherwise
- Return type:
bool
- xffl.learning.optim.warmup_cosine_decay(optimizer: torch.optim.Optimizer | Mapping[torch.nn.Parameter, torch.optim.Optimizer], total_steps_per_epoch: int, epochs: int | None = None, gradient_accumulation: int | None = None, lr_scheduler_params: Mapping[str, Any] | None = None, config: xffl.custom.config.XFFLConfig | None = None) torch.optim.lr_scheduler.LRScheduler | Mapping[torch.nn.Parameter, torch.optim.lr_scheduler.LRScheduler]¶
Creates a Warmup + Cosine Decay learning rate scheduler.
The scheduler operates on optimizer steps and is compatible with gradient accumulation. Training steps are internally converted into optimizer steps.
- Learning rate schedule:
Linear warmup from 0 -> peak learning rate
Cosine decay from peak learning rate -> final learning rate
- Parameters:
optimizer (Optimizer|Mapping[nn.Parameter, Optimizer]) – Optimizer or mapping of parameters and optimizers
total_steps_per_epoch (int) – Number of training steps per epoch (before gradient accumulation)
epochs (int, optional) – Number of epochs to train, defaults to None
gradient_accumulation (Optional[int], optional) – Gradient accumulation steps, defaults to None
lr_scheduler_params (Optional[Mapping[str, Any]], optional) – Learning rate parameters, defaults to None
config (XFFLConfig) – xFFL training configuration
- Returns:
Configured warmup + cosine decay scheduler
- Return type:
LRScheduler|Mapping[nn.Parameter, LRScheduler]
- class xffl.learning.optim.XFFLOptimizer(model: torch.nn.Module | torch.distributed.fsdp.FullyShardedDataParallel, optimizer: Callable | None = None, optimizer_params: Mapping[str, Any] | None = None, gradient_clipping: float | None = None, gradient_accumulation: int | None = None, interleaved_optim: bool | None = None, lr_scheduler: Callable | None = None, total_steps_per_epoch: int = -1, scaler: torch.GradScaler | None = None, config: xffl.custom.config.XFFLConfig | None = None)¶
Creates the specified XFFL optimizer configured for FM pretraining.
- Parameters:
model (nn.Module | FullyShardedDataParallel) – Model to train
optimizer (Optional[Callable], None) – Optimizer class, defaults to None
optimizer_params (Optional[Mapping[str, Any]], optional) – Optimizer parameters, defaults to None
gradient_clipping (Optional[float], optional) – Gradient clipping value, defaults to None
gradient_accumulation (Optional[int], optional) – Gradient accumulation steps, defaults to None
interleaved_optim (bool, optional) – Interleave optimizer and backward phase, defaults to None
lr_scheduler (Optional[LRScheduler], optional) – Learning rate scheduler, defaults to None
total_steps_per_epoch (int) – Number of training steps per epoch (before gradient accumulation), defaults to -1
scaler (Optional[GradScaler]) – Gradient scaler, if necessary
config (Optional[XFFLConfig], optional) – XFFL configuration, defaults to None
- Raises:
ValueError – If some configuration values are incompatible with their expected values
- model: torch.nn.Module | torch.distributed.fsdp.FullyShardedDataParallel¶
- optimizer_class: Callable = None¶
- optimizer: torch.optim.Optimizer | Mapping[torch.nn.Parameter, torch.optim.Optimizer] | None = None¶
- optimizer_params: Mapping[str, Any]¶
- interleaved_optim: bool = False¶
- gradient_clipping: float | None = None¶
- gradient_accumulation: int | None = None¶
- lr_scheduler_class: Callable | None = None¶
- lr_scheduler: torch.optim.lr_scheduler.LRScheduler | Mapping[torch.nn.Parameter, torch.optim.lr_scheduler.LRScheduler] | None = None¶
- total_steps_per_epoch: int = -1¶
- scaler: torch.GradScaler | None = None¶
- optimizer_step: int = 0¶
- training_step: int = 0¶
- _create_optimizer() None¶
Creates the specified optimizer configured for LLM pretraining.
- _register_interleaving() None¶
Register optimizer as a post-gradient hook for every parameter.
- _get_clip_fn() Callable¶
Get the right gradient clip function.
- Returns:
Clipping function
- Return type:
Callable
- get_lr() float¶
Returns the current learning rate.
- Returns:
Current learning rate
- Return type:
float
- get_optimizer() torch.optim.Optimizer | Sequence[torch.optim.Optimizer]¶
Returns the optimizer(s).
- Returns:
Optimizer(s)
- Return type:
Optimizer|Sequence[Optimizer]
- zero_grad(set_to_none: bool = True) None¶
Reset the XFFL optimizers gradients.
- Parameters:
set_to_none (bool, optional) – Set the gradients to None instead of zero, defaults to True
- step(closure: Callable | None = None, set_to_none: bool = True) None¶
Perform a single XFFL optimization step.
- Parameters:
closure – A closure that reevaluates the model and
returns the loss, defaults to None :type closure: Optional[Callable], optional :param set_to_none: Set the gradients to None instead of zero, defaults to True :type set_to_none: bool, optional