xffl.learning.optim

Optimization utilities

Attributes

logger

Default xFFL logger

Classes

XFFLOptimizer

Creates the specified XFFL optimizer configured for FM pretraining.

Functions

_is_optimizer_accumulation_step_time(→ bool)

Returns true if the gradient accumulation is terminated and its time to step, False otherwise.

warmup_cosine_decay(...)

Creates a Warmup + Cosine Decay learning rate scheduler.

Module Contents

xffl.learning.optim.logger: logging.Logger

Default xFFL logger

xffl.learning.optim._is_optimizer_accumulation_step_time(gradient_accumulation: int | None, step: int, total_steps_per_epoch: int) bool

Returns true if the gradient accumulation is terminated and its time to step, False otherwise.

Parameters:
  • gradient_accumulation (Optional[int]) – Gradient accumulation steps

  • step (int) – Current training epoch step

  • total_steps_per_epoch (int) – Number of training steps per epoch

Returns:

True if the gradient accumulation is terminated and its time to step, False otherwise

Return type:

bool

xffl.learning.optim.warmup_cosine_decay(optimizer: torch.optim.Optimizer | Mapping[torch.nn.Parameter, torch.optim.Optimizer], total_steps_per_epoch: int, epochs: int | None = None, gradient_accumulation: int | None = None, lr_scheduler_params: Mapping[str, Any] | None = None, config: xffl.custom.config.XFFLConfig | None = None) torch.optim.lr_scheduler.LRScheduler | Mapping[torch.nn.Parameter, torch.optim.lr_scheduler.LRScheduler]

Creates a Warmup + Cosine Decay learning rate scheduler.

The scheduler operates on optimizer steps and is compatible with gradient accumulation. Training steps are internally converted into optimizer steps.

Learning rate schedule:
  1. Linear warmup from 0 -> peak learning rate

  2. Cosine decay from peak learning rate -> final learning rate

Parameters:
  • optimizer (Optimizer|Mapping[nn.Parameter, Optimizer]) – Optimizer or mapping of parameters and optimizers

  • total_steps_per_epoch (int) – Number of training steps per epoch (before gradient accumulation)

  • epochs (int, optional) – Number of epochs to train, defaults to None

  • gradient_accumulation (Optional[int], optional) – Gradient accumulation steps, defaults to None

  • lr_scheduler_params (Optional[Mapping[str, Any]], optional) – Learning rate parameters, defaults to None

  • config (XFFLConfig) – xFFL training configuration

Returns:

Configured warmup + cosine decay scheduler

Return type:

LRScheduler|Mapping[nn.Parameter, LRScheduler]

class xffl.learning.optim.XFFLOptimizer(model: torch.nn.Module | torch.distributed.fsdp.FullyShardedDataParallel, optimizer: Callable | None = None, optimizer_params: Mapping[str, Any] | None = None, gradient_clipping: float | None = None, gradient_accumulation: int | None = None, interleaved_optim: bool | None = None, lr_scheduler: Callable | None = None, total_steps_per_epoch: int = -1, scaler: torch.GradScaler | None = None, config: xffl.custom.config.XFFLConfig | None = None)

Creates the specified XFFL optimizer configured for FM pretraining.

Parameters:
  • model (nn.Module | FullyShardedDataParallel) – Model to train

  • optimizer (Optional[Callable], None) – Optimizer class, defaults to None

  • optimizer_params (Optional[Mapping[str, Any]], optional) – Optimizer parameters, defaults to None

  • gradient_clipping (Optional[float], optional) – Gradient clipping value, defaults to None

  • gradient_accumulation (Optional[int], optional) – Gradient accumulation steps, defaults to None

  • interleaved_optim (bool, optional) – Interleave optimizer and backward phase, defaults to None

  • lr_scheduler (Optional[LRScheduler], optional) – Learning rate scheduler, defaults to None

  • total_steps_per_epoch (int) – Number of training steps per epoch (before gradient accumulation), defaults to -1

  • scaler (Optional[GradScaler]) – Gradient scaler, if necessary

  • config (Optional[XFFLConfig], optional) – XFFL configuration, defaults to None

Raises:

ValueError – If some configuration values are incompatible with their expected values

model: torch.nn.Module | torch.distributed.fsdp.FullyShardedDataParallel
optimizer_class: Callable = None
optimizer: torch.optim.Optimizer | Mapping[torch.nn.Parameter, torch.optim.Optimizer] | None = None
optimizer_params: Mapping[str, Any]
interleaved_optim: bool = False
gradient_clipping: float | None = None
gradient_accumulation: int | None = None
lr_scheduler_class: Callable | None = None
lr_scheduler: torch.optim.lr_scheduler.LRScheduler | Mapping[torch.nn.Parameter, torch.optim.lr_scheduler.LRScheduler] | None = None
total_steps_per_epoch: int = -1
scaler: torch.GradScaler | None = None
optimizer_step: int = 0
training_step: int = 0
_create_optimizer() None

Creates the specified optimizer configured for LLM pretraining.

_register_interleaving() None

Register optimizer as a post-gradient hook for every parameter.

_get_clip_fn() Callable

Get the right gradient clip function.

Returns:

Clipping function

Return type:

Callable

get_lr() float

Returns the current learning rate.

Returns:

Current learning rate

Return type:

float

get_optimizer() torch.optim.Optimizer | Sequence[torch.optim.Optimizer]

Returns the optimizer(s).

Returns:

Optimizer(s)

Return type:

Optimizer|Sequence[Optimizer]

zero_grad(set_to_none: bool = True) None

Reset the XFFL optimizers gradients.

Parameters:

set_to_none (bool, optional) – Set the gradients to None instead of zero, defaults to True

step(closure: Callable | None = None, set_to_none: bool = True) None

Perform a single XFFL optimization step.

Parameters:

closure – A closure that reevaluates the model and

returns the loss, defaults to None :type closure: Optional[Callable], optional :param set_to_none: Set the gradients to None instead of zero, defaults to True :type set_to_none: bool, optional