iBOT

Contents

iBOT#

class stable_pretraining.methods.iBOT(encoder_name: str | Module = 'vit_small_patch16_224', projector_hidden_dim: int = 2048, projector_bottleneck_dim: int = 256, n_cls_prototypes: int = 65536, n_patch_prototypes: int = 8192, mask_ratio: float = 0.3, patch_loss_weight: float = 1.0, temperature_student: float = 0.1, temperature_teacher_warmup: float = 0.04, temperature_teacher: float = 0.07, warmup_epochs_temperature_teacher: int = 30, ema_decay_start: float = 0.996, ema_decay_end: float = 1.0, image_size: int = 224, pretrained: bool = False)[source]#

Bases: Module

iBOT: DINO on CLS + masked patch self-distillation.

Architecture:
  • Backbone wrapped with EMA teacher (timm ViT with forward_features).

  • Two prototype heads: CLS head and patch head, both wrapped with EMA.

  • Loss: DINOv1 on CLS + iBOT patch loss on masked patch positions.

Parameters:
  • encoder_name – timm ViT name (default "vit_small_patch16_224").

  • projector_hidden_dim – Hidden dim for both heads (default 2048).

  • projector_bottleneck_dim – Bottleneck dim before prototypes (default 256).

  • n_cls_prototypes – Number of CLS prototypes (default 65536).

  • n_patch_prototypes – Number of patch prototypes (default 8192).

  • mask_ratio – Patch masking ratio for the student (default 0.3).

  • patch_loss_weight – Weight on the patch loss term (default 1.0).

  • temperature_student – Student softmax temperature (default 0.1).

  • temperature_teacher_warmup – Teacher temperature at start (default 0.04).

  • temperature_teacher – Teacher temperature after warmup (default 0.07).

  • warmup_epochs_temperature_teacher – Warmup length (default 30).

  • ema_decay_start – Initial backbone/head EMA (default 0.996).

  • ema_decay_end – Final EMA (default 1.0).

  • image_size – Input size (default 224).

  • pretrained – Load pretrained timm weights.

forward(global_views: Sequence[Tensor] | None = None, images: Tensor | None = None) iBOTOutput[source]#

Forward pass.

Parameters:
  • global_views – List of n_global tensors [B, C, H, W].

  • images – Single batch for evaluation.