HFDataset#
- class stable_pretraining.data.HFDataset(*args, transform=None, rename_columns=None, remove_columns=None, **kwargs)[source]#
Bases:
Create a HuggingFace dataset wrapper.
Automatically chooses map-style or streaming based on
streaming=True/Falsein kwargs.The returned object is either an
HFMapDataset(subclass oftorch.utils.data.Dataset) or anHFIterableDataset(subclass oftorch.utils.data.IterableDataset), so PyTorchDataLoaderand LightningTrainerhandle both correctly out of the box.- Parameters:
*args – Positional arguments forwarded to
datasets.load_dataset(typically the dataset name/path).transform – Optional transform applied to every sample dict.
rename_columns – Optional
{old: new}mapping of columns to rename.remove_columns – Optional list of column names to drop.
**kwargs – Keyword arguments forwarded to
datasets.load_dataset(e.g.split,streaming,data_dir).
- Returns:
An
HFMapDatasetorHFIterableDatasetinstance.
Example:
# Map-style ds = HFDataset("imagenet-1k", split="train") print(len(ds)) # works # Streaming ds = HFDataset("imagenet-1k", split="train", streaming=True) ds.shuffle(seed=42, buffer_size=10_000) for sample in ds: ...