stable_datasets package
Subpackages
- stable_datasets.backends package
- Submodules
- stable_datasets.backends.arrow_shards module
- stable_datasets.backends.lance_rows module
- stable_datasets.backends.lance_video_frames module
LanceVideoFramesBackendLanceVideoFramesBackend.cache_dirLanceVideoFramesBackend.get_row()LanceVideoFramesBackend.is_file_backedLanceVideoFramesBackend.iter_batches()LanceVideoFramesBackend.num_rowsLanceVideoFramesBackend.num_shardsLanceVideoFramesBackend.prefer_batched_takeLanceVideoFramesBackend.schemaLanceVideoFramesBackend.segment_filename()LanceVideoFramesBackend.segment_filenamesLanceVideoFramesBackend.segment_info()LanceVideoFramesBackend.slice()LanceVideoFramesBackend.tableLanceVideoFramesBackend.take()LanceVideoFramesBackend.video_pathsLanceVideoFramesBackend.worker_init()
reset_worker_state()
- stable_datasets.backends.protocol module
- Module contents
ArrowBackendLanceBackendLanceVideoFramesBackendLanceVideoFramesBackend.cache_dirLanceVideoFramesBackend.get_row()LanceVideoFramesBackend.is_file_backedLanceVideoFramesBackend.iter_batches()LanceVideoFramesBackend.num_rowsLanceVideoFramesBackend.num_shardsLanceVideoFramesBackend.prefer_batched_takeLanceVideoFramesBackend.schemaLanceVideoFramesBackend.segment_filename()LanceVideoFramesBackend.segment_filenamesLanceVideoFramesBackend.segment_info()LanceVideoFramesBackend.slice()LanceVideoFramesBackend.tableLanceVideoFramesBackend.take()LanceVideoFramesBackend.video_pathsLanceVideoFramesBackend.worker_init()
StorageBackend
- stable_datasets.features package
- stable_datasets.images package
- Submodules
- stable_datasets.images.arabic_characters module
- stable_datasets.images.arabic_digits module
- stable_datasets.images.awa2 module
- stable_datasets.images.beans module
- stable_datasets.images.cars196 module
- stable_datasets.images.cars3d module
- stable_datasets.images.cassava module
- stable_datasets.images.celeb_a module
- stable_datasets.images.cifar10 module
- stable_datasets.images.cifar100 module
- stable_datasets.images.cifar100_c module
- stable_datasets.images.cifar10_c module
- stable_datasets.images.clevrer module
- stable_datasets.images.country211 module
- stable_datasets.images.cub200 module
- stable_datasets.images.dsprites module
- stable_datasets.images.dsprites_color module
- stable_datasets.images.dsprites_noise module
- stable_datasets.images.dsprites_scream module
- stable_datasets.images.dtd module
- stable_datasets.images.e_mnist module
- stable_datasets.images.face_pointing module
- stable_datasets.images.fashion_mnist module
- stable_datasets.images.fgvc_aircraft module
- stable_datasets.images.flowers102 module
- stable_datasets.images.food101 module
- stable_datasets.images.galaxy10 module
- stable_datasets.images.hasy_v2 module
- stable_datasets.images.imagenet_10 module
- stable_datasets.images.imagenet_100 module
- stable_datasets.images.imagenet_1k module
- stable_datasets.images.imagenette module
- stable_datasets.images.k_mnist module
- stable_datasets.images.linnaeus5 module
- stable_datasets.images.med_mnist module
- stable_datasets.images.mnist module
- stable_datasets.images.not_mnist module
- stable_datasets.images.patch_camelyon module
- stable_datasets.images.places365_small module
- stable_datasets.images.rock_paper_scissor module
- stable_datasets.images.shapes3d module
- stable_datasets.images.small_norb module
- stable_datasets.images.stl10 module
- stable_datasets.images.svhn module
- stable_datasets.images.tiny_imagenet module
- stable_datasets.images.tiny_imagenet_c module
- Module contents
AWA2ArabicCharactersArabicDigitsBeansCIFAR10CIFAR100CIFAR100CCIFAR10CCLEVRERCUB200Cars196Cars3DCountry211DSpritesDSpritesColorDSpritesNoiseDSpritesScreamDTDEMNISTFGVCAircraftFacePointingFashionMNISTFlowers102Food101Galaxy10DecalHASYv2ImageNet100ImageNet1KImagenetteKMNISTLinnaeus5MedMNISTNotMNISTRockPaperScissorSTL10SVHNShapes3DSmallNORBTinyImagenetTinyImagenetC
- stable_datasets.timeseries package
- Submodules
- stable_datasets.timeseries.CatsDogs module
- stable_datasets.timeseries.JapaneseVowels module
- stable_datasets.timeseries.MosquitoSound module
- stable_datasets.timeseries.Phoneme module
- stable_datasets.timeseries.RightWhaleCalls module
- stable_datasets.timeseries.TUTacousticscenes2017 module
- stable_datasets.timeseries.UCR_multivariate module
- stable_datasets.timeseries.UCR_univariate module
- stable_datasets.timeseries.UrbanSound module
- stable_datasets.timeseries.VoiceGenderDetection module
- stable_datasets.timeseries.audiomnist module
- stable_datasets.timeseries.birdvox_70k module
- stable_datasets.timeseries.birdvox_dcase_20k module
- stable_datasets.timeseries.brain_mnist module
- stable_datasets.timeseries.dcase_2019_task4 module
- stable_datasets.timeseries.dclde module
- stable_datasets.timeseries.esc module
- stable_datasets.timeseries.freefield1010 module
- stable_datasets.timeseries.fsd_kaggle_2018 module
- stable_datasets.timeseries.groove_MIDI module
- stable_datasets.timeseries.gtzan module
- stable_datasets.timeseries.high_gamma module
- stable_datasets.timeseries.irmas module
- stable_datasets.timeseries.picidae module
- stable_datasets.timeseries.seizures_neonatal module
- stable_datasets.timeseries.sonycust module
- stable_datasets.timeseries.speech_commands module
- stable_datasets.timeseries.vocalset module
- stable_datasets.timeseries.warblr module
- Module contents
- stable_datasets.video package
Submodules
stable_datasets.cache module
Generator-to-Arrow sharded caching pipeline.
Writes dataset examples to a directory of PyArrow IPC (Feather v2) shard files. Peak memory during writes is bounded to ~1 batch, and the sharded layout supports efficient sequential reads for training workloads.
- class CacheOpenResult(*, backend, num_rows: int, layout: str, metadata)[source]
Bases:
objectResult of opening cache metadata into a backend.
- class LanceCacheMeta(cache_dir: Path, num_rows: int, schema_fingerprint: str)[source]
Bases:
objectLightweight descriptor for a Lance-format cache on disk.
- class ShardedCacheMeta(cache_dir: Path, num_rows: int, num_shards: int, shard_filenames: list[str], shard_row_counts: list[int], schema_fingerprint: str, compression: str | None = None)[source]
Bases:
objectLightweight descriptor for a sharded Arrow cache on disk.
- cache_fingerprint(cls_name: str, version: str, config_name: str, split: str, storage_format: str = 'arrow') str[source]
Deterministic cache directory name for a dataset variant + split.
storage_formatis always included in the hash so Arrow and Lance caches for the same dataset coexist at different paths.
- detect_cache_format(cache_dir: Path) str[source]
Return
"arrow"or"lance"based on the cache’s metadata.
- detect_cache_layout(cache_dir: Path) str[source]
Return the physical cache layout recorded in cache metadata.
- encode_example(example: dict, features: Features, *, cache_dir: Path | None = None) dict[source]
Encode a single example dict into Arrow-compatible values.
- open_cache(cache_dir: Path, features: Features, *, backend_kwargs: dict | None = None) CacheOpenResult[source]
Open a cache directory and return the backend selected by its layout.
- read_lance_cache_meta(cache_dir: Path) LanceCacheMeta[source]
Read metadata from a Lance-format cache directory.
Returns a
LanceCacheMetawith the cached row count and schema fingerprint populated from_metadata.json. Deliberately does NOT open the underlying Lance dataset: that would initialize Lance’s tokio runtime in the caller’s process, which is a DataLoader-fork footgun. Row count comes from the metadata file, not fromds.count_rows().
- read_shard(shard_path: Path) Table[source]
Memory-map a single shard file and return its table.
- read_sharded_cache_meta(cache_dir: Path) ShardedCacheMeta[source]
Read metadata from a sharded cache directory.
Validates that all shard files and metadata exist and are internally consistent. Raises
ValueErroron corruption.
- validate_sharded_cache(cache_dir: Path, features: Features) ShardedCacheMeta[source]
Read and validate a sharded cache, checking the schema fingerprint.
Raises
ValueErrorif the cache is inconsistent or the schema has changed.
- write_lance_cache(generator, features: Features, cache_dir: Path, *, batch_size: int = 1000, num_encode_workers: int = 0, lineage: dict | None = None) LanceCacheMeta[source]
Consume a generator and write directly to a Lance dataset.
Mirrors
write_sharded_arrow_cache()in shape (same generator contract, same features, same encode pipeline, same atomic-publish semantics) but writes a Lance dataset vialance.write_datasetinstead of Arrow IPC shards. No intermediate Arrow IPC file is produced – the encoded rows stream into Lance via apa.RecordBatchReader, so the native Lance write path is used end-to-end.Writing is atomic: Lance writes to a temporary directory next to
cache_dirand the directory is renamed on success. The completed cache directory contains:Lance dataset files (
_versions/,data/, manifest)_metadata.json– row count, format marker, schema fingerprint
- Parameters:
batch_size (int) – Rows per
pa.RecordBatchflushed to the Lance writer. Larger batches reduce per-call overhead; smaller batches reduce peak memory during writing.num_encode_workers (int) – When > 0, encode examples in parallel using a thread pool (same contract as the Arrow writer).
lineage (dict, optional) – Optional metadata blob written into
_metadata.json.
- write_lance_video_frames_cache(generator, features: Features, cache_dir: Path, *, video_column: str | None = None, quality: int = 65, resize: int | None = None, workers: int | None = None, skip_corrupt: bool = True, lineage: dict | None = None) LanceCacheMeta[source]
Write a specialized Lance row-per-frame video cache.
Each input example contributes one source video. The physical Lance dataset stores one WebP-encoded frame per row. Segment sampling is a read-time concern handled by
LanceVideoFramesBackend.
- write_sharded_arrow_cache(generator, features: Features, cache_dir: Path, *, shard_size_bytes: int = inf, batch_size: int = 1000, compression: str | None = None, num_encode_workers: int = 0, single_file: bool = False, lineage: dict | None = None) ShardedCacheMeta[source]
Consume a generator and write to a directory of Arrow IPC shards.
Batches are flushed every batch_size rows. After each flush the cumulative
RecordBatch.nbytesfor the current shard is checked; when it exceeds shard_size_bytes the shard is closed. The next shard is opened lazily when the next batch is ready, so there are never trailing empty shards.Note
shard_size_bytes is an approximate target based on Arrow in-memory batch sizes, not exact on-disk file sizes. Actual shard files may be somewhat larger or smaller due to IPC framing, batch granularity, and compression differences.
An empty generator produces zero shards (
num_shards == 0).The completed cache directory contains:
shard-NNNNN.arrow— zero or more IPC files_metadata.json— row counts, shard list, format version, schema fingerprint
Writing is atomic: shards are first written to a temporary directory and renamed into place on success.
- Parameters:
compression (str or None) – IPC buffer compression codec (e.g.
"zstd","lz4"). Decompression on read is automatic.num_encode_workers (int) – When > 0, encode examples in parallel using a thread pool.
:param Returns a
ShardedCacheMetadescribing the cache.:
stable_datasets.dataset module
Map-style dataset built on a pluggable storage backend.
Provides StableDataset (single split) and
StableDatasetDict (multi-split), exposing __len__,
__getitem__, __getitems__, __iter__, .features, and
.train_test_split().
Architecture: three layers with strict boundaries:
StorageBackend -> row access, iteration, pickling (returns Arrow types)
|
Formatter -> Arrow -> user type (PIL / torch / numpy / raw)
|
StableDataset -> orchestrates backend + formatter + indices + transform
StableDataset depends only on the StorageBackend
protocol, never on a concrete implementation or on-disk layout.
- class StableDataset(features: Features, info: DatasetInfo, *, backend: StorageBackend | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, table: Table | None = None, num_rows: int | None = None, _indices: ndarray | None = None, _format_type: str | None = None, _decode_images: bool = True, _video_decode_config: VideoDecodeConfig | None = None, _transform: Callable | None = None, _cache_dir: Path | None = None)[source]
Bases:
objectA single-split dataset backed by Arrow.
Users interact with rows, columns, and transforms — never with files or shards. All storage details are delegated to
ArrowBackend.Construction:
File-backed (typical) — pass
backend=ArrowBackend(shard_paths=...).In-memory — pass
backend=ArrowBackend(table=table).Indexed view — pass
_indices=arrayto create a virtual view sharing the same backend. Zero data copying.
- add_column(name: str, column) StableDataset[source]
Return a new dataset with an additional column.
columncan be apa.Array, a Python list, or a numpy array.
- as_iterable(*, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]
Return a
StableIterableDatasetwrapping this dataset.
- property features: Features
- filter(fn: Callable, *, batched: bool = False, batch_size: int = 1000) StableDataset[source]
Return a view containing rows where
fnreturns True.Non-batched (default):
fn(row_dict) -> bool, applied per row. Batched:fn(dict_of_lists) -> list[bool], applied per batch using sequential scan for better performance on large datasets.Returns an indexed view — no data is materialized.
- flatten_indices(cache_dir: Path | None = None) StableDataset[source]
Materialize an indexed view into a new contiguous Arrow file.
- property info: DatasetInfo
- iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]
Iterate with optional shard-level shuffling.
- make_sampler(kind: str = 'shard_shuffle', **kwargs)[source]
Return a backend-aware
torch.utils.data.Samplerfor this dataset.Convenience wrapper around the classes in
stable_datasets.samplers. Use as:sampler = ds.make_sampler("shard_shuffle", seed=42) loader = DataLoader(ds, batch_size=128, sampler=sampler, ...)DataLoader(ds, shuffle=True)(full-random viaRandomSampler) continues to work unchanged; this is strictly opt-in for users who want an iteration order matched to the backend’s I/O layout.- Parameters:
kind (str, default
"shard_shuffle") – Currently the only supported kind.**kwargs – Forwarded to the underlying sampler class (e.g.
seed,within_shard).
- map(fn: Callable, *, batched: bool = False, batch_size: int = 1000, with_indices: bool = False, remove_columns: list[str] | None = None, features: Features | None = None, cache_dir: Path | str | None = None) StableDataset[source]
Apply a function to every row/batch and return a new dataset.
This is a materializing operation — output is written incrementally to Arrow IPC files via the sharded cache pipeline, so memory usage stays bounded regardless of dataset size. Use
with_transformfor lazy per-row transforms during iteration.Non-batched:
fn(row_dict) -> row_dict(orfn(row_dict, idx)ifwith_indices=True). Batched:fn(dict_of_lists) -> dict_of_lists(orfn(dict_of_lists, list_of_indices)).- Parameters:
- remove_columns(columns: list[str] | str) StableDataset[source]
Return a new dataset without the specified columns.
- rename_column(old_name: str, new_name: str) StableDataset[source]
Return a new dataset with a column renamed.
- rename_columns(mapping: dict[str, str]) StableDataset[source]
Return a new dataset with columns renamed per the mapping.
- select(indices) StableDataset[source]
Return a view containing only the specified row indices.
- set_decode(decode: bool) StableDataset[source]
Control whether Image columns are decoded or left as raw bytes.
- set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDataset[source]
Return a view that decodes a video column at read time.
Passing
Nonewith no keyword arguments disables video decoding on the returned view.
- shuffle(seed: int = 42) StableDataset[source]
Return a shuffled view.
- property table: Table
Materialize and return the full Arrow table.
For single-file datasets this is a cheap mmap. For multi-file datasets this concatenates all files — prefer
__getitem__or__iter__for row access. Use this for bulk operations like column mutations.
- train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]
Random split via index indirection. No data materialization.
- with_format(format_type: str | None) StableDataset[source]
Return a view with the specified output format.
Supported:
None(PIL/numpy/Python),"torch","numpy","raw".
- with_transform(fn: Callable | None) StableDataset[source]
Return a view with a transform applied after format conversion.
- class StableDatasetDict[source]
Bases:
dictDict of
split_name -> StableDataset.- set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDatasetDict[source]
Return a split dict where each split applies the same video decode view.
stable_datasets.formatting module
Formatters that convert Arrow-native values to user-facing types.
Middle layer of the three-layer split
(StorageBackend -> Formatter -> StableDataset).
Formatters consume Arrow values and emit PIL images, numpy arrays, torch
tensors, or raw Python, never touching files or storage themselves.
- class Formatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]
Bases:
objectBase formatter. Subclasses convert Arrow-native values to user-facing types.
- format_batch(table) list[dict][source]
Format a batch (from backend.take). Returns list of row dicts.
Column-first: extract columns once, decode each column in bulk, then zip into per-row dicts at the end.
- format_row(row: dict) dict[source]
Format a single row dict (from backend.get_row).
- class NumpyFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]
Bases:
FormatterNumpy format: Image -> HWC numpy array, rest as-is.
- class PythonFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]
Bases:
FormatterDefault format: Image -> PIL, Array3D -> numpy, scalars -> Python native.
- class RawFormatter(features: Features, decode_images: bool = False, cache_dir: Path | None = None)[source]
Bases:
FormatterRaw format: all values as-is from Arrow (bytes for images, bytes for Array3D).
- format_batch(table) list[dict][source]
Format a batch (from backend.take). Returns list of row dicts.
Column-first: extract columns once, decode each column in bulk, then zip into per-row dicts at the end.
- format_row(row: dict) dict[source]
Format a single row dict (from backend.get_row).
stable_datasets.iterable module
Iterable dataset for streaming with worker sharding and buffered shuffle.
Provides StableIterableDataset for efficient streaming in PyTorch
DataLoader with multiple workers. Supports shard-level worker partitioning
and reservoir-based row-level shuffle.
- class StableIterableDataset(dataset, *, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]
Bases:
IterableDatasetAn iterable-style dataset with worker sharding and buffered shuffle.
Wraps a
StableDatasetfor efficient streaming in PyTorch DataLoader with multiple workers. Shards are partitioned across workers so each worker reads a disjoint subset.- Parameters:
dataset (StableDataset) – The underlying map-style dataset (must be shard-backed).
shuffle (bool) – Whether to shuffle shard order and apply buffered row-level shuffle.
seed (int) – Base random seed.
buffer_size (int) – Size of the reservoir buffer for row-level shuffle.
transform (callable, optional) – Transform applied to each yielded row dict.
- set_epoch(epoch: int)[source]
Set the epoch for varying shuffle seed across epochs.
stable_datasets.samplers module
Backend-aware samplers for StableDataset.
PyTorch’s DataLoader constructs a
RandomSampler when shuffle=True is
passed. That sampler yields indices in a full-random permutation
regardless of the underlying storage backend. For file-backed
storage formats partitioned into shards (Arrow) or fragments
(Lance), full-random access destroys any per-shard I/O locality
the format was designed to exploit.
This module exposes samplers that yield indices in shard-aware
orderings, preserving the classical PyTorch API (DataLoader(ds,
sampler=...)) while providing a sampler that matches the
backend’s access-pattern preferences:
from stable_datasets.samplers import ShardShuffleSampler
ds = CIFAR10(split=”train”, storage_format=”lance”) sampler = ShardShuffleSampler(ds, seed=42) loader = DataLoader(ds, batch_size=128, sampler=sampler,
num_workers=8, persistent_workers=True, multiprocessing_context=”spawn”)
DataLoader(ds, shuffle=True) continues to work unchanged for
users who need bit-exact full-random ordering (e.g. classification
reproduction). Samplers here are strictly opt-in.
See also
torch.utils.data.Sampler : base class.
lance.sampler.ShardedFragmentSampler : Lance’s own fragment
sampler for its native
lance.torch.data.LanceDatasetintegration.ShardShuffleSampleris the nearest equivalent exposed through the StableDataset backend protocol.
- class ShardShuffleSampler(dataset, *, seed: int = 0, within_shard: Literal['random', 'sequential'] = 'random')[source]
Bases:
Sampler[int]Yield indices in shard-shuffled order.
The shard (or Lance fragment) order is randomized each epoch. Within each shard, indices are yielded in an order controlled by
within_shard:"random"(default): indices inside a shard are themselves permuted. Shuffle quality is closer to full-random while still preserving per-shard I/O locality (all samples from shard k are emitted before any sample from shard k+1). Recommended for scientific training where shuffle quality matters."sequential": indices inside a shard are yielded in on- disk order. Maximally I/O-friendly but shuffle quality is coarse at the shard level. Matches the behaviour oflance.sampler.ShardedFragmentSampler.
- Parameters:
dataset (StableDataset) – Must expose a
StorageBackend-compatible._backendwithnum_shardsand a way to iterate per-shard row ranges. Non-file-backed datasets fall back to a single shard covering the full dataset.seed (int, default 0) – Base seed; the epoch is XOR’d in via
set_epoch().within_shard ({"random", "sequential"}, default "random") – Within-shard row ordering.
Notes
Epoch handling: call
set_epoch()before each epoch when usingDistributedSampleror any other stateful epoch pattern, so the random permutation differs between epochs. Mirrors PyTorch’s own convention.Fork-safety: the sampler holds only integers and a seed; it pickles trivially and is safe to use with
num_workers>0andmultiprocessing_context="spawn".- set_epoch(epoch: int) None[source]
stable_datasets.schema module
Feature and metadata schema definitions.
Each feature type maps itself to a PyArrow type for Arrow IPC serialization.
- class Array3D(shape: tuple, dtype: str = 'uint8')[source]
Bases:
FeatureTypeFixed-shape 3D array stored as flat bytes.
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]
Bases:
objectBase config for multi-variant datasets.
- version: Version | None = None
- class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]
Bases:
FeatureTypeCategorical label with name-to-int mapping.
- encode(value, *, cache_dir: Path | None = None)[source]
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- int2str(idx: int) str[source]
- str2int(name: str) int[source]
- to_arrow_type() DataType[source]
- class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]
Bases:
objectMetadata container for a dataset (description, features, citation, etc.).
- features: Features
- class DatasetSource(homepage: URL | str, assets: dict[str, DownloadInfo | str], citation: str, license: str = '', checksums: dict[str, str] | None = None)[source]
Bases:
Mapping[str,object]Typed source and download metadata for one dataset builder.
- assets: dict[str, DownloadInfo | str]
- get(k[, d]) D[k] if k in D, else d. d defaults to None.[source]
- class DownloadInfo(url: str, fallbacks: list[str] = <factory>, checksum: str | None = None, filename: str | None = None)[source]
Bases:
objectDownload source metadata for one raw asset.
urlis attempted first. Anyfallbacksare tried in order if the primary URL fails.- all_urls() list[str][source]
- class FeatureType[source]
Bases:
objectBase class for feature type descriptors.
- encode(value, *, cache_dir: Path | None = None)[source]
- fingerprint_data() str[source]
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class Features[source]
Bases:
OrderedDictOrdered mapping of
field_name -> FeatureType.Generates a PyArrow schema via
.to_arrow_schema().- fingerprint_data() str[source]
- to_arrow_schema() schema[source]
- class Image(encode_format: str = 'PNG')[source]
Bases:
FeatureTypeImage feature stored as raw bytes in Arrow.
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type()[source]
- class Sequence(feature: FeatureType)[source]
Bases:
FeatureTypeVariable-length list of a sub-feature.
- encode(value, *, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class Value(dtype: str)[source]
Bases:
FeatureTypeScalar value type. Maps dtype strings to PyArrow types.
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class Version(version_str: str)[source]
Bases:
objectSemantic version string (
major.minor.patch).
- class Video(storage: str = 'path', allowed_extensions: tuple[str, ...] = ('.mp4', '.avi', '.mov', '.webm', '.mkv'))[source]
Bases:
FeatureTypeVideo feature with validated path, bytes, or specialized frame storage.
- encode(value, *, cache_dir: Path | None = None)[source]
- fingerprint_data() str[source]
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class VideoDecodeConfig(num_frames: int, column: str = 'video', sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform', frame_stride: int = 1, decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec', output: Literal['torch', 'numpy'] = 'torch', layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW', dtype: Literal['float32', 'uint8'] = 'float32', scale: Literal['zero_one', 'none'] = 'zero_one', resize: int | tuple[int, int] | None = None, crop: Literal['none', 'center', 'random'] = 'none', pad: Literal['error', 'repeat_last', 'loop'] = 'error', seed: int | None = None, decode_fn: VideoDecodeFn | None = None, decode_fn_batched: VideoDecodeFnBatched | None = None)[source]
Bases:
objectRead-time video decode configuration.
This is retrieval policy only: it does not affect cache construction, cache fingerprints, or the persisted schema.
- decode_fn: VideoDecodeFn | None = None
- decode_fn_batched: VideoDecodeFnBatched | None = None
- class VideoDecodeFn(*args, **kwargs)[source]
Bases:
ProtocolPer-sample video decode callback.
- class VideoDecodeFnBatched(*args, **kwargs)[source]
Bases:
ProtocolBatched video decode callback used by
StableDataset.__getitems__.
- class VideoRef(cell: Mapping[str, Any], cache_dir: Path | None = None)[source]
Bases:
objectLazy reference to a cached video asset.
- property bytes: bytes
- collect_dataset_citations(sources: Iterable[DatasetSource | Mapping[str, object]]) list[str][source]
Collect unique citation strings in stable first-seen order.
stable_datasets.splits module
Split name constants and split generator.
- class Split[source]
Bases:
object
- class SplitGenerator(name: str, gen_kwargs: dict = <factory>)[source]
Bases:
objectDescribes one split and the kwargs to pass to
_generate_examples.
stable_datasets.utils module
- class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, storage_format=None, backend_kwargs=None, decode_video=None, **kwargs)[source]
Bases:
objectBase class for stable-datasets builders.
Handles downloading, Arrow caching, and split generation. Subclasses implement
_info,_split_generators, and_generate_examples.- SOURCE: DatasetSource | Mapping
- VERSION: Version
- __init__(config_name: str | None = None, **kwargs)[source]
Initialize builder, selecting a BuilderConfig if applicable.
- bulk_download(urls: Iterable[str | DownloadInfo], dest_folder: str | Path, checksums: dict[str, str] | None = None) list[Path][source]
Download multiple files concurrently and return their local paths.
- Parameters:
urls – Iterable of URL strings or DownloadInfo specs to download.
dest_folder – Destination folder for downloads.
checksums – Optional dict mapping primary URL -> checksum string.
- Returns:
Local file paths in the same order as the input URLs.
- Return type:
list[Path]
- download(url: str | DownloadInfo, dest_folder: str | Path | None = None, progress_bar: bool = True, _progress_dict=None, _task_id=None, checksum: str | None = None, fallbacks: list[str] | None = None, filename: str | None = None, cache_key_url: str | None = None) Path[source]
Download a file to dest_folder, returning the local path.
Supports resumable downloads: if a
.tmpfile exists from a previous interrupted attempt, an HTTPRangeheader is sent. The server may respond with 206 (append) or 200 (start over).checksum, when provided, is an
"algorithm:hex"string (e.g."sha256:a3f8..."). The downloaded file is verified after completion and deleted on mismatch.
- load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y=True, replace_missing_vals_with='NaN')[source]
Load data from a .ts file into a Pandas DataFrame. Credit to https://github.com/sktime/sktime/blob/7d572796ec519c35d30f482f2020c3e0256dd451/sktime/datasets/_data_io.py#L379 :param full_file_path_and_name: The full pathname of the .ts file to read. :type full_file_path_and_name: str :param return_separate_X_and_y: true if X and Y values should be returned as separate Data Frames (
X) and a numpy array (y), false otherwise. This is only relevant for data that
- Parameters:
replace_missing_vals_with (str) – The value that missing values in the text file should be replaced with prior to parsing.
- Returns:
DataFrame (default) or ndarray (i – If return_separate_X_and_y then a tuple containing a DataFrame and a numpy array containing the relevant time-series and corresponding class values.
DataFrame – If not return_separate_X_and_y then a single DataFrame containing all time-series and (if relevant) a column “class_vals” the associated class values.
Module contents
- class Array3D(shape: tuple, dtype: str = 'uint8')[source]
Bases:
FeatureTypeFixed-shape 3D array stored as flat bytes.
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, storage_format=None, backend_kwargs=None, decode_video=None, **kwargs)[source]
Bases:
objectBase class for stable-datasets builders.
Handles downloading, Arrow caching, and split generation. Subclasses implement
_info,_split_generators, and_generate_examples.- SOURCE: DatasetSource | Mapping
- VERSION: Version
- __init__(config_name: str | None = None, **kwargs)[source]
Initialize builder, selecting a BuilderConfig if applicable.
- class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]
Bases:
objectBase config for multi-variant datasets.
- version: Version | None = None
- class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]
Bases:
FeatureTypeCategorical label with name-to-int mapping.
- encode(value, *, cache_dir: Path | None = None)[source]
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- int2str(idx: int) str[source]
- str2int(name: str) int[source]
- to_arrow_type() DataType[source]
- class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]
Bases:
objectMetadata container for a dataset (description, features, citation, etc.).
- features: Features
- class DatasetSource(homepage: URL | str, assets: dict[str, DownloadInfo | str], citation: str, license: str = '', checksums: dict[str, str] | None = None)[source]
Bases:
Mapping[str,object]Typed source and download metadata for one dataset builder.
- assets: dict[str, DownloadInfo | str]
- get(k[, d]) D[k] if k in D, else d. d defaults to None.[source]
- class DownloadInfo(url: str, fallbacks: list[str] = <factory>, checksum: str | None = None, filename: str | None = None)[source]
Bases:
objectDownload source metadata for one raw asset.
urlis attempted first. Anyfallbacksare tried in order if the primary URL fails.- all_urls() list[str][source]
- class Features[source]
Bases:
OrderedDictOrdered mapping of
field_name -> FeatureType.Generates a PyArrow schema via
.to_arrow_schema().- fingerprint_data() str[source]
- to_arrow_schema() schema[source]
- class Image(encode_format: str = 'PNG')[source]
Bases:
FeatureTypeImage feature stored as raw bytes in Arrow.
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type()[source]
- class Sequence(feature: FeatureType)[source]
Bases:
FeatureTypeVariable-length list of a sub-feature.
- encode(value, *, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class StableDataset(features: Features, info: DatasetInfo, *, backend: StorageBackend | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, table: Table | None = None, num_rows: int | None = None, _indices: ndarray | None = None, _format_type: str | None = None, _decode_images: bool = True, _video_decode_config: VideoDecodeConfig | None = None, _transform: Callable | None = None, _cache_dir: Path | None = None)[source]
Bases:
objectA single-split dataset backed by Arrow.
Users interact with rows, columns, and transforms — never with files or shards. All storage details are delegated to
ArrowBackend.Construction:
File-backed (typical) — pass
backend=ArrowBackend(shard_paths=...).In-memory — pass
backend=ArrowBackend(table=table).Indexed view — pass
_indices=arrayto create a virtual view sharing the same backend. Zero data copying.
- add_column(name: str, column) StableDataset[source]
Return a new dataset with an additional column.
columncan be apa.Array, a Python list, or a numpy array.
- as_iterable(*, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]
Return a
StableIterableDatasetwrapping this dataset.
- property features: Features
- filter(fn: Callable, *, batched: bool = False, batch_size: int = 1000) StableDataset[source]
Return a view containing rows where
fnreturns True.Non-batched (default):
fn(row_dict) -> bool, applied per row. Batched:fn(dict_of_lists) -> list[bool], applied per batch using sequential scan for better performance on large datasets.Returns an indexed view — no data is materialized.
- flatten_indices(cache_dir: Path | None = None) StableDataset[source]
Materialize an indexed view into a new contiguous Arrow file.
- property info: DatasetInfo
- iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]
Iterate with optional shard-level shuffling.
- make_sampler(kind: str = 'shard_shuffle', **kwargs)[source]
Return a backend-aware
torch.utils.data.Samplerfor this dataset.Convenience wrapper around the classes in
stable_datasets.samplers. Use as:sampler = ds.make_sampler("shard_shuffle", seed=42) loader = DataLoader(ds, batch_size=128, sampler=sampler, ...)DataLoader(ds, shuffle=True)(full-random viaRandomSampler) continues to work unchanged; this is strictly opt-in for users who want an iteration order matched to the backend’s I/O layout.- Parameters:
kind (str, default
"shard_shuffle") – Currently the only supported kind.**kwargs – Forwarded to the underlying sampler class (e.g.
seed,within_shard).
- map(fn: Callable, *, batched: bool = False, batch_size: int = 1000, with_indices: bool = False, remove_columns: list[str] | None = None, features: Features | None = None, cache_dir: Path | str | None = None) StableDataset[source]
Apply a function to every row/batch and return a new dataset.
This is a materializing operation — output is written incrementally to Arrow IPC files via the sharded cache pipeline, so memory usage stays bounded regardless of dataset size. Use
with_transformfor lazy per-row transforms during iteration.Non-batched:
fn(row_dict) -> row_dict(orfn(row_dict, idx)ifwith_indices=True). Batched:fn(dict_of_lists) -> dict_of_lists(orfn(dict_of_lists, list_of_indices)).- Parameters:
- remove_columns(columns: list[str] | str) StableDataset[source]
Return a new dataset without the specified columns.
- rename_column(old_name: str, new_name: str) StableDataset[source]
Return a new dataset with a column renamed.
- rename_columns(mapping: dict[str, str]) StableDataset[source]
Return a new dataset with columns renamed per the mapping.
- select(indices) StableDataset[source]
Return a view containing only the specified row indices.
- set_decode(decode: bool) StableDataset[source]
Control whether Image columns are decoded or left as raw bytes.
- set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDataset[source]
Return a view that decodes a video column at read time.
Passing
Nonewith no keyword arguments disables video decoding on the returned view.
- shuffle(seed: int = 42) StableDataset[source]
Return a shuffled view.
- property table: Table
Materialize and return the full Arrow table.
For single-file datasets this is a cheap mmap. For multi-file datasets this concatenates all files — prefer
__getitem__or__iter__for row access. Use this for bulk operations like column mutations.
- train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]
Random split via index indirection. No data materialization.
- with_format(format_type: str | None) StableDataset[source]
Return a view with the specified output format.
Supported:
None(PIL/numpy/Python),"torch","numpy","raw".
- with_transform(fn: Callable | None) StableDataset[source]
Return a view with a transform applied after format conversion.
- class StableDatasetDict[source]
Bases:
dictDict of
split_name -> StableDataset.- set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDatasetDict[source]
Return a split dict where each split applies the same video decode view.
- class Value(dtype: str)[source]
Bases:
FeatureTypeScalar value type. Maps dtype strings to PyArrow types.
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class Version(version_str: str)[source]
Bases:
objectSemantic version string (
major.minor.patch).
- class Video(storage: str = 'path', allowed_extensions: tuple[str, ...] = ('.mp4', '.avi', '.mov', '.webm', '.mkv'))[source]
Bases:
FeatureTypeVideo feature with validated path, bytes, or specialized frame storage.
- encode(value, *, cache_dir: Path | None = None)[source]
- fingerprint_data() str[source]
- format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
- to_arrow_type() DataType[source]
- class VideoDecodeConfig(num_frames: int, column: str = 'video', sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform', frame_stride: int = 1, decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec', output: Literal['torch', 'numpy'] = 'torch', layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW', dtype: Literal['float32', 'uint8'] = 'float32', scale: Literal['zero_one', 'none'] = 'zero_one', resize: int | tuple[int, int] | None = None, crop: Literal['none', 'center', 'random'] = 'none', pad: Literal['error', 'repeat_last', 'loop'] = 'error', seed: int | None = None, decode_fn: VideoDecodeFn | None = None, decode_fn_batched: VideoDecodeFnBatched | None = None)[source]
Bases:
objectRead-time video decode configuration.
This is retrieval policy only: it does not affect cache construction, cache fingerprints, or the persisted schema.
- decode_fn: VideoDecodeFn | None = None
- decode_fn_batched: VideoDecodeFnBatched | None = None
- class VideoDecodeFn(*args, **kwargs)[source]
Bases:
ProtocolPer-sample video decode callback.
- class VideoDecodeFnBatched(*args, **kwargs)[source]
Bases:
ProtocolBatched video decode callback used by
StableDataset.__getitems__.