stable_datasets package

Subpackages

Submodules

stable_datasets.cache module

Generator-to-Arrow sharded caching pipeline.

Writes dataset examples to a directory of PyArrow IPC (Feather v2) shard files. Peak memory during writes is bounded to ~1 batch, and the sharded layout supports efficient sequential reads for training workloads.

class CacheOpenResult(*, backend, num_rows: int, layout: str, metadata)[source]

Bases: object

Result of opening cache metadata into a backend.

backend
layout
metadata
num_rows
class LanceCacheMeta(cache_dir: Path, num_rows: int, schema_fingerprint: str)[source]

Bases: object

Lightweight descriptor for a Lance-format cache on disk.

cache_dir
num_rows
schema_fingerprint
class ShardedCacheMeta(cache_dir: Path, num_rows: int, num_shards: int, shard_filenames: list[str], shard_row_counts: list[int], schema_fingerprint: str, compression: str | None = None)[source]

Bases: object

Lightweight descriptor for a sharded Arrow cache on disk.

cache_dir
compression
num_rows
num_shards
schema_fingerprint
shard_filenames
property shard_paths: list[Path]
shard_row_counts
cache_fingerprint(cls_name: str, version: str, config_name: str, split: str, storage_format: str = 'arrow') str[source]

Deterministic cache directory name for a dataset variant + split.

storage_format is always included in the hash so Arrow and Lance caches for the same dataset coexist at different paths.

detect_cache_format(cache_dir: Path) str[source]

Return "arrow" or "lance" based on the cache’s metadata.

detect_cache_layout(cache_dir: Path) str[source]

Return the physical cache layout recorded in cache metadata.

encode_example(example: dict, features: Features, *, cache_dir: Path | None = None) dict[source]

Encode a single example dict into Arrow-compatible values.

open_cache(cache_dir: Path, features: Features, *, backend_kwargs: dict | None = None) CacheOpenResult[source]

Open a cache directory and return the backend selected by its layout.

read_lance_cache_meta(cache_dir: Path) LanceCacheMeta[source]

Read metadata from a Lance-format cache directory.

Returns a LanceCacheMeta with the cached row count and schema fingerprint populated from _metadata.json. Deliberately does NOT open the underlying Lance dataset: that would initialize Lance’s tokio runtime in the caller’s process, which is a DataLoader-fork footgun. Row count comes from the metadata file, not from ds.count_rows().

read_shard(shard_path: Path) Table[source]

Memory-map a single shard file and return its table.

read_sharded_cache_meta(cache_dir: Path) ShardedCacheMeta[source]

Read metadata from a sharded cache directory.

Validates that all shard files and metadata exist and are internally consistent. Raises ValueError on corruption.

validate_sharded_cache(cache_dir: Path, features: Features) ShardedCacheMeta[source]

Read and validate a sharded cache, checking the schema fingerprint.

Raises ValueError if the cache is inconsistent or the schema has changed.

write_lance_cache(generator, features: Features, cache_dir: Path, *, batch_size: int = 1000, num_encode_workers: int = 0, lineage: dict | None = None) LanceCacheMeta[source]

Consume a generator and write directly to a Lance dataset.

Mirrors write_sharded_arrow_cache() in shape (same generator contract, same features, same encode pipeline, same atomic-publish semantics) but writes a Lance dataset via lance.write_dataset instead of Arrow IPC shards. No intermediate Arrow IPC file is produced – the encoded rows stream into Lance via a pa.RecordBatchReader, so the native Lance write path is used end-to-end.

Writing is atomic: Lance writes to a temporary directory next to cache_dir and the directory is renamed on success. The completed cache directory contains:

  • Lance dataset files (_versions/, data/, manifest)

  • _metadata.json – row count, format marker, schema fingerprint

Parameters:
  • batch_size (int) – Rows per pa.RecordBatch flushed to the Lance writer. Larger batches reduce per-call overhead; smaller batches reduce peak memory during writing.

  • num_encode_workers (int) – When > 0, encode examples in parallel using a thread pool (same contract as the Arrow writer).

  • lineage (dict, optional) – Optional metadata blob written into _metadata.json.

write_lance_video_frames_cache(generator, features: Features, cache_dir: Path, *, video_column: str | None = None, quality: int = 65, resize: int | None = None, workers: int | None = None, skip_corrupt: bool = True, lineage: dict | None = None) LanceCacheMeta[source]

Write a specialized Lance row-per-frame video cache.

Each input example contributes one source video. The physical Lance dataset stores one WebP-encoded frame per row. Segment sampling is a read-time concern handled by LanceVideoFramesBackend.

write_sharded_arrow_cache(generator, features: Features, cache_dir: Path, *, shard_size_bytes: int = inf, batch_size: int = 1000, compression: str | None = None, num_encode_workers: int = 0, single_file: bool = False, lineage: dict | None = None) ShardedCacheMeta[source]

Consume a generator and write to a directory of Arrow IPC shards.

Batches are flushed every batch_size rows. After each flush the cumulative RecordBatch.nbytes for the current shard is checked; when it exceeds shard_size_bytes the shard is closed. The next shard is opened lazily when the next batch is ready, so there are never trailing empty shards.

Note

shard_size_bytes is an approximate target based on Arrow in-memory batch sizes, not exact on-disk file sizes. Actual shard files may be somewhat larger or smaller due to IPC framing, batch granularity, and compression differences.

An empty generator produces zero shards (num_shards == 0).

The completed cache directory contains:

  • shard-NNNNN.arrow — zero or more IPC files

  • _metadata.json — row counts, shard list, format version, schema fingerprint

Writing is atomic: shards are first written to a temporary directory and renamed into place on success.

Parameters:
  • compression (str or None) – IPC buffer compression codec (e.g. "zstd", "lz4"). Decompression on read is automatic.

  • num_encode_workers (int) – When > 0, encode examples in parallel using a thread pool.

:param Returns a ShardedCacheMeta describing the cache.:

stable_datasets.dataset module

Map-style dataset built on a pluggable storage backend.

Provides StableDataset (single split) and StableDatasetDict (multi-split), exposing __len__, __getitem__, __getitems__, __iter__, .features, and .train_test_split().

Architecture: three layers with strict boundaries:

StorageBackend  -> row access, iteration, pickling (returns Arrow types)
    |
Formatter       -> Arrow -> user type (PIL / torch / numpy / raw)
    |
StableDataset   -> orchestrates backend + formatter + indices + transform

StableDataset depends only on the StorageBackend protocol, never on a concrete implementation or on-disk layout.

class StableDataset(features: Features, info: DatasetInfo, *, backend: StorageBackend | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, table: Table | None = None, num_rows: int | None = None, _indices: ndarray | None = None, _format_type: str | None = None, _decode_images: bool = True, _video_decode_config: VideoDecodeConfig | None = None, _transform: Callable | None = None, _cache_dir: Path | None = None)[source]

Bases: object

A single-split dataset backed by Arrow.

Users interact with rows, columns, and transforms — never with files or shards. All storage details are delegated to ArrowBackend.

Construction:

  1. File-backed (typical) — pass backend=ArrowBackend(shard_paths=...).

  2. In-memory — pass backend=ArrowBackend(table=table).

  3. Indexed view — pass _indices=array to create a virtual view sharing the same backend. Zero data copying.

add_column(name: str, column) StableDataset[source]

Return a new dataset with an additional column.

column can be a pa.Array, a Python list, or a numpy array.

as_iterable(*, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]

Return a StableIterableDataset wrapping this dataset.

property column_names: list[str]
property features: Features
filter(fn: Callable, *, batched: bool = False, batch_size: int = 1000) StableDataset[source]

Return a view containing rows where fn returns True.

Non-batched (default): fn(row_dict) -> bool, applied per row. Batched: fn(dict_of_lists) -> list[bool], applied per batch using sequential scan for better performance on large datasets.

Returns an indexed view — no data is materialized.

flatten_indices(cache_dir: Path | None = None) StableDataset[source]

Materialize an indexed view into a new contiguous Arrow file.

property info: DatasetInfo
iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]

Iterate with optional shard-level shuffling.

make_sampler(kind: str = 'shard_shuffle', **kwargs)[source]

Return a backend-aware torch.utils.data.Sampler for this dataset.

Convenience wrapper around the classes in stable_datasets.samplers. Use as:

sampler = ds.make_sampler("shard_shuffle", seed=42)
loader = DataLoader(ds, batch_size=128, sampler=sampler, ...)

DataLoader(ds, shuffle=True) (full-random via RandomSampler) continues to work unchanged; this is strictly opt-in for users who want an iteration order matched to the backend’s I/O layout.

Parameters:
  • kind (str, default "shard_shuffle") – Currently the only supported kind.

  • **kwargs – Forwarded to the underlying sampler class (e.g. seed, within_shard).

map(fn: Callable, *, batched: bool = False, batch_size: int = 1000, with_indices: bool = False, remove_columns: list[str] | None = None, features: Features | None = None, cache_dir: Path | str | None = None) StableDataset[source]

Apply a function to every row/batch and return a new dataset.

This is a materializing operation — output is written incrementally to Arrow IPC files via the sharded cache pipeline, so memory usage stays bounded regardless of dataset size. Use with_transform for lazy per-row transforms during iteration.

Non-batched: fn(row_dict) -> row_dict (or fn(row_dict, idx) if with_indices=True). Batched: fn(dict_of_lists) -> dict_of_lists (or fn(dict_of_lists, list_of_indices)).

Parameters:
  • features (Features, optional) – Output schema. If None, columns matching input features keep their types; new columns are inferred from Arrow types. Provide explicitly when the output schema is ambiguous.

  • cache_dir (path, optional) – Where to write the output cache. If None, uses a temp directory.

property num_rows: int
remove_columns(columns: list[str] | str) StableDataset[source]

Return a new dataset without the specified columns.

rename_column(old_name: str, new_name: str) StableDataset[source]

Return a new dataset with a column renamed.

rename_columns(mapping: dict[str, str]) StableDataset[source]

Return a new dataset with columns renamed per the mapping.

select(indices) StableDataset[source]

Return a view containing only the specified row indices.

set_decode(decode: bool) StableDataset[source]

Control whether Image columns are decoded or left as raw bytes.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDataset[source]

Return a view that decodes a video column at read time.

Passing None with no keyword arguments disables video decoding on the returned view.

shuffle(seed: int = 42) StableDataset[source]

Return a shuffled view.

property table: Table

Materialize and return the full Arrow table.

For single-file datasets this is a cheap mmap. For multi-file datasets this concatenates all files — prefer __getitem__ or __iter__ for row access. Use this for bulk operations like column mutations.

train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]

Random split via index indirection. No data materialization.

with_format(format_type: str | None) StableDataset[source]

Return a view with the specified output format.

Supported: None (PIL/numpy/Python), "torch", "numpy", "raw".

with_transform(fn: Callable | None) StableDataset[source]

Return a view with a transform applied after format conversion.

class StableDatasetDict[source]

Bases: dict

Dict of split_name -> StableDataset.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDatasetDict[source]

Return a split dict where each split applies the same video decode view.

stable_datasets.formatting module

Formatters that convert Arrow-native values to user-facing types.

Middle layer of the three-layer split (StorageBackend -> Formatter -> StableDataset). Formatters consume Arrow values and emit PIL images, numpy arrays, torch tensors, or raw Python, never touching files or storage themselves.

class Formatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: object

Base formatter. Subclasses convert Arrow-native values to user-facing types.

format_batch(table) list[dict][source]

Format a batch (from backend.take). Returns list of row dicts.

Column-first: extract columns once, decode each column in bulk, then zip into per-row dicts at the end.

format_row(row: dict) dict[source]

Format a single row dict (from backend.get_row).

format_type = 'default'
class NumpyFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: Formatter

Numpy format: Image -> HWC numpy array, rest as-is.

format_type = 'numpy'
class PythonFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: Formatter

Default format: Image -> PIL, Array3D -> numpy, scalars -> Python native.

format_type = 'default'
class RawFormatter(features: Features, decode_images: bool = False, cache_dir: Path | None = None)[source]

Bases: Formatter

Raw format: all values as-is from Arrow (bytes for images, bytes for Array3D).

format_batch(table) list[dict][source]

Format a batch (from backend.take). Returns list of row dicts.

Column-first: extract columns once, decode each column in bulk, then zip into per-row dicts at the end.

format_row(row: dict) dict[source]

Format a single row dict (from backend.get_row).

format_type = 'raw'
class TorchFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: Formatter

Torch format: Image -> CHW float32 tensor, Array3D -> float32 tensor, scalars -> tensors.

format_type = 'torch'
get_formatter(format_type: str | None, features: Features, decode_images: bool = True, cache_dir: Path | None = None) Formatter[source]

Factory for formatter instances.

stable_datasets.iterable module

Iterable dataset for streaming with worker sharding and buffered shuffle.

Provides StableIterableDataset for efficient streaming in PyTorch DataLoader with multiple workers. Supports shard-level worker partitioning and reservoir-based row-level shuffle.

class StableIterableDataset(dataset, *, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]

Bases: IterableDataset

An iterable-style dataset with worker sharding and buffered shuffle.

Wraps a StableDataset for efficient streaming in PyTorch DataLoader with multiple workers. Shards are partitioned across workers so each worker reads a disjoint subset.

Parameters:
  • dataset (StableDataset) – The underlying map-style dataset (must be shard-backed).

  • shuffle (bool) – Whether to shuffle shard order and apply buffered row-level shuffle.

  • seed (int) – Base random seed.

  • buffer_size (int) – Size of the reservoir buffer for row-level shuffle.

  • transform (callable, optional) – Transform applied to each yielded row dict.

set_epoch(epoch: int)[source]

Set the epoch for varying shuffle seed across epochs.

stable_datasets.samplers module

Backend-aware samplers for StableDataset.

PyTorch’s DataLoader constructs a RandomSampler when shuffle=True is passed. That sampler yields indices in a full-random permutation regardless of the underlying storage backend. For file-backed storage formats partitioned into shards (Arrow) or fragments (Lance), full-random access destroys any per-shard I/O locality the format was designed to exploit.

This module exposes samplers that yield indices in shard-aware orderings, preserving the classical PyTorch API (DataLoader(ds, sampler=...)) while providing a sampler that matches the backend’s access-pattern preferences:

from stable_datasets.samplers import ShardShuffleSampler

ds = CIFAR10(split=”train”, storage_format=”lance”) sampler = ShardShuffleSampler(ds, seed=42) loader = DataLoader(ds, batch_size=128, sampler=sampler,

num_workers=8, persistent_workers=True, multiprocessing_context=”spawn”)

DataLoader(ds, shuffle=True) continues to work unchanged for users who need bit-exact full-random ordering (e.g. classification reproduction). Samplers here are strictly opt-in.

See also

torch.utils.data.Sampler : base class. lance.sampler.ShardedFragmentSampler : Lance’s own fragment

sampler for its native lance.torch.data.LanceDataset integration. ShardShuffleSampler is the nearest equivalent exposed through the StableDataset backend protocol.

class ShardShuffleSampler(dataset, *, seed: int = 0, within_shard: Literal['random', 'sequential'] = 'random')[source]

Bases: Sampler[int]

Yield indices in shard-shuffled order.

The shard (or Lance fragment) order is randomized each epoch. Within each shard, indices are yielded in an order controlled by within_shard:

  • "random" (default): indices inside a shard are themselves permuted. Shuffle quality is closer to full-random while still preserving per-shard I/O locality (all samples from shard k are emitted before any sample from shard k+1). Recommended for scientific training where shuffle quality matters.

  • "sequential": indices inside a shard are yielded in on- disk order. Maximally I/O-friendly but shuffle quality is coarse at the shard level. Matches the behaviour of lance.sampler.ShardedFragmentSampler.

Parameters:
  • dataset (StableDataset) – Must expose a StorageBackend-compatible ._backend with num_shards and a way to iterate per-shard row ranges. Non-file-backed datasets fall back to a single shard covering the full dataset.

  • seed (int, default 0) – Base seed; the epoch is XOR’d in via set_epoch().

  • within_shard ({"random", "sequential"}, default "random") – Within-shard row ordering.

Notes

Epoch handling: call set_epoch() before each epoch when using DistributedSampler or any other stateful epoch pattern, so the random permutation differs between epochs. Mirrors PyTorch’s own convention.

Fork-safety: the sampler holds only integers and a seed; it pickles trivially and is safe to use with num_workers>0 and multiprocessing_context="spawn".

set_epoch(epoch: int) None[source]

stable_datasets.schema module

Feature and metadata schema definitions.

Each feature type maps itself to a PyArrow type for Arrow IPC serialization.

class Array3D(shape: tuple, dtype: str = 'uint8')[source]

Bases: FeatureType

Fixed-shape 3D array stored as flat bytes.

encode(value, *, cache_dir: Path | None = None) bytes | None[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]

Bases: object

Base config for multi-variant datasets.

description: str = ''
name: str = 'default'
version: Version | None = None
class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]

Bases: FeatureType

Categorical label with name-to-int mapping.

encode(value, *, cache_dir: Path | None = None)[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
int2str(idx: int) str[source]
str2int(name: str) int[source]
to_arrow_type() DataType[source]
class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]

Bases: object

Metadata container for a dataset (description, features, citation, etc.).

citation: str = ''
config_name: str = ''
description: str = ''
features: Features
homepage: str = ''
license: str = ''
supervised_keys: tuple | None = None
class DatasetSource(homepage: URL | str, assets: dict[str, DownloadInfo | str], citation: str, license: str = '', checksums: dict[str, str] | None = None)[source]

Bases: Mapping[str, object]

Typed source and download metadata for one dataset builder.

assets: dict[str, DownloadInfo | str]
checksums: dict[str, str] | None = None
citation: str
get(k[, d]) D[k] if k in D, else d.  d defaults to None.[source]
homepage: URL | str
license: str = ''
class DownloadInfo(url: str, fallbacks: list[str] = <factory>, checksum: str | None = None, filename: str | None = None)[source]

Bases: object

Download source metadata for one raw asset.

url is attempted first. Any fallbacks are tried in order if the primary URL fails.

all_urls() list[str][source]
checksum: str | None = None
fallbacks: list[str]
filename: str | None = None
url: str
class FeatureType[source]

Bases: object

Base class for feature type descriptors.

arrow_metadata() dict[bytes, bytes][source]
encode(value, *, cache_dir: Path | None = None)[source]
fingerprint_data() str[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class Features[source]

Bases: OrderedDict

Ordered mapping of field_name -> FeatureType.

Generates a PyArrow schema via .to_arrow_schema().

fingerprint_data() str[source]
to_arrow_schema() schema[source]
class Image(encode_format: str = 'PNG')[source]

Bases: FeatureType

Image feature stored as raw bytes in Arrow.

encode(value, *, cache_dir: Path | None = None) bytes | None[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type()[source]
class Sequence(feature: FeatureType)[source]

Bases: FeatureType

Variable-length list of a sub-feature.

encode(value, *, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class Value(dtype: str)[source]

Bases: FeatureType

Scalar value type. Maps dtype strings to PyArrow types.

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class Version(version_str: str)[source]

Bases: object

Semantic version string (major.minor.patch).

class Video(storage: str = 'path', allowed_extensions: tuple[str, ...] = ('.mp4', '.avi', '.mov', '.webm', '.mkv'))[source]

Bases: FeatureType

Video feature with validated path, bytes, or specialized frame storage.

arrow_metadata() dict[bytes, bytes][source]
encode(value, *, cache_dir: Path | None = None)[source]
fingerprint_data() str[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class VideoDecodeConfig(num_frames: int, column: str = 'video', sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform', frame_stride: int = 1, decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec', output: Literal['torch', 'numpy'] = 'torch', layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW', dtype: Literal['float32', 'uint8'] = 'float32', scale: Literal['zero_one', 'none'] = 'zero_one', resize: int | tuple[int, int] | None = None, crop: Literal['none', 'center', 'random'] = 'none', pad: Literal['error', 'repeat_last', 'loop'] = 'error', seed: int | None = None, decode_fn: VideoDecodeFn | None = None, decode_fn_batched: VideoDecodeFnBatched | None = None)[source]

Bases: object

Read-time video decode configuration.

This is retrieval policy only: it does not affect cache construction, cache fingerprints, or the persisted schema.

column: str = 'video'
crop: Literal['none', 'center', 'random'] = 'none'
decode_fn: VideoDecodeFn | None = None
decode_fn_batched: VideoDecodeFnBatched | None = None
decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec'
dtype: Literal['float32', 'uint8'] = 'float32'
frame_stride: int = 1
layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW'
num_frames: int
output: Literal['torch', 'numpy'] = 'torch'
pad: Literal['error', 'repeat_last', 'loop'] = 'error'
resize: int | tuple[int, int] | None = None
sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform'
scale: Literal['zero_one', 'none'] = 'zero_one'
seed: int | None = None
class VideoDecodeFn(*args, **kwargs)[source]

Bases: Protocol

Per-sample video decode callback.

class VideoDecodeFnBatched(*args, **kwargs)[source]

Bases: Protocol

Batched video decode callback used by StableDataset.__getitems__.

class VideoRef(cell: Mapping[str, Any], cache_dir: Path | None = None)[source]

Bases: object

Lazy reference to a cached video asset.

property bytes: bytes
cache_dir: Path | None = None
cell: Mapping[str, Any]
property checksum: str | None
property extension: str
property media_type: str
property mode: str
property path: Path | None
property size: int
collect_dataset_citations(sources: Iterable[DatasetSource | Mapping[str, object]]) list[str][source]

Collect unique citation strings in stable first-seen order.

stable_datasets.splits module

Split name constants and split generator.

class Split[source]

Bases: object

TEST = 'test'
TRAIN = 'train'
VALIDATION = 'validation'
class SplitGenerator(name: str, gen_kwargs: dict = <factory>)[source]

Bases: object

Describes one split and the kwargs to pass to _generate_examples.

gen_kwargs: dict
name: str

stable_datasets.utils module

class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, storage_format=None, backend_kwargs=None, decode_video=None, **kwargs)[source]

Bases: object

Base class for stable-datasets builders.

Handles downloading, Arrow caching, and split generation. Subclasses implement _info, _split_generators, and _generate_examples.

BUILDER_CONFIGS: list = []
DEFAULT_CONFIG_NAME: str | None = None
SOURCE: DatasetSource | Mapping
STORAGE_FORMAT: str = 'arrow'
VERSION: Version
__init__(config_name: str | None = None, **kwargs)[source]

Initialize builder, selecting a BuilderConfig if applicable.

property info
bulk_download(urls: Iterable[str | DownloadInfo], dest_folder: str | Path, checksums: dict[str, str] | None = None) list[Path][source]

Download multiple files concurrently and return their local paths.

Parameters:
  • urls – Iterable of URL strings or DownloadInfo specs to download.

  • dest_folder – Destination folder for downloads.

  • checksums – Optional dict mapping primary URL -> checksum string.

Returns:

Local file paths in the same order as the input URLs.

Return type:

list[Path]

download(url: str | DownloadInfo, dest_folder: str | Path | None = None, progress_bar: bool = True, _progress_dict=None, _task_id=None, checksum: str | None = None, fallbacks: list[str] | None = None, filename: str | None = None, cache_key_url: str | None = None) Path[source]

Download a file to dest_folder, returning the local path.

Supports resumable downloads: if a .tmp file exists from a previous interrupted attempt, an HTTP Range header is sent. The server may respond with 206 (append) or 200 (start over).

checksum, when provided, is an "algorithm:hex" string (e.g. "sha256:a3f8..."). The downloaded file is verified after completion and deleted on mismatch.

load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y=True, replace_missing_vals_with='NaN')[source]

Load data from a .ts file into a Pandas DataFrame. Credit to https://github.com/sktime/sktime/blob/7d572796ec519c35d30f482f2020c3e0256dd451/sktime/datasets/_data_io.py#L379 :param full_file_path_and_name: The full pathname of the .ts file to read. :type full_file_path_and_name: str :param return_separate_X_and_y: true if X and Y values should be returned as separate Data Frames (

X) and a numpy array (y), false otherwise. This is only relevant for data that

Parameters:

replace_missing_vals_with (str) – The value that missing values in the text file should be replaced with prior to parsing.

Returns:

  • DataFrame (default) or ndarray (i – If return_separate_X_and_y then a tuple containing a DataFrame and a numpy array containing the relevant time-series and corresponding class values.

  • DataFrame – If not return_separate_X_and_y then a single DataFrame containing all time-series and (if relevant) a column “class_vals” the associated class values.

Module contents

class Array3D(shape: tuple, dtype: str = 'uint8')[source]

Bases: FeatureType

Fixed-shape 3D array stored as flat bytes.

encode(value, *, cache_dir: Path | None = None) bytes | None[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, storage_format=None, backend_kwargs=None, decode_video=None, **kwargs)[source]

Bases: object

Base class for stable-datasets builders.

Handles downloading, Arrow caching, and split generation. Subclasses implement _info, _split_generators, and _generate_examples.

BUILDER_CONFIGS: list = []
DEFAULT_CONFIG_NAME: str | None = None
SOURCE: DatasetSource | Mapping
STORAGE_FORMAT: str = 'arrow'
VERSION: Version
__init__(config_name: str | None = None, **kwargs)[source]

Initialize builder, selecting a BuilderConfig if applicable.

property info
class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]

Bases: object

Base config for multi-variant datasets.

description: str = ''
name: str = 'default'
version: Version | None = None
class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]

Bases: FeatureType

Categorical label with name-to-int mapping.

encode(value, *, cache_dir: Path | None = None)[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
int2str(idx: int) str[source]
str2int(name: str) int[source]
to_arrow_type() DataType[source]
class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]

Bases: object

Metadata container for a dataset (description, features, citation, etc.).

citation: str = ''
config_name: str = ''
description: str = ''
features: Features
homepage: str = ''
license: str = ''
supervised_keys: tuple | None = None
class DatasetSource(homepage: URL | str, assets: dict[str, DownloadInfo | str], citation: str, license: str = '', checksums: dict[str, str] | None = None)[source]

Bases: Mapping[str, object]

Typed source and download metadata for one dataset builder.

assets: dict[str, DownloadInfo | str]
checksums: dict[str, str] | None = None
citation: str
get(k[, d]) D[k] if k in D, else d.  d defaults to None.[source]
homepage: URL | str
license: str = ''
class DownloadInfo(url: str, fallbacks: list[str] = <factory>, checksum: str | None = None, filename: str | None = None)[source]

Bases: object

Download source metadata for one raw asset.

url is attempted first. Any fallbacks are tried in order if the primary URL fails.

all_urls() list[str][source]
checksum: str | None = None
fallbacks: list[str]
filename: str | None = None
url: str
class Features[source]

Bases: OrderedDict

Ordered mapping of field_name -> FeatureType.

Generates a PyArrow schema via .to_arrow_schema().

fingerprint_data() str[source]
to_arrow_schema() schema[source]
class Image(encode_format: str = 'PNG')[source]

Bases: FeatureType

Image feature stored as raw bytes in Arrow.

encode(value, *, cache_dir: Path | None = None) bytes | None[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type()[source]
class Sequence(feature: FeatureType)[source]

Bases: FeatureType

Variable-length list of a sub-feature.

encode(value, *, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class StableDataset(features: Features, info: DatasetInfo, *, backend: StorageBackend | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, table: Table | None = None, num_rows: int | None = None, _indices: ndarray | None = None, _format_type: str | None = None, _decode_images: bool = True, _video_decode_config: VideoDecodeConfig | None = None, _transform: Callable | None = None, _cache_dir: Path | None = None)[source]

Bases: object

A single-split dataset backed by Arrow.

Users interact with rows, columns, and transforms — never with files or shards. All storage details are delegated to ArrowBackend.

Construction:

  1. File-backed (typical) — pass backend=ArrowBackend(shard_paths=...).

  2. In-memory — pass backend=ArrowBackend(table=table).

  3. Indexed view — pass _indices=array to create a virtual view sharing the same backend. Zero data copying.

add_column(name: str, column) StableDataset[source]

Return a new dataset with an additional column.

column can be a pa.Array, a Python list, or a numpy array.

as_iterable(*, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]

Return a StableIterableDataset wrapping this dataset.

property column_names: list[str]
property features: Features
filter(fn: Callable, *, batched: bool = False, batch_size: int = 1000) StableDataset[source]

Return a view containing rows where fn returns True.

Non-batched (default): fn(row_dict) -> bool, applied per row. Batched: fn(dict_of_lists) -> list[bool], applied per batch using sequential scan for better performance on large datasets.

Returns an indexed view — no data is materialized.

flatten_indices(cache_dir: Path | None = None) StableDataset[source]

Materialize an indexed view into a new contiguous Arrow file.

property info: DatasetInfo
iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]

Iterate with optional shard-level shuffling.

make_sampler(kind: str = 'shard_shuffle', **kwargs)[source]

Return a backend-aware torch.utils.data.Sampler for this dataset.

Convenience wrapper around the classes in stable_datasets.samplers. Use as:

sampler = ds.make_sampler("shard_shuffle", seed=42)
loader = DataLoader(ds, batch_size=128, sampler=sampler, ...)

DataLoader(ds, shuffle=True) (full-random via RandomSampler) continues to work unchanged; this is strictly opt-in for users who want an iteration order matched to the backend’s I/O layout.

Parameters:
  • kind (str, default "shard_shuffle") – Currently the only supported kind.

  • **kwargs – Forwarded to the underlying sampler class (e.g. seed, within_shard).

map(fn: Callable, *, batched: bool = False, batch_size: int = 1000, with_indices: bool = False, remove_columns: list[str] | None = None, features: Features | None = None, cache_dir: Path | str | None = None) StableDataset[source]

Apply a function to every row/batch and return a new dataset.

This is a materializing operation — output is written incrementally to Arrow IPC files via the sharded cache pipeline, so memory usage stays bounded regardless of dataset size. Use with_transform for lazy per-row transforms during iteration.

Non-batched: fn(row_dict) -> row_dict (or fn(row_dict, idx) if with_indices=True). Batched: fn(dict_of_lists) -> dict_of_lists (or fn(dict_of_lists, list_of_indices)).

Parameters:
  • features (Features, optional) – Output schema. If None, columns matching input features keep their types; new columns are inferred from Arrow types. Provide explicitly when the output schema is ambiguous.

  • cache_dir (path, optional) – Where to write the output cache. If None, uses a temp directory.

property num_rows: int
remove_columns(columns: list[str] | str) StableDataset[source]

Return a new dataset without the specified columns.

rename_column(old_name: str, new_name: str) StableDataset[source]

Return a new dataset with a column renamed.

rename_columns(mapping: dict[str, str]) StableDataset[source]

Return a new dataset with columns renamed per the mapping.

select(indices) StableDataset[source]

Return a view containing only the specified row indices.

set_decode(decode: bool) StableDataset[source]

Control whether Image columns are decoded or left as raw bytes.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDataset[source]

Return a view that decodes a video column at read time.

Passing None with no keyword arguments disables video decoding on the returned view.

shuffle(seed: int = 42) StableDataset[source]

Return a shuffled view.

property table: Table

Materialize and return the full Arrow table.

For single-file datasets this is a cheap mmap. For multi-file datasets this concatenates all files — prefer __getitem__ or __iter__ for row access. Use this for bulk operations like column mutations.

train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]

Random split via index indirection. No data materialization.

with_format(format_type: str | None) StableDataset[source]

Return a view with the specified output format.

Supported: None (PIL/numpy/Python), "torch", "numpy", "raw".

with_transform(fn: Callable | None) StableDataset[source]

Return a view with a transform applied after format conversion.

class StableDatasetDict[source]

Bases: dict

Dict of split_name -> StableDataset.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) StableDatasetDict[source]

Return a split dict where each split applies the same video decode view.

class Value(dtype: str)[source]

Bases: FeatureType

Scalar value type. Maps dtype strings to PyArrow types.

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class Version(version_str: str)[source]

Bases: object

Semantic version string (major.minor.patch).

class Video(storage: str = 'path', allowed_extensions: tuple[str, ...] = ('.mp4', '.avi', '.mov', '.webm', '.mkv'))[source]

Bases: FeatureType

Video feature with validated path, bytes, or specialized frame storage.

arrow_metadata() dict[bytes, bytes][source]
encode(value, *, cache_dir: Path | None = None)[source]
fingerprint_data() str[source]
format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]
to_arrow_type() DataType[source]
class VideoDecodeConfig(num_frames: int, column: str = 'video', sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform', frame_stride: int = 1, decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec', output: Literal['torch', 'numpy'] = 'torch', layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW', dtype: Literal['float32', 'uint8'] = 'float32', scale: Literal['zero_one', 'none'] = 'zero_one', resize: int | tuple[int, int] | None = None, crop: Literal['none', 'center', 'random'] = 'none', pad: Literal['error', 'repeat_last', 'loop'] = 'error', seed: int | None = None, decode_fn: VideoDecodeFn | None = None, decode_fn_batched: VideoDecodeFnBatched | None = None)[source]

Bases: object

Read-time video decode configuration.

This is retrieval policy only: it does not affect cache construction, cache fingerprints, or the persisted schema.

column: str = 'video'
crop: Literal['none', 'center', 'random'] = 'none'
decode_fn: VideoDecodeFn | None = None
decode_fn_batched: VideoDecodeFnBatched | None = None
decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec'
dtype: Literal['float32', 'uint8'] = 'float32'
frame_stride: int = 1
layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW'
num_frames: int
output: Literal['torch', 'numpy'] = 'torch'
pad: Literal['error', 'repeat_last', 'loop'] = 'error'
resize: int | tuple[int, int] | None = None
sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform'
scale: Literal['zero_one', 'none'] = 'zero_one'
seed: int | None = None
class VideoDecodeFn(*args, **kwargs)[source]

Bases: Protocol

Per-sample video decode callback.

class VideoDecodeFnBatched(*args, **kwargs)[source]

Bases: Protocol

Batched video decode callback used by StableDataset.__getitems__.

class VideoRef(cell: Mapping[str, Any], cache_dir: Path | None = None)[source]

Bases: object

Lazy reference to a cached video asset.

property bytes: bytes
cache_dir: Path | None = None
cell: Mapping[str, Any]
property checksum: str | None
property extension: str
property media_type: str
property mode: str
property path: Path | None
property size: int