stable_datasets package

Subpackages

Submodules

stable_datasets.cache module

Generator-to-Arrow sharded caching pipeline.

Writes dataset examples to a directory of PyArrow IPC (Feather v2) shard files. Peak memory during writes is bounded to ~1 batch, and the sharded layout supports efficient sequential reads for training workloads.

class CacheOpenResult(*, backend, num_rows: int, layout: str, metadata)[source]

Bases: object

Result of opening cache metadata into a backend.

backend

layout

metadata

num_rows

class LanceCacheMeta(cache_dir: Path, num_rows: int, schema_fingerprint: str)[source]

Bases: object

Lightweight descriptor for a Lance-format cache on disk.

cache_dir

num_rows

schema_fingerprint

class ShardedCacheMeta(cache_dir: Path, num_rows: int, num_shards: int, shard_filenames: list[str], shard_row_counts: list[int], schema_fingerprint: str, compression: str | None = None)[source]

Bases: object

Lightweight descriptor for a sharded Arrow cache on disk.

cache_dir

compression

num_rows

num_shards

schema_fingerprint

shard_filenames

property shard_paths: list[Path]

shard_row_counts

cache_fingerprint(cls_name: str, version: str, config_name: str, split: str, storage_format: str = 'arrow') → str[source]

Deterministic cache directory name for a dataset variant + split.

storage_format is always included in the hash so Arrow and Lance caches for the same dataset coexist at different paths.

detect_cache_format(cache_dir: Path) → str[source]: Return "arrow" or "lance" based on the cache’s metadata.

detect_cache_layout(cache_dir: Path) → str[source]: Return the physical cache layout recorded in cache metadata.

encode_example(example: dict, features: Features, *, cache_dir: Path | None = None) → dict[source]: Encode a single example dict into Arrow-compatible values.

open_cache(cache_dir: Path, features: Features, *, backend_kwargs: dict | None = None) → CacheOpenResult[source]: Open a cache directory and return the backend selected by its layout.

read_lance_cache_meta(cache_dir: Path) → LanceCacheMeta[source]

Read metadata from a Lance-format cache directory.

Returns a LanceCacheMeta with the cached row count and schema fingerprint populated from _metadata.json. Deliberately does NOT open the underlying Lance dataset: that would initialize Lance’s tokio runtime in the caller’s process, which is a DataLoader-fork footgun. Row count comes from the metadata file, not from ds.count_rows().

read_shard(shard_path: Path) → Table[source]: Memory-map a single shard file and return its table.

read_sharded_cache_meta(cache_dir: Path) → ShardedCacheMeta[source]

Read metadata from a sharded cache directory.

Validates that all shard files and metadata exist and are internally consistent. Raises ValueError on corruption.

validate_sharded_cache(cache_dir: Path, features: Features) → ShardedCacheMeta[source]

Read and validate a sharded cache, checking the schema fingerprint.

Raises ValueError if the cache is inconsistent or the schema has changed.

write_lance_cache(generator, features: Features, cache_dir: Path, *, batch_size: int = 1000, num_encode_workers: int = 0, lineage: dict | None = None) → LanceCacheMeta[source]

Consume a generator and write directly to a Lance dataset.

Mirrors write_sharded_arrow_cache() in shape (same generator contract, same features, same encode pipeline, same atomic-publish semantics) but writes a Lance dataset via lance.write_dataset instead of Arrow IPC shards. No intermediate Arrow IPC file is produced – the encoded rows stream into Lance via a pa.RecordBatchReader, so the native Lance write path is used end-to-end.

Writing is atomic: Lance writes to a temporary directory next to cache_dir and the directory is renamed on success. The completed cache directory contains:

Lance dataset files (_versions/, data/, manifest)
_metadata.json – row count, format marker, schema fingerprint

Parameters:

batch_size (int) – Rows per pa.RecordBatch flushed to the Lance writer. Larger batches reduce per-call overhead; smaller batches reduce peak memory during writing.
num_encode_workers (int) – When > 0, encode examples in parallel using a thread pool (same contract as the Arrow writer).
lineage (dict, optional) – Optional metadata blob written into _metadata.json.

write_lance_video_frames_cache(generator, features: Features, cache_dir: Path, *, video_column: str | None = None, quality: int = 65, resize: int | None = None, workers: int | None = None, skip_corrupt: bool = True, lineage: dict | None = None) → LanceCacheMeta[source]

Write a specialized Lance row-per-frame video cache.

Each input example contributes one source video. The physical Lance dataset stores one WebP-encoded frame per row. Segment sampling is a read-time concern handled by LanceVideoFramesBackend.

write_sharded_arrow_cache(generator, features: Features, cache_dir: Path, *, shard_size_bytes: int = inf, batch_size: int = 1000, compression: str | None = None, num_encode_workers: int = 0, single_file: bool = False, lineage: dict | None = None) → ShardedCacheMeta[source]

Consume a generator and write to a directory of Arrow IPC shards.

Batches are flushed every batch_size rows. After each flush the cumulative RecordBatch.nbytes for the current shard is checked; when it exceeds shard_size_bytes the shard is closed. The next shard is opened lazily when the next batch is ready, so there are never trailing empty shards.

Note

shard_size_bytes is an approximate target based on Arrow in-memory batch sizes, not exact on-disk file sizes. Actual shard files may be somewhat larger or smaller due to IPC framing, batch granularity, and compression differences.

An empty generator produces zero shards (num_shards == 0).

The completed cache directory contains:

shard-NNNNN.arrow — zero or more IPC files
_metadata.json — row counts, shard list, format version, schema fingerprint

Writing is atomic: shards are first written to a temporary directory and renamed into place on success.

Parameters:

compression (str or None) – IPC buffer compression codec (e.g. "zstd", "lz4"). Decompression on read is automatic.
num_encode_workers (int) – When > 0, encode examples in parallel using a thread pool.

:param Returns a ShardedCacheMeta describing the cache.:

stable_datasets.dataset module

Map-style dataset built on a pluggable storage backend.

Provides StableDataset (single split) and StableDatasetDict (multi-split), exposing __len__, __getitem__, __getitems__, __iter__, .features, and .train_test_split().

Architecture: three layers with strict boundaries:

StorageBackend  -> row access, iteration, pickling (returns Arrow types)
    |
Formatter       -> Arrow -> user type (PIL / torch / numpy / raw)
    |
StableDataset   -> orchestrates backend + formatter + indices + transform

StableDataset depends only on the StorageBackend protocol, never on a concrete implementation or on-disk layout.

class StableDataset(features: Features, info: DatasetInfo, *, backend: StorageBackend | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, table: Table | None = None, num_rows: int | None = None, _indices: ndarray | None = None, _format_type: str | None = None, _decode_images: bool = True, _video_decode_config: VideoDecodeConfig | None = None, _transform: Callable | None = None, _cache_dir: Path | None = None)[source]

Bases: object

A single-split dataset backed by Arrow.

Users interact with rows, columns, and transforms — never with files or shards. All storage details are delegated to ArrowBackend.

Construction:

File-backed (typical) — pass backend=ArrowBackend(shard_paths=...).
In-memory — pass backend=ArrowBackend(table=table).
Indexed view — pass _indices=array to create a virtual view sharing the same backend. Zero data copying.

add_column(name: str, column) → StableDataset[source]

Return a new dataset with an additional column.

column can be a pa.Array, a Python list, or a numpy array.

as_iterable(*, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]: Return a StableIterableDataset wrapping this dataset.

property column_names: list[str]

property features: Features

filter(fn: Callable, *, batched: bool = False, batch_size: int = 1000) → StableDataset[source]

Return a view containing rows where fn returns True.

Non-batched (default): fn(row_dict) -> bool, applied per row. Batched: fn(dict_of_lists) -> list[bool], applied per batch using sequential scan for better performance on large datasets.

Returns an indexed view — no data is materialized.

flatten_indices(cache_dir: Path | None = None) → StableDataset[source]: Materialize an indexed view into a new contiguous Arrow file.

property info: DatasetInfo

iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]: Iterate with optional shard-level shuffling.

make_sampler(kind: str = 'shard_shuffle', **kwargs)[source]

Return a backend-aware torch.utils.data.Sampler for this dataset.

Convenience wrapper around the classes in stable_datasets.samplers. Use as:

sampler = ds.make_sampler("shard_shuffle", seed=42)
loader = DataLoader(ds, batch_size=128, sampler=sampler, ...)

DataLoader(ds, shuffle=True) (full-random via RandomSampler) continues to work unchanged; this is strictly opt-in for users who want an iteration order matched to the backend’s I/O layout.

Parameters:

kind (str, default "shard_shuffle") – Currently the only supported kind.
**kwargs – Forwarded to the underlying sampler class (e.g. seed, within_shard).

map(fn: Callable, *, batched: bool = False, batch_size: int = 1000, with_indices: bool = False, remove_columns: list[str] | None = None, features: Features | None = None, cache_dir: Path | str | None = None) → StableDataset[source]

Apply a function to every row/batch and return a new dataset.

This is a materializing operation — output is written incrementally to Arrow IPC files via the sharded cache pipeline, so memory usage stays bounded regardless of dataset size. Use with_transform for lazy per-row transforms during iteration.

Non-batched: fn(row_dict) -> row_dict (or fn(row_dict, idx) if with_indices=True). Batched: fn(dict_of_lists) -> dict_of_lists (or fn(dict_of_lists, list_of_indices)).

Parameters:

features (Features, optional) – Output schema. If None, columns matching input features keep their types; new columns are inferred from Arrow types. Provide explicitly when the output schema is ambiguous.
cache_dir (path, optional) – Where to write the output cache. If None, uses a temp directory.

property num_rows: int

remove_columns(columns: list[str] | str) → StableDataset[source]: Return a new dataset without the specified columns.

rename_column(old_name: str, new_name: str) → StableDataset[source]: Return a new dataset with a column renamed.

rename_columns(mapping: dict[str, str]) → StableDataset[source]: Return a new dataset with columns renamed per the mapping.

select(indices) → StableDataset[source]: Return a view containing only the specified row indices.

set_decode(decode: bool) → StableDataset[source]: Control whether Image columns are decoded or left as raw bytes.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) → StableDataset[source]

Return a view that decodes a video column at read time.

Passing None with no keyword arguments disables video decoding on the returned view.

shuffle(seed: int = 42) → StableDataset[source]: Return a shuffled view.

property table: Table

Materialize and return the full Arrow table.

For single-file datasets this is a cheap mmap. For multi-file datasets this concatenates all files — prefer __getitem__ or __iter__ for row access. Use this for bulk operations like column mutations.

train_test_split(test_size: float = 0.1, seed: int = 42) → dict[str, StableDataset][source]: Random split via index indirection. No data materialization.

with_format(format_type: str | None) → StableDataset[source]

Return a view with the specified output format.

Supported: None (PIL/numpy/Python), "torch", "numpy", "raw".

with_transform(fn: Callable | None) → StableDataset[source]: Return a view with a transform applied after format conversion.

class StableDatasetDict[source]

Bases: dict

Dict of split_name -> StableDataset.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) → StableDatasetDict[source]: Return a split dict where each split applies the same video decode view.

stable_datasets.formatting module

Formatters that convert Arrow-native values to user-facing types.

Middle layer of the three-layer split (StorageBackend -> Formatter -> StableDataset). Formatters consume Arrow values and emit PIL images, numpy arrays, torch tensors, or raw Python, never touching files or storage themselves.

class Formatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: object

Base formatter. Subclasses convert Arrow-native values to user-facing types.

format_batch(table) → list[dict][source]

Format a batch (from backend.take). Returns list of row dicts.

Column-first: extract columns once, decode each column in bulk, then zip into per-row dicts at the end.

format_row(row: dict) → dict[source]: Format a single row dict (from backend.get_row).

format_type = 'default'

class NumpyFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: Formatter

Numpy format: Image -> HWC numpy array, rest as-is.

format_type = 'numpy'

class PythonFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: Formatter

Default format: Image -> PIL, Array3D -> numpy, scalars -> Python native.

format_type = 'default'

class RawFormatter(features: Features, decode_images: bool = False, cache_dir: Path | None = None)[source]

Bases: Formatter

Raw format: all values as-is from Arrow (bytes for images, bytes for Array3D).

format_batch(table) → list[dict][source]

Format a batch (from backend.take). Returns list of row dicts.

Column-first: extract columns once, decode each column in bulk, then zip into per-row dicts at the end.

format_row(row: dict) → dict[source]: Format a single row dict (from backend.get_row).

format_type = 'raw'

class TorchFormatter(features: Features, decode_images: bool = True, cache_dir: Path | None = None)[source]

Bases: Formatter

Torch format: Image -> CHW float32 tensor, Array3D -> float32 tensor, scalars -> tensors.

format_type = 'torch'

get_formatter(format_type: str | None, features: Features, decode_images: bool = True, cache_dir: Path | None = None) → Formatter[source]: Factory for formatter instances.

stable_datasets.iterable module

Iterable dataset for streaming with worker sharding and buffered shuffle.

Provides StableIterableDataset for efficient streaming in PyTorch DataLoader with multiple workers. Supports shard-level worker partitioning and reservoir-based row-level shuffle.

class StableIterableDataset(dataset, *, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]

Bases: IterableDataset

An iterable-style dataset with worker sharding and buffered shuffle.

Wraps a StableDataset for efficient streaming in PyTorch DataLoader with multiple workers. Shards are partitioned across workers so each worker reads a disjoint subset.

Parameters:

dataset (StableDataset) – The underlying map-style dataset (must be shard-backed).
shuffle (bool) – Whether to shuffle shard order and apply buffered row-level shuffle.
seed (int) – Base random seed.
buffer_size (int) – Size of the reservoir buffer for row-level shuffle.
transform (callable, optional) – Transform applied to each yielded row dict.

set_epoch(epoch: int)[source]: Set the epoch for varying shuffle seed across epochs.

stable_datasets.samplers module

Backend-aware samplers for StableDataset.

PyTorch’s DataLoader constructs a RandomSampler when shuffle=True is passed. That sampler yields indices in a full-random permutation regardless of the underlying storage backend. For file-backed storage formats partitioned into shards (Arrow) or fragments (Lance), full-random access destroys any per-shard I/O locality the format was designed to exploit.

This module exposes samplers that yield indices in shard-aware orderings, preserving the classical PyTorch API (DataLoader(ds, sampler=...)) while providing a sampler that matches the backend’s access-pattern preferences:

from stable_datasets.samplers import ShardShuffleSampler

ds = CIFAR10(split=”train”, storage_format=”lance”) sampler = ShardShuffleSampler(ds, seed=42) loader = DataLoader(ds, batch_size=128, sampler=sampler,

num_workers=8, persistent_workers=True, multiprocessing_context=”spawn”)

DataLoader(ds, shuffle=True) continues to work unchanged for users who need bit-exact full-random ordering (e.g. classification reproduction). Samplers here are strictly opt-in.

stable_datasets.schema module

Feature and metadata schema definitions.

Each feature type maps itself to a PyArrow type for Arrow IPC serialization.

class Array3D(shape: tuple, dtype: str = 'uint8')[source]

Bases: FeatureType

Fixed-shape 3D array stored as flat bytes.

encode(value, *, cache_dir: Path | None = None) → bytes | None[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]

Bases: object

Base config for multi-variant datasets.

description: str = ''

name: str = 'default'

version: Version | None = None

class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]

Bases: FeatureType

Categorical label with name-to-int mapping.

encode(value, *, cache_dir: Path | None = None)[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

int2str(idx: int) → str[source]

str2int(name: str) → int[source]

to_arrow_type() → DataType[source]

class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]

Bases: object

Metadata container for a dataset (description, features, citation, etc.).

citation: str = ''

config_name: str = ''

description: str = ''

features: Features

homepage: str = ''

license: str = ''

supervised_keys: tuple | None = None

class DatasetSource(homepage: URL | str, assets: dict[str, DownloadInfo | str], citation: str, license: str = '', checksums: dict[str, str] | None = None)[source]

Bases: Mapping[str, object]

Typed source and download metadata for one dataset builder.

assets: dict[str, DownloadInfo | str]

checksums: dict[str, str] | None = None

citation: str

get(k[, d]) → D[k] if k in D, else d. d defaults to None.[source]

homepage: URL | str

license: str = ''

class DownloadInfo(url: str, fallbacks: list[str] = <factory>, checksum: str | None = None, filename: str | None = None)[source]

Bases: object

Download source metadata for one raw asset.

url is attempted first. Any fallbacks are tried in order if the primary URL fails.

all_urls() → list[str][source]

checksum: str | None = None

fallbacks: list[str]

filename: str | None = None

url: str

class FeatureType[source]

Bases: object

Base class for feature type descriptors.

arrow_metadata() → dict[bytes, bytes][source]

encode(value, *, cache_dir: Path | None = None)[source]

fingerprint_data() → str[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class Features[source]

Bases: OrderedDict

Ordered mapping of field_name -> FeatureType.

Generates a PyArrow schema via .to_arrow_schema().

fingerprint_data() → str[source]

to_arrow_schema() → schema[source]

class Image(encode_format: str = 'PNG')[source]

Bases: FeatureType

Image feature stored as raw bytes in Arrow.

encode(value, *, cache_dir: Path | None = None) → bytes | None[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type()[source]

class Sequence(feature: FeatureType)[source]

Bases: FeatureType

Variable-length list of a sub-feature.

encode(value, *, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class Value(dtype: str)[source]

Bases: FeatureType

Scalar value type. Maps dtype strings to PyArrow types.

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class Version(version_str: str)[source]

Bases: object

Semantic version string (major.minor.patch).

class Video(storage: str = 'path', allowed_extensions: tuple[str, ...] = ('.mp4', '.avi', '.mov', '.webm', '.mkv'))[source]

Bases: FeatureType

Video feature with validated path, bytes, or specialized frame storage.

arrow_metadata() → dict[bytes, bytes][source]

encode(value, *, cache_dir: Path | None = None)[source]

fingerprint_data() → str[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class VideoDecodeConfig(num_frames: int, column: str = 'video', sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform', frame_stride: int = 1, decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec', output: Literal['torch', 'numpy'] = 'torch', layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW', dtype: Literal['float32', 'uint8'] = 'float32', scale: Literal['zero_one', 'none'] = 'zero_one', resize: int | tuple[int, int] | None = None, crop: Literal['none', 'center', 'random'] = 'none', pad: Literal['error', 'repeat_last', 'loop'] = 'error', seed: int | None = None, decode_fn: VideoDecodeFn | None = None, decode_fn_batched: VideoDecodeFnBatched | None = None)[source]

Bases: object

Read-time video decode configuration.

This is retrieval policy only: it does not affect cache construction, cache fingerprints, or the persisted schema.

column: str = 'video'

crop: Literal['none', 'center', 'random'] = 'none'

decode_fn: VideoDecodeFn | None = None

decode_fn_batched: VideoDecodeFnBatched | None = None

decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec'

dtype: Literal['float32', 'uint8'] = 'float32'

frame_stride: int = 1

layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW'

num_frames: int

output: Literal['torch', 'numpy'] = 'torch'

pad: Literal['error', 'repeat_last', 'loop'] = 'error'

resize: int | tuple[int, int] | None = None

sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform'

scale: Literal['zero_one', 'none'] = 'zero_one'

seed: int | None = None

class VideoDecodeFn(*args, **kwargs)[source]

Bases: Protocol

Per-sample video decode callback.

class VideoDecodeFnBatched(*args, **kwargs)[source]

Bases: Protocol

Batched video decode callback used by StableDataset.__getitems__.

class VideoRef(cell: Mapping[str, Any], cache_dir: Path | None = None)[source]

Bases: object

Lazy reference to a cached video asset.

property bytes: bytes

cache_dir: Path | None = None

cell: Mapping[str, Any]

property checksum: str | None

property extension: str

property media_type: str

property mode: str

property path: Path | None

property size: int

collect_dataset_citations(sources: Iterable[DatasetSource | Mapping[str, object]]) → list[str][source]: Collect unique citation strings in stable first-seen order.

stable_datasets.splits module

Split name constants and split generator.

class Split[source]

Bases: object

TEST = 'test'

TRAIN = 'train'

VALIDATION = 'validation'

class SplitGenerator(name: str, gen_kwargs: dict = <factory>)[source]

Bases: object

Describes one split and the kwargs to pass to _generate_examples.

gen_kwargs: dict

name: str

stable_datasets.utils module

class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, storage_format=None, backend_kwargs=None, decode_video=None, **kwargs)[source]

Bases: object

Base class for stable-datasets builders.

Handles downloading, Arrow caching, and split generation. Subclasses implement _info, _split_generators, and _generate_examples.

BUILDER_CONFIGS: list = []

DEFAULT_CONFIG_NAME: str | None = None

SOURCE: DatasetSource | Mapping

STORAGE_FORMAT: str = 'arrow'

VERSION: Version

__init__(config_name: str | None = None, **kwargs)[source]: Initialize builder, selecting a BuilderConfig if applicable.

property info

bulk_download(urls: Iterable[str | DownloadInfo], dest_folder: str | Path, checksums: dict[str, str] | None = None) → list[Path][source]

Download multiple files concurrently and return their local paths.

Parameters:

urls – Iterable of URL strings or DownloadInfo specs to download.
dest_folder – Destination folder for downloads.
checksums – Optional dict mapping primary URL -> checksum string.

Returns:

Local file paths in the same order as the input URLs.

Return type:

list[Path]

Download a file to dest_folder, returning the local path.

Supports resumable downloads: if a .tmp file exists from a previous interrupted attempt, an HTTP Range header is sent. The server may respond with 206 (append) or 200 (start over).

checksum, when provided, is an "algorithm:hex" string (e.g. "sha256:a3f8..."). The downloaded file is verified after completion and deleted on mismatch.

load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y=True, replace_missing_vals_with='NaN')[source]

Load data from a .ts file into a Pandas DataFrame. Credit to https://github.com/sktime/sktime/blob/7d572796ec519c35d30f482f2020c3e0256dd451/sktime/datasets/_data_io.py#L379 :param full_file_path_and_name: The full pathname of the .ts file to read. :type full_file_path_and_name: str :param return_separate_X_and_y: true if X and Y values should be returned as separate Data Frames (

X) and a numpy array (y), false otherwise. This is only relevant for data that

Parameters:

replace_missing_vals_with (str) – The value that missing values in the text file should be replaced with prior to parsing.

Returns:

DataFrame (default) or ndarray (i – If return_separate_X_and_y then a tuple containing a DataFrame and a numpy array containing the relevant time-series and corresponding class values.
DataFrame – If not return_separate_X_and_y then a single DataFrame containing all time-series and (if relevant) a column “class_vals” the associated class values.

Module contents

class Array3D(shape: tuple, dtype: str = 'uint8')[source]

Bases: FeatureType

Fixed-shape 3D array stored as flat bytes.

encode(value, *, cache_dir: Path | None = None) → bytes | None[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, storage_format=None, backend_kwargs=None, decode_video=None, **kwargs)[source]

Bases: object

Base class for stable-datasets builders.

Handles downloading, Arrow caching, and split generation. Subclasses implement _info, _split_generators, and _generate_examples.

BUILDER_CONFIGS: list = []

DEFAULT_CONFIG_NAME: str | None = None

SOURCE: DatasetSource | Mapping

STORAGE_FORMAT: str = 'arrow'

VERSION: Version

__init__(config_name: str | None = None, **kwargs)[source]: Initialize builder, selecting a BuilderConfig if applicable.

property info

class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]

Bases: object

Base config for multi-variant datasets.

description: str = ''

name: str = 'default'

version: Version | None = None

class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]

Bases: FeatureType

Categorical label with name-to-int mapping.

encode(value, *, cache_dir: Path | None = None)[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

int2str(idx: int) → str[source]

str2int(name: str) → int[source]

to_arrow_type() → DataType[source]

class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]

Bases: object

Metadata container for a dataset (description, features, citation, etc.).

citation: str = ''

config_name: str = ''

description: str = ''

features: Features

homepage: str = ''

license: str = ''

supervised_keys: tuple | None = None

class DatasetSource(homepage: URL | str, assets: dict[str, DownloadInfo | str], citation: str, license: str = '', checksums: dict[str, str] | None = None)[source]

Bases: Mapping[str, object]

Typed source and download metadata for one dataset builder.

assets: dict[str, DownloadInfo | str]

checksums: dict[str, str] | None = None

citation: str

get(k[, d]) → D[k] if k in D, else d. d defaults to None.[source]

homepage: URL | str

license: str = ''

class DownloadInfo(url: str, fallbacks: list[str] = <factory>, checksum: str | None = None, filename: str | None = None)[source]

Bases: object

Download source metadata for one raw asset.

url is attempted first. Any fallbacks are tried in order if the primary URL fails.

all_urls() → list[str][source]

checksum: str | None = None

fallbacks: list[str]

filename: str | None = None

url: str

class Features[source]

Bases: OrderedDict

Ordered mapping of field_name -> FeatureType.

Generates a PyArrow schema via .to_arrow_schema().

fingerprint_data() → str[source]

to_arrow_schema() → schema[source]

class Image(encode_format: str = 'PNG')[source]

Bases: FeatureType

Image feature stored as raw bytes in Arrow.

encode(value, *, cache_dir: Path | None = None) → bytes | None[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type()[source]

class Sequence(feature: FeatureType)[source]

Bases: FeatureType

Variable-length list of a sub-feature.

encode(value, *, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class StableDataset(features: Features, info: DatasetInfo, *, backend: StorageBackend | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, table: Table | None = None, num_rows: int | None = None, _indices: ndarray | None = None, _format_type: str | None = None, _decode_images: bool = True, _video_decode_config: VideoDecodeConfig | None = None, _transform: Callable | None = None, _cache_dir: Path | None = None)[source]

Bases: object

A single-split dataset backed by Arrow.

Users interact with rows, columns, and transforms — never with files or shards. All storage details are delegated to ArrowBackend.

Construction:

File-backed (typical) — pass backend=ArrowBackend(shard_paths=...).
In-memory — pass backend=ArrowBackend(table=table).
Indexed view — pass _indices=array to create a virtual view sharing the same backend. Zero data copying.

add_column(name: str, column) → StableDataset[source]

Return a new dataset with an additional column.

column can be a pa.Array, a Python list, or a numpy array.

as_iterable(*, shuffle: bool = False, seed: int = 0, buffer_size: int = 10000, transform: Callable | None = None)[source]: Return a StableIterableDataset wrapping this dataset.

property column_names: list[str]

property features: Features

filter(fn: Callable, *, batched: bool = False, batch_size: int = 1000) → StableDataset[source]

Return a view containing rows where fn returns True.

Non-batched (default): fn(row_dict) -> bool, applied per row. Batched: fn(dict_of_lists) -> list[bool], applied per batch using sequential scan for better performance on large datasets.

Returns an indexed view — no data is materialized.

flatten_indices(cache_dir: Path | None = None) → StableDataset[source]: Materialize an indexed view into a new contiguous Arrow file.

property info: DatasetInfo

iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]: Iterate with optional shard-level shuffling.

make_sampler(kind: str = 'shard_shuffle', **kwargs)[source]

Return a backend-aware torch.utils.data.Sampler for this dataset.

Convenience wrapper around the classes in stable_datasets.samplers. Use as:

sampler = ds.make_sampler("shard_shuffle", seed=42)
loader = DataLoader(ds, batch_size=128, sampler=sampler, ...)

DataLoader(ds, shuffle=True) (full-random via RandomSampler) continues to work unchanged; this is strictly opt-in for users who want an iteration order matched to the backend’s I/O layout.

Parameters:

kind (str, default "shard_shuffle") – Currently the only supported kind.
**kwargs – Forwarded to the underlying sampler class (e.g. seed, within_shard).

map(fn: Callable, *, batched: bool = False, batch_size: int = 1000, with_indices: bool = False, remove_columns: list[str] | None = None, features: Features | None = None, cache_dir: Path | str | None = None) → StableDataset[source]

Apply a function to every row/batch and return a new dataset.

This is a materializing operation — output is written incrementally to Arrow IPC files via the sharded cache pipeline, so memory usage stays bounded regardless of dataset size. Use with_transform for lazy per-row transforms during iteration.

Non-batched: fn(row_dict) -> row_dict (or fn(row_dict, idx) if with_indices=True). Batched: fn(dict_of_lists) -> dict_of_lists (or fn(dict_of_lists, list_of_indices)).

Parameters:

features (Features, optional) – Output schema. If None, columns matching input features keep their types; new columns are inferred from Arrow types. Provide explicitly when the output schema is ambiguous.
cache_dir (path, optional) – Where to write the output cache. If None, uses a temp directory.

property num_rows: int

remove_columns(columns: list[str] | str) → StableDataset[source]: Return a new dataset without the specified columns.

rename_column(old_name: str, new_name: str) → StableDataset[source]: Return a new dataset with a column renamed.

rename_columns(mapping: dict[str, str]) → StableDataset[source]: Return a new dataset with columns renamed per the mapping.

select(indices) → StableDataset[source]: Return a view containing only the specified row indices.

set_decode(decode: bool) → StableDataset[source]: Control whether Image columns are decoded or left as raw bytes.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) → StableDataset[source]

Return a view that decodes a video column at read time.

Passing None with no keyword arguments disables video decoding on the returned view.

shuffle(seed: int = 42) → StableDataset[source]: Return a shuffled view.

property table: Table

Materialize and return the full Arrow table.

For single-file datasets this is a cheap mmap. For multi-file datasets this concatenates all files — prefer __getitem__ or __iter__ for row access. Use this for bulk operations like column mutations.

train_test_split(test_size: float = 0.1, seed: int = 42) → dict[str, StableDataset][source]: Random split via index indirection. No data materialization.

with_format(format_type: str | None) → StableDataset[source]

Return a view with the specified output format.

Supported: None (PIL/numpy/Python), "torch", "numpy", "raw".

with_transform(fn: Callable | None) → StableDataset[source]: Return a view with a transform applied after format conversion.

class StableDatasetDict[source]

Bases: dict

Dict of split_name -> StableDataset.

set_video_decode(config: VideoDecodeConfig | Mapping | None = None, **kwargs) → StableDatasetDict[source]: Return a split dict where each split applies the same video decode view.

class Value(dtype: str)[source]

Bases: FeatureType

Scalar value type. Maps dtype strings to PyArrow types.

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class Version(version_str: str)[source]

Bases: object

Semantic version string (major.minor.patch).

class Video(storage: str = 'path', allowed_extensions: tuple[str, ...] = ('.mp4', '.avi', '.mov', '.webm', '.mkv'))[source]

Bases: FeatureType

Video feature with validated path, bytes, or specialized frame storage.

arrow_metadata() → dict[bytes, bytes][source]

encode(value, *, cache_dir: Path | None = None)[source]

fingerprint_data() → str[source]

format(value, *, format_type: str, decode_images: bool = True, cache_dir: Path | None = None)[source]

to_arrow_type() → DataType[source]

class VideoDecodeConfig(num_frames: int, column: str = 'video', sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform', frame_stride: int = 1, decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec', output: Literal['torch', 'numpy'] = 'torch', layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW', dtype: Literal['float32', 'uint8'] = 'float32', scale: Literal['zero_one', 'none'] = 'zero_one', resize: int | tuple[int, int] | None = None, crop: Literal['none', 'center', 'random'] = 'none', pad: Literal['error', 'repeat_last', 'loop'] = 'error', seed: int | None = None, decode_fn: VideoDecodeFn | None = None, decode_fn_batched: VideoDecodeFnBatched | None = None)[source]

Bases: object

Read-time video decode configuration.

This is retrieval policy only: it does not affect cache construction, cache fingerprints, or the persisted schema.

column: str = 'video'

crop: Literal['none', 'center', 'random'] = 'none'

decode_fn: VideoDecodeFn | None = None

decode_fn_batched: VideoDecodeFnBatched | None = None

decoder: Literal['torchcodec', 'decord', 'cv2'] = 'torchcodec'

dtype: Literal['float32', 'uint8'] = 'float32'

frame_stride: int = 1

layout: Literal['TCHW', 'CTHW', 'THWC'] = 'TCHW'

num_frames: int

output: Literal['torch', 'numpy'] = 'torch'

pad: Literal['error', 'repeat_last', 'loop'] = 'error'

resize: int | tuple[int, int] | None = None

sampling: Literal['uniform', 'random', 'center', 'start'] = 'uniform'

scale: Literal['zero_one', 'none'] = 'zero_one'

seed: int | None = None

class VideoDecodeFn(*args, **kwargs)[source]

Bases: Protocol

Per-sample video decode callback.

class VideoDecodeFnBatched(*args, **kwargs)[source]

Bases: Protocol

Batched video decode callback used by StableDataset.__getitems__.

class VideoRef(cell: Mapping[str, Any], cache_dir: Path | None = None)[source]

Bases: object

Lazy reference to a cached video asset.

property bytes: bytes

cache_dir: Path | None = None

cell: Mapping[str, Any]

property checksum: str | None

property extension: str

property media_type: str

property mode: str

property path: Path | None

property size: int