stable_datasets package

Subpackages

Submodules

stable_datasets.arrow_dataset module

PyArrow-backed dataset with optional TensorDict conversion.

Provides StableDataset (single split) and StableDatasetDict (multi-split) with __len__, __getitem__, __iter__, .features, and .train_test_split() for downstream benchmarks.

StableDataset supports two construction modes:

  1. Shard-backed — directory of Arrow IPC shards. Only the needed shard is memory-mapped for __getitem__; __iter__ reads one shard at a time with bounded memory.

  2. In-memory — for small derived subsets (slices, train_test_split).

All modes keep pickle size tiny (paths only) so DataLoader workers share OS pages via mmap instead of copying data.

class StableDataset(features: Features, info: DatasetInfo, *, table: Table | None = None, num_rows: int | None = None, shard_dir: Path | str | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, max_open_shards: int = 4)[source]

Bases: object

A single-split dataset backed by a directory of Arrow IPC shards.

Two construction modes:

  1. Shard-backedStableDataset(features, info, shard_dir=..., shard_paths=[...], shard_row_counts=[...]). Only the needed shard is memory-mapped; __iter__ streams one shard at a time.

  2. In-memoryStableDataset(features, info, table=table). For small derived subsets (slices, splits). Pickle serialises the full table.

property features: Features
property info: DatasetInfo
iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]

Iterate over all rows with optional shard-level shuffling.

For non-sharded datasets, this is equivalent to __iter__.

property table: Table

Return the underlying Arrow table, memory-mapping from disk if needed.

For shard-backed datasets this concatenates all shards — prefer __getitem__ or __iter__ for large datasets.

to_tensordict(columns: list[str] | None = None)[source]

Convert numeric columns to a tensordict.TensorDict.

Image and Video columns are skipped (they stay lazy-decoded). Requires tensordict to be installed.

train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]

Random split. Returns {"train": StableDataset, "test": StableDataset}.

class StableDatasetDict[source]

Bases: dict

Dict of split_name -> StableDataset.

stable_datasets.cache module

Generator-to-Arrow sharded caching pipeline.

Writes dataset examples to a directory of PyArrow IPC (Feather v2) shard files. Peak memory during writes is bounded to ~1 batch, and the sharded layout supports efficient sequential reads for training workloads.

class ShardedCacheMeta(cache_dir: Path, num_rows: int, num_shards: int, shard_filenames: list[str], shard_row_counts: list[int], schema_fingerprint: str)[source]

Bases: object

Lightweight descriptor for a sharded Arrow cache on disk.

cache_dir
num_rows
num_shards
schema_fingerprint
shard_filenames
property shard_paths: list[Path]
shard_row_counts
cache_fingerprint(cls_name: str, version: str, config_name: str, split: str) str[source]

Deterministic cache directory name for a dataset variant + split.

encode_example(example: dict, features: Features) dict[source]

Encode a single example dict into Arrow-compatible values.

read_shard(shard_path: Path) Table[source]

Memory-map a single shard file and return its table.

read_sharded_cache_meta(cache_dir: Path) ShardedCacheMeta[source]

Read metadata from a sharded cache directory.

Validates that all shard files and metadata exist and are internally consistent. Raises ValueError on corruption.

validate_sharded_cache(cache_dir: Path, features: Features) ShardedCacheMeta[source]

Read and validate a sharded cache, checking the schema fingerprint.

Raises ValueError if the cache is inconsistent or the schema has changed.

write_sharded_arrow_cache(generator, features: Features, cache_dir: Path, *, shard_size_bytes: int = 268435456, batch_size: int = 1000) ShardedCacheMeta[source]

Consume a generator and write to a directory of Arrow IPC shards.

Batches are flushed every batch_size rows. After each flush the cumulative RecordBatch.nbytes for the current shard is checked; when it exceeds shard_size_bytes the shard is closed. The next shard is opened lazily when the next batch is ready, so there are never trailing empty shards.

Note

shard_size_bytes is an approximate target based on Arrow in-memory batch sizes, not exact on-disk file sizes. Actual shard files may be somewhat larger or smaller due to IPC framing, batch granularity, and compression differences.

An empty generator produces zero shards (num_shards == 0).

The completed cache directory contains:

  • shard-NNNNN.arrow — zero or more IPC files

  • _metadata.json — row counts, shard list, format version, schema fingerprint

Writing is atomic: shards are first written to a temporary directory and renamed into place on success.

Returns a ShardedCacheMeta describing the cache.

stable_datasets.schema module

Feature and metadata schema definitions.

Each feature type maps itself to a PyArrow type for Arrow IPC serialization.

class Array3D(shape: tuple, dtype: str = 'uint8')[source]

Bases: FeatureType

Fixed-shape 3D array (e.g. 3D medical volumes). Stored as flat bytes.

to_arrow_type() DataType[source]
class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]

Bases: object

Base config for multi-variant datasets.

description: str = ''
name: str = 'default'
version: Version | None = None
class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]

Bases: FeatureType

Categorical label with name-to-int mapping.

Preserves the .names, .num_classes, .str2int(), .int2str() API that downstream code relies on.

int2str(idx: int) str[source]
names: list[str]
num_classes: int
str2int(name: str) int[source]
to_arrow_type() DataType[source]
class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]

Bases: object

Metadata container for a dataset (description, features, citation, etc.).

citation: str = ''
config_name: str = ''
description: str = ''
features: Features
homepage: str = ''
license: str = ''
supervised_keys: tuple | None = None
class FeatureType[source]

Bases: object

Base class for feature type descriptors.

to_arrow_type() DataType[source]
class Features[source]

Bases: dict

Ordered dict of field_name -> FeatureType.

Generates a PyArrow schema via .to_arrow_schema().

to_arrow_schema() schema[source]
class Image[source]

Bases: FeatureType

Image feature. Stored as raw bytes (PNG-encoded) in Arrow.

to_arrow_type() DataType[source]
class Sequence(feature: FeatureType)[source]

Bases: FeatureType

Variable-length list of a sub-feature.

to_arrow_type() DataType[source]
class Value(dtype: str)[source]

Bases: FeatureType

Scalar value type. Maps dtype strings to PyArrow types.

to_arrow_type() DataType[source]
class Version(version_str: str)[source]

Bases: object

Semantic version string (major.minor.patch).

class Video[source]

Bases: FeatureType

Video feature. Stored as file path string in Arrow (metadata-only).

Video bytes are never inlined into the Arrow cache. The path points to the source media file; decoding happens lazily at access time.

to_arrow_type() DataType[source]

stable_datasets.splits module

Split name constants and split generator.

class Split[source]

Bases: object

TEST = 'test'
TRAIN = 'train'
VALIDATION = 'validation'
class SplitGenerator(name: str, gen_kwargs: dict = <factory>)[source]

Bases: object

Describes one split and the kwargs to pass to _generate_examples.

gen_kwargs: dict
name: str

stable_datasets.utils module

class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, **kwargs)[source]

Bases: object

Base class for stable-datasets builders.

Handles downloading, Arrow caching, and split generation. Subclasses implement _info, _split_generators, and _generate_examples.

BUILDER_CONFIGS: list = []
DEFAULT_CONFIG_NAME: str | None = None
SOURCE: Mapping
VERSION: Version
__init__(config_name: str | None = None, **kwargs)[source]

Initialize builder, selecting a BuilderConfig if applicable.

property info
bulk_download(urls: Iterable[str], dest_folder: str | Path) list[Path][source]

Download multiple files concurrently and return their local paths.

Parameters:
  • urls – Iterable of URL strings to download.

  • dest_folder – Destination folder for downloads.

Returns:

Local file paths in the same order as the input URLs.

Return type:

list[Path]

download(url: str, dest_folder: str | Path | None = None, progress_bar: bool = True, _progress_dict=None, _task_id=None) Path[source]
load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y=True, replace_missing_vals_with='NaN')[source]

Load data from a .ts file into a Pandas DataFrame. Credit to https://github.com/sktime/sktime/blob/7d572796ec519c35d30f482f2020c3e0256dd451/sktime/datasets/_data_io.py#L379 :param full_file_path_and_name: The full pathname of the .ts file to read. :type full_file_path_and_name: str :param return_separate_X_and_y: true if X and Y values should be returned as separate Data Frames (

X) and a numpy array (y), false otherwise. This is only relevant for data that

Parameters:

replace_missing_vals_with (str) – The value that missing values in the text file should be replaced with prior to parsing.

Returns:

  • DataFrame (default) or ndarray (i – If return_separate_X_and_y then a tuple containing a DataFrame and a numpy array containing the relevant time-series and corresponding class values.

  • DataFrame – If not return_separate_X_and_y then a single DataFrame containing all time-series and (if relevant) a column “class_vals” the associated class values.

Module contents

class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, **kwargs)[source]

Bases: object

Base class for stable-datasets builders.

Handles downloading, Arrow caching, and split generation. Subclasses implement _info, _split_generators, and _generate_examples.

BUILDER_CONFIGS: list = []
DEFAULT_CONFIG_NAME: str | None = None
SOURCE: Mapping
VERSION: Version
__init__(config_name: str | None = None, **kwargs)[source]

Initialize builder, selecting a BuilderConfig if applicable.

property info
class StableDataset(features: Features, info: DatasetInfo, *, table: Table | None = None, num_rows: int | None = None, shard_dir: Path | str | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, max_open_shards: int = 4)[source]

Bases: object

A single-split dataset backed by a directory of Arrow IPC shards.

Two construction modes:

  1. Shard-backedStableDataset(features, info, shard_dir=..., shard_paths=[...], shard_row_counts=[...]). Only the needed shard is memory-mapped; __iter__ streams one shard at a time.

  2. In-memoryStableDataset(features, info, table=table). For small derived subsets (slices, splits). Pickle serialises the full table.

property features: Features
property info: DatasetInfo
iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]

Iterate over all rows with optional shard-level shuffling.

For non-sharded datasets, this is equivalent to __iter__.

property table: Table

Return the underlying Arrow table, memory-mapping from disk if needed.

For shard-backed datasets this concatenates all shards — prefer __getitem__ or __iter__ for large datasets.

to_tensordict(columns: list[str] | None = None)[source]

Convert numeric columns to a tensordict.TensorDict.

Image and Video columns are skipped (they stay lazy-decoded). Requires tensordict to be installed.

train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]

Random split. Returns {"train": StableDataset, "test": StableDataset}.

class StableDatasetDict[source]

Bases: dict

Dict of split_name -> StableDataset.