stable_datasets package
Subpackages
- stable_datasets.images package
- Submodules
- stable_datasets.images.arabic_characters module
- stable_datasets.images.arabic_digits module
- stable_datasets.images.awa2 module
- stable_datasets.images.beans module
- stable_datasets.images.cars196 module
- stable_datasets.images.cars3d module
- stable_datasets.images.cassava module
- stable_datasets.images.celeb_a module
- stable_datasets.images.cifar10 module
- stable_datasets.images.cifar100 module
- stable_datasets.images.cifar100_c module
- stable_datasets.images.cifar10_c module
- stable_datasets.images.clevrer module
- stable_datasets.images.country211 module
- stable_datasets.images.cub200 module
- stable_datasets.images.dsprites module
- stable_datasets.images.dsprites_color module
- stable_datasets.images.dsprites_noise module
- stable_datasets.images.dsprites_scream module
- stable_datasets.images.dtd module
- stable_datasets.images.e_mnist module
- stable_datasets.images.face_pointing module
- stable_datasets.images.fashion_mnist module
- stable_datasets.images.fgvc_aircraft module
- stable_datasets.images.flowers102 module
- stable_datasets.images.food101 module
- stable_datasets.images.galaxy10 module
- stable_datasets.images.hasy_v2 module
- stable_datasets.images.imagenet_10 module
- stable_datasets.images.imagenet_100 module
- stable_datasets.images.imagenet_1k module
- stable_datasets.images.k_mnist module
- stable_datasets.images.linnaeus5 module
- stable_datasets.images.med_mnist module
- stable_datasets.images.mnist module
- stable_datasets.images.not_mnist module
- stable_datasets.images.patch_camelyon module
- stable_datasets.images.places365_small module
- stable_datasets.images.rock_paper_scissor module
- stable_datasets.images.shapes3d module
- stable_datasets.images.small_norb module
- stable_datasets.images.stl10 module
- stable_datasets.images.svhn module
- stable_datasets.images.tiny_imagenet module
- stable_datasets.images.tiny_imagenet_c module
- Module contents
AWA2ArabicCharactersArabicDigitsBeansCIFAR10CIFAR100CIFAR100CCIFAR10CCLEVRERCUB200Cars196Cars3DCountry211DSpritesDSpritesColorDSpritesNoiseDSpritesScreamDTDEMNISTFGVCAircraftFacePointingFashionMNISTFlowers102Food101Galaxy10DecalHASYv2ImageNet100ImageNet1KImagenetteKMNISTLinnaeus5MedMNISTNotMNISTRockPaperScissorSTL10SVHNShapes3DSmallNORBTinyImagenetTinyImagenetC
- stable_datasets.timeseries package
- Submodules
- stable_datasets.timeseries.CatsDogs module
- stable_datasets.timeseries.JapaneseVowels module
- stable_datasets.timeseries.MosquitoSound module
- stable_datasets.timeseries.Phoneme module
- stable_datasets.timeseries.RightWhaleCalls module
- stable_datasets.timeseries.TUTacousticscenes2017 module
- stable_datasets.timeseries.UCR_multivariate module
- stable_datasets.timeseries.UCR_univariate module
- stable_datasets.timeseries.UrbanSound module
- stable_datasets.timeseries.VoiceGenderDetection module
- stable_datasets.timeseries.audiomnist module
- stable_datasets.timeseries.birdvox_70k module
- stable_datasets.timeseries.birdvox_dcase_20k module
- stable_datasets.timeseries.brain_mnist module
- stable_datasets.timeseries.dcase_2019_task4 module
- stable_datasets.timeseries.dclde module
- stable_datasets.timeseries.esc module
- stable_datasets.timeseries.freefield1010 module
- stable_datasets.timeseries.fsd_kaggle_2018 module
- stable_datasets.timeseries.groove_MIDI module
- stable_datasets.timeseries.gtzan module
- stable_datasets.timeseries.high_gamma module
- stable_datasets.timeseries.irmas module
- stable_datasets.timeseries.picidae module
- stable_datasets.timeseries.seizures_neonatal module
- stable_datasets.timeseries.sonycust module
- stable_datasets.timeseries.speech_commands module
- stable_datasets.timeseries.vocalset module
- stable_datasets.timeseries.warblr module
- Module contents
Submodules
stable_datasets.arrow_dataset module
PyArrow-backed dataset with optional TensorDict conversion.
Provides StableDataset (single split) and StableDatasetDict (multi-split)
with __len__, __getitem__, __iter__, .features, and
.train_test_split() for downstream benchmarks.
StableDataset supports two construction modes:
Shard-backed — directory of Arrow IPC shards. Only the needed shard is memory-mapped for
__getitem__;__iter__reads one shard at a time with bounded memory.In-memory — for small derived subsets (slices,
train_test_split).
All modes keep pickle size tiny (paths only) so DataLoader workers share
OS pages via mmap instead of copying data.
- class StableDataset(features: Features, info: DatasetInfo, *, table: Table | None = None, num_rows: int | None = None, shard_dir: Path | str | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, max_open_shards: int = 4)[source]
Bases:
objectA single-split dataset backed by a directory of Arrow IPC shards.
Two construction modes:
Shard-backed —
StableDataset(features, info, shard_dir=..., shard_paths=[...], shard_row_counts=[...]). Only the needed shard is memory-mapped;__iter__streams one shard at a time.In-memory —
StableDataset(features, info, table=table). For small derived subsets (slices, splits). Pickle serialises the full table.
- property features: Features
- property info: DatasetInfo
- iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]
Iterate over all rows with optional shard-level shuffling.
For non-sharded datasets, this is equivalent to
__iter__.
- property table: Table
Return the underlying Arrow table, memory-mapping from disk if needed.
For shard-backed datasets this concatenates all shards — prefer
__getitem__or__iter__for large datasets.
- to_tensordict(columns: list[str] | None = None)[source]
Convert numeric columns to a
tensordict.TensorDict.Image and Video columns are skipped (they stay lazy-decoded). Requires
tensordictto be installed.
- train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]
Random split. Returns
{"train": StableDataset, "test": StableDataset}.
- class StableDatasetDict[source]
Bases:
dictDict of
split_name -> StableDataset.
stable_datasets.cache module
Generator-to-Arrow sharded caching pipeline.
Writes dataset examples to a directory of PyArrow IPC (Feather v2) shard files. Peak memory during writes is bounded to ~1 batch, and the sharded layout supports efficient sequential reads for training workloads.
- class ShardedCacheMeta(cache_dir: Path, num_rows: int, num_shards: int, shard_filenames: list[str], shard_row_counts: list[int], schema_fingerprint: str)[source]
Bases:
objectLightweight descriptor for a sharded Arrow cache on disk.
- cache_fingerprint(cls_name: str, version: str, config_name: str, split: str) str[source]
Deterministic cache directory name for a dataset variant + split.
- encode_example(example: dict, features: Features) dict[source]
Encode a single example dict into Arrow-compatible values.
- read_shard(shard_path: Path) Table[source]
Memory-map a single shard file and return its table.
- read_sharded_cache_meta(cache_dir: Path) ShardedCacheMeta[source]
Read metadata from a sharded cache directory.
Validates that all shard files and metadata exist and are internally consistent. Raises
ValueErroron corruption.
- validate_sharded_cache(cache_dir: Path, features: Features) ShardedCacheMeta[source]
Read and validate a sharded cache, checking the schema fingerprint.
Raises
ValueErrorif the cache is inconsistent or the schema has changed.
- write_sharded_arrow_cache(generator, features: Features, cache_dir: Path, *, shard_size_bytes: int = 268435456, batch_size: int = 1000) ShardedCacheMeta[source]
Consume a generator and write to a directory of Arrow IPC shards.
Batches are flushed every batch_size rows. After each flush the cumulative
RecordBatch.nbytesfor the current shard is checked; when it exceeds shard_size_bytes the shard is closed. The next shard is opened lazily when the next batch is ready, so there are never trailing empty shards.Note
shard_size_bytes is an approximate target based on Arrow in-memory batch sizes, not exact on-disk file sizes. Actual shard files may be somewhat larger or smaller due to IPC framing, batch granularity, and compression differences.
An empty generator produces zero shards (
num_shards == 0).The completed cache directory contains:
shard-NNNNN.arrow— zero or more IPC files_metadata.json— row counts, shard list, format version, schema fingerprint
Writing is atomic: shards are first written to a temporary directory and renamed into place on success.
Returns a
ShardedCacheMetadescribing the cache.
stable_datasets.schema module
Feature and metadata schema definitions.
Each feature type maps itself to a PyArrow type for Arrow IPC serialization.
- class Array3D(shape: tuple, dtype: str = 'uint8')[source]
Bases:
FeatureTypeFixed-shape 3D array (e.g. 3D medical volumes). Stored as flat bytes.
- to_arrow_type() DataType[source]
- class BuilderConfig(name: str = 'default', version: Version | None = None, description: str = '')[source]
Bases:
objectBase config for multi-variant datasets.
- version: Version | None = None
- class ClassLabel(names: list[str] | None = None, num_classes: int | None = None)[source]
Bases:
FeatureTypeCategorical label with name-to-int mapping.
Preserves the
.names,.num_classes,.str2int(),.int2str()API that downstream code relies on.- int2str(idx: int) str[source]
- str2int(name: str) int[source]
- to_arrow_type() DataType[source]
- class DatasetInfo(features: Features, description: str = '', supervised_keys: tuple | None = None, homepage: str = '', citation: str = '', license: str = '', config_name: str = '')[source]
Bases:
objectMetadata container for a dataset (description, features, citation, etc.).
- features: Features
- class FeatureType[source]
Bases:
objectBase class for feature type descriptors.
- to_arrow_type() DataType[source]
- class Features[source]
Bases:
dictOrdered dict of
field_name -> FeatureType.Generates a PyArrow schema via
.to_arrow_schema().- to_arrow_schema() schema[source]
- class Image[source]
Bases:
FeatureTypeImage feature. Stored as raw bytes (PNG-encoded) in Arrow.
- to_arrow_type() DataType[source]
- class Sequence(feature: FeatureType)[source]
Bases:
FeatureTypeVariable-length list of a sub-feature.
- to_arrow_type() DataType[source]
- class Value(dtype: str)[source]
Bases:
FeatureTypeScalar value type. Maps dtype strings to PyArrow types.
- to_arrow_type() DataType[source]
- class Version(version_str: str)[source]
Bases:
objectSemantic version string (
major.minor.patch).
- class Video[source]
Bases:
FeatureTypeVideo feature. Stored as file path string in Arrow (metadata-only).
Video bytes are never inlined into the Arrow cache. The path points to the source media file; decoding happens lazily at access time.
- to_arrow_type() DataType[source]
stable_datasets.splits module
Split name constants and split generator.
- class Split[source]
Bases:
object
- class SplitGenerator(name: str, gen_kwargs: dict = <factory>)[source]
Bases:
objectDescribes one split and the kwargs to pass to
_generate_examples.
stable_datasets.utils module
- class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, **kwargs)[source]
Bases:
objectBase class for stable-datasets builders.
Handles downloading, Arrow caching, and split generation. Subclasses implement
_info,_split_generators, and_generate_examples.- VERSION: Version
- __init__(config_name: str | None = None, **kwargs)[source]
Initialize builder, selecting a BuilderConfig if applicable.
- bulk_download(urls: Iterable[str], dest_folder: str | Path) list[Path][source]
Download multiple files concurrently and return their local paths.
- Parameters:
urls – Iterable of URL strings to download.
dest_folder – Destination folder for downloads.
- Returns:
Local file paths in the same order as the input URLs.
- Return type:
list[Path]
- download(url: str, dest_folder: str | Path | None = None, progress_bar: bool = True, _progress_dict=None, _task_id=None) Path[source]
- load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y=True, replace_missing_vals_with='NaN')[source]
Load data from a .ts file into a Pandas DataFrame. Credit to https://github.com/sktime/sktime/blob/7d572796ec519c35d30f482f2020c3e0256dd451/sktime/datasets/_data_io.py#L379 :param full_file_path_and_name: The full pathname of the .ts file to read. :type full_file_path_and_name: str :param return_separate_X_and_y: true if X and Y values should be returned as separate Data Frames (
X) and a numpy array (y), false otherwise. This is only relevant for data that
- Parameters:
replace_missing_vals_with (str) – The value that missing values in the text file should be replaced with prior to parsing.
- Returns:
DataFrame (default) or ndarray (i – If return_separate_X_and_y then a tuple containing a DataFrame and a numpy array containing the relevant time-series and corresponding class values.
DataFrame – If not return_separate_X_and_y then a single DataFrame containing all time-series and (if relevant) a column “class_vals” the associated class values.
Module contents
- class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, **kwargs)[source]
Bases:
objectBase class for stable-datasets builders.
Handles downloading, Arrow caching, and split generation. Subclasses implement
_info,_split_generators, and_generate_examples.- VERSION: Version
- __init__(config_name: str | None = None, **kwargs)[source]
Initialize builder, selecting a BuilderConfig if applicable.
- class StableDataset(features: Features, info: DatasetInfo, *, table: Table | None = None, num_rows: int | None = None, shard_dir: Path | str | None = None, shard_paths: list[Path] | None = None, shard_row_counts: list[int] | None = None, max_open_shards: int = 4)[source]
Bases:
objectA single-split dataset backed by a directory of Arrow IPC shards.
Two construction modes:
Shard-backed —
StableDataset(features, info, shard_dir=..., shard_paths=[...], shard_row_counts=[...]). Only the needed shard is memory-mapped;__iter__streams one shard at a time.In-memory —
StableDataset(features, info, table=table). For small derived subsets (slices, splits). Pickle serialises the full table.
- property features: Features
- property info: DatasetInfo
- iter_epoch(*, shuffle_shards: bool = True, seed: int | None = None)[source]
Iterate over all rows with optional shard-level shuffling.
For non-sharded datasets, this is equivalent to
__iter__.
- property table: Table
Return the underlying Arrow table, memory-mapping from disk if needed.
For shard-backed datasets this concatenates all shards — prefer
__getitem__or__iter__for large datasets.
- to_tensordict(columns: list[str] | None = None)[source]
Convert numeric columns to a
tensordict.TensorDict.Image and Video columns are skipped (they stay lazy-decoded). Requires
tensordictto be installed.
- train_test_split(test_size: float = 0.1, seed: int = 42) dict[str, StableDataset][source]
Random split. Returns
{"train": StableDataset, "test": StableDataset}.
- class StableDatasetDict[source]
Bases:
dictDict of
split_name -> StableDataset.