stable_datasets.backends package
Submodules
stable_datasets.backends.arrow_shards module
Arrow IPC implementation of StorageBackend.
ArrowBackend owns mmap lifetime, shard routing, and
pickle/unpickle for Arrow IPC shard files. Returns Arrow-native types
(pa.Table, pa.RecordBatch) and plain Python dicts;
carries no dependency on PIL, torch, or numpy beyond what Arrow itself
requires. Decoding to user-facing types is the formatter’s job.
- class ArrowBackend(*, shard_paths: list[Path] | None = None, table: Table | None = None, shard_row_counts: list[int] | None = None, schema: Schema | None = None)[source]
Bases:
objectArrow-native storage layer.
Construction modes:
File-backed (typical): receives shard file paths + row counts. Mmaps lazily on first access; each DataLoader worker re-mmaps after fork.
In-memory: receives a
pa.Tabledirectly (for derived subsets, column mutations,flatten_indicesoutput).
StableDatasetshould never inspect shard internals — it delegates all storage concerns here.- get_row(idx: int) dict[source]
Single row access. Uses slice() which avoids copying binary data.
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) Iterator[RecordBatch][source]
Yield record batches from shard files sequentially.
Only one shard is mmap’d at a time, which can reduce peak Python-managed memory for multi-shard datasets.
- slice(start: int, length: int) Table[source]
Contiguous range access.
For single-shard datasets, slices directly from the mmap’d table without triggering full materialization.
- property table: Table
Materialize and return the full Arrow table.
For single-file datasets this is a cheap mmap. For multi-shard datasets this concatenates all shards into one table — use only when full materialization is intended (e.g. column mutations). Hot paths should use
get_row,take, oriter_batchesinstead.
- take(indices: ndarray | list[int]) Table[source]
Batched row access without using
pa.Table.take.
stable_datasets.backends.lance_rows module
Lance implementation of StorageBackend.
Lance is built on Arrow: its Python API returns pa.Table,
pa.RecordBatch, and pa.Schema directly, with no adapter layer.
LanceBackend is a thin wrapper over lance.LanceDataset that
exposes those Arrow return values through the same protocol as
ArrowBackend, so StableDataset consumes either
interchangeably.
Read-only. Lance is a storage format, not a mutable in-memory
table. In-memory operations on StableDataset
(rename_column, add_column, map, derived subsets) always
produce a fresh ArrowBackend over a pa.Table regardless of
the source backend; LanceBackend has no table=...
construction mode.
Shards = fragments. Lance partitions datasets internally into
fragments, which are the I/O units StableIterableDataset uses for
worker sharding. num_shards returns the fragment count, and
iter_batches(shard_indices=...) iterates only those fragment
indices.
Blob encoding is off by default. Lance’s blob encoding stores large
binary columns out-of-line and only pays off when paired with
take_blobs and to_batches(blob_handling="all_binary") at read
time, plus per-column field metadata at write time. The read methods
here use plain take / to_batches, which work for any Lance
dataset regardless of whether the column was blob-encoded.
Pickling is URI-based. __getstate__ serializes only the dataset
URI plus cached row/shard counts; __setstate__ reopens by URI via
lance.dataset(...).
- class LanceBackend(*, uri: str | Path, batch_readahead: int = 8)[source]
Bases:
object- __init__(*, uri: str | Path, batch_readahead: int = 8)[source]
- Parameters:
uri (str or Path) – Path to the Lance dataset directory.
batch_readahead (int, default 8) – Number of RecordBatches Lance reads ahead in the scanner when
iter_batchesis called. Matches Lance’s ownlance.torch.data.LanceDatasetexample which usesbatch_readahead=8. Higher values increase memory use during iteration but improve throughput on high-latency storage. Ignored bytake/get_row/slice.
- get_row(idx: int) dict[source]
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) Iterator[RecordBatch][source]
Yield record batches from Lance fragments.
shardmaps to LanceFragment. Worker partitioning inStableIterableDatasetworks the same way as for the Arrow backend: each worker receives a disjoint set of fragment indices and iterates only those.
- prefer_batched_take: bool = True
Hint to
StableDataset.__getitems__that this backend’s batchedtake(indices)path should be used for random index reads. Batching amortizes Lance’s Python/Rust call boundary.
- slice(start: int, length: int) Table[source]
- property table: Table
Full materialization as a single
pa.Table.Expensive for large datasets. Use
get_row,take,slice, oriter_batcheson hot paths.
- take(indices: ndarray | list[int]) Table[source]
stable_datasets.backends.lance_video_frames module
Lance-backed random-access video frame segment layout.
- class LanceVideoFramesBackend(*, uri: str | Path, window_length: int = 1, frame_skip: int = 0, hop_size: int = 1, min_video_frames: int | None = None, batch_readahead: int = 8)[source]
Bases:
objectStorageBackend for the
lance-video-frameslayout.The physical Lance dataset stores one WebP-encoded frame per row. The logical dataset exposes deterministic frame windows as samples.
- get_row(idx: int) dict[source]
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None)[source]
- segment_filename(idx: int) str[source]
- segment_info(idx: int) dict[source]
- slice(start: int, length: int) Table[source]
- take(indices: ndarray | list[int]) Table[source]
- static worker_init(worker_id: int) None[source]
- reset_worker_state() None[source]
Reset process-local decoder/backend state after a DataLoader fork.
stable_datasets.backends.protocol module
Read-side storage backend protocol.
Defines StorageBackend, the interface StableDataset
depends on for row access, iteration, and materialization. Concrete
backends (e.g. ArrowBackend) conform structurally.
Arrow types (pa.Table, pa.RecordBatch,
pa.Schema) are the boundary types. Members not declared on the
protocol are backend-private.
- class StorageBackend(*args, **kwargs)[source]
Bases:
ProtocolRead-side storage interface consumed by
StableDataset.- get_row(idx: int) dict[source]
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) Iterator[RecordBatch][source]
- slice(start: int, length: int) Table[source]
- property table: Table
Full materialization as a single
pa.Table.Expensive for multi-shard datasets. Hot paths should prefer
get_row,take,slice, oriter_batches.
- take(indices: ndarray | list[int]) Table[source]
Module contents
Storage backend implementations and protocol exports.
- class ArrowBackend(*, shard_paths: list[Path] | None = None, table: Table | None = None, shard_row_counts: list[int] | None = None, schema: Schema | None = None)[source]
Bases:
objectArrow-native storage layer.
Construction modes:
File-backed (typical): receives shard file paths + row counts. Mmaps lazily on first access; each DataLoader worker re-mmaps after fork.
In-memory: receives a
pa.Tabledirectly (for derived subsets, column mutations,flatten_indicesoutput).
StableDatasetshould never inspect shard internals — it delegates all storage concerns here.- get_row(idx: int) dict[source]
Single row access. Uses slice() which avoids copying binary data.
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) Iterator[RecordBatch][source]
Yield record batches from shard files sequentially.
Only one shard is mmap’d at a time, which can reduce peak Python-managed memory for multi-shard datasets.
- slice(start: int, length: int) Table[source]
Contiguous range access.
For single-shard datasets, slices directly from the mmap’d table without triggering full materialization.
- property table: Table
Materialize and return the full Arrow table.
For single-file datasets this is a cheap mmap. For multi-shard datasets this concatenates all shards into one table — use only when full materialization is intended (e.g. column mutations). Hot paths should use
get_row,take, oriter_batchesinstead.
- take(indices: ndarray | list[int]) Table[source]
Batched row access without using
pa.Table.take.
- class LanceBackend(*, uri: str | Path, batch_readahead: int = 8)[source]
Bases:
object- __init__(*, uri: str | Path, batch_readahead: int = 8)[source]
- Parameters:
uri (str or Path) – Path to the Lance dataset directory.
batch_readahead (int, default 8) – Number of RecordBatches Lance reads ahead in the scanner when
iter_batchesis called. Matches Lance’s ownlance.torch.data.LanceDatasetexample which usesbatch_readahead=8. Higher values increase memory use during iteration but improve throughput on high-latency storage. Ignored bytake/get_row/slice.
- get_row(idx: int) dict[source]
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) Iterator[RecordBatch][source]
Yield record batches from Lance fragments.
shardmaps to LanceFragment. Worker partitioning inStableIterableDatasetworks the same way as for the Arrow backend: each worker receives a disjoint set of fragment indices and iterates only those.
- prefer_batched_take: bool = True
Hint to
StableDataset.__getitems__that this backend’s batchedtake(indices)path should be used for random index reads. Batching amortizes Lance’s Python/Rust call boundary.
- slice(start: int, length: int) Table[source]
- property table: Table
Full materialization as a single
pa.Table.Expensive for large datasets. Use
get_row,take,slice, oriter_batcheson hot paths.
- take(indices: ndarray | list[int]) Table[source]
- class LanceVideoFramesBackend(*, uri: str | Path, window_length: int = 1, frame_skip: int = 0, hop_size: int = 1, min_video_frames: int | None = None, batch_readahead: int = 8)[source]
Bases:
objectStorageBackend for the
lance-video-frameslayout.The physical Lance dataset stores one WebP-encoded frame per row. The logical dataset exposes deterministic frame windows as samples.
- get_row(idx: int) dict[source]
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None)[source]
- segment_filename(idx: int) str[source]
- segment_info(idx: int) dict[source]
- slice(start: int, length: int) Table[source]
- take(indices: ndarray | list[int]) Table[source]
- static worker_init(worker_id: int) None[source]
- class StorageBackend(*args, **kwargs)[source]
Bases:
ProtocolRead-side storage interface consumed by
StableDataset.- get_row(idx: int) dict[source]
- iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) Iterator[RecordBatch][source]
- slice(start: int, length: int) Table[source]
- property table: Table
Full materialization as a single
pa.Table.Expensive for multi-shard datasets. Hot paths should prefer
get_row,take,slice, oriter_batches.
- take(indices: ndarray | list[int]) Table[source]