stable_datasets.backends package

Submodules

stable_datasets.backends.arrow_shards module

Arrow IPC implementation of StorageBackend.

ArrowBackend owns mmap lifetime, shard routing, and pickle/unpickle for Arrow IPC shard files. Returns Arrow-native types (pa.Table, pa.RecordBatch) and plain Python dicts; carries no dependency on PIL, torch, or numpy beyond what Arrow itself requires. Decoding to user-facing types is the formatter’s job.

class ArrowBackend(*, shard_paths: list[Path] | None = None, table: Table | None = None, shard_row_counts: list[int] | None = None, schema: Schema | None = None)[source]

Bases: object

Arrow-native storage layer.

Construction modes:

File-backed (typical): receives shard file paths + row counts. Mmaps lazily on first access; each DataLoader worker re-mmaps after fork.
In-memory: receives a pa.Table directly (for derived subsets, column mutations, flatten_indices output).

StableDataset should never inspect shard internals — it delegates all storage concerns here.

property cache_dir: Path | None

get_row(idx: int) → dict[source]: Single row access. Uses slice() which avoids copying binary data.

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) → Iterator[RecordBatch][source]

Yield record batches from shard files sequentially.

Only one shard is mmap’d at a time, which can reduce peak Python-managed memory for multi-shard datasets.

property num_rows: int: Row count without forcing table load.

property num_shards: int

prefer_batched_take: bool = True

property schema: Schema

slice(start: int, length: int) → Table[source]

Contiguous range access.

For single-shard datasets, slices directly from the mmap’d table without triggering full materialization.

property table: Table

Materialize and return the full Arrow table.

For single-file datasets this is a cheap mmap. For multi-shard datasets this concatenates all shards into one table — use only when full materialization is intended (e.g. column mutations). Hot paths should use get_row, take, or iter_batches instead.

take(indices: ndarray | list[int]) → Table[source]: Batched row access without using pa.Table.take.

stable_datasets.backends.lance_rows module

Lance implementation of StorageBackend.

Lance is built on Arrow: its Python API returns pa.Table, pa.RecordBatch, and pa.Schema directly, with no adapter layer. LanceBackend is a thin wrapper over lance.LanceDataset that exposes those Arrow return values through the same protocol as ArrowBackend, so StableDataset consumes either interchangeably.

Read-only. Lance is a storage format, not a mutable in-memory table. In-memory operations on StableDataset (rename_column, add_column, map, derived subsets) always produce a fresh ArrowBackend over a pa.Table regardless of the source backend; LanceBackend has no table=... construction mode.

Shards = fragments. Lance partitions datasets internally into fragments, which are the I/O units StableIterableDataset uses for worker sharding. num_shards returns the fragment count, and iter_batches(shard_indices=...) iterates only those fragment indices.

Blob encoding is off by default. Lance’s blob encoding stores large binary columns out-of-line and only pays off when paired with take_blobs and to_batches(blob_handling="all_binary") at read time, plus per-column field metadata at write time. The read methods here use plain take / to_batches, which work for any Lance dataset regardless of whether the column was blob-encoded.

Pickling is URI-based. __getstate__ serializes only the dataset URI plus cached row/shard counts; __setstate__ reopens by URI via lance.dataset(...).

class LanceBackend(*, uri: str | Path, batch_readahead: int = 8)[source]

Bases: object

__init__(*, uri: str | Path, batch_readahead: int = 8)[source]

Parameters:

uri (str or Path) – Path to the Lance dataset directory.
batch_readahead (int, default 8) – Number of RecordBatches Lance reads ahead in the scanner when iter_batches is called. Matches Lance’s own lance.torch.data.LanceDataset example which uses batch_readahead=8. Higher values increase memory use during iteration but improve throughput on high-latency storage. Ignored by take/get_row/slice.

property cache_dir: Path

get_row(idx: int) → dict[source]

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) → Iterator[RecordBatch][source]

Yield record batches from Lance fragments.

shard maps to Lance Fragment. Worker partitioning in StableIterableDataset works the same way as for the Arrow backend: each worker receives a disjoint set of fragment indices and iterates only those.

property num_rows: int

property num_shards: int

prefer_batched_take: bool = True: Hint to StableDataset.__getitems__ that this backend’s batched take(indices) path should be used for random index reads. Batching amortizes Lance’s Python/Rust call boundary.

property schema: Schema

slice(start: int, length: int) → Table[source]

property table: Table

Full materialization as a single pa.Table.

Expensive for large datasets. Use get_row, take, slice, or iter_batches on hot paths.

take(indices: ndarray | list[int]) → Table[source]

stable_datasets.backends.lance_video_frames module

Lance-backed random-access video frame segment layout.

class LanceVideoFramesBackend(*, uri: str | Path, window_length: int = 1, frame_skip: int = 0, hop_size: int = 1, min_video_frames: int | None = None, batch_readahead: int = 8)[source]

Bases: object

StorageBackend for the lance-video-frames layout.

The physical Lance dataset stores one WebP-encoded frame per row. The logical dataset exposes deterministic frame windows as samples.

property cache_dir: Path

get_row(idx: int) → dict[source]

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None)[source]

property num_rows: int

property num_shards: int

prefer_batched_take: bool = False

property schema: Schema

segment_filename(idx: int) → str[source]

property segment_filenames: list[str]

segment_info(idx: int) → dict[source]

slice(start: int, length: int) → Table[source]

property table: Table

take(indices: ndarray | list[int]) → Table[source]

property video_paths: list[str]

static worker_init(worker_id: int) → None[source]

reset_worker_state() → None[source]: Reset process-local decoder/backend state after a DataLoader fork.

stable_datasets.backends.protocol module

Read-side storage backend protocol.

Defines StorageBackend, the interface StableDataset depends on for row access, iteration, and materialization. Concrete backends (e.g. ArrowBackend) conform structurally.

Arrow types (pa.Table, pa.RecordBatch, pa.Schema) are the boundary types. Members not declared on the protocol are backend-private.

class StorageBackend(*args, **kwargs)[source]

Bases: Protocol

Read-side storage interface consumed by StableDataset.

get_row(idx: int) → dict[source]

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) → Iterator[RecordBatch][source]

property num_rows: int

property num_shards: int

property schema: Schema

slice(start: int, length: int) → Table[source]

property table: Table

Full materialization as a single pa.Table.

Expensive for multi-shard datasets. Hot paths should prefer get_row, take, slice, or iter_batches.

take(indices: ndarray | list[int]) → Table[source]

Module contents

Storage backend implementations and protocol exports.

class ArrowBackend(*, shard_paths: list[Path] | None = None, table: Table | None = None, shard_row_counts: list[int] | None = None, schema: Schema | None = None)[source]

Bases: object

Arrow-native storage layer.

Construction modes:

File-backed (typical): receives shard file paths + row counts. Mmaps lazily on first access; each DataLoader worker re-mmaps after fork.
In-memory: receives a pa.Table directly (for derived subsets, column mutations, flatten_indices output).

StableDataset should never inspect shard internals — it delegates all storage concerns here.

property cache_dir: Path | None

get_row(idx: int) → dict[source]: Single row access. Uses slice() which avoids copying binary data.

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) → Iterator[RecordBatch][source]

Yield record batches from shard files sequentially.

Only one shard is mmap’d at a time, which can reduce peak Python-managed memory for multi-shard datasets.

property num_rows: int: Row count without forcing table load.

property num_shards: int

prefer_batched_take: bool = True

property schema: Schema

slice(start: int, length: int) → Table[source]

Contiguous range access.

For single-shard datasets, slices directly from the mmap’d table without triggering full materialization.

property table: Table

Materialize and return the full Arrow table.

For single-file datasets this is a cheap mmap. For multi-shard datasets this concatenates all shards into one table — use only when full materialization is intended (e.g. column mutations). Hot paths should use get_row, take, or iter_batches instead.

take(indices: ndarray | list[int]) → Table[source]: Batched row access without using pa.Table.take.

class LanceBackend(*, uri: str | Path, batch_readahead: int = 8)[source]

Bases: object

__init__(*, uri: str | Path, batch_readahead: int = 8)[source]

Parameters:

uri (str or Path) – Path to the Lance dataset directory.
batch_readahead (int, default 8) – Number of RecordBatches Lance reads ahead in the scanner when iter_batches is called. Matches Lance’s own lance.torch.data.LanceDataset example which uses batch_readahead=8. Higher values increase memory use during iteration but improve throughput on high-latency storage. Ignored by take/get_row/slice.

property cache_dir: Path

get_row(idx: int) → dict[source]

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) → Iterator[RecordBatch][source]

Yield record batches from Lance fragments.

shard maps to Lance Fragment. Worker partitioning in StableIterableDataset works the same way as for the Arrow backend: each worker receives a disjoint set of fragment indices and iterates only those.

property num_rows: int

property num_shards: int

prefer_batched_take: bool = True: Hint to StableDataset.__getitems__ that this backend’s batched take(indices) path should be used for random index reads. Batching amortizes Lance’s Python/Rust call boundary.

property schema: Schema

slice(start: int, length: int) → Table[source]

property table: Table

Full materialization as a single pa.Table.

Expensive for large datasets. Use get_row, take, slice, or iter_batches on hot paths.

take(indices: ndarray | list[int]) → Table[source]

class LanceVideoFramesBackend(*, uri: str | Path, window_length: int = 1, frame_skip: int = 0, hop_size: int = 1, min_video_frames: int | None = None, batch_readahead: int = 8)[source]

Bases: object

StorageBackend for the lance-video-frames layout.

The physical Lance dataset stores one WebP-encoded frame per row. The logical dataset exposes deterministic frame windows as samples.

property cache_dir: Path

get_row(idx: int) → dict[source]

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None)[source]

property num_rows: int

property num_shards: int

prefer_batched_take: bool = False

property schema: Schema

segment_filename(idx: int) → str[source]

property segment_filenames: list[str]

segment_info(idx: int) → dict[source]

slice(start: int, length: int) → Table[source]

property table: Table

take(indices: ndarray | list[int]) → Table[source]

property video_paths: list[str]

static worker_init(worker_id: int) → None[source]

class StorageBackend(*args, **kwargs)[source]

Bases: Protocol

Read-side storage interface consumed by StableDataset.

get_row(idx: int) → dict[source]

property is_file_backed: bool

iter_batches(shard_indices: list[int] | None = None, shuffle: bool = False, seed: int | None = None) → Iterator[RecordBatch][source]

property num_rows: int

property num_shards: int

property schema: Schema

slice(start: int, length: int) → Table[source]

property table: Table

Full materialization as a single pa.Table.

Expensive for multi-shard datasets. Hot paths should prefer get_row, take, slice, or iter_batches.

take(indices: ndarray | list[int]) → Table[source]