RunScanner#
- class stable_pretraining.web.scan.RunScanner(root: Path, poll_interval: float = 1.0)[source]#
Bases:
objectPolls a directory tree for sidecar+metrics changes and fans out events.
- log_content(run_id: str, stream_id: str, max_bytes: int = 4194304) bytes | None[source]#
Read the (last
max_bytesof the) log identified bystream_id.Returns
Noneif the run / stream is unknown. Truncates from the front so the most recent output is preserved when the file exceeds the cap.
- logs_index(run_id: str) dict | None[source]#
Discover
.out/.err/.logfiles for a run.Search order:
Anything inside
{run_dir}/matching*.out/*.err.Files in
hp.output_dir(Hydra often points training logs here).submitit layout (common for spt + slurm):
{output_dir}/../{sweep_id}_{task_id}/.submitit/{sweep_id}_{task_id}_{rank}_log.{out,err}sweep_idcomes from thesweep:Ntag andtask_idfromhp.slurm.task_id(or the trailing_<task>part ofrun_id). Multiple ranks are returned individually so DDP runs get a per-rank selector.
Returns a dict
{"streams": [{name, kind, rank, size, stream_id}, ...]}orNoneif the run is unknown.
- media_file_path(run_id: str, rel_path: str) Path | None[source]#
Resolve a media file path safely, with
..traversal blocked.The resolved file must live under
{run_dir}/media/and must actually exist as a file. ReturnsNoneotherwise — the caller should respond with 404.
- media_json(run_id: str) dict | None[source]#
Return the media events for a run by parsing
media.jsonl.Returns
Noneif the run is unknown. Returns{"events": []}if there is no media yet (empty/missing file). Each event has at leaststep,tag,type,path; videos may also havefpsandformat.
- metrics_json_bytes(run_id: str) bytes | None[source]#
Return the metrics response as pre-serialised, NaN-safe JSON bytes.
First call materialises the dict (reusing the structured cache via
metrics_json()) and serialises once; subsequent calls return the cached bytes directly so the HTTP layer is reduced to a memcpy.
- metrics_stream(run_id: str)[source]#
Yield metrics for
run_idin chunks as the CSV is parsed.Each yielded value is a
dictwith shape:{"chunk": int, "metrics": {<name>: {"step":[...], "epoch":[...], "y":[...]}}}
and the final yielded value is
{"done": true}. If the metrics are already cached we emit them as a single chunk followed bydone— callers don’t need a separate fast-path. If the run is unknown we yieldNone(HTTP layer turns this into a 404).While streaming we accumulate the parsed structure in memory and publish it to
self._metrics_cacheon completion, so successive reads of the same run hit the warm cache.
- start() None[source]#
Start the background scanner thread.
Returns immediately. The first scan runs in the background so that callers (e.g. the HTTP server) become responsive before a potentially slow NFS walk over thousands of sidecars completes. Clients learn about discovered runs via the SSE /api/stream channel.