About me

Running ML Infrastructure Without Losing Your Mind

Notes from Building and Operating NexaCompute


One thing I didn’t fully appreciate early in my ML career is how much infrastructure shapes thinking.
Not in an abstract sense, but very concretely: what experiments you try, how often you iterate, how confident you are in results, and how much mental energy you burn just keeping things alive.

A lot of ML today works because engineers are smart and willing to stitch things together under pressure.
That’s impressive. It’s also fragile.

NexaCompute grew out of wanting something more mechanical, more explicit, and easier to reason about over time.
This post is a walkthrough of how the system works, how I operate it day-to-day, and the principles that shaped it.


The Problem Space

Most ML infra pain doesn’t come from models.
It comes from:

When something crashes mid-run, or months later you want to understand why something worked, you often realize how much was never recorded.

NexaCompute is designed to eliminate as much of that ambiguity as possible.


The Core Mental Model

Everything is organized around three environments, each with a single responsibility.

Control Plane

This is the local environment.

There is no large data, no checkpoints, and no training here.


Compute Plane

This is where execution happens.

If a machine dies, it’s reprovisioned. Nothing critical lives only here.


Storage Plane

This is the system of record.

Object storage (e.g. Wasabi) is canonical.
Local disks are treated as scratch space only.


A Typical Workflow

The workflow is intentionally boring.

  1. Define intent
    Decide what to train, the architecture, and constraints.

  2. Prepare data with NexaData
    Data is curated, validated, sharded, and written to storage with a dataset manifest.

  3. Provision compute
    GPUs and CPU resources are allocated based on the job.

  4. Bootstrap the environment
    Dependencies are pinned. Known-good stacks are used. No live debugging.

  5. Launch via NexaCompute
    A single command launches the full pipeline.

  6. Observe
    Logs and metrics are monitored. No manual intervention.

This flow does not change with scale.


Opinionated by Design

NexaCompute is opinionated because entropy is the real enemy.

Every extra way to launch a job, every implicit convention, every undocumented assumption adds cognitive load.

So the system enforces:

These choices reduce surface area for mistakes.


Logging and Manifests

Rule: No manifest, no run.

Every run produces:

This enables:

Reproducibility here is practical, not ceremonial.


Hyperparameter Sweeps

Hyperparameters are treated as a search problem, not a hunch.

This replaces guessing with evidence and builds real intuition over time.


Repository Structure (High-Level)

The monorepo is modular but tightly integrated.

Core Modules

Core Infrastructure

Tooling


Rust, Python, Go: Clear Boundaries

Each language has a role.

Rust is used where determinism, performance, and memory behavior matter.
Python remains the control layer.


Monorepo Tradeoffs

A monorepo works here because:

In larger organizations, a poly-repo may make sense.
Here, a monorepo reduces friction.


Documentation and Agents

Because everything is explicit—configs, manifests, DAGs—the system is teachable.

This benefits:

Ambiguous systems don’t scale well. Explicit ones do.


What This Enables

The difficult parts are no longer operational.

The bottlenecks become:

That’s the right place for difficulty to live.


Closing Thoughts

NexaCompute isn’t about novelty.
It’s about applying discipline to the unglamorous parts of ML.

When execution becomes mechanical, attention shifts to architecture and decisions.
That’s where ML work becomes sustainable.