Running ML Infrastructure Without Losing Your Mind

17 Jan, 2026

Notes from Building and Operating NexaCompute

One thing I didn’t fully appreciate early in my ML career is how much infrastructure shapes thinking.
Not in an abstract sense, but very concretely: what experiments you try, how often you iterate, how confident you are in results, and how much mental energy you burn just keeping things alive.

A lot of ML today works because engineers are smart and willing to stitch things together under pressure.
That’s impressive. It’s also fragile.

NexaCompute grew out of wanting something more mechanical, more explicit, and easier to reason about over time.
This post is a walkthrough of how the system works, how I operate it day-to-day, and the principles that shaped it.

The Problem Space

Most ML infra pain doesn’t come from models.
It comes from:

Implicit state
One-off scripts
Runs that can’t be reconstructed
Knowledge living in people’s heads

When something crashes mid-run, or months later you want to understand why something worked, you often realize how much was never recorded.

NexaCompute is designed to eliminate as much of that ambiguity as possible.

The Core Mental Model

Everything is organized around three environments, each with a single responsibility.

Control Plane

This is the local environment.

Holds the NexaCompute monorepo
Contains configs (YAML), launch scripts, documentation
Where design and decision-making happens

There is no large data, no checkpoints, and no training here.

Compute Plane

This is where execution happens.

Provisioned GPU machines (single GPU → multi-H100 clusters)
All work runs inside tmux
Nodes are treated as ephemeral

If a machine dies, it’s reprovisioned. Nothing critical lives only here.

Storage Plane

This is the system of record.

Raw datasets
Processed datasets
Distillation outputs
Checkpoints
Evaluation artifacts
Manifests

Object storage (e.g. Wasabi) is canonical.
Local disks are treated as scratch space only.

A Typical Workflow

The workflow is intentionally boring.

Define intent
Decide what to train, the architecture, and constraints.
Prepare data with NexaData
Data is curated, validated, sharded, and written to storage with a dataset manifest.
Provision compute
GPUs and CPU resources are allocated based on the job.
Bootstrap the environment
Dependencies are pinned. Known-good stacks are used. No live debugging.
Launch via NexaCompute
A single command launches the full pipeline.
Observe
Logs and metrics are monitored. No manual intervention.

This flow does not change with scale.

Opinionated by Design

NexaCompute is opinionated because entropy is the real enemy.

Every extra way to launch a job, every implicit convention, every undocumented assumption adds cognitive load.

So the system enforces:

Explicit DAGs
Mandatory manifests
Clear module boundaries
Standardized launch patterns

These choices reduce surface area for mistakes.

Logging and Manifests

Rule: No manifest, no run.

Every run produces:

A run manifest (config, code version, dataset versions)
Structured logs
Artifact references

This enables:

Run reconstruction after crashes
Reliable comparisons
Post-hoc analysis
Auditable decisions

Reproducibility here is practical, not ceremonial.

Hyperparameter Sweeps

Hyperparameters are treated as a search problem, not a hunch.

Slurm for distributed scheduling
Optuna-style structured sweeps
One manifest per trial

This replaces guessing with evidence and builds real intuition over time.

Repository Structure (High-Level)

The monorepo is modular but tightly integrated.

Core Modules

NexaData – data curation, validation, sharding
NexaDistill – synthetic data generation and filtering
NexaTrain – fine-tuning, pre-training, post-training
NexaEval – evaluation as a reproducible pipeline
NexaInference – serving, caching, embeddings, monitoring

Core Infrastructure

Manifests
DAG validation
Storage abstraction
Retry, timeout, and circuit-breaker logic
Policy enforcement

Tooling

CLI utilities
Tests-as-CLI
Documentation and runbooks
Agent-compatible interfaces

Rust, Python, Go: Clear Boundaries

Each language has a role.

Python – orchestration, glue, pipeline logic
Rust – data processing, validation, batching, memory-sensitive paths
Go / UI tooling – operator experience, TUIs, dashboards

Rust is used where determinism, performance, and memory behavior matter.
Python remains the control layer.

Monorepo Tradeoffs

A monorepo works here because:

Development is mostly solo
Dependency management must be centralized
Cross-module reasoning matters
Automation is simpler

In larger organizations, a poly-repo may make sense.
Here, a monorepo reduces friction.

Documentation and Agents

Because everything is explicit—configs, manifests, DAGs—the system is teachable.

This benefits:

Humans onboarding
Automation
Agents

Ambiguous systems don’t scale well. Explicit ones do.

What This Enables

The difficult parts are no longer operational.

The bottlenecks become:

Securing compute
Choosing good problems

That’s the right place for difficulty to live.

Closing Thoughts

NexaCompute isn’t about novelty.
It’s about applying discipline to the unglamorous parts of ML.

When execution becomes mechanical, attention shifts to architecture and decisions.
That’s where ML work becomes sustainable.