Running ML Infrastructure Without Losing Your Mind
Notes from Building and Operating NexaCompute
One thing I didn’t fully appreciate early in my ML career is how much infrastructure shapes thinking.
Not in an abstract sense, but very concretely: what experiments you try, how often you iterate, how confident you are in results, and how much mental energy you burn just keeping things alive.
A lot of ML today works because engineers are smart and willing to stitch things together under pressure.
That’s impressive. It’s also fragile.
NexaCompute grew out of wanting something more mechanical, more explicit, and easier to reason about over time.
This post is a walkthrough of how the system works, how I operate it day-to-day, and the principles that shaped it.
The Problem Space
Most ML infra pain doesn’t come from models.
It comes from:
- Implicit state
- One-off scripts
- Runs that can’t be reconstructed
- Knowledge living in people’s heads
When something crashes mid-run, or months later you want to understand why something worked, you often realize how much was never recorded.
NexaCompute is designed to eliminate as much of that ambiguity as possible.
The Core Mental Model
Everything is organized around three environments, each with a single responsibility.
Control Plane
This is the local environment.
- Holds the NexaCompute monorepo
- Contains configs (YAML), launch scripts, documentation
- Where design and decision-making happens
There is no large data, no checkpoints, and no training here.
Compute Plane
This is where execution happens.
- Provisioned GPU machines (single GPU → multi-H100 clusters)
- All work runs inside tmux
- Nodes are treated as ephemeral
If a machine dies, it’s reprovisioned. Nothing critical lives only here.
Storage Plane
This is the system of record.
- Raw datasets
- Processed datasets
- Distillation outputs
- Checkpoints
- Evaluation artifacts
- Manifests
Object storage (e.g. Wasabi) is canonical.
Local disks are treated as scratch space only.
A Typical Workflow
The workflow is intentionally boring.
Define intent
Decide what to train, the architecture, and constraints.Prepare data with NexaData
Data is curated, validated, sharded, and written to storage with a dataset manifest.Provision compute
GPUs and CPU resources are allocated based on the job.Bootstrap the environment
Dependencies are pinned. Known-good stacks are used. No live debugging.Launch via NexaCompute
A single command launches the full pipeline.Observe
Logs and metrics are monitored. No manual intervention.
This flow does not change with scale.
Opinionated by Design
NexaCompute is opinionated because entropy is the real enemy.
Every extra way to launch a job, every implicit convention, every undocumented assumption adds cognitive load.
So the system enforces:
- Explicit DAGs
- Mandatory manifests
- Clear module boundaries
- Standardized launch patterns
These choices reduce surface area for mistakes.
Logging and Manifests
Rule: No manifest, no run.
Every run produces:
- A run manifest (config, code version, dataset versions)
- Structured logs
- Artifact references
This enables:
- Run reconstruction after crashes
- Reliable comparisons
- Post-hoc analysis
- Auditable decisions
Reproducibility here is practical, not ceremonial.
Hyperparameter Sweeps
Hyperparameters are treated as a search problem, not a hunch.
- Slurm for distributed scheduling
- Optuna-style structured sweeps
- One manifest per trial
This replaces guessing with evidence and builds real intuition over time.
Repository Structure (High-Level)
The monorepo is modular but tightly integrated.
Core Modules
- NexaData – data curation, validation, sharding
- NexaDistill – synthetic data generation and filtering
- NexaTrain – fine-tuning, pre-training, post-training
- NexaEval – evaluation as a reproducible pipeline
- NexaInference – serving, caching, embeddings, monitoring
Core Infrastructure
- Manifests
- DAG validation
- Storage abstraction
- Retry, timeout, and circuit-breaker logic
- Policy enforcement
Tooling
- CLI utilities
- Tests-as-CLI
- Documentation and runbooks
- Agent-compatible interfaces
Rust, Python, Go: Clear Boundaries
Each language has a role.
- Python – orchestration, glue, pipeline logic
- Rust – data processing, validation, batching, memory-sensitive paths
- Go / UI tooling – operator experience, TUIs, dashboards
Rust is used where determinism, performance, and memory behavior matter.
Python remains the control layer.
Monorepo Tradeoffs
A monorepo works here because:
- Development is mostly solo
- Dependency management must be centralized
- Cross-module reasoning matters
- Automation is simpler
In larger organizations, a poly-repo may make sense.
Here, a monorepo reduces friction.
Documentation and Agents
Because everything is explicit—configs, manifests, DAGs—the system is teachable.
This benefits:
- Humans onboarding
- Automation
- Agents
Ambiguous systems don’t scale well. Explicit ones do.
What This Enables
The difficult parts are no longer operational.
The bottlenecks become:
- Securing compute
- Choosing good problems
That’s the right place for difficulty to live.
Closing Thoughts
NexaCompute isn’t about novelty.
It’s about applying discipline to the unglamorous parts of ML.
When execution becomes mechanical, attention shifts to architecture and decisions.
That’s where ML work becomes sustainable.